Download as pdf or txt
Download as pdf or txt
You are on page 1of 586

International Handbooks on Information Systems

Series Editors
Peter Bernus . Jacek Blazewicz . Gunter Schmidt· Michael Shaw

Springer
Berlin
Heidelberg
New York
Hong Kong
London
Milan
Paris
Tokyo
Titles in the Series

P. Bemus, K. Mertins and G. Schmidt (Eds.)


Handbook on Architectures of Information Systems
ISBN 3-540-64453-9

M. Shaw, R. Blanning, T. Strader and A. Whinston (Eds.)


Handbook on Electronic Commerce
ISBN 3-540-65822-X

J. Blazewicz, K. Ecker, B. Plateau and D. Trystram (Eds.)


Handbook on Parallel and Distributed Processing
ISBN 3-540-66441-6

H. H.Adelsberger, B. Collis and J. M. Pawlowski (Eds.)


Handbook on Information Technologies for Education and Training
ISBN 3-540-67803-4

C. W. Holsapple (Ed.)
Handbook on Knowledge Management 1
Knowledge Matters
ISBN 3-540-43527-1
Handbook on Knowledge Management 2
Knowledge Directions
ISBN 3-540-43527-1

P. Bemus, L. Nemes and G. Schmidt (Eds.)


Handbook on Enterprise Architecture
ISBN 3-540-00343-6

J. Blazewicz, W. Kubiak, T. Morzy and M. Rusinkiewicz (Eds.)


Handbook on Data Management in Information Systems
ISBN 3-540-43893-9
Jacek Blaiewicz . Wieslaw Kubiak
Tadeusz Morzy . Marek Rusinkiewicz
Editors

Handbook
on Data Management
in Information
Systems
With 157 Figures
and 9 Tables

Springer
Professor Jacek Blazewicz e-mail: blazewic@put.poznan.pl
Institute of Bioorganic Chemistry
Polish Academy of Sciences
ul. Noskowskiego 12
61-704 Poznan, Poland

Professor Wieslaw Kubiak e-mail: wkubiak@morgan.ucs.mun.ca


Memorial University of Newfoundland
Faculty of Business Administration
St. John's
NF AlB 3X5, Canada
Professor Tadeusz Morzy e-mail: morzy@put.poznan.pl
Poznan University of Technology
Institute of Computing Science
ul. Piotrowo 3a
60-965 Poznan, Poland

Professor Marek Rusinkiewicz e-mail: marek@research.telecordia.com


Telcordia Technologies
Information and Computer Science Laboratory
445 South Street MCC-1J346B
Morristown, NJ 07960, USA

ISBN 978-3-642-53441-6 ISBN 978-3-540-24742-5 (eBook)


DOI 10.1007/978-3-540-24742-5

Cataloging-in-Publication Data applied for


A catalog record for this book is available from the Library of Congress.
Bibliographic information published by Die Deutsche Bibliothek
Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie;
detailed bibliographic data available in the internet at http.lldnb.ddb.de
This work is subject to copyright. All rights are reserved, whether the whole or part of the material
is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication
of this publication or parts thereof is permitted only under the provisions of the German Copyright
Law of September 9, 1965, in its current version, and permission for use must always be obtained
from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law.
Springer-Verlag Berlin Heidelberg New York
a member of BertelsmannSpringer Science + Business Media GmbH
http://www.springer.de
© Springer-Verlag Berlin Heidelberg 2003
Softcover reprint of the hardcover 1st edition 2003
The use of general descriptive names, registered names, trademarks, etc. in this publication does
not imply, even in the absence of a specific statement, that such names are exempt from the
relevant protective laws and regulations and therefore free for general use.
Cover design: Erich Kirchner, Heidelberg
SPIN 10886050 4213130 - 5 4 3 2 1 0 - Printed on acid-free paper
Foreword
This book is the sixth of a running series of volumes dedicated to selected
topics of information theory and practice. The objective of the series is to pro-
vide a reference source for problem solvers in business, industry, government,
and professional researchers and gradute students.
The first volume, Handbook on Architecture of Information Systems,
presents a balanced number of contributions from academia and practition-
ers. The structure of the material follows a differentiation between model-
ing languages, tools and methodologies. The second volume, Handbook on
Electronic Commerce, examines electronic commerce storefront, on-line busi-
ness, consumer interface, business-to-business networking, digital payment,
legal issues, information product development and electronic business mod-
els. The third volume, Handbook on Parallel and Distributed Processing,
presents basic concepts, methods, and recent developments in the field of
parallel and distributed processing as well as some important aplications of
parallel and distributed computing. In particular, the book examines such
fundamental issues in the above area as languages for parallel processing,
parallel operating systems, architecture of parallel and distributed systems,
parallel database and multimedia systems, networking aspects of parallel and
distributed systems, efficiency of parallel algorithms. The fourth volume on
Information Technologies for Education and Training is· devoted to a pre-
sentation of current and future research and applications in the field of ed-
ucational technology. The fifth double volume on Knowledge Management
contains an extensive, fundamental coverage of the knowledge management
field.
The present volume of the International Handbooks on Data Manage-
ment, as the previous ones, is a joint venture of an international board of
editors, gathering prominent authors of academia and practice, who are well
known specialists in the field of data management. The technology for data
management has evolved during last 30 years from simple file systems through
hierarchical, network, and relational database systems to the new generation
data management technology. This transition was driven by two factors: the
increasing requirements of new data management applications on one side,
and recent developments in database, networking and computer technolo-
gies on the other side. Advances in data management technology have led
to new exciting applications such as multimedia systems, digital libraries, e-
commerce, workflow management systems, decision support systems, etc. The
intention of the Handbook is to provide practitioners, scientists and gradu-
ate students with a comprehensive overview of basic methods, concepts, tools
and techniques applied currently for data management and their use in in-
formation system management and development. The handbook contains 11
chapters that cover a wide spectrum of topics ranging from core database
technologies such as data modeling, relational, object-oriented, parallel and
distributed database systems to advanced database systems and XML pro-
VI Foreword

cessing, multimedia database systems, workflow management, data warehous-


ing, mobile computing, and data mining. Each chapter includes a compre-
hensive overview of the issue covered, proposed solutions to problems, and
directions for further research and development. We hope the handbook will
help readers to better understand the current status of the data management
field and directions of its development.
Summing up, the Handbook is indispensable for academics and profes-
sionals who are interested in learning leading experts' coherent and individual
view of the topic.
We would like to express our sincere thanks to the people who have con-
tributed to prepare the volume. First, we would like to thank authors for their
submissions. We also want to thank Dr. Muller from the Springer-Verlag for
his encouragement to prepare the volume. Special thanks are addressed to
Mr. Piotr Krzyzag6rski for his excellent job in careful editing and converting
the chapters into a single uniform style of Springer-Verlag format.

J acek Blazewicz
Wieslaw Kubiak
Tadeusz Morzy
Marek Rusinkiewicz
Contents

Foreword ................ "..................................... V

1. Management of Data: State-of-the-Art and Emerging Trends 1


Jacek Blaiewicz, Tadeusz Morzy
1 Introduction................................................. 2
2 Survey of the Volume. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 12

2. Database Systems: from File Systems to Modern Database


Systems ...................................................... 18
Zbyszko K r6likowski, Tadeusz M orzy
1 Introduction - Database Concepts ............................. 19
2 Database System Generations. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . ... 21
3 Network Database Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 22
4 Hierarchical Database Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 25
5 Relational Database Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 29
6 Object-Oriented Database Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 33
7 Federated, Mediated Database Systems and Data Warehouses .. . .. 38
8 Conclusions................................................. 47

3. Data Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49
Jeffrey Parsons
1 Introduction................................................. 50
2 Early Concerns in Data Management . . . . . . . . . . . . . . . . . . . . . . . . . .. 50
3 Abstraction in Data Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 52
4 Semantic Data Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 56
5 Models of Reality and Perception .... . . . . . . . . . . . . . . . . . . . . . . . . .. 62
6 Toward Cognition-Based Data Management. ... .. . ... .. .. .... ... 66
7 A Cognitive Approach to Data Modeling. . . . . . . . . . . . . . . . . . . . . . .. 70
8 Research Directions .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 72

4. Object-Oriented Database Systems. . . . . . . . . . . . . . . . . . . . . . . .. 78


Alfons Kemper, Guido Moerkotte
1 Introduction and Motivation .................................. 80
2 Object-Oriented Data Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 85
3 The Query Language OQL .................................... 106
4 Physical Object Management .................................. 117
5 Architecture of Client-Server-Systems ........................... 135
6 Indexing .................................................... 139
7 Dealing with Set-Valued Attributes ............................ " 160
8 Query Optimization .......................................... 164
9 Conclusion .................................................. 186
VIII Contents

5. High Performance Parallel Database Management Systems 194


Shahram Ghandeharizadeh, Shan Gao, Chris Gahagan, Russ Krauss
1 Introduction ................................................. 195
2 Partitioning Strategies ........................................ 196
3 Join Using Inter-Operator Parallelism .......................... 201
4 ORE: a Framework for Data Migration ......................... 203
5 Conclusions and Future Research Directions ..................... 216

6. Advanced Database Systems ............................... 221


Gottfried Vossen
1 Introduction................................................. 222
2 Preliminaries................................................ 227
3 Data Models and Modeling for Complex Objects ................. 234
4 Advanced Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
5 Advanced Database Server Capabilities ......................... 262
6 Conclusions and Outlook ..................................... 274
7. Parallel and Distributed Multimedia Database Systems .... 284
Odej Kao
1 Introduction................................................. 286
2 Media Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
3 MPEG as an Example of Media Compression. . . . . . . . . . . . . . . . . . . . 292
4 Organisation and Retrieval of Multimedia Data . . . . . . . . . . . . . . . . . . 298
5 Data Models for Multimedia Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
6 Multimedia Retrieval Sequence Using Images as an Example ...... 308
7 Requirements for Multimedia Applications ...................... 318
8 Parallel and Distributed Processing of Multimedia Data .......... 321
9 Parallel and Distributed Techniques for Multimedia Databases ..... 337
10 Case Study: CAIRO - Cluster Architecture for Image Retrieval and
Organisation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

8. Workflow Technology: the Support for Collaboration ....... 365


Dimitrios Georgakopoulos, Andrzej Cichocki, Marek Rusinkiewicz
1 Introduction................................................. 367
2 Application Scenario and Collaboration Requirements ............ 368
3 Commercial Technologies Addressing Collaboration Requirements . . 371
4 Evaluation of Current Workflow Management Technology . . . . . . . . . 372
5 Research Problems, Related Work, and Directions ................ 381
6 Summary................................................... 383

9. Data Warehouses .......................................... 387


Ulrich Dorndorf, Erwin Pesch
1 Introduction................................................. 389
2 Basics...................................................... 389
3 The Database of a Data Warehouse ............................ 394
Contents IX

4 The Data Warehouse Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404


5 Data Analysis of a Data Warehouse ............................ 411
6 Building a Data Warehouse ................................... 418
7 Future Research Directions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
8 Conclusions................................................. 423
10. Mobile Computing . ....................................... 431
Omran Bukhres, Evaggelia Pitoura, Arkady Zaslavsky
1 Introduction ................................................. 433
2 Mobile Computing Infrastructure .............................. 437
3 Mobile Computing Software Architectures
and Models ................................................. 444
4 Disconnected Operation ...................................... 454
5 Weak Connectivity ........................................... 462
6 Data Delivery by Broadcast ................................... 468
7 Mobile Computing Resources and Pointers . . . . . . . . . . . . . . . . . . . . . . 476
8 Conclusions ................................................. 479
11. Data Mining .............................................. 487
Tadeusz Morzy, Maciej Zakrzewicz
1 Introduction ................................................. 488
2 Mining Associations .......................................... 490
3 Classification and Prediction .................................. 517
4 Clustering .................................................. 540
5 Conclusions ................................................. 558

Index ......................................................... 567

List of Contributors .......................................... 577


1. Management of Data: State-of-the-Art and
Emerging Trends

Jacek Blazewicz 1 and Tadeusz Morzy 2

1 Institute of Bioorganic Chemistry, Polish Academy of Sciences, Laboratory of


Bioinformatics, Poznan, Poland
2 Institute of Computing Science, Poznan University of Technology, Poznan,
Poland

1. Introduction ....................................................... 2
1.1 Database Systems..... .... .... ... ............ ....... ........... 3
1.2 Beyond Database Systems ...................................... 8
1.3 The Future Research .......................................... 11
2. Survey of the Volume ............................................. 12

Abstract. This chapter presents an introduction to the area of data management.


The aim of the chapter is to recall the evolution of the data management during the
past decades in order to present the future trends and emerging fields of research.
In the second part of the chapter the brief survey of the Volume is presented.
2 J. BlaZewicz and T.Morzy

1 Introduction
One of the most important application of computers is the management of
data in various forms, such as: records, documents, scientific and business
data, voice, videos, and images. The systems that are used to store, manage,
manipulate, analyze, and visualize the data are called data management sys-
tems. During the last 40 years the technology for managing data has evolved
from simple file systems to multimedia databases, complex workflow systems,
and large integrated distributed systems. Nowadays, they allow for an effi-
cient, reliable, and secure access to globally distributed complex data.
The history of data management research is one of exceptional productiv-
ity and starling economic impact [SSU96]. Achievements in data management
research underpin fundamental advances in communication systems, financial
management, administration systems, medicine, law, knowledge-based sys-
tems, and a host of other civilian and defense applications. They also serve as
the foundation for considerable progress in the basic science fields from com-
puting to biology [SSU91,SSU96,BBC+98]. The research on data manage-
ment has led to the database systems becoming arguably the most important
development in the field of software engineering as well as the most important
technology used to build information systems. Now, it would be unthinkable
to manage the large volumes of data that keeps corporations running with-
out support from commercial database management systems (DBMSs). The
field of database system research and development is an example of enormous
success story over its 30-year history both in terms of significant theoretical
results and of significant practical commercial values. These achievements are
documented and discussed in [ENOO,Gra96,SSU91,SSU96j.
The major strength of database systems is the ability to provide fast
and unprocedural concurrent access to data while ensuring reliable storage
and accurate maintainability of data. The features of the database system
technology, efficiency and consistency, have enabled development of huge fi-
nancial systems, reservation systems, and other business systems. During the
last decade, the database systems have evolved from the simple business-data
processing systems, that operate on well structured traditional data such as
numbers and character strings, to more complex object-relational systems
that operate on multimedia "documents", videos, geographic/spatial data,
time-series, voice, etc. Recent advances in database technology have been
leading to new exciting applications of database systems: geographic infor-
mation systems, elM systems, CASE systems, data warehouses and OLAP
systems, data mining systems, mobile systems, workflow systems, etc. How-
ever, despite the popularity and flexibility of database systems, which are
now able to cope with data of increasing complexity, still a large portion of
data is being stored and processed in places other that database systems (flat
files, data repositories, etc.). While the trend of building more powerful and
flexible database management systems has a place due to increasing demands
of its users, there is also a need for new data management solutions for new
1. Management of Data: State-of-the-Art and Emerging Trends 3

information environments. There are many examples of new data-intensive


applications in which data management solutions are conspicuous by their
absence. Recently, the Asilomar report [BBC+98] has pointed out that the
fundamental data management issues have changed dramatically in the last
decade and the field needs to radically broaden its research focus to new
information environments that require new data management solutions.
In the following, we will recall the evolution of the data management
during the past decades and briefly present and discuss the current trends of
research.

1.1 Database Systems


There have been six distinct phases in the evolution of data management
(J. Gray [Gra96]). The first phase was characterized by manual processing
of data. The second phase used mechanical and electromechanical equip-
ment, like punched-card machines, to process data. Each data record was
represented as binary patterns on a punched card, and a special sorters and
tabulators were used to sort and tabulate the cards. The third phase stored
data on magnetic tape and used stored-program computers to process data.
It is the beginning of the file-oriented processing model and file-based sys-
tems. The fourth phase introduced the concept of on-line data processing
systems. It was the beginning of database systems. The first hierarchical
database management system, called 1M S (Information Management Sys-
tem), was released by IBM in the late 1960s. IMS managed data organized
as hierarchies of records. The main reason for this organization, called hier-
archical data model, was to allow the use of serial storage devices such as
magnetic tape. This restriction was subsequently dropped. The key contri-
bution of 1MS was the idea that data should be managed independently of
any single application. Previously, applications owned private data files that
often duplicated data from other files. Another significant development in
the field of data management at this time was the emergence of IDS from
General Electric. This development has led to a new type of database system
known as the network database management system. The network database
management system managed data organized as networks of records. The
network database system was developed partially to address the need of rep-
resenting more complex relationships among data than could be modeled
with hierarchical data model, and partially to impose a database standard.
To help establish this standard, the Conference on Data System Languages
(CODASYL) formed in the mid-1960s a List Processing Task Force, subse-
quently renamed the Data Base Task Group (DBTG). The DBTG defined
two distinct languages; a data definition language (D D L) and a data ma-
nipulation language (DML). Moreover, the DBTG crystallized the concept
of schemas. The D BTG proposal distinguished a logical schema, describ-
ing the logical organization of the entire database, a subschema, describing
the part of the database as seen by the user or application program, and a
4 J. Bla.zewicz and T.Morzy

physical schema, which describes the physical layout of the database records
on storage devices. This logical-physical-subschema mechanism defined by
DBTG provided data independence. A number of DBMS were subsequently
developed following the DBTG proposal. These systems are known as CO-
DASYL or DBTG systems. The 1MS and CODASYL systems represented
thejirst-generation of DBMSs. The main disadvantage of both the 1MS and
CODASYL data models was the graph-based logical organization of data, in
which programs could navigate among records by following the relationships
among them. This navigational interface to database systems was too diffi-
cult even for programmers. To answer even simple queries they had to write
complex programs to navigate these databases.
The fifth phase of data management evolution is related to relational
databases. In 1970, E.F. Codd published the paper, in which the relational
data model was outlined. The relational data model gave database users and
programmers high-level set-oriented data acCesb to databases organized as
sets of tables (relations). Many experimental relational DBMSs were imple-
mented thereafter, with the first commercial products appearing in the late
1970s and early 1980s. The database research community in academia and
industry, inspired by the relational data model, developed many important
results and new ideas that changed the database technology, but that can be
also applied to other information environments: the standard query language
SQL and a theory of query language expressibility and complexity, query
processing and optimization techniques, concurrent transaction management
techniques, transactional recovery techniques, distributed and parallel pro-
cessing techniques, etc. The list is not exhaustive, but rather illustrates some
of the major technologies that have been developed by database research and
development. The relational data model is still the most commonly supported
among commercial database vendors. Relational DBMSs are referred to as
second-generation DBMSs.
According to Jim Gray [Gra96J, we are now in the sixth phase of data
management evolution. The phase began in the mid-1980s with a new data
model, called object-oriented data model, based on object-oriented program-
ming principles. The relational database systems have several shortcomings.
First of all, they have limited modeling capabilities. Second, they offer pre-
defined limited set of data types. Despite SQL added new data types for
time, time intervals, timestamps, dates, currency, different types of numbers
and character strings, this set of data types is still insufficient for some ap-
plications. Moreover, the relational database systems have a clear distinction
between programs and data. However, with new fields of database systems
applications, this separation between programs and data became problem-
atic. New applications require new data types together with a definition of
their behavior. In other words, DBMSs should let users create their own
application-specific data types that would then be managed by the DBMS.
The object-oriented data model assumes the unification of programs and data.
1. Management of Data: State-of-the-Art and Emerging Trends 5

Many experimental and commercial object-oriented DBMSs (OODBMS)


have emerged in the late 1980s and early 1990s. However, the market was very
slow to accept the new data model due to its limitations. Very few companies
decided to move their mission-critical business applications to OODBMS
platform. Meanwhile, the vendors of traditional relational DBMSs, in re-
sponse to the needs of new database application fields and as an attempt
to address some of the key deficiencies of relational DBMSs that are due
to the inherent restrictions in the relational data model, extended their re-
lational products with core object-oriented concepts found in OODBMSs.
These concepts include encapsulation of data and programs into an object,
object identity, multiple inheritance, abstract data types, and nested objects.
This evolution of relational DBMSs has led to a "new" hybrid DBMS called
object-relational DBMS (ORDBMS) [Kim95]. Now, vendorsofORDBMSs
are augmenting their products with object-oriented database design and man-
agement tools. Both OODBMSs and ORDBMSs represent third-generation
DBMSs. There is a general agreement in the database research community
that the foundation of the post-relational database technology is a unified re-
lational and object-oriented database system; the system that has all major
features of today's relational database systems (query optimization, trans-
action management, meta data management, views, authorization, transac-
tional recovery, triggers, etc.) extended with the concepts of encapsulation,
inheritance, arbitrary data types with a definition of their behavior, and
nested objects.

It was recognized in several reports published by the database research


community [SSU91,SSU96,SZ96,BBC+98] that there is a growing trend in
the computer industry to provide support for non-numerical data manage-
ment. The availability of low-cost data capture devices (digital cameras,
sensors, scanners, digital audio, etc.), combined with low-cost mass stor-
age and high-resolution displays devices, created new classes of applica-
tions that require new facilities for multimedia data management, on one
side, and typified the limits of current data management technology, on
the other side. Multimedia data means arbitrary data types and data from
arbitrary data sources. They include: numerical data, strings, text, image,
audio, graphics, video, time-series, sets, arrays, charts, graphs, and com-
pound documents that are comprised of such data. Arbitrary data sources
include: databases, file systems, sensors, spreadsheets, Web documents, data-
generating and data-consuming programs, data satellite feeds, on-line pub-
lishers [CB02,ENOO,LBK02,Kim95]. To meet the needs of these new applica-
tions new multimedia database management systems are necessary. The next-
generation multimedia DBMS will be very likely built on an ORDBM S with
support for management of multimedia data. This support will include: the
ability to represent arbitrary data types and specification of procedures that
interact with arbitrary data sources, the ability to query, update, insert, and
delete multimedia data, the ability to specify and execute abstract operations
6 J. Blaiewicz and T .Morzy

on multimedia data, and the ability to deal with heterogeneous data sources
in a uniform way. In the years ahead, multimedia DBMSs are expected to
dominate on the database marketplace.

This brief outline of the history of database systems research does not
cover all developments and achievements in the database field. The database
research has developed several different "types" of DBMS for specific ap-
plication areas [CB02,ENOO,LBK02]. Temporal database systems are used to
support applications that require some aspect of time when organizing their
data. Temporal data models incorporate time as a first-class element of the
system - not only as a data type, but also real time. Therefore, they can
store and manage a history of database changes, and allow users to query
both current and past database states. Some temporal database models also
allow users to store future expected information. Spatial database systems
were developed to meet the needs of applications that store and manage
data that has spatial (multidimensional) characteristics that describe them.
These database systems are used in such applications as weather informa-
tion systems, environmental information systems, cartographic information
systems. For example, cartographic information systems store maps together
with two or three-dimensional spatial descriptions of their objects (countries,
rivers, cities, roads, etc.). Special kind of spatial database systems is used for
building Geographic Information Systems (GIS). They can store and manage
data originating from digital satellite images, roads, transportation networks,
etc. Building GIS requires advanced features in data storage, management,
and visualization that are not supported by traditional DBMSs. Moreover,
very often, the new GIS applications process data that has both temporal
and spatial characteristics. The system supporting such applications requires
new functionality for storing and managing those data types. In 1990s sev-
eral research prototypes were developed that combined spatial and temporal
DBMSs to create the new type of DBMSs called spatio-temporal database
systems. Real-time database systems (RTDBSs) are used to process trans-
actions (applications) having timing constraints associated with them and
accessing data which values and validity change in time. These constraints,
usually expressed in a form of a deadline, arise from the need to make the
results of transactions available to the system that has to perform appropri-
ate controlling decisions in time. Importance of real-time database systems is
the result of an increasing number of real-time applications maintaining and
processing large volumes of data. The applications concern: computer inte-
grated manufacturing, factory automation and robotics, workflow systems,
aerospace systems, military command and control, medical monitoring, traf-
fic control, etc. RT DBSs were created as the result of integration real-time
systems with traditional database systems. Active database systems are used
to support applications that require some kind of activity on the side of data.
Active database systems provide additional functionality for specifying the
so-called active rules. The rules, also referred as EGA rules (Event-Condition-
1. Management of Data: State-of-the-Art and Emerging Trends 7

Action), specify actions that are automatically triggered by certain events


that occur. Active database systems can be used in controlling industrial and
manufacturing processes, medical monitoring, stock systems, etc. For exam-
ple, an active database system may be used to monitor the blood pressure in
a medical monitoring system. The application can periodically insert in the
database the blood pressure reading records from sensors, and active rule can
be defined that is triggered whenever a pressure goes above the user-defined
threshold. The integration of logic programming and database technology
has led to deductive database systems. Deductive database systems provide
functionality for specifying the so-called deductive rules. The deductive rules
are used to deduce new facts from the facts stored in the database. Deduc-
tive database systems can be used in several application domains such as
enterprise modeling, hypothesis testing, software reuse, electronic commerce.
Deductive object-oriented database systems (DOODs) came about through
the integration of the object-oriented paradigm and logic programming due
to the observation that object-oriented and deductive database systems have
complementary strengths and weaknesses. While early database systems were
strictly centralized - they had a single CPU database system architecture;
majority of the new systems today operate in an environment where multiple
CPU s are working in parallel to provide database services. Database systems
with multiple CPU s that are physically close together, i.e. they enable CPU s
to communicate without the overhead of exchanging messages over a network,
are generally said to be parallel database systems, while systems with multiple
CPUs that are geographically distributed and communicate with each other
by exchanging messages over a network are said to be distributed database
systems. The development of parallel and distributed database systems was
possible due to the developments of database systems architectures, as well as
advances in distributed and parallel data processing [AW98,BEP+00j. These
systems were developed for various reasons ranging from organizational de-
centralization and economical processing to greater autonomy of sites. Usu-
ally, distributed and parallel database systems offer higher data availability,
reliability, autonomy, performance, and flexibility with regards to centralized
database systems. Finally, recent advances in wireless technology have led to
mobile database systems that allow users to establish communication with
other users, or main data repositories, while they are mobile. This feature of
mobile database systems is particularly useful to geographically distributed
organizations whose employees are mobile but require from time to time di-
rect access to organization's data. Typical examples are traffic police, taxi
dispatchers, information brokering applications, etc.

This brief survey of different "types" of database systems developed dur-


ing the last 30 years illustrates the effort that has been done by database
research community in the field of data management. It also illustrates the
connection between basic research and commercial success. All the above
mentioned database systems came from the academic and industrial research
8 J. BlaZewicz and T.Morzy

labs. They have roots in experimental studies and prototype implementa-


tions, which evolve, in turn, into commercial products.

1.2 Beyond Database Systems

Traditionally, database systems are used to store and manage large volumes
of data, and, as we outlined above, much database research was focused in
this direction. However, the concepts and solutions developed in the field of
database systems are of significant importance in many fields of computer
science. They can be applied and extended in different interesting ways. Re-
cently, new important fields of data management have emerged. Each has a
new environment for which data management technology, especially database
technology, had to be adopted: data warehousing and OLAP, data mining,
and workflow management. We discuss briefly each in turn.

Data warehousing and OLAP. Data warehousing is a collection of deci-


sion support technologies, aimed at enabling the knowledge workers (decision
makers) to make better and faster decisions [CB02,CD97,Kur99j. We observe
the explosive growth both in the number of data warehousing products and
services offered, and in acceptance of these technologies by industry. Data
warehousing technologies have been successfully used in many industries and
applications: retail (to store cash-register transactions for further user pro-
filing and inventory management), manufacturing (for order shipment and
customer support), financial services (for risk analysis and fraud detection),
telecommunication (for call analysis and fraud detection), healthcare (for
spending analysis), and, finally, data integration. Comprehensive analysis of
the organization, its business, its requirements, and market trends, requires
access to all the organization's data, wherever it is located. Moreover, it is
required to access not only the current values of data but also historical data.
The core of data warehousing technology is a data warehouse, which holds
data drawn from one or more external data sources together with historical,
consolidated and summarized data. Since data warehouses contain consoli-
dated data over potentially long periods of time, they tend to be order of
magnitude larger than operational databases. The data warehouse provides
storage, management, and responsiveness to complex queries that can access
millions of records and perform a lot of joins and aggregates. To facilitate
complex analysis and visualization, a data warehouse often uses a multidi-
mensional data model (data cube), which supports on-line analytical pro-
cessing (OLAP), the functional and performance requirements of which are
quite different from those of the traditional on-line transaction processing.
Data warehousing is currently very active field of research. Research prob-
lems associated with creating, maintaining, and using data warehouses are
partially very similar to those involved with database systems. In fact, a data
warehouse can be considered as a "large" database system with additional
1. Management of Data: State-of-the-Art and Emerging Trends 9

functionality. However, the well-known problems of index selection, data par-


titioning, materialized view maintenance, data integration, query processing,
parallel query processing, received renewed attention in data warehousing
research. Some research problems are specific to data warehousing: data ac-
quisition and data cleaning, evolution of data warehouse schema, multidimen-
sional query optimization, design of wrappers, data quality management. New
trends in data warehousing are the adaptation and integration of active and
temporal database technologies with data warehousing and the extension of
data warehousing technology with transaction management.

Data mining. Over the last decades, many organizations have generated
and collected a large amount of data in the form of files, documents, and
databases. From the point of view of decision makers, simple storing of in-
formation in databases and data warehouses does not provide the benefits
an organization is seeking. To realize the value of stored data, it is necessary
to extract the knowledge hidden within databases and/or data warehouses
[HKOO,Ho103a,HoI03b,WFOOj. Useful knowledge can be partially discovered
using OLAP tools. This kind of analysis is often called the query-driven data
analysis. However, as the amount and complexity of the data stored in large
databases and data warehouses grows, it becomes increasingly difficult, if not
impossible, for decision makers to manually identify trends, patterns, regu-
larities, rules, constraints, relationships in the data using query and reporting
tools. Data mining is one of the best ways to extract or discover meaningful
knowledge from huge amount of data. Data mining is the process of dis-
covering frequently occurring, previously unknown, and interesting patterns,
relationships, rules, anomalies, and regularities in large databases and data
warehouses. The main goal of this analysis is to help human analysts to un-
derstand the data. To illustrate the difference between both OLAP and data
mining analysis let us consider typical queries formulated by both technolo-
gies. Typical OLAP query is the following: How many bottles of wine did
we sell in 1st quarter of 2003 in Poland vs. Austria? Typical data mining
queries are: How do the buyers of wine in Poland and Austria differ? What
else do the buyers of wine in Poland buy along with wine? How can the buy-
ers of wine be characterized? Which clients are likely to respond to our next
promotional mailing, and why?
Data mining technology can be used in many industries and applica-
tions [HKOOj: marketing, manufacturing, financial services, telecommunica-
tion, healthcare, scientific research, and, even, sport. Data mining is now
one of the most exciting new areas of data management. It is still evolving,
building on ideas from the latest scientific research. It incorporates the latest
development taken from artificial intelligence, statistics, optimization, paral-
lel processing, database systems, and data warehousing. From the conceptual
point of view, data mining can be perceived as advanced database querying,
since the resulting knowledge in fact exists in the database or data ware-
10 J. BlaZewicz and T.Morzy

house, however, it is difficult to retrieve this knowledge manually. Therefore,


at present, there is a very promising idea of integrating data mining methods
and tools with database/data warehouse technologies to benefit functionality
of both of them in knowledge discovery process. It is concerned with such
issues as new types of indices supporting data mining algorithms, parallel
processing, visualization of data mining results, query languages supporting
ad hoc data mining queries, etc. This leads to the concept of on-line data
mining (OLAM), fully supported by an extended DBMS architecture. We
expect that, in the near future, the integration of database and data ware-
house technologies with data mining will create a new type of "database
system" able to store and manage both data and knowledge extracted from
this data.

Workflow management. The business processes often involve the coordi-


nated execution of multiple tasks performed by different processing entities,
which may be people or software systems, such as a DBMS, an applica-
tion program, or an electronic mail system. For example, a simple rental
agreement for a property consists of several steps. The client contacts the ap-
propriate member of staff appointed to manage the desired property and fills
a special form. The member of staff contacts the company's credit controller
to check credibility of the client using a database system. Then, the controller
decides to approve or reject the application and informs the member of staff
of the final decision, who passes the decision on to the client. Also, a simple
purchase agreement consists of several steps: buyer request, bid, agree, ship,
invoice, and pay. All these examples illustrate the typical request-response
model of computing called a workflow [LROO,RS95j.
The are two main problems involved in workflow systems: the specifi-
cation of the workflow and the execution of the workflow. Both problems
are complicated by the fact that many organizations use multiple, indepen-
dently managed systems to automatize different parts of the workflow. It is
obvious, that workflows require special kinds of data management that sup-
port sequences of related tasks. In other words, workflows require their own
"workflow management systems" that support their specific requirements.
First of all, workflow processes require, in terms of execution, special open
nesting semantics (new transaction models) that permit partial results of the
workflow to be visible outside the workflow, allowing components of the work-
flow to commit independently. Then, workflow processes require special tools
for their specification, creation, and management. Summarizing, a compre-
hensive transactional workflow system should support multitask, miltisystem
activities where (1) different tasks may have different execution behavior and
properties, (2) the tasks may be executed on different processing entities, (3)
application or user-defined coordination of the execution of different tasks is
provided, and (4) application or user-defined failure and execution atomicity
are supported. In the near future, we expect an evolution toward application
1. Management of Data: State-of-the-Art and Emerging Trends 11

development models that provide the extended transaction and workflow ca-
pabilities to suit the needs of complex applications accessing heterogeneous
systems.

1.3 The Future Research


The data management research, in the past few decades, has developed into
one of the great success stories of computer science, both in terms of signif-
icant theoretical results and practical commercial value. However, the tech-
nological and information environment of the world is changing rapidly. Ad-
vances in computer hardware, communication networks, software engineering
methodology have enabled the evolution of data management from manual
data processing to complex data management systems. This progress in com-
puter hardware, communication networks, software engineering methodology
is expected to continue for many more years. We observe the phenomenon
of the "information explosion", which refers to the increasing amount of in-
formation available now in a digital form. This information explosion is the
result of:
• low-cost computing and storage hardware, and easy software have made
computers available and accessible to almost everybody,
• low-cost Internet access, which makes it easy and attractive to put all
information into cyberspace, and makes it accessible to almost everybody,
and
• availability of simple, easy-to-use interfaces (WWW browsers).
Data management has advanced in parallel with these changes developing
new solutions, techniques and technologies to meet the requirements of new
applications and increasing demands of users. Still, much of the current re-
search effort is aimed at increasing the functionality and performance of
DB M S s, and related data management technologies. Another important
and emerging goal of current research is to make DBMSs easier to use.
Users expect to add new applications and new services with almost no ef-
fort, they expect automated management with intuitive graphical interfaces
for all administration, operation, tuning, and design tasks. Users expect sim-
ple and powerful tools to browse, search, manage, and visualize the data.
However, the information demands of the changing world are also stressing
the limitations of current data management technology. Many data manage-
ment challenges remain. The most recent report on database research, the
"Asilomar Report" [BBC+98], emphasizes the new challenges and demands
for research in database and information systems today: "the database re-
search community should embrace a broader research agenda - broadening
the definition of database management to embrace all the content of the Web
and other on-line stores, and rethinking fundamental assumptions in light of
technology shifts". The fundamental data management issues have changed
dramatically in the last decade. To meet the needs of the future information
12 J. BlaZewicz and T.Morzy

society, data management software must be developed in several directions


simultaneously, including the following challenges:
• Management of changes to the database and data warehouse schemas
with the same facility as current systems manages changes in data;
• Defining the data models for new data types and integrating them with
the database and data warehouse technology;
• Developing new data mining and data analysis algorithms adapted to
deal with huge databases on secondary and tertiary storage devices;
• Scaling databases in size, space, and diversity;
• Developing new, more flexible workflow models;
• Integrating the information from multiple external data sources;
• Developing new solutions for large heterogeneous federated systems
(query optimization, data cleaning, data quality management, etc.);
• Integrating of structured and semistructured data over Internet.
• Developing models and data processing algorithms for continous data
streams.
• Developing new solutions and data management capabilities for e-
commerce applications.
The above mentioned list of challenging problems is, of course, very limited.
It is simply infeasible to enumerate all interesting issues and challenges that
should be addressed by database researchers to meet the increasing demands
of the future information society. For example, one of the most promising and
rapidly developing areas, for which new data storage and processing capabil-
ities should be provided, is the domain of bioinformatics [BFK03]. Bioinfor-
matics addresses information management of genetic information with special
emphasis on DNA sequence analysis. It needs to be broadened into a wider
scope to embrace all types of biological data: its modeling, storage, retrieval,
and management.
Finally, we would like to remind the recommended near future goal for
the database and data management community formulated in the "Asilomar
Report" [BBC+98]: "make it easy for everyone to store, organize, access, and
analyze the majority of human information on-line" .

2 Survey of the Volume


Data management has evolved from simple record-oriented navigational data-
base systems (hierarchical and network systems) to set-oriented systems that
gave way to relational database systems. The relational database systems are
now evolving to object-relational and multimedia database systems. During
last years, database systems have been used as platforms for managing data
for conventional transaction-oriented business applications. As organizations
and companies have become more sophisticated, pressure to provide infor-
mation sharing across different, heterogeneous data sources has mounted.
1. Management of Data: State-of-the-Art and Emerging Trends 13

The problem how to integrate and provide organization-wide uniform ac-


cess to data and software resources distributed across heterogeneous and
autonomous external data sources (file systems, databases, Web pages, etc.)
can be addressed in a few different ways: federated databases, data ware-
housing, and mediation. Chapter 2 introduces basic database concepts and
provides a brief overview of basic data models (hierarchical, network, rela-
tional, and object-oriented). Then, the chapter discusses the basic approaches
to information integration.

The process of modeling and formalizing data requirements with a formal


conceptual modeling tool is called data modeling and is an intrinsic part of the
(information system) database design process. Good database design requires
a thorough understanding of organizational and user data requirements. This
includes identifying what data associated with an enterprise is important,
what data should be maintained for the enterprise, and what business rules
apply to the data. This information is used to develop a high-level description
of the data to be stored in the database, along with the constraints that hold
over this data. This step is known as data modeling, and is carried out using
an abstract data model, called a conceptual data model, which allows one to
describe the structure of information to be stored in a database at a high level
of abstraction. Formalizing organizational data requirements with conceptual
data models serves two important functions. Firstly, conceptual data models
help users and system developers identify data requirements. They encourage
high-level problem structuring and help to establish a common ground on
which users and developers can communicate to one another about data and
system functions. Secondly, conceptual models are useful in understanding
how existing systems are designed. Even very simple stand-alone system can
be better explained and represented with an abstract data model.
Over the years, data modeling has evolved from simple data models that
focus on machine-oriented constructs to much more sophisticated data mod-
els focusing on capturing the structure of knowledge as perceived by users
for whom the database is developed. Chapter 3 presents an overview of the
basic data models and traces the evolution of data modeling during last
years. It discusses also the increasing level of abstraction demonstrated by
recent developments in data modeling, as well as presents a framework for
understanding the ontological and conceptual foundation of data modeling
as human activities of creating models of the world. The chapter concludes
by outlying directions for future research in data modeling.

Chapter 4 focuses on the current status and research and development


agenda for object-oriented database technology. The chapter introduces ba-
sic concepts of object-oriented data modeling illustarted by examples. Then,
the chapter provides the description of the OQL query language for object-
oriented database systems. Further, the chapter addresses the technical is-
sues of object-oriented database systems like physical object management,
14 J. BlaZewicz and T.Morzy

architecture of client-server systems, indexing techniques for object-oriented


database systems, and examines query optimization issues.

One of the most important trends in databases is the increased use of par-
allel processing and data partitioning techniques in database management
systems. Parallel DBMSs are based on the premise that single-processor
systems can no longer meet the growing requirements for cost-effective scal-
ability, reliability, and performance. A powerful and financially attractive
alternative to a single-processor-driven DBMS is a parallel DBMS driven
by multiple processors. With the predicted future database sizes and com-
plexity of queries, the scalability of parallel database systems to hundreds
and thousands of processors is essential for satisfying the projected demands.
Parallel DBMSs can improve the performance of complex query execution
through parallel implementation of various operations (load, scan, join, sort)
that allow multiple processors automatically to share the processing work-
load. Chapter 5 describes three key components of a high performance parallel
database management system: data partitioning techniques, algorithms for
parallel processing of a join operation, and a data migration technique that
controls the placement of data to respond to changing workloads and evolving
hardware platforms.

Database systems have emerged into ubiquitous components of any large


software system over the last decades. They offer comprehensive capabili-
ties for storing, retrieving, querying, and processing data. However, recent
changes in the computing infrastructure (computer hardware, data and com-
munication networks, development of the Web, development of electronic
commerce platforms, etc.) accompanied by increasing demands of the user
community with regard data management, require new capabilities and func-
tionality of database and data management systems. Advanced database sys-
tems, discussed in Chapter 6, try to meet the requirements of present-day
applications by offering advanced functionality in terms of languages, sys-
tem features, new data types support, data integration capabilities, etc. The
chapter surveys the state-of-the-art in these areas.

Multimedia database systems are the systems that manage multimedia


information, facilitate multimedia for presentations, and use specific tools for
storage, management, and retrieval of multimedia data. Chapter 7 presents
the overview of different techniques and their interoperability necessary for
the design and implementation of multimedia database systems. The chapter
describes the characteristics of multimedia data, data models for multimedia
data, requirements for multimedia applications, and presents algorithms and
structures for multimedia retrieval. Further, different aspects of distributed
and parallel processing of multimedia data are depicted, and different ap-
proaches for the parallel execution of retrieval operations for multimedia data
1. Management of Data: State-of-the-Art and Emerging Trends 15

are considered. Finally, the chapter presents a case study of a cluster-based


prototype, called CAIRO, for image retrieval.
Chapter 8 presents the main concepts of the workflow technology, and
evaluates the current state of this technology from the point of view of the
requirements of new advanced applications. Based on this evaluation, the
problems that could not be adequately addressed by the existing commer-
cial products are identified. The chapter concludes by outlying directions for
future research in workflow management.
Chapter 9 presents the overview of basic concepts and techniques neces-
sary for the design, implementation, and running of data warehouse systems.
The chapter describes data warehouse architectures, components of the data
warehouse architecture, and data analysis tools used for analyzing and eval-
uating data stored in a data warehouse. The chapter concludes with the dis-
cussion of concepts and procedures for building data warehouses, and future
research directions.
The integration of wireless technology, database technology, and distrib-
uted processing has led to mobile computing and mobile databases. Mobile
computing systems become increasingly commonplace as people more often
conduct their activities away from their offices and homes while requiring ac-
cess to some data repositories. Mobile computing may be considered a varia-
tion of distributed computing. However, there are a number of hardware and
software problems that must be resolved before the capabilities of mobile
computing can be fully utilized. Some of these problems associated with data
management, transaction management, transactional recovery, query opti-
mization, are partially similar to those involved with distributed database
systems. However, in mobile computing environment, these problems become
more difficult to solve. Some research problems are specific to mobile com-
puting: migrating applications, migrating objects and agents, the relatively
short active life of the power supply, the relatively narrow bandwidth of
the wireless communication channels. All these problems pose many research
challenges. Chapter 10 addresses data and transaction management issues
in mobile computing environment, analyzes the past and present of mobile
computing, mobile computing devices, architectures for mobile computing,
and advanced applications for mobile computing platforms.
As we already mentioned before, over the past decade, many organiza-
tions, companies, and institutions collected huge volumes of data describing
their operations, products, and customers. At the same time, scientists and
engineers in many fields have been capturing increasingly complex experi-
mental data sets describing brain activity in humans. The new field of data
mining addresses the question of how best to use this data to discover gen-
eral regularities, anomalies, trends, rules, and improve the process of decision
making. Chapter 11 presents an overview of data mining process and funda-
mental data mining problems. Further, the basic data mining techniques are
discussed and some of them (mining associations, classification, and cluster-
16 J. BlaZewicz and T.Morzy

ing) are presented in detail. The chapter concludes with the discussion of
future research directions.
We should note that there are also some other subject areas relevant to
the research and development agenda for the next-generation data manage-
ment systems, namely: deductive and object-deductive systems, XML and
semistructured data management, genome data management, database tun-
ing and administration, real-time database systems, data stream issues. Un-
fortunately, it was impossible, due to the scope limitation of the handbook, to
present all aspects of data management issues. Therefore, the reader should
not mistake the absence of chapters on these topics to mean that they are less
than important. The handbook covers a large amount of available knowledge
on currently available data management technologies and, we hope, will be
useful to understand further developments in the field of data management.

References

[AW98) Abdelguerfi, M., Wong, K-F. (ed.), Parallel database techniques,


IEEE Computer Press, Los Alamitos, 1998.
[BBC+98] Bernstein, P., Brodie, M., Ceri, S., et al., The Asilomar report on
database research, SIGMOD Record 27(4), 1998, 74-80.
[BEP+OO] BlaZewicz, J., Ecker, K., Plateau, B., Trystram, D. (ed.), Handbook
on parallel and distributed processing, Springer-Verlag, 2000.
[BFK03] BlaZewicz, J., Furmanowicz, P., Kasprzak, M., Selected combinatorial
problems of computational biology, European Journal of Operational
Research, 2003, (to appear).
[CD97] Chaudhuri, S., Dayal, V., An overview of data warehousing and
OLAP technology, SIGMOD Record 26(1), 1997, 65-74.
[CB02] Connolly, T., Begg, C., Database systems: a practical approach to
design, implementation, and management, 3rd ed., Addison-Wesley,
2002.
[ENOO) Elmasri, R., Navathe, S., Pundamentals of database systems, 3rd ed.,
Addison-Wesley, 2000.
[Gra96) Gray, J., Evolution of data management, IEEE Computer 29(10),
1996, 38-46.
[HKOO] Han, J., Kamber, M., Data mining: concepts and techniques, Morgan
Kaufmann Pub., 2000.
[HoI03a] Holsapple, C.W. (ed.), Handbook on knowledge management: knowl-
edge matters, Springer-Verlag, 2003.
[HoI03b] Holsapple, C.W. (ed.), Handbook on knowledge management: knowl-
edge directions, Springer-Verlag, 2003.
[Kim95] Kim, W. (ed.), Modem database systems: the object model, interop-
erability, and beyond, ACM Press, New York, 1995.
[KS97] Korth, H.F., Silbershatz, A., Database research faces the information
explosion, Communication of the ACM 40(2), 1997, 139-142.
[Kur99] Kurz, A., Data warehousing - enabling technology, MITP-Verlag,
1999.
1. Management of Data: State-of-the-Art and Emerging Trends 17

[LBK02] Lewis, P.M., Bernstein, A., Kifer, M., Databases and transaction pro-
cessing: an application-oriented approach, Addison-Wesley, 2002.
[LRoo] Leymann, F., Roller, D., Production workflow - concepts and tech-
niques, Upper Saddle River, NJ, Prentice Hall, 2000.
[RS95] Rusinkiewicz, M., Sheth, A., Specification and execution of transac-
tional workflows, W. Kim (ed.), Modem database systems, Reading,
MA, Addison-Wesley, 1995, 592-620.
[SSU91] SHberschatz, A., Stonebraker, M.J., Ullman, J., Database systems:
achievements and opportunities, SIGMOD Record 19(4), 1991,6-22
(also in Communication of the ACM 34(100), 1991, 110--120).
[SSU96] SHberschatz, A., Stonebraker, M.J., Ullman, J. (eds.), Database re-
search: achievements and opportunities into the 21st Century, SIG-
MOD Record 25(1), 1996, 52-63.
[SZ96] SHberschatz, A., Zdonik, S.B., Strategic directions in database sys-
tems - breaking out of the box, ACM Computing Surveys 28(4), 1996,
764-778.
[WFoo] Witten, I. H., Frank, E., Data mining: practical machine learning
tools and techniques with Java implementations, Morgan Kaufmann
Pub., 2000.
2. Database Systems: from File Systems to
Modern Database Systems

Zbyszko Krolikowski and Tadeusz Morzy

Institute of Computing Science, Poznan University of Technology, Poznan, Poland

1. Introduction - Database Concepts ................................. 19


2. Database System Generations ..................................... 21
3. Network Database Systems ........................................ 22
4. Hierarchical Database Systems .................................... 25
5. Relational Database Systems ...................................... 29
6. Object-Oriented Database Systems................................ 33
7. Federated, Mediated Database Systems and Data Warehouses ..... 38
7.1 Federated Database System................................... 38
7.2 Mediated System .............................................. 42
7.3 Data Warehouse System ....................................... 44
8. Conclusions....................................................... 47

Abstract. Database systems have evolved from simple record-oriented naviga-


tional database systems, hierarchical and network systems, into set-oriented systems
that gave way to relational database systems. The relational database systems are
now evolving into object-relational and multimedia database systems. During the
last years, database systems have been used as platforms for managing data for
conventional transaction-oriented business applications. As organizations and com-
panies have become more sophisticated, pressure to provide data integration across
different, heterogeneous data sources has mounted. The problem how to provide
organization-wide uniform access to heterogeneous and autonomous external data
sources (file systems, databases, Web pages, etc.) can be addressed, genraily, in
three different ways: federated databases, data warehousing systems, and mediated
systems.
This chapter introduces the reader to basic concepts of database systems and
data integration. We start with a presentation of central ideas and foundations
of database systems. After tracing the database systems' evolution, we briefly ex-
plore the background, characteristics, advantages and disadvantages of the main
database models: hierarchical, network, relational, and object-oriented. Then, the
chapter discusses the basic approaches to data integration, and briefly describes
each approach in detail.
2. Database Systems: from File Systems to Modern Database Systems 19

1 Introduction - Database Concepts

During the past forty years, databases have ceased to be simple file systems
and became collections of data that simultaneously serve a community of
users and several distinct applications. For example, an insurance company
might store in its database the data for policies, investments, personnel, and
planning. Although databases can vary in size from very small to very large,
most databases are shared by multiple users or applications [Br086].
Typically, a database is a resource for an enterprise in which the three
following human roles are distinguished in relation to the database, i.e. a
database administrator, application programmers and end users. A database
administrator is responsible for designing and maintaining the database. Ap-
plication programmers design and implement database transactions and ap-
plication interfaces, whereas, end-users use prepared applications and, possi-
bly, high level database query languages. The design of database applications
can be stated as follows. Given the information and processing requirements
of an information system, construct a representation of the application that
captures the static and dynamic properties needed to support the required
transactions and queries. A database represents the properties common to all
applications, hence it is independent of any particular application. The pro-
cess of capturing and representing these properties in the database is called
database design.
The representation that results from database design must be able to
meet ever-changing requirements of both the existing and new applications.
A major objective of database design is to assure data independence, which
concerns isolating the database and the associated applications from logical
and physical changes. Ideally, the database could be changed logically (e.g.
add objects) or physically (e.g. change access structures) without affecting
applications, and applications could be added or modified without affecting
the database.
Static properties include the following: objects, objects properties (called
attributes), and relationships among objects. Dynamic properties encompass
query and update operations on objects as well as relationships among oper-
ations (e.g. to form complex operations called transactions). Properties that
cannot be expressed conveniently as objects or operations are expressed as
semantic integrity constraints. A semantic integrity constraint is a logical
condition expressed over objects (i.e., database states) and operations.
The result of database design is a schema that defines the static properties
and specifications for transactions and queries that define the dynamic prop-
erties. A scheme consists of definitions of all application object types, includ-
ing their attributes, relationships, and static constraints. Thus, a database is
a data repository that corresponds to the schema. A database consists of in-
stances of objects and relationships defined in the schema. A particular class
of processes within an application may need to access only some of the static
properties of a predetermined subset of the objects. Such a subset, which
20 Z. Krolikowski and T. Morzy

is called a subschema or view, is derived from the schema much as a query


is defined. Logical database integrity is connected with the schema concept.
A database exhibits logical integrity if the values in the database are legal
instances of the types in the schema and if all semantic integrity constraints
are satisfied.

A database aims at answering queries and supporting database transac-


tions. A query can be expressed as a logical expression over the objects and
relationships defined in the schema and results in identifying a logical subset
of the database. A transaction consists of several database queries and up-
date operations over objects in a subschema and is used to define application
events or operations. Transactions are atomic since all steps of a transaction
must be completed successfully or the entire transaction must be aborted
(Le. no part of a transaction is committed before the whole transaction is
completed).

A data model is a collection of mathematically well defined concepts that


express the static and dynamic properties and integrity constraints for an ap-
plication. They include concepts for defining schemas, subschemas, integrity
constraints, queries and transactions. A data model provides a syntactic and
semantic basis for tools and techniques used to support the design and use
of a database. Tools associated with data models are languages for defining,
manipulating, querying, and supporting the evolution of databases. Majority
of existing database management systems provide a Data Definition Lan-
guage (DDL) for defining schemas and subschemas, a Data Manipulation
Language (DML) for writing database programs, and Query Language (QL)
for writing queries. Many database languages combine both query and up-
date operations. These languages can be provided on a stand-alone basis,
embedded as call statement in a host language, or integrated directly into a
high level programming language.

Hierarchical, network, relational, post-relational and object-oriented data


models have been developed. Detailed discussions of data model concepts
can be found in [Dat95,EN99,KS86,TL76,UIl89j. The classical data models
are based on common concepts (e.g., records, attributes, relationships, and
unique valued identifying fields) that were inherited from their ancestors,
simple file systems. Nevertheless, the notation and some concepts are specific
to each model. A database management system (DBMS) is a system that
implements the tools associated with a data model, e.g., the DDL, DML and
QL and the processors needed to implement schemas and execute transaction
and queries. Consequently, we take into consideration hierarchical, relational,
post-relational and object database management systems. A short overview
of such systems will be given in the next sections.
2. Database Systems: from File Systems to Modern Database Systems 21

2 Database System Generations

Database systems constitute a widely accepted tool for the computer-aided


management of large, formatted collection of data. Like in numerous other
areas of computer science, their historical development has been closely con-
nected with to the development of computer hardware and software. With
respect to hardware development, it is now common to talk about "computer
generations" , and in a similar way several "database system generations" can
already be distinguished. In this chapter, a brief historical perspective of this
evolution will be presented.
The field of databases has always been influenced by several other dis-
ciplines, e.g. hardware-oriented areas, such as the development of fast, sec-
ondary memory, in particular magnetic disks, for storing large amounts of
data. Data structures and operating systems are closely related to this devel-
opment. Data structures are used to manage data in secondary memory in
such a way that an efficient update and retrieval becomes feasible. Operat-
ing systems nowadays provide, for example, multiprogramming, which is an
important function used by a database system.
The "history" of database systems to date can be divided into five gen-
erations, which roughly correspond to the five decades of computing starting
from the 1950s. The first two decades were concerned with predecessors of
database systems. A central role in this development was played by the on-
going evolution of hardware and software on the one hand, and a continuous
change in user requirements for data processing on the other.
The first computing generation concerns the 1950s, when the major task of
any computer system was to process data (mainly calculating and counting)
under the control of a program. Each individual program was either directly
provided with the data set it operated upon, or it read its data from some
secondary memory into the main memory of the computer, processed it,
and finally wrote the eventually modified set back to secondary memory.
"Secondary memory" then referred to punched cards or to magnetic tapes,
both of which allowed sequential processing only. Thus, the first file systems
exclusively allowed a sequential access to the records of a file.
The early 1960s marked the second generation, which was different from
the first one in several aspects. On the one hand, it became possible to use
computers in interactive mode as well as batch mode. On the other hand, the
development of magnetic disks as fast secondary memory resulted in more
sophisticated file systems, which now supported multiple access. A direct
access file allows access to a record in that file directly via its address on the
disk, without having to read or to browse through all the records which are
physically located in front of it. Such an address can be located, for example,
in a special index file or found by using a hash function.
Both generations were thus characterized by the availability of file systems
only, which strictly speaking are the forerunners of database systems. The
22 Z. Krolikowski and T. Morzy

static association of certain data files with individual programs is of vital


importance for the use of a file system.
The third generation roughly coincides with the 1970s, although it actu-
ally started in the middle of the 1960s. It is characterized by the introduction
of a distinction between logical and physical information, which occurred par-
allel to an increasing need to manage large collections of data. During that
time, data models (i.e., hierarchical and network) were used for the first time
to describe physical structures from a logical point of view. However, the then
emerging approaches such as the hierarchical or the network model have to
be classified as "implementation-oriented".
Starting from this distinction between the logical meaning of data (i.e. the
syntax and the semantics of its description, and its current, physical value)
systems that were developed could integrate all the data of a given application
into one collection. It was henceforth termed a database. A database provided
individual users of this collection with a particular "view" to it only.
The fourth generation reached the market place in the 1980s. This gen-
eration, now generally called database systems, which in addition to storing
data redundancy-free under a centralized control make a clear distinction be-
tween a physical and a logical data model, which is particularly true for the
relational model of data. Systems based on this model are typically provided
with a high degree of physical data independence and the availability of pow-
erfullanguages. A fourth generation also saw an increasing penetration of the
area of database from theoretical point of view, which in particular resulted
in a now comprehensive theory of relational databases.
The third generation might be termed "pre-relational" and the fourth
one may be called "relational". The fifth generation, which is beginning to
emerge for the 1990s, is termed "post-relational" . As the relational model in
particular, and systems based on it, have produced nice tools and solutions
for a large number of commercial applications, people have begun to under-
stand that various other areas of application could benefit from database
technology. This is resulting in the development of object-oriented systems,
logic-oriented systems, and extensible systems.

3 Network Database Systems

Several commercial database systems based on the network model emerged


in the sixties. These systems were studied extensively by the Database Task
Group (DBTG). The first database standard specification, called the CODA-
SYL DBTG 1971 report, was written by the DBTG. Since then, a number
of changes have been suggested to that report and the last official version of
the report was published in 1978.
Computer Associates has developed a network database system IDMS
(Integrated Database Management System), which ran on IBM mainframes
under most of the standard IBM operating systems. It is probably the best
2. Database Systems: from File Systems to Modern Database Systems 23

known example of what is usually referred to as a "CODASYL system" -


that is, a system on the proposals of the DBTG of the Programming Lan-
guage Committee (renamed later COBOL Committee) of the "Conference on
Data Systems Languages" (CODASYL), the organization responsible for the
definition of COBOL.
A data-structure diagram is a scheme for a network database. The network
data structure can be regarded as an extended form of the hierarchical data
structure. Such a diagram consists of two basic components: boxes, which
correspond to record types, and lines, which correspond to links. A data-
structure diagram specifies the overall logical structure of the database and
serves the same purpose as nowadays an entity-relationship diagram.
A network database consists of a collection of records, which are connected
with each other through links. A link is an association between exactly two
records. Records are organized in the form of an arbitrary graph. More pre-
cisely, a network database consists of two following sets. The first one is a set
of multiple occurrences of each of several types of record. The second one is
a set of multiple occurrences of each of several types of link. Each link type
involves two record types i.e., a parent record type and a child record type.
Each occurrence of a given link type consists of a single occurrence of the
parent record type, together with an ordered set of multiple occurrences of
the child record type. Thus, in the DBTG model, only one-to-one and one-
ta-many links can be used. Whereas many-to-many links are disallowed in
order to simplify the implementation.
Let us remark that "link", "parent" and "child" are not CODASYL-
DBTG terms. In the CODASYL terminology links are called sets, parent
are called owners, and children are called members. Thus, a data-structure
diagram consisting of two record types that are linked together is referred to
in the DBTG model as a DBTG-set. Each DBTG-set has one record type
designated as the owner of the set, and the other record type designated as
the member of the set. A DBTG-set can have any number of set occurrences.
The DBTG model allows a field (or collection of fields) to have a set of val-
ues, rather than one single value. For example, suppose that a customer has
several addresses. In this case, the customer record type will have the (street,
city) pair of fields defined as a repeating group.
As an example, we show in Fig. 3.1 how the suppliers-and-parts database
could be represented in network data model. The database contains three
record types, namely Suppliers (S), Parts (P) and Shipment (SP). In place
of the two foreign keys SP.S# and SP.P#, we have two link types, namely
S-SP and P-SP, where:

• each occurrence of S-SP consists of a single occurrence of S, together with


one occurrence of SP for each shipment by the supplier represented by
that S occurrence;
24 Z. Krolikowski and T. Morzy

• each occurrence of P-SP consists of a single occurrence of P, together


with one occurrence of SP for each shipment of the part represented by
that P occurrence.

Fig.3.1. Network version of the "suppliers-and-parts" database

The data manipulation language of the DBTG model consists of a set of


operators for processing data represented in the form of record and links. The
operators are embedded in a host language. The find and get commands are
the most frequently used in DBTG systems. There are a number of different
formats for the find command. The main distinction among them is whether
individual records are to be located or whether records within a particular
set occurrence are to be located. There are various mechanisms available in
the DBTG model for updating information in the database. These include
the creation and deletion of records (via the store and erase operation) as
well as the modification (via the modify operation) of the content of existing
records. In order to insert records into and remove records from particular
set occurrence, the connect, disconnect, and reconnect operation are made
available.
A schema, written in the IDMS schema data description language defines
an IDMS database. The schema for a given database defines the records in
the database, the fields they contain, and the "sets" (links) in which they
participate as either "owner" (parent) or "member" (child). The schema is
compiled by the schema DDL compiler, and the output from the compilation
is stored in IDMS dictionary.
Users interact with the database via a user view of that database, defined
by a subschema. A subschema is a simple subset of the schema. Subschemas
are written in the IDMS subschema Data Definition Language (DDL) and
next they are compiled by the subschema DDL compiler. The compilation
output is stored in the IDMS dictionary.
As for data manipulation, IDMS is basically invoked by means of a host
language CALL interface. However, users do not have to code the calls di-
rectly; instead, IDMS provides a set of Data Manipulation Language (DML)
statement (such as FIND, GET, STORE) together with preprocessors to
2. Database Systems: from File Systems to Modern Database Systems 25

translate those DML statements into the appropriate host language calling
sequences. The syntax of DML statements resembles the syntax of the host
language, Preprocessor is provided for the following host languages: COBOL,
PL/I, FORTRAN, and System/370 Assembler Language.
C.J. Date in [Dat95] gave among others the following critical comments on
network systems in general, and CODASYL systems and IDMS in particular.
Networks are complicated; consequently, the data structures are complex.
The operators are complex; and note that they would still be complex, even
if they functioned at the set level instead of just on one record at a time.

4 Hierarchical Database Systems

A hierarchical database consists of a collection of records, which are con-


nected with each other through links. Each record is a collection of fields,
each of which contains only one data value. A link is an association between
exactly two records. The hierarchical data model is thus similar to the net-
work data model in the sense that data and relationships among data are
also represented by records and links, respectively. The hierarchical model
differs from the network model in that the records are organized as collection
of trees rather than arbitrary graphs.
A schema for hierarchical database is a tree-structure diagram . Such a
diagram consists of two basic components: boxes, which correspond to record
types, and lines, which correspond to links. A tree- structure diagram spec-
ifies the overall logical structure of the database and it is similar to a data-
structure diagram in the network model. The main difference between these
two data models is that in the former, record types are organized in the form
of an arbitrary graph, while in the latter record types are organized in the
form of a rooted tree.
The database scheme is thus represented as a collection of tree-structure
diagrams. A single instance of a database tree exists for each such diagram.
The root of this tree is a dummy node. The children of that node are actual in-
stances of the appropriate record type. Each such instance may, in turn, have
several instances of various record types, as specified in the corresponding
tree-structure diagram. The data manipulation language consists of a num-
ber of commands that are embedded in a host language. These commands
access and manipulate database items as well as locally declared variables.
Data retrieval from the database is accomplished through the get com-
mand. The command firstly locates a record in the database. Secondly, it sets
the currency pointer to point to it, and finally, copies that record from the
database to working areas of an appropriate application program.
There are various mechanisms used for updating information in the
database, which include the creation and deletion of records as well as the
modification of the content of existing records.
26 Z. Krolikowski and T. Morzy

One of the earliest database systems that became commercially available


was the IBM product, called Information Management System - IMS. It was
designed for the MVS environment. The first version of the system (IMS 360
Version 1) was released in 1968 - and at that time it was one of the top
three products, if not the top product, in the mainframe marketplace, both
in terms of the number of systems installed and user commitment.
Hierarchical systems were not originally constructed on the basis of a pre-
defined abstract data model. Rather, such a model was defined after event by
a process of abstraction from implemented systems. A hierarchical database
consists of an ordered set of trees - more precisely, an ordered set consisting
of multiple occurrences of a single type of tree.
A tree type consists of a single "root" record type, together with an
ordered set of zero or more dependent subtree types. A subtree type in turn
consists of a single record type - the root of the subtree type - together with
an ordered set of zero or more lower-level dependent subtree types, and so
on. The entire tree type thus consists of a hierarchical arrangement of record
types.
The relationships formed in the tree must be such that only one-to-many
or one-to-one relationships exist between a parent and a child.
As an example, consider the education database of Fig. 4.1, which con-
tains information about the education system of an industrial company. The
education department of the company offers several training courses. Each
course is offered at a number of different locations within the organization,
and the database contains details both of offerings already given and of offer-
ings scheduled to be given in the future. The database contains the following
information:
• for each course: course number, course title, details of all immediate pre-
requisite courses, and details of all offerings;
• for each prerequisite course for a given course: course number for that
prerequisite course;
• for each offering of a given course: offering number, date, location, details
of all teachers, and details of all students;
• for each teacher: employee number and name;
• for each student: employee number, name, and grade.
The tree type for the education database has COURSE as its root record
type and has two subtree types, rooted in the PREQUISITE and OFFERING
record types, respectively. Note that this set of two subtree types is ordered
- that is, PREQUISITE subtree type definitely precedes the OFFERING
subtree type (see Fig. 4.1). The subtree type rooted in PREQUISITE is
"root only"; by contrast, the subtree type rooted in OFFERING in turn has
two lower-level subtree types, both root only, rooted in the TEACHER and
STUDENT record types, respectively.
The database thus contains five record types: COURSE, PREQUISITE,
OFFERING, TEACHER and STUDENT. COURSE is the root record type,
2. Database Systems: from File Systems to Modern Database Systems 27

COURSE

TEACHER STUDENT

Fig.4.1. Structure of the education database

the others are dependent record types. Furthermore, COURSE is said to


be parent record type for the PREQUISITE and OFFERING record types.
Whereas, PREQUISITE and OFFERING are said to be child record types
for the COURSE record type. Likewise, OFFERING is the parent record
type for TEACHER and STUDENT, whereas TEACHER and STUDENT
are child record types for OFFERING. The connection between a given child
and its corresponding parent is called a link.
In the hierarchical data model the root/parent/child terminology just in-
troduced for types also applies to occurrences. Thus, each tree occurrence
consists of a single root record occurrence, together with an ordered set of
zero or more occurrences of each of the subtree types immediately dependent
on the root record type. Each of those subtree occurrences in turn consists
of a single record occurrence - the root of the subtree occurrence - together
with an ordered set of zero or more occurrences of each of the subtree types
immediately dependent on that root record type, and so on. For an illustra-
tion, see Fig. 4.2, which shows a single tree from the education database of
Fig. 4.1.
The notion of ordering is vital to the hierarchical data structure. Each
individual tree in the database can be regarded as a subtree of a hypothetical
"system" root record. Consequently, the entire database can be considered
as a single tree. It follows that the notion of hierarchical sequence defined
above applies to the entire database as well as to each individual (sub)tree.
That is, the notion of hierarchical sequence defines a total ordering for the
set of all records in the database, and database can be regarded as being log-
ically stored in accordance with that total ordering. This idea is particularly
28 Z. Krolikowski and T. Morzy

COURSE

PREREQUISITE
,....-L-.....-J
PREREQUISITE

STUDENT

Fig. 4.2. Sample tree for the education database

important in IMS, because many of the Information Management System


manipulative operators are defined in terms of that total ordering.
A hierarchical data manipulation language consists of a set of operators for
processing data represented in the form of trees. Example of such operators
include the following:
• an operator to locate a specific tree in the database - e.g., to locate the
tree for course M23 (see Fig. 4.2);
• an operator to move from one tree to the next - e.g., to step from the
tree for course M23 to the tree that follows it in the hierarchical sequence
of the database;
• operators to navigate between records within such a tree by moving up
and down the various hierarchical paths - e.g., to step from the COURSE
record for course M23 to the first OFFERlNG record for that course;
• an operator to insert a new record at a specified position within such a
tree - e.g., to insert a new OFFERING into the tree for course M23;
• an operator to delete a specified record, and so on.
Note that such operators are typically all record level. Thanks to the following
rule: "No child is allowed to exists without its parent", the hierarchical data
model includes "automatic" support for certain forms of referential integrity.
For example, if a given parent is deleted, the system will automatically delete
the entire (sub )tree that is rooted at that parent. Likewise, a child cannot be
inserted unless its parent already exists.
There are two principal definitional constructs in IMS, namely, the
database description (DBD) and the program communication block (PCB).
2. Database Systems: from File Systems to Modern Database Systems 29

An IMS database is of course a hierarchical database - it consists of a hi-


erarchical arrangement of segments (Le., records), and each segment in turn
consists of a collection of fields. Each such database is defined by means
of a DBD, which specifies the hierarchical structure of that database. How-
ever, users operate on views of those databases rather than directly on the
database. A given user's view of a given database consists basically of a
"subhierarchy", derived from the underlying hierarchy by omitting certain
segments and/or certain fields. Such a view is defined by means of a PCB,
which specifies the hierarchical structure of that view. As a result of these
two considerations, true IMS picture is considerably more complicated than
our initial brief explanation might have suggested.
IMS is invoked via a CALL interface called DL/I (Data Language/I)
from application programs written in PL/I, Cobol, or System/370 Assembler
Language. Therefore, the user in IMS is definitely an application programmer.
The full IMS system includes not only the database management system
components, but also a data communication component.
Many details in our presentation of the Information Management System
were omitted. As a result, our explanations may make the system appear
unrealistically simple. In reality IMS is a very complex system with regard to
its internal structure as well as to the user interface. Indeed, one of Codd's
motivations for developing the relational model in the first place was precisely
to escape from the complexities of systems such as IMS.

5 Relational Database Systems

In 1970, Codd's classic paper, "A Relational Model for Large Shared Data
Banks" , presented the foundation for relational database systems. Since then,
many commercial relational database systems, such as Oracle, DB2, Sybase,
Informix, and Ingres, have been built. In fact, relational database systems
have dominated the database marked for years. The remarkable success of
relational database technology can be attributed to such factors as having a
solid mathematical foundation and employing an easy to use query language,
Le., SQL (Structured Query Language). SQL is a declarative language in the
sense that users need only specify what data they are looking for in a database
without providing the information how to get the data. The relational data
model, basic relational operators, and the relational query language SQL are
briefly reviewed below [Dat95,EN99,KS86,Nei94, Ull89,Ram03].
In a relational database [Ram03J, data are organized into table format.
Each table (or relation) consists of a set of attributes describing the table.
Each attribute corresponds to one column of the table. Each attribute is as-
sociated with a domain indicating the set of values the attribute can take.
Each row of a table is called a tuple, and it is usually used to describe one
real-world entity and/or a relationship among several entities. It is required
for any tuple and any attribute of a relation that the value of the tuple un-
30 Z. Krolikowski and T. Morzy

der the attribute be atomic. The atomicity of an attribute value means that
no composite value or set value is allowed. For each relation, there exists
an attribute or a combination of attributes such that no two tuples in the
relation can have the same values under the attribute or the combination of
attributes. Such an attribute or combination of attributes is called a superkey
of the relation. Namely, each tuple of a relation can be uniquely identified by
its values under a superkey. If every attribute in a superkey is needed for it to
uniquely identify each tuple, then the super key is called a key. In other words,
every key has the property that if any attribute is removed from it, then the
remaining attribute(s) can no longer uniquely identifies each tuple. Clearly,
any superkey consisting of a single attribute is also a key. Each relation must
have at least one key. But a relation may have multiple keys. In this case,
one of them will be designated as the primary key, and each of the remaining
keys will be called a candidate key. Note that key and superkey are concepts
associated with a relation, not just the current set of tuples of the relation.
In other words, a key (superkey) of a relation must remain to be a key (su-
perkey) even when the instance of the relation changes through insertions
and deletions of tuples. Relational algebra is a collection of operations that
are used to manipulate relations. Each operation takes one or two relations as
the input and produces a new relation as the output. The operations are cho-
sen in such a way that all well-known types of queries may be expressed by
their composition in a rather straightforward manner. First, the relational
algebra contains the usual set operations: Cartesian product, union, inter-
section, and difference. Second, this algebra also includes the operations of
projection, selection, join and division. The latter are in fact characteristic
for the relational algebra and essential for its expressive power for stating
queries. If we represent the relation R as a table, then the projection of R
over the set of attributes X is interpreted as the selection of those columns
of R which correspond to the attributes X and elimination of duplicate rows
in a table obtained by such selection. Similarly, the operation of selection
applied to R may be interpreted as elimination of those rows from the table
R, which do not satisfy the specified condition.

Although the relational algebra is a simple formal language, it is not suit-


able for casual users of the database, especially those who are not educated in
mathematics and programming. As such, it is not a suitable practical query
language. A number of relational query languages have been designed and
implemented to serve as practical tools for casual users. Queries in such lan-
guages have clear structure and meaning and are expressed in a way, which
is much closer to the way one would ask such queries in ordinary English.
Moreover complex queries can be expressed easier in SQL than in relational
algebra. Today, such a standard relational query language is SQL (Struc-
tured Query Language). SQL has many components dealing with different
aspects of managing data in the database, such as the definition and ma-
2. Database Systems: from File Systems to Modern Database Systems 31

nipulation of data, interfacing with host programming languages (embedded


SQL), definition of constraints, and support for transactions.
The most fundamental concept of SQL is called the query block. Its basic
form is shown below.
SELECT (list of attributes)
FROM (list of relations)
WHERE (qualification expression).
The result of a query block execution is a relation whose structure and
content are determined by that block. Attributes of that relation are specified
in the list of attributes. The attributes listed are selected from the relations
in the list of relations. The first two clauses (SELECT and FROM) in the
query block are used to define the operation of projection. The qualification
expression in the WHERE clause is a logical expression. It contains attributes
of the relations listed in the FROM clause and it determines which tuples
of those relations qualify for the operation of projection. This means that
only the attributes of those tuples for which the qualification expression is
true will appear in the result of the query block. The WHERE clause thus
contains specification of the selection and the join operations.
Access, produced by Microsoft, is the most widespread DBMS for the
Microsoft Windows environment [ACP+99j. Access can be used in two ways:

• as an independent database manager on a personal computer;


• as an interface to work on data residing on other systems.

As an independent database manager, it suffers from the limitations of per-


sonal computer architecture. It offers limited support for transactions, with
rather simple and incomplete mechanisms for security, data protection and
concurrency control. On the other hand, it has a low cost and the applications
to which it is targeted do not typically require a sophisticated implementa-
tion of these services. Access applications are designed and run in a graphical
user interface.
Data resident in other databases can be read and written from Access
applications via the ODBC protocol. When Access is used as a client of a
relational server, it makes available its own interface features for the interac-
tion with the external system. In this context, Access can be seen as a tool
that allows the user to avoid writing SQL code, as it acquires schemas and
simple queries using a graphical representation that is easy to understand.
These inputs are translated into suitable SQL commands in a transparent
manner.
DB2 Universal Database belongs to a historic family of database man-
agement systems produced by IBM [ACP+99,Dat95j. The oldest member of
this family is SQL/DS, one of the first commercial systems based on the re-
lational model, made available by IBM at the beginning of the eighties. In its
turn, SQL/DS has its roots in System R. It was one of the first prototypes
of relational DBMSs developed, in the early seventies, in the IBM research
32 Z. Krolikowski and T. Morzy

laboratories in San Jose. It was in the development environment of this pro-


totype that the SQL language was born. SQL soon became the standard for
all DBMSs based on the relational data model.
DB2 completely supports the relational model. Moreover, it offers some
object-oriented features and a rich set of advanced features, including:
• support for the management of non-traditional data types, such as texts,
images, sounds and video;
• support for the management of data types and functions defined by the
user;
• some extensions of SQL that include powerful On Line Analytical Pro-
cessing (OLAP) operators and special constructs to specify recursive
queries;
• support for parallelisms based on both "shared memory" configurations,
in which a database is managed by symmetric multiprocessing (SMP)
machine, and "shared nothing" configurations, in which a database is
partitioned among different machines connected by a network.
The DB2 database server runs on Windows NT, OS2 and several Unix-
based platforms. The client component is also available on Windows and
Macintosh environments for personal computers. Client and server can com-
municate on the basis of diffuse communication protocol standards (TCP lIP,
NetBios, etc.). Moreover, DB2 system can participate in heterogeneous and
distributed multi-database environments, using a protocol called Distributed
Relational Database Architecture (DRDA), adopted by many other database
management systems. Finally, DB2 provides support for the main interface
standards (such as ODBC, JDBC) and adheres to SQL-2 standard.
Oracle [ACP+99,Nei94j is currently one of the main world producers
of software, and the range of products offered has as its foundation the
database management system Oracle Server, available for most types of com-
puter. Oracle is available on various platforms including PCs, local network
servers, workstations, mini-computers, mainframes and parallel supercom-
puters, which facilitates the integration among databases at various levels in
an organization. The functionality of an Oracle server can be enhanced by
various components, e.g.:

• Video Option: for management of multimedia data.


• Spatial Data Option: for the management of geographic and spatial data.
• ConText Option: for the management of unstructured text-type informa-
tion. This component adds to the server the functionality of an informa-
tion retrieval system.
• On Line Analytical Processing Option (OLAP): for increasing database
efficiency, when the database is used in tasks of data analysis and decision
support.
• Messaging Option: to use the database as a tool for the interchange of
messages between information system users.
2. Database Systems: from File Systems to Modern Database Systems 33

• Web Applications Server: a proprietary HTTP server allowing access to


the database with a Web interface.
Version 8 of Oracle has introduced a few object-oriented features into the
relational engine, resulting in the so-called object-relational database system.
An object-relational system is compatible with previous relational applica-
tions.
The basic component of the object extension are the type definition ser-
vices, which improve significantly the domain definition services of relational
systems and are comparable to the type definition mechanisms of modern
programming languages. The SQL interpreter offers a create type command,
which permits the definition of objects types. Each element of an object type
is characterized by an implicit identifier (the object id or OID).
We stress the point that we have omitted a great amount of detail from
description of hierarchic, network and relational systems.

6 Object-Oriented Database Systems


During the past decade, the nature of database applications has rapidly un-
dergone changes from simple applications to complex applications. Simple
applications retrieve a small number of fiat records that contain numeric and
short symbolic data. Complex applications store and retrieve not only sim-
ple data but also complex nested data, compound data (e.g., sets, arrays,
structures), and multimedia data (e.g., images, audio's, texts).
During the past several years, object-oriented database systems (OODBS)
and associated tools have been introduced to the marked, partly in response
to the anticipated growth in the use of object-oriented programming lan-
guages and partly as an attempt to address some of the key deficiencies of
relational database systems that are due to the inherent restrictions in the
relational model of data.
An object-oriented database system [ACP+99,CBS98,EN99,Kim94] is a
system that satisfies two following criteria: (1) it is a database management
system and (2) it is an object-oriented system. The first criterion implies
the following functionality: data persistence, secondary storage management,
authorization, concurrency control, and data recovery after system crash, ad
hoc query facility. The second criterion implies the support for: objects, en-
capsulation, types and classes, inheritance, overloading, and late binding. The
features that are also very important (but not mandatory) for an OODBS are
as follows: integrity constraints maintenance, views, support for data distri-
bution, query optimization, multiple inheritance, long transactions, versions
of data and versions of a database schema.
However, the support of the basic DBMS as well additional functionality
is only partially incorporated in existing commercial and prototype object-
oriented database systems. In spite of that we will further call these systems
object-oriented database systems.
34 Z. Krolikowski and T. Morzy

Several object-oriented prototypes as well as commercial database systems


have been built. The prototype systems are among others: Exodus (Wiscon-
sin University) [Vos91], Ode (AT&T), Orion (MCC), Zeitgeist (Texas In-
struments). Whereas, the commercial products include among others: Itasca
(MCC), GemStone (Servio Corporation), ObjectStore (Object Design, Inc.),
Ontos (Ontologic), Objectivity jDB (Objectivity, Inc.), Poet (Poet Software),
Versant (Versant Corporation), and 02 (Ardent Software).
This section presents the following issues specific to OODBS: object per-
sistence, storage servers, object-oriented query language and query optimiza-
tion, design transactions, and versioning. One of the necessary feature of an
object-oriented database system is the support for data persistence and man-
agement of these data. There are a few different ways of making the instances
of a class persistent: (1) designing a class as persistent, (2) attaching an ob-
ject to a schema element with persistence feature, and (3) sending a message
to an object making it persistent. The first technique consists in designing a
class as a persistent one. After that every object of this class is automatically
persistent. Poet and ObjectStore, for example, use this technique. In order
to design a class persistent, a developer has to precompile it using a compiler
dedicated to a given OODBS. In ODMG standard in order to create so called
a persistence capable class a designer has to make it a subclass of the system
class, called PersistenLObject.
The second technique requires additional database schema elements that
will serve as persistent containers of objects. Such a schema element is called
the root of persistence, or a container. The root of persistence is attached to
a particular class and can contain any type compatible object with the class.
An object added explicitly to the root of persistence becomes persistent.
The persistence is also attributed to an object if this object is referenced by
another persistent object; this feature is called persistence by reachability.
This policy has two main advantages: an object may be temporary or per-
sistent and the persistence is independent from the class definition. Similar
concept, called names, is used in 02. Every object that was assigned a name
is persistent.
The third technique consists in sending a message persistent to an object.
Every object understands this message, which initiates its permanent storage
on secondary memory.
Storage servers can be classified by their unit of transfer and control.
Page servers and object servers are distinguished. Page servers manipulate
virtual memory pages where objects reside, but do not directly manipulate
objects themselves. The application that requests a given object receives the
whole page or pages, and it is the application that is responsible for finding
a required object on the received pages. The examples of page servers are
Exodus and GemStone.
Object servers, in contrast, manage individual objects or groups of ob-
jects. It is the server that is responsible for finding a required object and
2. Database Systems: from File Systems to Modern Database Systems 35

sending it to an application. Some servers (e.g., ObServer, Zeitgeist) do not


interpret the types or classes to which an object belongs. In a consequence,
these servers can not execute methods or access the properties of objects
they manage. Other servers (e.g., Orion, Itasca, 02) are able to interpret the
semantics of objects they manage, i.e. they are able to execute methods of
objects and access their values.
All of the prototype and commercial object-oriented databases provide ac-
cess to them via a code written in an object-oriented programming language.
For example, Exodus uses C++ and its extension - the E language; Gem-
Stone uses OPAL, based on Smalltalk; Itasca and Orion support Common
Lisp extended with object-oriented constructs; Ontos, Poet, and ObjectStore
support C++; Ode uses the extension of C++, called 0++; in 02 an ap-
plication can be written either in C++ or in the 02C language being the
extension of C.
With the support of object-oriented programming languages OODBSs
try to overcome another problem emerging while developing database appli-
cations. It is the impedance mismatch. It often emerges between a database
and a program accessing this database. An impedance mismatch results from:

• the difference between a database manipulation language, e.g. SQL and


the language used to implement a database application, e.g. C++;
• the difference in type systems used by a database and a programming
language - a programming language is not able to represent database
structures, e.g. relations directly. Therefore, types and data have to be
converted from one type system to the other.

Although an application can be written in a procedural language, the use


of a query language for accessing objects is considered very important. Some
of the object-oriented database systems support a query and a data definition
language. Because of the success and popularity of relational SQL, these
object-oriented systems use the syntax similar to SQL. Object query and
data definition language are commonly referred to as object query language
(OQL) or object SQL (OSQL). The standard of OQL has been defined in the
ODMG proposal. From the commercial systems GemStone, Itasca, Ontos and
02 support dialects of OQL.
In comparison to SQL queries, OQL queries provide new functionality:
path expressions, set attributes and qualifiers, reference variables, the use
of methods, querying along inheritance hierarchy. A relationship between ob-
jects provide means to traverse from one object to the other one following the
relationship between these objects. A dot "." notation is used to specify such
traversal. For example, the below query can be used to find the horsepower
of engines that use the injection unit with the symbol '1300':
select e.horse_power from Engines e
where e.inj_unit.uniLtype = '1300';
Constructs such as
36 Z. Kr6likowski and T. Morzy

e.inj_unit. unit_type
are called path expressions. Path expressions express so-called implicit
(hidden) joins. Implicit joins are also possible in OQL.
Predicates in an OQL query can be formed using set attributes and set
membership operator in. For example, the query below selects those profes-
sors who teach the course entitled 'Introduction to databases':
select p.name from Professors p
where 'Introduction to databases' in p.teaches.course..subject;
In an OQL query a path expression can be bind to a variable, called
a reference variable. After that this variable can be used within a query.
Reference variables can be considered as shorthand for path expressions.
An OQL query can use methods of two kinds, i.e. a predicate method and
a derived-attribute method. A predicate method returns a Boolean value
for each object it is invoked for. Whereas, a derived-attribute method is
used to compute the value of an attribute (or attributes) of an object and
return this value. For each object returned by a query such a method can
be invoked. A derived-attribute method can be used in a query just like an
attribute. For example, let us assume that class Material defines attribute
melting_temperature measured in Celsius centigrade. A method melt-tem..F
could be defined in the same class to compute a melting temperature from
Celsius to Fahrenheit. While querying Material, method melt_tem..F can be
invoked to return the melting temperature of objects in Fahrenheit.
The set of a subclass instances is the subset of a superclass instances.
For example, the instances of the Radio class are at the same time electronic
devices. The existence of an inheritance hierarchy allows to use a new kind
of querying technique. While querying along inheritance hierarchy one may
be interested in retrieving objects from some, but not all, classes in this
hierarchy. For example, one query rooted at ElectronicDevice may be issued
in order to retrieve the instances of the ElectronicDevice class and the Radio
class, but not from the TapeRecorder class. Whereas another query rooted
at ElectronicDevice may return the instances of ElectronicDevice as well as
the instances of all its subclasses.
With the support of OQL, object-oriented database systems have to pro-
vide optimization techniques of such queries. Query optimization in OODBS
is more difficult than in relational systems due to the following reasons:

• Different data types - the input type and the output type of an OQL
query may be different, that results in the difficulty of designing object
algebra. Most of the proposed object algebra use separate sets of alge-
bra operators dedicated for individual types, e.g. object operators, tuple
operators, set operators, list operators. In a consequence, object algebra
and equivalence transformation rules are much more complicated than in
relational systems .
• Methods - they are written in a high-level programming language and
their code is usually hidden to a query optimizer in order to preserve
2. Database Systems: from File Systems to Modern Database Systems 37

encapsulation. Moreover, estimating the cost of executing a method is


another serious problem.
• Queries along an inheritance hierarchy - the results of such queries are
collections of heterogeneous objects. Therefore, when a method is applied
to such a collection the late binding should be used to invoke appropriate
method for appropriate object. This implies the need for run-time query
optimization in addition to compile-time optimization.
The optimization of queries along the inheritance hierarchy requires new
types of index structures. Various indexes were proposed for the use in such
kind of queries, for example, class hierarchy index, H-tree index, and nested
inherited index.
Path expressions are one of the features of object-oriented queries. The
efficient processing and optimization of such expressions is possible by the use
of various kinds of indexes, e.g., nested index, path index, and multiindex.
One kind of applications requiring the use of object-oriented databases is
Computer Aided Design. A process of designing involves a team of designer's
co-operating during a long period of time, e.g. days or months. A database
system has to support the mechanism of a transaction with its atomicity,
consistency, isolation, and durability features. The long duration of design
processes, i.e. long duration of transactions, causes that traditional concur-
rency control strategies are not appropriate. Therefore, the support for long
transactions is very important in object-oriented database systems. A long
transaction should be able to save its intermediate state. To this end, a check-
pointing mechanism can be used and nested transaction can be started. A
nested transaction may contain subtransactions that in turn may contain
their own subtransactions. Despite the fact that the support for long trans-
actions in OODBS is highly important, the work on that is constantly carried
out and has not been fully implemented yet in existing commercial OODB
systems.
The process of designing (CAD, CASE, CSCW systems) either a me-
chanical element or a software, or writing a document is characterized by
creating alternative solutions as well as by the need of storing previous solu-
tions. A database schema also can change its structure in order to conform
to changing real-world. For these reasons OODB systems have to support:
(1) the modifications of a database schema, (2) the derivation, storage, and
management of dozens of alternative versions of designs (objects), (3) the cre-
ation and management of different versions of the same class and the whole
schema. The existing data models, prototypes, and commercial products can
be categorized by the support for:
• schema modifications,
• versioning objects without versioning classes, e.g. Exodus, ObjectStore,
Ode, Onots, 02,
• versioning of subschemas, i.e. groups of classes connected by inheritance
and by various kinds of relationships, but without versioning objects,
38 Z. Krolikowski and T. Morzy

• versioning the whole database with its schema, e.g. Orion and Itasca.

7 Federated, Mediated Database Systems and Data


Warehouses
The most common approaches that allow distributed databases and other
data sources to be accessed in an integrated manner are as follows:
• a federated database system,
• a mediated system,
• a data warehouse system.
The federated and mediated approaches are called virtual approaches.

7.1 Federated Database System


A federated database system (FDBS) is composed of different databases,
called component databases. A FDBS allows to access component databases
in an integrated manner. The component databases co-operate with each
other in order to provide answers to queries issued in a federated system. A
FDBS has no data of its own and queries are answered by accessing data
stored in component databases. Each component database can expose to the
integration process either the whole schema or only part of its schema.
A typical architecture of a FDBS consists of the following layers of
schemas and software: local schema, component schema, transforming proces-
sor, export schema, filtering processor, federated schema, constructing pro-
cessor, external schema. This architecture is presented in Figure 7.1.
Each of the component databases (noted as component db 1, component
db 2, and component db 3), which are integrated, use their own data model
and schema (noted as local schema 1, local schema 2, and local schema 3). In
order to integrate these data sources they have to conform to a common data
model, used in the federated system. The common integrating data model has
usually more expressive power than local data models. The most frequently
used common data models are relational or object-oriented.
The main task of transforming processor is to transform the underlying
data sources to a common data model. Additionally, the transforming pro-
cessor is responsible for:

• maintaining mappings between local and component schema elements,


• translation of commands from a federated query language to a query
language of a component database,
• translation of data from a local to common data format.

For example, if the component databases 1, 2, and 3 store data in the


XML, relational, and hierarchical data format, respectively, and the inte-
grated data model is relational, then a transforming processor must present
2. Database Systems: from File Systems to Modern Database Systems 39

user I external
schema

constructing processor

export schema

local schema 1 local schema 2 local schema 3 local


data m:xieI

Fig.7.1. An example architecture of a federated database system

these three data sources as if they were relational. A transforming processor


is specific for each different data source. After the transformation, each local
schema is seen as the so-called component schema, expressed in a common
integrating data model.
Not the whole component schema can be the subject of integration. A
given component database may allow to see the portion of its schema and,
in a consequence, share only the subset of its data. Therefore, on top of each
of the component schemas an export schema is defined. An export schema is
implemented as the set of views.
Using the information about data visibility and access control specified in
an export schema, a filtering processor controls the set of operations that are
issued for a component schema. One or more federated schemas can be build
on top of export schemas. Each federated schema serves as an entry point to
the system for each specific group of global users.
A user's global query is issued for a federated database schema. A soft-
ware, called constructing processor, is responsible for:
40 Z. Krolikowski and T. Morzy

• integrating different information sources by resolving inconsistencies and


conflicts between them;
• determining the set of data sources capable to answer a given query that
was issued and formulated in terms of a federated schema;
• decomposing, optimizing, and transforming the query into local queries,
that is queries for each of the data sources;
• sending each local query to appropriate data source;
• receiving query results from data sources, translating, filtering, and merg-
ing these results to form a global result.

Several external schemas can be built on top of each federated schema


by each user of the system. The purposes of using an external schema are as
follows: (1) further customization of a federated schema, (2) its simplification
when original federated schema is large and complicated, and finally, (3) the
introduction of additional layer for access control.
A federated system has to store descriptions of component databases be-
ing the part of a federate<;l system. The information, e.g. network addresses
of component databases, communication protocols used by them, mappings
between local, component, export, federated, and external schemas, descrip-
tions of data format used by data sources, statistics and heuristics used by a
query optimizer, as well as data format transformation functions are stored
in a federated data dictionary. This dictionary is used by the constructing
processor.
A federated database system may be designed either as loosely coupled or
tightly coupled. In a loosely coupled FDBS a user is responsible for creating
its own federated schema. An administrator of a federated system does not
have any control on a user's federated schema. Such a schema is created
as a view build on different export schemas. In order to be able to define
such a view, a user has to know which export schema are available and
has to understand the structure of these export schemas. Multiple federated
schemas can exist at the same time, and they can be created and dropped
at any time. As each user has its own view over the federation, two or more
users may build their own views to access the same information from the
same local databases. These users will not be aware of the existence of other
views. In a consequence, duplicate work may be done by them.
In a loosely coupled FDBS, local databases have high degree of autonomy.
In such a system changes in source schemas can be easier performed than in
a tightly coupled FDBS as it is easier to define a new view on a changed
export schema than to create the global schema. But it may be difficult do
detect by a user that an export schema was changed.
Loosely coupled systems allow querying data. Whereas, updates of local
data through a federated schema are not allowed because different users might
define different mappings between views and export schemas and might define
different update policies for the same data.
2. Database Systems: from File Systems to Modern Database Systems 41

In a tightly coupled FDBS it is the federated system administrator who is


responsible for the creation and maintenance of one or more global federated
schemas. Such a global schema integrates all the available export schemas.
The idea behind building a global federated schema is to provide location
transparency. Users simply use the provided global schema without knowing
the location of data sources.
In a tightly coupled FDBS, local databases have less autonomy than in a
loosely coupled FDBS. During the creation of a global federated schema, a
FDBS administrator negotiates with local database administrators the struc-
ture of their export schemas. In a tightly coupled system changes in source
schemas are difficult to propagate into a federated global schema as any
change in a source and subsequently in an export schema results in the cre-
ation of a federated schema from scratch. Updates of local data through a
federated schema are at least partially supported.
An important issue while building a federated schema, either in a loosely
or tightly coupled FDBS, is the integration of various export schemas. Al-
though these schemas already use the same data model, they may differ in the
way a given real-world object is represented. At the level of an export schema
the structural and semantic heterogeneity usually exist. This heterogeneity
is caused by:
• different naming of the same real-world entity in different local databases
and, in a consequence, in different export schemas;
• different semantics of various real-world entities having the same name
in different local databases;
• different structures or modeling constructs used to represent the same
real-world entities in different local databases;
• different semantics of values stored in local databases in the context of
different geographical region or organizational structure; for example, the
value 10000 of attribute salary may be understood differently in the con-
text of Poland (10000 PLN) and differently in the context of Great Britain
(10000 GBP).
A kind of translation is required between languages used while accessing
a federated schema and next, local data sources. As data sources use dif-
ferent data models, they also support different languages, e.g. SQL, OQL.
Even within the same data model, e.g. relational, different dialects of the
SQL language are supported by various database systems. A language of
the integrated system is usually different than the native languages of local
databases. Therefore, a command expressed in the language of an integrated
system has to be translated into commands expressed in native languages of
local databases.
The advantages of the FDBS approach are as follows:
• a user of a FDBS queries data that are always up-to-date, as queries
operate directly on component databases, which are the subjects of inte-
gration,
42 Z. Krolikowski and T. Morzy

• only little additional storage for information is required in a FDBS,


• a user can query any data that is accessible by a federated schema. How-
ever, this approach has some disadvantages that are pointed out below.
• the results of a query may arrive with a long delay caused by a slow
network, or a low response time of data sources,
• the decomposition and translation of a query as well as merging the
results of a query incur additional time overhead,
• queries coming from a federated system may interfere with queries exe-
cuted locally in component databases. In a consequence, federated queries
may slow down the execution of the local queries,
• some of the component databases may be temporarily unavailable, thus
making the query results incomplete or unavailable.

For these reasons the FDBS approach to data integration is appropriate


when:
• source data in component databases change very often,
• a user of an integrated system needs the information that is always up
to date,
• a user can afford the delay of receiving query results.

1.2 Mediated System

In a mediated system, a software component, called mediator, supports a


virtual view or collection of views that integrates several data sources. A
mediator does not store locally any of the integrated data. A user issues a
global query to the mediator. Since the mediator does not store any data,
data sources are queried in order to form the result of a global query. In the
architecture presented in Figure 7.2 two layers of software components are
used between a user query and data sources. These components are wrappers
and a mediator.
Data sources may use different data models. In such a case, the different
models are transformed into a common data model used by all wrappers and
the mediator. The translation of data models is performed by a wrapper. For
each data source a dedicated wrapper has to be implemented.
Users access data stored in various data sources by issuing queries for
the mediator. Upon receiving a query the mediator sends the query to wrap-
pers. Each wrapper then translates the received query to the format that
is understood by a data source and next, sends the query to the source for
execution. After executing the query a data source sends the query results to
its wrapper. The wrapper translates them into the common data model, and
sends them to the mediator. The mediator integrates the data received from
wrappers into one global query result.
Mediator systems require complex wrappers because the wrapper must
be able to accept a variety of queries from the mediator and translate any of
2. Database Systems: from File Systems to Modern Database Systems 43

Fig. 7.2. The architecture of a mediated system

them to the format accepted by a data source. A common way to design a


wrapper is to predict and classify into templates the possible queries that the
mediator can pass to a wrapper. A template is a query with parameters. The
mediator can pass the values of query parameters, and the wrapper executes
the query with the current values of the parameters. Templates are used for
the automatic generation of program code used to implement a wrapper. The
software that creates a wrapper is called a wrapper generator. The wrapper
generator creates among others a mapping table that stores various query
patterns contained in templates, and the corresponding specific queries for a
data source for which a wrapper is implemented.

Similarly, as in a federated database system, the advantages of a mediated


system follow that a user queries data that are always up-to-date and only
little additional storage for information is required in the system.

The disadvantages of a mediated approach concern the delay in getting


the result of a query, the interference of global queries with local queries
in data sources, temporary unavailability of some data sources, like in a
FDBS. Additionally, in a mediated system the set of possible queries that
can be translated by a wrapper and sent to a data source is limited by the
functionality of a wrapper, i.e. by the content of its mapping table.
44 Z. Krolikowski and T . Morzy

7.3 Data Warehouse System

The third solution to data integration is to store in a central repository


the previously extracted and integrated data of interest coming from several
sources. This repository is called a data warehouse and looks to users like an
ordinary database. A data warehouse (DW) contains a collection of integrated
data stored in a centralized repository. These data are not being updated at
this database. A data warehouse stores raw as well as aggregated data.
User queries are answered using the information stored in a data ware-
house, rather than directly accessing data sources.
Data warehouse architecture has various layers. Two basic architectures
(cf. Figure 7.3 and 7.4) are used frequently. Figure 7.3 presents a generic data
warehouse architecture.

data marts

data sources

Fig. 7.3. A generic data warehouse architecture

At the lowest layer of this architecture data sources DSl, DS2, and DS3
are located. These sources may contain heterogeneous information: struc-
tured data - stored in relational, object-relational, object- oriented databases,
semistructured data - in the format of XML or SGML, or unstructured in-
formation. Data sources are usually distributed.
2. Database Systems: from File Systems to Modern Database Systems 45

Similarly as in the mediated approach to data integration, a software,


called wrapper, is responsible for translating information from the format
used in a data source to a common format and data model of a data ware-
house. Wrappers applied in data warehousing may be simpler than those used
in mediation systems because the set of possible queries used to load a data
warehouse in known in advance, while designing a warehouse.
The second software component, called monitor, is responsible for de-
tecting changes made to source data. For each data source a specific wrap-
per/monitor component has to be provided. When a new data source is added
to a data warehousing system or when the information of interest has changed
in a source, then the new or modified information is sent do the integrator
module. It is responsible for installing the information in a warehouse. This
process may for example require filtering data, merging data with data com-
ing from other sources, summarizing data.
At the highest layer of the DW architecture data marts (DMl, DM2,
DM3) may be located. A data mart is a small warehouse that contains the
subset of the data derived from the data in the main warehouse. Data in a
data mart are highly aggregated. The purposes to use a DM are as follows.
Firstly, to provide data of interests to a group of professionals. Secondly,
to provide the required data at the required level of aggregation in order
to allow the different kinds of analysis and decision making, e.g. trend and
anomalies analysis, historical analysis, long-term decisions. For example, one
data mart could be used by a sales department, providing the information
about products sold by different shops at different periods. Other data mart
could be build for a marketing department providing the information about
the correlation between promotions and sales. Both of the data marts would
be built based on data in a warehouse. Data marts can be implemented as
materialized views, and can use a relational or multidimensional data model.
In some cases an intermediate layer, called an operational data store
(ODS) is built between data sources and a data warehouse (Figure 7.4).
An operational data store is a subject-oriented set of data coming from one
or more data sources. While loading an ODS, data are extracted from data
sources, transformed to a common data model of an ODS, and preintegrated.
Then, the content of various operational data stores is again integrated and
loaded into a central data warehouse.
The main differences between a data warehouse and an operational data
store are as follows. Firstly, the content of an ODS changes more frequently
than the content of a data warehouse. Secondly, data in an ODS are near
current with respect to the corresponding data in data sources. Next, data
in an ODS can be updated but the updates do not propagate down to data
sources. And finally, the aggregation level of data in an ODS is very low.
The purpose to build an ODS is to separate long lasting, time consuming
processing, e.g. On-Line Analytical Processing queries, from data sources.
Since OLAP queries operate on data in an ODS, they do not interfere with
46 Z. Krolikowski and T. Morzy

~ l DM2) daamarts
-----------------~- ~------------

operaional
daastores

data SOU"ces

Fig. 7.4. A data warehouse architecture with operational data stores

the processing in data sources. Furthermore, an ODS may be designed and


tuned especially for a particular pattern of processing, whereas the underlying
data sources may be designed and tuned for other kind of processing, e.g.
OLTP. For the reason of tuning one may need to change the structure of
source data in an ODS, e.g. merge some source information, project some
attributes, introduce data redundancies. From the implementation point of
view, an ODS may be seen as the set of materialized views.
The advantages of a data warehousing approach are as follows:

• queries operate on a local centralized data repository, that reduces access


time to data,
• queries need not be decomposed into different formats and their results
need not be integrated because data are stored in a uniform format in a
central repository,
• users execute queries in a data warehouse and their queries do not inter-
fere with queries issued in data sources,
2. Database Systems: from File Systems to Modern Database Systems 47

• it is possible to query data that originally were not stored in a database,


provided that these data have been previously integrated into the system
and loaded into a warehouse,
• since a data warehouse usually makes available additional information
not stored directly in data sources (e.g. summaries, averages), users can
make use of this enriched information,
• data warehouse is independent of data sources - consequently, even
though data sources were unavailable, the users of a data warehousing
system could get the information that has already been loaded into a
warehouse.
The disadvantages of this approach are the following:
• because data from various sources are loaded to a warehouse, a warehouse
becomes obsolete when source information change. In order to keep a
warehouse up do date additional mechanisms must be implemented,
• data warehouse administrator has to know in advance which sources of
data should be integrated, which data should be extracted from these
sources and loaded into a warehouse,
• user can query only these data that have been previously loaded into a
warehouse,
• since a warehouse stores locally all its data, additional storage space
is required. Moreover, the size of the storage space is very large as a
warehouse contains data from a number of data sources.

8 Conclusions

In this chapter, we have introduced and briefly discussed the basic definitions
and concepts of database systems, including a data model, a database man-
agement system, a transaction, a query language, etc. We have also explained
how databases are created and used. Then, we have briefly described the evo-
lution of database systems starting from simple record-oriented navigational
databases systems to relational and object-relational systems, and explored
the background, characteristics, advantages and disadvantages of the main
database models: hierarchical, network, relational, and object-oriented. Then,
we have addressed the issue of data integration in a heterogeneous comput-
ing environment. We have briefly presented and pointed out advantages and
disadvantages of the three basic approaches to data integration, namely, fed-
erated database systems, data warehousing systems, and mediated systems.

References

[ACP+99] Atzeni, P., Ceri, S., Paraboschi, S., Torlone, R., Database systems -
concepts, languages and architectures, McGraw-Hill Publishing Com-
pany, 1999.
48 Z. Krolikowski and T. Morzy

[Bro86] Brodie, M., Database management: a survey, M. Brodie, J. Mylopou-


los (eds.) On knowledge base management systems, Springer-Verlag,
Berlin, 1986, 201-218.
[CBS98] Connolly, T., Begg, C., Strachan, A., Database systems - a pmctical
approach to design, implementation and management, Addison-Wesley
Publishing Company, 1998.
[Dat95] Date, C.J., An introduction to database systems, 6th edition, Addison-
Wesley Publishing Company, 1995.
[EN99] Elmasri, R., Navathe, S., Fundamentals of database systems, 3rd edi-
tion, Adison-Wesley Publishing Company, 1999.
[Fr086] Frost, R., Introduction to knowledge base systems, McGraw-Hill Pub-
lishing Company, 1986.
[KS86] Korth, H.F., Silberschatz, A., Database system concepts, McGraw-Hill
Publishing Company, 1986.
[Kim94] Kim, W., Modern database systems, Addison-Wesley Publishing Com-
pany, ACM Press, 1994.
[Nei94] O'Neil, P., Data base - principles, progmmming, performance, Morgan
Kaufmann Publishers Inc., 1994.
[Ram03] Ramakrishnan, R., Gehrke, J., Database management systems,
McGraw-Hill Publishing Company, 2003.
[TL76] Tsichritzis, D.C., Lochowsky, F.H., Hierarchical data base manage-
ment: a survey, ACM Computing Surveys, vol. 8, No.1, 1976.
[Ull89] Ullman, J.D., Principles of database and knowledge base systems, vol.
I and II, Computer Science Press, Rockville, Maryland, 1989.
[Vos91] Vossen, G., Data models, database languages and database management
systems, Addison- Wesley Publishing Company, 1991.
3. Data Modeling

Jeffrey Parsons

Faculty of Business Administration, Memorial University of Newfoundland,


St. John's, NF, Canada

1. Introduction ...................................................... 50
2. Early Concerns in Data Management .............................. 50
3. Abstraction in Data Modeling ..................................... 52
3.1 Traditional Data Models ...................................... 52
4. Semantic Data Models ............................................ 56
4.1 Specialization/Generalization.................................. 59
4.2 Composition .................................................. 60
4.3 Materialization................................................ 61
4.4 Encapsulation ................................................. 61
4.5 Emergent Themes in Data Modeling Abstractions ............. 62
5. Models of Reality and Perception ................................. 62
5.1 Ontology...................................................... 62
5.2 Cognition ..................................................... 63
5.3 Reconciling Models of Data with Models of Perceived Reality .. 65
6. Toward Cognition-Based Data Management....................... 66
6.1 Classification Issues ........................................... 67
6.2 Well-Defined Entity Types .................................... 67
6.3 Stable Entity Types ........................................... 68
6.4 Shared Entity Types .......................................... 69
7. A Cognitive Approach to Data Modeling.......................... 70
7.1 Other Applications of Cognition to Data Modeling ............ 72
8. Research Directions ............................................... 72

Abstract. Data modeling is an activity aimed at describing the structure of in-


formation to be stored in a database. Over the years, data modeling has evolved
from a focus on machine-oriented constructs to a focus on capturing the structure
of knowledge as perceived by users for whom the database is developed. The evo-
lution of data modeling has paralleled a shift in emphasis from technical issues
of efficiency in storage and retrieval to conceptual issues of capturing more of the
semantics of the data. This chapter traces the evolution of data modeling from this
perspective and poses some possible areas of future research.
50 J. Parsons

1 Introduction
In the field of information technology, there has been a tremendous focus on
improvements in hardware, ranging from processing speed to primary and
secondary storage capacity to telecommunications bandwidth. In addition,
much has been made of advances in software, exemplified by generations
of programming languages with increasing levels of abstraction [Sha84]. In
the field of data management, there has been a similar, though perhaps less
widely recognized, degree of progress. Several reviews of the evolution of data
modeling have been written (e.g., [Bro84], [Nav92]' [TL82]), focusing mainly
on the structuring of data in various models. In this chapter, a case is made
for viewing progress in data management in terms of the degree to which a
database can be viewed as a model of knowledge about some segment of the
real world.
Section 2 examines early concerns in data modeling. Section 3 outlines the
changes in focus during the evolution through flat file, hierarchical, network,
and relational approaches to organizing data. Section 4 discusses the increas-
ing level of abstraction demonstrated in research on semantic data models
and conceptual modeling. Section 5 introduces a framework for understand-
ing the ontological and conceptual foundation of data modeling by outlining
views of ontology and cognition as human activities of creating models of the
world. Section 6 describes what can be gained from a cognitive approach to
data modeling. Section 7 summarizes an information model based on cogni-
tive principles. Section 8 concludes by outlining some directions for future
research in data management.

2 Early Concerns in Data Management


A primary concern in the early days of data management was efficiency in
retrieving and processing data. Several factors associated with the technology
of the period made this focus mandatory. First, processing speeds of the most
powerful machines were quite slow compared to what is now available on
the desktop. In that environment, organizing data to minimize processing
was imperative. Second, locating and retrieving data from secondary storage
devices was slow by today's standards, even though progress in this area has
not been as dramatic as progress in CPU speed. Third, the unit cost of both
primary and secondary storage was high.
For all these reasons, early attention in data management focused on how
to organize data for efficient retrieval and processing. Little attention was
given to issues surrounding the semantics, or meaning, of data [Sow76]. The
field of data modeling has developed to allow the data structures used in data
management to better reflect the interpretations placed by humans on the
data.
The first attempts to structure data in a way to capture some application
semantics were, perhaps paradoxically, influenced by the sequential access
3. Data Modeling 51

method that constituted the only way to access data on the dominant early
secondary storage medium - magnetic tape. Under a constraint of sequential
access, efficient applications are those that involve processing most or all of
the data in a file. In particular, since updates to files on magnetic tape re-
quire writing all the data in a file to a new tape after an update, updates to
only a few pieces of data at a time are extremely inefficient. Consequently,
batch processing emerged as the dominant data processing strategy in early
applications. In batch processing, a file of transaction data (e.g., all the trans-
actions for a day) is used to update a master file of application data at certain
intervals.
In conjunction with batch techniques, processing efficiency is maximized
by sorting both master and transaction data according to the same field, so
that the master file can be updated by reading the file sequentially only once.
This discussion of processing strategies in a sequential access world is rel-
evant to data modeling since it leads naturally to a particular approach to
organizing data. To illustrate, suppose that a master file contains accounts
payable balances for customers of an organization. Transactions consist of
purchases and payments by customers. The only way to support efficient
batch processing (i.e., processing the batch while reading the master and
transaction files only once) is to first organize the master file sorted by cus-
tomer identification data (e.g., customer number). The transaction file is
similarly sorted by customer identification data and all the transactions of
each customer for the batch are grouped and arranged in sequence on the
tape.
This example implies that the data organization method appropriate to
sequential access is a variable-length record-based structure. A record consists
of a collection of related fields. Records may vary in length since, for any batch
of transactions, there may be anything from zero to many transactions for a
single master record. In this structure, all the transactions pertaining to the
same master record are arranged in sequence, followed by all the transactions
for the next master record, and so on.
In this structure, the data contain little, if any, semantics. Instead, the
relevant knowledge about how to interpret the next byte of data is contained
in the program(s) which access that data. One negative consequence of this is
that, if the data structure is changed for any reason, all programs that access
that data need to be changed. This limitation was a major factor motivat-
ing the subsequent development of the data modeling field. Data modeling
involves embedding domain semantics in the structure of the data.
The evolution of data modeling was, in some respects, possible due to
the development of direct-access secondary storage devices (disks) to replace
sequential access devices. This allowed data management to be driven by
semantic issues, instead of the constraints of the technology of secondary
storage devices.
52 J. Parsons

3 Abstraction in Data Modeling

The contents of a database represent the state of some real-world phenomena


at a point in time. The changes in a database over time reflect corresponding
changes in the real world phenomena it describes. In this way, a database can
be seen as a model of some portion of the real world. Moreover, a database
represents only some information about the phenomena of interest and there-
fore is an abstraction.
While the values contained in a database can be viewed as modeling the
state of a part of the world (the phenomena of interest), the structure of the
data can be seen as modeling the abstract structure of that portion of the
real world. Such a data structure is termed a data model.

3.1 Traditional Data Models

Data modeling has evolved through a number of distinct generations that


reflect increasing levels of abstraction. This chapter does not describe in detail
the structures and operations associated with these data models (for the
uninitiated reader, details can be found in sources such as [TL82]). Rather,
the focus is on the domain semantics and abstractions that can be expressed
in these models.
The first generation corresponds to the hierarchical data model. Unlike
subsequent models, the hierarchical model emerged from the underlying
sequential access methods associated with magnetic tape [TL82]. That is,
the sequential access method (SAM) and indexed sequential access method
(ISAM) data structures that implement the model preceded the recognition
of the hierarchical model as a way to abstractly describe the contents of a
database. The essence of the hierarchical model is the organization of data
into record types with fixed, ordered hierarchical links. The record types and
links constitute a tree that reflects a hierarchical organization of one-to-many
relationships. An example of such a tree is shown in Figure 3.11.
This representation effectively models hierarchical relationships among
records. The simple example in Figure 3.1 is interpreted to mean that each
department record is associated with one or more employee records and one
or more project records. Each employee is in turn associated with one or more
dependents and each project is associated with one or more workers. In this
database structure, the database is a collection of trees - each tree consisting
of a department and all associated employee, dependent, project, and worker
data.
A hierarchical structure lends itself well to processing using a sequential
access method, as each tree can be stored in sequence in a file. However, it
imposes a hierarchical constraint that is not sufficient to reflect the multiple

1 Figures 3.1 through 4.1 are adapted from examples in [EN89].


3. Data Modeling 53

DEPARTMENT

DNAME I DNUMBER
I MGRNAME I MGRSTARTDATE

I
EMPLOYEE PROJECT

NAME I SSN I BDATE I ADDRESS PNAME I PNUMBER I PLOCATION

DEPENDENT WORKER

DEPNAME I SEX I BIRTHDATE NAME I SSN


I HOURS

Fig. 3.1. A hierarchical data model

and time-variant relationships that can exist between things in the applica-
tion domain [Ken78]. To illustrate, although the above hierarchical structure
is appropriate for updating department information as employees are hired or
fired and projects started, worked on, or completed, it does not easily support
finding information such as which projects a given worker works on. To model
such a relationship, a separate hierarchical structure would be needed (e.g.,
WORKER-+PROJECT). Alternatively, virtual pointers can be used to avoid
some of the duplication that would result from implementing distinct hierar-
chical structures [EN89]. Moreover, answering typical managerial queries such
as "Which departments have projects located in Canada?" would in prac-
tice require replicating the data in yet another hierarchical structure (e.g.,
PROJECT-+DEPARTMENT, where PROJECT records might be sorted by
project location). In other words, accommodating a wide variety of uses re-
quires developing a potentially large number of distinct hierarchies. Moreover,
since it would be difficult, if not impossible, to determine all possible uses
during database design, it might be necessary to add new hierarchical struc-
tures as new uses of the data are identified. In other words, the hierarchical
structure does not capture enough domain semantics to support a wide range
of database uses. Providing additional hierarchies increases the complexity of
the database, and may result in a high level of redundancy, with the associ-
ated insertion, deletion, and update anomalies that accompany poor database
design [EN89].
To combat the replication associated with multiple hierarchies or virtual
pointers needed to represent many-to-many relationships in a pure hierar-
chical database, the network data model (or CODASYL DBTG model) was
developed [COD71] [COD78]. The essence of the network model is that it
54 J. Parsons

supports many-to-many relationships by maintaining a complex system of


pointers.
The structure of a network database can be understood by adapting the
database of Figure 3.1. Consider that a department can have many projects,
and a project is sponsored by one department. From the company's point of
view (and interest in managing project information), a network model repre-
sentation might reasonably focus on the directional (one-to-many) relation-
ship between a department and the projects it sponsors. In network parlance,
two record types would be used: DEPARTMENT and PROJECT, with DE-
PARTMENT being the owner type and PROJECT the member type. The
arc or linkage between the owner and member types is called a set type and
represents a relationship. This is depicted in Figure 3.2.

FNAME

DEPENDENT
SEX BIRTHDATE RELATIONSHIP

Fig. 3.2. A network data model

In addition to the complexity associated with navigating through a net-


work structure, representing many-to-many relationships in the network
model is indirect since it requires the introduction of intermediate record
types. To model the fact that an employee can be assigned to many projects
and a project may have many employees assigned to it, it is necessary to
introduce another record type, WORKS_ON, which would be a member type
in two set types linking to EMPLOYEE and PROJECT respectively. This is
also shown in Figure 3.2.
Though a generalization of the hierarchical model, the network model
suffers from a similar problem - relationships among records are fixed by the
data access paths defined by pointers linking records. This makes for very
efficient access and processing for certain types of applications. However, if
3. Data Modeling 55

the relationships of interest to users change, the database structure cannot


represent those changes without massive database redesign and reorganiza-
tion.
The emergence of the relational data model [Cod70] marked a significant
departure from previous data models. The essential data structuring mech-
anism of the relational model is the relation, or table. A table defines the
structure (attributes) of a collection of tuples (rows). Unlike the hierarchical
and network models, the ordering of tuples in a relation is not important to
the user of a database. Additionally, each table can stand alone. There are no
explicit links among tables that represent relationships as they are handled in
the hierarchical and network models. Instead, relationships are represented
implicitly when the values of an attribute in one table match the values in a
second table for which the values of the attribute serve to uniquely identify
rows in that table. For instance, a WORKS_ON table can have an attribute
ESSN whose values at any time are a subset of the values of an attribute SSN,
which uniquely identifies rows in an EMPLOYEE table. SSN is said to be a
primary key of EMPLOYEE and ESSN a foreign key of WORKS_ON. This is
illustrated in Figure 3.3.

EMPLOYEE
IFNAME IMINIT ILNAME I~ IBDATE IADDRESS ISEX ISALARY ISUPERSSN IDNO

DEPARTMENT
IDNAME IDNUMBER IMGRSSN IMGRSTARTDATE

DEPT_LOCATIONS
IDNUMBER IDLOCATION

PROJECT
IPNAME IPNUMBER IPLOCATION IDNUM

I~ 1Et:!Q IHOURS

DEPENDENT
IESSN IDEPNAME ISEX IBDATE IRELATIONSHIP

Fig. 3.3. A relational data model


56 J. Parsons

This structure gives the database designer a tremendous degree of flexi-


bility in designing a database. Essentially, it frees the designer from having
to predetermine the ways in which users might want to access data, or hav-
ing to predetermine the kinds of relationships among entities in which they
might be interested. Also, new relations can be added to an existing database
without having to reorganize or repopulate the database.
In addition to its flexibility, the relational model has generally been pro-
moted in terms of ease of use. Unlike the earlier models, the user need not
keep track of pointers or specify access paths to retrieve data. Instead, data
retrieval is made much simpler by a non-procedural query language such as
SQL.
Both the flexibility and ease of use of the relational model can be viewed
as consequences of the model's higher level of abstraction. Since relationships
are not fixed in a relational data structure, the model can easily accommo-
date changes in the relationships of interest to users. In many applications,
such changes are common. Hence, the model allows representations to corre-
spond more closely and adapt more easily to how users view the underlying
application domain. Hierarchical and network models become awkward when
users want to model the relationships among data in various ways.
To summarize, the evolution from hierarchical through network to rela-
tional data models highlights a growing emphasis on using the structure of
data to capture relationships among kinds of entities of interest in a database
application. In the hierarchical model, one-to-many relationships are fixed
in the sequential organization of data files. In the network model, many-to-
many relationships can be supported by a complex system of fixed pointers
linking data in a file. In the relational model, fixed relationships are replaced
with variable logical relationships embedded in the notion of foreign keys.
This allows much more flexibility in recognizing that relationships can change
over time.
However, in all three of these models, the emphasis has been on data
processing issues, rather than issues of providing a rich representation of the
semantics of the data, or what the data say about the things represented in
the database. As Kent [Ken78] states, "most models describe data processing
activities, not human enterprises" (p. 96). He further goes on to state that the
hierarchical, network, and relational models are all based on the concept of
records, and that "record technology reflects our attempt to find efficient ways
to process data, [but] does not reflect the natural structure of information"
(p. 101).

4 Semantic Data Models

Developments in data modeling subsequent to the three traditional mod-


els described above have focused on enhancing the representation power of
models and moving away from machine-oriented concepts such as pointers
3. Data Modeling 57

in the network model. Terms for such an extended model include semantic
data model, information model, and conceptual model. While these terms can
be argued to refer to different kinds of models (on a continuum from data-
oriented to knowledge-oriented), the emphasis in semantic data models and
information models has been on mechanisms to capture more domain knowl-
edge in the structure of data. Therefore, the terms are used interchangeably
in this chapter, and follow the original authors' usage as much as possible.
Although the evolution of data models shows a continual increase in the
degree of domain semantics captured in the conceptual schema, the adjective
"semantic" gained popular use in describing data models only in the early
1980s. The term may have been adapted to data modeling because the se-
mantic models borrowed ideas from research in knowledge representation on
semantic networks [TL82], [BMS84]. In fact, one of the stated motivations of
semantic data modeling was to develop representation constructs that corre-
spond more closely to how humans think about a problem domain [Che76],
[Br084], [HM81]. Semantic models recognized that the relational model did
not easily or naturally allow the database designer to capture a great deal of
what users know about how the data can be interpreted in the context of the
subject matter of the database; that is, the relational model cannot express
much of the semantics of the data.
The first of the widely known semantic data models is generally acknowl-
edged to be the entity-relationship (ER) model [Che76]. The ER model in-
troduced the notion of entity type as a fundamental modeling construct. An
entity type expresses the similarity of a set of real world entities. The sim-
ilarity of entities belonging to a given entity type is characterized by the
attributes defined for an entity type (all entities of a given type share all the
attributes defining that type) and by the relationships which link different
entity types.
The ER model contains more abstract representation constructs than any
of the classical data models. The model is purely conceptual in that its main
constructs - entities, attributes, and relationships - imply no implementation
mechanism (such as pointers). Indeed, an ER design can easily be converted
to a design in any of the hierarchical, network, or relational models, although
some semantics may be lost in the conversion. However, without additional
information (specifically the semantics lost in the conversion), it is not pos-
sible to convert a design in any of the classical models to a semantically
equivalent ER design.
The additional semantics of the ER model can be seen clearly by exam-
ining a conversion from an ER representation (ER diagram) to a relational
structure. Figure 4.1 contains an ER diagram for the project database exam-
ple introduced earlier.
Generally, an entity type can be converted to a relational table, with the
attributes of the entity type becoming attributes in the relation. However,
to avoid anomalies in a relational design, the table should be normalized or
58 J. Parsons

Fig. 4.1. An entity-relationship data model

decomposed (for a discussion of normalization see [EN89]). If an entity type


(such as EMPLOYEE) contains multivalued attributes (attributes that can
have more than one value, such as PhoneNum), several relational tables will
be needed to represent the entity type in a normalized format. Moreover,
the only linkage among these tables in the relational design will be a foreign
key designation (e.g., the primary key of the PERSON table will appear as a
foreign key in a PHONENUMBER table). Thus, tables are used to represent
some kinds of attributes, in addition to entities, and their domain semantics
is therefore somewhat ambiguous. What may appear to the user to be a single
entity type may be represented in a relational design as several tables with
foreign keys appearing in some tables.
The second stage in converting an ER representation into a relational form
involves representing relationships. The actual conversion procedure depends
on the connectivity of the relationship. Two kinds of connectivity are of pri-
mary interest. First, a relationship may be one-to-many, as in the relationship
[EMPLOYEE]-(O,N)-<worksfor>-(l,l)-[DEPARTMENT]. This means that
an employee works for exactly one (min 1, max 1) department, and a depart-
ment may have from zero to many (min 0, max N) employees working for
it. To represent this in a normalized relational structure, both EMPLOYEE
and DEPARTMENT entity types would become relations. The EMPLOYEE
relation would have, in addition to the attributes of the EMPLOYEE type,
an attribute such as DNum which is a foreign key whose values are drawn from
the valid values of the primary key of the DEPARTMENT table.
3. Data Modeling 59

Note that this treatment is identical to that of multivalued attributes, so


that the semantic difference is lost in the relational representation. Also lost
in the relational schema is the fact that "worksfor" is an optional relation-
ship of DEPARTMENT - a department mayor may not have any employees
working for it. This optionality can be seen only by examining the contents
of a relational database. That is, one would need to scan the EMPLOYEE
and DEPARTMENT tables to discover that not all primary key values from
DEPARTMENT appear as foreign key values of the DNum attribute of EM-
PLOYEE.
The second type of relationship connectivity of interest in the conversion
is many-to-many relationships. For example,
[EMPLOYEEj-(O,N)-<workson>-(l,N)-[PROJECTj
means that an employee is assigned to one or more projects, while a depart-
ment has assigned to it anything from zero to many employees. In this case,
a relational table is created to represent the relationship, and its primary
key contains the primary keys of the relations representing the related entity
types. Note that in the relational structure no distinction is made between
tables that represent relationships and tables that represent entity types (or
tables that represent multivalued attributes)2.
This examination shows that the ER model allows the explicit represen-
tation of certain application semantics that may be important to database
users, but which is not (explicitly) represented in a relational structure (or
other classical structure). However, the ER model itself is inadequate to re-
flect additional domain semantics. As a result, there have been numerous
additional developments in semantic data modeling (reflected in the devel-
opment of a number of additional semantic data models) in an attempt to
explicitly capture additional information about the domain in the conceptual
schema. Since there is often similarity and overlap in these models, instead
of discussing each independently, we proceed by discussing the most widely
examined semantic concepts that have been dealt with in these models and
make reference to the models that use these concepts.

4.1 Specialization/Generalization
Perhaps the most widely used construct in semantic data modeling is the
notion of a hierarchy of entity types or classes. Specialization/generalization
hierarchies or networks can be found in models such as [SS77], RMjT [Cod79],
2 The rationale for the distinction between entities and relationships is not always
clear in an ER model, as the designer has some latitude is deciding whether to
model something as an entity type or as a relationship [Ken78]. Hence, the in-
tended domain semantics is lost in the relational representation. In addition, there
is no explicit indication in the schema whether a relationship involves mandatory
or optional participation of the related entity types. As with one-to-many rela-
tionships, this semantics can only be revealed by examining the contents of the
database.
60 J. Parsons

SHM [Bro8I], SDM [HM8I]' DAPLEX [Shi8I], Extended ER [TYF86],


and various object-oriented models (e.g., [Boo9I], [EKW92], [RBP+9I] and
[BRJ99]).
In a specialization/generalization hierarchy, some entity types are asso-
ciated with others along a path from more general to more specific. For
instance, PERSON may be a very general type whose attributes and rela-
tionships for some application are the relevant attributes and relationships
common to all people for that application. A subset of PERSONs may be
those persons employed by a company, EMPLOYEEs. Since all employees are
persons, they have all the attributes and participate in all the relationships
of persons. They may (or must, as in [PW97a]) have additional attributes
and/or relationships. An entity type that is more general than another is
variously called a supertype [AC085], superclass [HM8I]' or generalization
[SS77]. An entity type whose members are a subset of another is variously
called a subtype, subclass, or specialization.
Numerous variations on this idea have been proposed, including modeling
notation to indicate whether subtypes are disjoint or overlapping [TYF86],
and whether the subtypes cover the supertype (that is, whether all members
of the supertype belong to at least one subtype) [Teo90]. Collectively, these
approaches capture the semantic notion that, in addition to relationships be-
tween distinct entities of distinct types, entity types can be related according
to whether they are specializations or generalizations, recognizing that the
same entity may belong to two or more types in a type hierarchy.

4.2 Composition

The ER model has also been recognized to be deficient in representing


the semantics of composite entities. A composite entity is a whole that is
made up of several other entities (parts) and possesses some emergent at-
tributes or relationships [SS77] [Par96a]. Because of emergence, the associa-
tion between the components of a composite entity has a different meaning
than the notion of a relationship as expressed .in the ER model. A con-
siderable amount of effort has been devoted to developing notation that
can capture the semantics of part-whole composition [Sto9I]. In addition,
various refinements to the general notion of composition and components
have been proposed. These include component-object (e.g., Engine part-of
Car), feature-event (e.g., Demonstration part-of Presentation), member-
collection (e.g., Employee member-of Committee), portion-mass (e.g., Slice
part-of Pie), phase-activity (e.g., Billing part-of Consulting), place-area
(e.g., Reception part-of Office, and stuff-object (e.g., Steel part-of Car)
(examples adapted from [Sto93]).
These distinctions, though useful for codifying and clarifying the seman-
tics of association, appear to have limited application in data modeling. In
other words, with the exception of composition, the distinctions are not gen-
3. Data Modeling 61

erally found in textbooks on data modeling or database design (e.g., [Ken78],


[TL82]' [EN89], [Te090]).

4.3 Materialization

Recently, another abstraction has been recognized in the data modeling liter-
ature - materialization [GS94]. The basis of materialization is the recognition
that some entities or things of interest in a domain have only a conceptual,
rather than physical, existence, but have a specific manifestation in a number
of physical entities. A classical example occurs in a video store between the
abstract concept of MOVIE and the concrete concept COPY. Each instance
of movie is an abstract entity which is manifested in one or more instances
of COPY, the latter reflecting individual copies of the movie in a store's
inventory.
Goldstein and Storey [GS94] demonstrate that the semantics of this ab-
straction cannot be captured by traditional abstraction mechanisms such
as specialization/generalization and composition (either alone or in combi-
nation). They further give evidence that materialization is a relatively com-
mon abstraction for organizing knowledge about conceptual entities and their
manifestations across a variety of applications. Attempts to express material-
ization through a generic modeling construct such as the relationship in the
ER model will result in a loss of semantics about the nature of the linkage
between the entities.

4.4 Encapsulation

The most recent development in incorporating semantics in information mod-


eling is object orientation. Object modeling has a strong intellectual lineage in
semantic data modeling. In fact, a number of object modeling techniques pro-
duce diagrams very similar to ER diagrams. For example, object diagrams in
OMT [RBP+91] and class diagrams in UML [BRJ99] are essentially semantic
data models in the tradition of the ER model. Object modeling, however, goes
beyond semantic data modeling to incorporate semantics associated with the
behavior of objects (entities).
Perhaps the most fundamental aspect of object-orientation is the encap-
sulation of structure (attributes) and behavior (methods) [PW97b). By incor-
porating behavior into an object model of a domain, it is possible to capture
more of what users know about that domain: in particular, what they know
about what the entities in the domain can and cannot do. Such information
is very useful in designing a system in which the state of objects represented
in the system correctly mirrors the state of real world objects as the latter
change state over time [WW93).
62 J. Parsons

4.5 Emergent Themes in Data Modeling Abstractions

The evolution of abstraction in data modeling shows an underlying theme


or pattern. In the earliest approaches to structuring data, data models were
shaped by the constraints of the secondary storage medium (sequential ac-
cess), giving rise to hierarchical fixed links among records. Later, as direct
access storage developed, data structuring relied on bi-directional pointers
to navigate through data in more than one way. In both these cases, data
abstractions were based on computing issues, rather than on user issues. The
relational model was noteworthy in separating data access (user) issues from
data storage (DBMS) issues.
Subsequent abstraction in data modeling focused explicitly on constructs
to capture users' knowledge about that part of the real world constituting the
subject matter of a database [Ken78), [TL82), [Br084). Various claims have
been made that semantic data models are more natural, and better reflect the
structure of the real world and/or the way users organize knowledge about
the real world (e.g., [Che76), [HM81), [HK87), [PM88)). However, semantic
data models pay little, if any, attention to applying to data modeling what
is known about the nature of things and the structure of knowledge. At best
they base data modeling constructs on naive models of reality or cognition.
The remainder of this paper describes some important ontological principles
and cognitive research and their implications to data modeling.

5 Models of Reality and Perception

5.1 Ontology

An information system can be viewed as a model of some part of reality


[Nij76), [Bub80). Consequently, it is natural to turn to studies of the nature
of reality - or ontology - as a basis for developing useful information model-
ing constructs. However, the explicit application of ontological principles to
understanding and improving information systems development (including
data modeling) did not begin until the late 1980s. At that time, a program
of research by Wand and Weber (e.g., [WW88), [Wan89), [WW93), [WW95),
[WSW99)) began. That work has made significant strides in advancing our
formal understanding of many issues related to information systems develop-
ment, including several related to information modeling. The foundation of
that work is the formal ontology of Bunge [Bun77), [Bun79).
A variety of ontological concepts have direct relevance to information
modeling. In Bunge's ontology, the world is comprised of things. The con-
cept of entities and objects in data modeling corresponds to the notion of
thing [Wan89). Things possess properties. Through properties, humans can
understand the distinctness of individual things, since no two things possess
exactly the same properties. In information modeling, attributes represent
3. Data Modeling 63

properties, and the use of keys can be understood as a mechanism to cap-


ture the distinctness of individual things when the attributes in a model are
not otherwise sufficient to distinguish entities of a given type. Another key
ontological construct is composition, by which things can be composed of
other things. Ontology adds to our understanding of composition in informa-
tion modeling through the notion of emergent properties which, according to
Bunge's ontology, all composite things must possess. Generally, information
models have not required composite entities to possess emergent properties.
Ontology also uses the notion of functional schema to describe how a thing
may be modeled in terms of a set of property functions in a particular frame
of reference. This provides a way of describing how entities can be classified in
multiple ways according to different frames of reference, and suggests a need
to support multiple conceptual views in information modeling. Recently, on-
tological principles have also been applied to improve our understanding of
the semantics of relationships in data modeling [WSW99j.
In addition to using ontological constructs to formalize and clarify certain
information modeling ideas, ontological thinking has been combined with
a cognitive perspective to develop criteria for assessing the quality of an
information model from a classification perspective [PW97aj. Furthermore,
the combined perspective has been used to prescribe necessary elements of an
information model if it is to serve as an adequate representation of perceptions
of some portion of reality [PW97bj.
In sum, ontological principles constitute a vocabulary for talking about
or modeling the nature of the real world. Application of these principles
has led to the development of formal, representation-based interpretations
of information modeling constructs (e.g., [Wan89], [WSW99]), as well as the
development of principles to guide certain aspects of information modeling
[PW97a], [PW97bj.

5.2 Cognition
Knowledge organization. Cognition is the study of human thinking. An
integral part of cognitive research is the investigation of mechanisms for
organizing knowledge. Much research in cognitive psychology has studied
the semantic structure of memory, proposing a variety of memory models
[Smi78j [RN88j. Several of these models deal with the nature of concepts and
the categorization of individual things as instances of concepts [SM81j. In
other words, developing concepts and categorizing things involves developing
a model of the world.
Two aspects of cognition are particularly relevant to modeling the world.
First, perception involves recognizing features that distinguish things in the
world from other things. Since there are potentially a very large number of
features that could serve to identify and distinguish things, perception in-
herently involves abstraction, or choosing to focus on certain features and
ignoring others. From a Darwinian perspective, the ability to abstract based
64 J. Parsons

on features that enhance opportunities for survival (e.g., danger associated


with certain objects) would lead over time to evolution based on the devel-
opment of useful abstractions [Lak87j.
The sheer volume of individual facts that would need to be committed
to memory in order to have a mental "database" of enough facts to deal
with the complexity of objects in the real world is overwhelming. It would be
impossible to remember what is important about all the objects committed
to memory in a simple feature-based memory model [Smi88j. Hence, from the
perspective of modeling the world, a second and equally important aspect of
cognition is concept formation and categorization.
Concept formation involves the identification of abstractions that describe
the similarity of several things. Such abstractions allow us to categorize a
thing perceived as an instance of a particular concept. Concepts fulfill two
roles in enabling humans to deal with the complexity of the world [Ros78j.
First, they provide cognitive economy. This means that concepts "reduce the
cognitive burden associated with storing and organizing knowledge" [Par96aj
since the similarity of instances of a concept is established by their com-
mon status as instances of that concept. Second, concepts support inferences
about things. We categorize a perceived thing as an instance of a concept
based on observation of a (small) subset of its properties. It is then possi-
ble to infer from the structure of the concept additional properties that the
thing must possess. Both cognitive economy and inference contribute to the
survival value of concepts [Par96aj.

Models of concepts. Models of concepts and categorization fall into three


camps [SM81j. The classical model views a concept as a set of necessary
and sufficient conditions that define membership of instances in a category.
Frequently, these conditions are modeled in terms of properties, features,
or relationships possessed by instances. Since, under this view, membership
conditions are precise, a thing either belongs to a category or not - there are
no degrees of membership.
The classical model has been widely criticized for its failure to account for
a number of empirically robust phenomena. In controlled experiments, partic-
ipants are consistently unable to articulate defining membership conditions
for a variety of concepts [Smi88j. By itself, failure to articulate defining con-
ditions does not mean that people do not use defining conditions. However,
additional evidence from experiments shows that, for many concepts, people
recognize degrees of membership, or typicality effects [Smi88j. This means
that people view some instances as more typical, or even better, examples of
a concept than others. For instance, a robin is seen as a more typical bird
than a penguin, while 3 is seen as a more typical odd number than 135647.
In combination, these results have led to the development of alternate
models of concepts. The so-called prototype view models a concept in terms
of a central member (the prototype), and dimensions along which less typical
3. Data Modeling 65

members can diverge from the prototype [Smi88]. In some versions of this
view, the prototype does not have to be an actual thing, but may be idealized
[Ros78]. Numerous variations on the prototype approach have been proposed
[MS84], although the differences are not relevant to this discussion.
A third competing model that has received attention is the exemplar
approach [SM81]. Both the classical and prototype models view a concept as
an abstract characterization of the similarity of a set of instances. In contrast,
the exemplar view has no abstract notion of a concept. Instead, the members
of a class may be similar in different ways to other members of the class. In
fact, the similarity of any two members may not be direct, but only reflected
through a chain of similarity between pairs of intervening instances. This
view has gained popularity largely as a result of the work of George Lakoff
on the complexity of categorization and the differences in concept structures
between different cultures [Lak87].
In addition to these distinct models reflecting the presence or type of ex-
plicit abstractions in concepts, recent thinking on categorization has high-
lighted the degree to which, through human activity, concepts are "con-
structed" rather than "discovered". Historically, the classical view of con-
cepts as being well-defined was also associated with the notion that concepts
were fixed [Lak87]. Under the most extreme version of that view, concepts
were seen as having some objective existence outside of human observers,
and that the task of concept formation consisted largely of identifying these
preexisting objective differences.
That view has been largely replaced by a more subjective perspective in
which concept formation is seen as constructing abstractions that capture
useful differences among the things in the world [Lak87]. This evolution in
perspective helps to account for the observed fact that different people or
groups do not necessarily agree on a set of concepts by which to categorize
things in the world. Since what is useful differs among people and over time,
two consequences follow. First, different people may conceptualize the same
domain in different ways. Second, one's conceptualization of a domain may
change over time.

5.3 Reconciling Models of Data with Models of Perceived


Reality
Data modeling emerged from data processing concerns, and the recognition
that early forms of data structures did not allow significant application se-
mantics to be captured in the data. Semantic data models attempt to capture
more semantics in the structure of data, rather than leaving those semantics
to be expressed and enforced in application programs.
At first glance, the constructs in semantic data models appear to parallel
those in ontological and cognitive models of the world. Most notably, an entity
type represents the similarity (in terms of shared attributes and relationships)
of a number of entities, and an entity represents a thing. Since an entity type
66 J. Parsons

defines precisely the criteria for membership of entities in the type, data
modeling implicitly incorporates a classical view of well-defined concepts, or
a functional schema-based view of things. There is no provision in semantic
data models for degrees of membership of entities in a type. The ER model
does allow the idea of optional relationships to be defined for an entity type,
but does not treat any entities as being more or less typical instances of a
type. In addition, semantic data models do not allow an entity to be modeled
with attributes that are not attributes of the entity type to which it belongs.
Beyond this, data models do not easily accommodate varying views of
the data, or views that change over time. Semantic data models are used for
conceptual modeling of data requirements, and are typically translated to an
underlying representation, such as a relational one, for implementation. Dur-
ing conceptual modeling, multiple smaller data models may be developed for
different users in an organization. To prepare for database implementation,
these views must be combined into a global conceptual schema in a process
called view integration [NEL86]. However, the global schema is artificial in
that it does not correspond to any user's view of the domain [Par96a]. In
that sense, models, and databases developed from them, are not supportive
of a multiplicity of conceptualizations of a domain. Moreover, such models do
not easily accommodate changes in the conceptualization of a domain over
time. Indeed, many advocates of data modeling hold the view that while the
contents of a database change frequently, the schema is relatively stable over
time [CY91]. In this context, changing a database schema as a conceptual
model changes can be a very time consuming and expensive activity [LH90].
In sum, data models attempt to reflect the structure of the real world as
perceived by users. However, they do not draw their modeling constructs ex-
plicitly from ontological models of reality or cognitive models of how people
organize information about the things in their environment. In particular,
semantic models typically structure knowledge according to well-defined en-
tity types. When used as the basis for implementing a database, they do not
easily accommodate multiple or changing views. The next section considers
one line of research on the implications of basing a conceptual model on a
model of knowledge organization.

6 Toward Cognition-Based Data Management

The trend toward increasing abstraction in conceptual modeling is also seen


as a trend away from machine-oriented constructs toward user- or domain-
oriented constructs [Bro84]. To the extent that conceptual modeling draws
on knowledge constructs, they come mainly from knowledge representation
research in artificial intelligence [BMS84]. For example, in the conceptual
modeling language Telos [Myl91], the modeling approach draws heavily on
notions of semantic networks from artificial intelligence (e.g., [Qui68]). Such
3. Data Modeling 67

developments have been valuable in providing conceptual modeling constructs


to formally represent an increasing range of domain semantics.
However, by focusing on mechanisms to represent application semantics,
there is a danger of ignoring the underlying cognitive origin of entity types
and related data modeling constructs. In this section, some implications of
refocusing attention in data modeling on the cognitive issues introduced in
Section 5.2 are explored.

6.1 Classification Issues

In data modeling, entity types are useful in capturing users' perceptions of


the important kinds of things about which information needs to be kept in a
database. In other words, entity types represent concepts, and entities repre-
sent things or instances of concepts. To the extent that semantic data mod-
eling constructs reflect a cognitive perspective on classification, it is clearly
a classical one. That is, entity or class types in a semantic model are pre-
cisely defined in terms of attributes and relationships, they are assumed to
be stable over time, and they are assumed to reflect a shared view among
multiple groups of users. In descriptions of methodologies for analyzing data
requirements, various authors advocate "identify the entity types" as the first
step [TYF86j, [WWW90j, [RBP+91j.
This view of entity types as well-defined, stable, and widely shared is in-
consistent with the cognitive research on concepts and classification. Specif-
ically, concepts are often not well-defined, can change over time, and are
defined differently by different people depending on their interests and needs
[SM81j. This suggests that a classical model of concepts, as reflected in the
way entity types are defined in semantic data modeling, may be inadequate
for meeting the representation requirements in the development of databases.
To assess whether it matters that semantic models reflect the way people
conceptualize a domain (the presentation of such models often states that
it indeed does matter), it is useful to examine the degree to which semantic
models support the development of databases that meet users' needs. This
can be done by focusing on the three consequences of the classical approach
to concepts implicit in semantic models.

6.2 Well-Defined Entity Types

Semantic models require that entity types be well-defined. An essential step


in data modeling is defining the entity types of interest and the relation-
ships among them [TYF86j. Neither the ER model nor any of the subsequent
semantic models reviewed above allows concepts to be imprecisely specified
with respect to the data requirements of an application. Entity types are
neither specified by prototypes and measures of dispersion nor by collections
of exemplars.
68 J. Parsons

The requirement for well-defined entity types is understandable given the


twin purposes of many large-scale production databases. Such databases first
must support transaction processing, which entails updating the state of a
database as events occur in the real world [WW88j. Many times, these events
affect all entities of a given type. For instance, if a company awards a five
percent raise to all its employees, it is important that the structure of the
database allows these employees to be identified. A well-defined EMPLOYEE
entity type provides a clear specification of which database records must be
updated in the transaction. Other times, transactions involve only one or a
few entities; nevertheless, these entities are typically specified in terms of the
type to which they belong (e.g., a customer, an employee, a product). Well-
defined types provide an unambiguous answer to the question of whether an
entity is a member of a specified type.
In addition to transaction processing, databases serve a second function
in supporting both planned and ad hoc queries about the state of real world
entities as represented in the database. In particular, a query implies that the
values of specific attributes of one or more entities of a particular type need
to be retrieved. If an entity type is not well-defined, it may not be possible to
either determine whether a specific entity of interest belongs to a type stated
in a query, or whether the specification of a type in a query will enable the
required attributes to be included in the query result.
Hence, the requirement of well-defined entity types seems well-suited to
the kinds of uses expected of traditional databases. The widespread success of
relational databases and SQL in support of transaction processing and query-
ing suggests there is substantial value to the areas of transaction processing
and querying in requiring that entity types be well-defined. Note, however,
that the requirement of well-defined entity types may not be well- suited to
applications for which only approximate answers are required. For example,
in data mining applications, success in finding useful patterns generally in-
volves identifying attribute structures that are less than perfectly correlated.

6.3 Stable Entity Types


A conceptual data model serves as the input to the development of a database
schema for implementation. Currently, most databases are implemented in a
database management system (DBMS) based on the relational data model.
In such a system, relations implement entity types. If the entity types change,
the entire relational database may need to be reorganized to accommodate
these changes. This is costly and difficult.
In view of this, it is not surprising that semantic data models have paid
scant attention to issues of dynamic entity types and the consequences on
database implementation. More recently, in the area of object- oriented mod-
eling, schema evolution has been recognized as an area for which object
modeling techniques can offer advantages over traditional semantic modeling
3. Data Modeling 69

[BCG+87j. The ability to modify schemas to reflect changes in object class


definitions is recognized to be important for some applications [BCG+87j.
One of the goals of research in this area is to support schema changes with-
out requiring massive reorganization of the underlying database [LH90j.
Nevertheless, the prevailing view in conceptual modeling is that the data
model schema is one of the most stable aspects of a database [CY91j. As a re-
sult, databases cannot easily accommodate changes to entity type definitions
if the underlying concepts they represent change in the users' perceptions of
the domain.

6.4 Shared Entity Types


A conceptual schema expresses a single view of the entity types important
in a domain. Methods for converting such a schema into a form suitable for
implementation (e.g., a set of relations in a relational database) apply only
to the conversion of a single model of a domain. Hence, database design
implicitly assumes there will a view of the domain that is acceptable to, and
can be shared by, all users involved in database development.
Modern research on concepts is based on the observation that there can
be significant differences between individuals in their conceptualization of a
domain. Concepts are seen to be constructed based on usefulness, and what is
useful can vary among persons and over time. In many cases, shared concepts
are vital to communication, and their absence leads to miscommunication
[Lak87j. However, in situations where there is non-overlapping interest in
concepts for particular purposes, sharing may not be necessary.
In practice, the fact of different conceptualizations of a domain is recog-
nized in data modeling. To deal with the incompatibility between this fact
and the requirement that database design proceed from a single schema, a
significant amount of research has been done on developing methods to com-
bine mUltiple views of a domain prior to implementing a database. Under this
view integration approach, multiple views are obtained from different users
as needed. Since these often reflect different definitions of entity types and
different subsets of a domain, depending on users' interests, it is impossible
to proceed directly to defining the database logical schema. View (or schema)
integration consists of a set of techniques to combine so-called local views to
produce a single "global" schema which then serves as the basis for defining
the database [BLN86j, [NEL86j.
The essential problem with this approach is that the global schema does
not reflect the domain as viewed by any users. It is an artificial creation
that serves to support the implementation of a database, but does nothing to
improve the representation of domain semantics. In fact, it has been shown
that a global schema approach can impair communication between different
users having different views of the domain [Par96bj.
In sum, by adopting an approach that assumes well-defined, stable, and
shared concepts, semantic data modeling approaches have some significant
70 J. Parsons

deficits in representing knowledge relevant for database design, implemen-


tation, and effective use. The next section reviews research proposing an
approach that reflects cognitive issues and avoids the problems discussed.

7 A Cognitive Approach to Data Modeling

The view that data modeling should be intended to represent users' knowl-
edge of things in a domain has gained increasing attention in recent years
[HM81J, [Nav92], [Par96a]. While this view has typically been presented only
informally and without building on cognitive foundations, one line of research
has looked formally at building semantic information modeling constructs by
explicitly drawing on what has been learned in recent years about catego-
rization and knowledge organization.
The essence of the MIMIC model [Par96a] is the separation of instances
and classes (entity types). That is, instances (representing things) are mod-
eled independently of any classes to which they might be assigned. The basis
for this separation derives from an ontology in which things exist and classes
are regarded as a view of things according to the properties they possess
[Bun77J, [WW88].
In MIMIC, instances represent things in the real world. Instances are rec-
ognized to have three kinds of properties, according to the kinds of properties
distinguished in the cognitive literature. Each of these can be represented in,
and is formally defined in, the model. Structural properties describe the state
of an instance in terms of primitive values (e.g., name, height, salary). Rela-
tional properties describe associations among instances. Behavioral proper-
ties describe the constraints or laws that determine the allowable changes in
values of structural and relational properties.
Structural, relational, and behavioral properties are not different in pur-
pose from similar notions developed in semantic and object-oriented data
models. However, in MIMIC, properties are defined in terms of instances or
sets of instances that possess them. This contrasts with the common approach
in semantic models of defining attributes, relationships, and behavior in terms
of classes of entities or objects that possess them. In those models, classes
precede properties; in MIMIC, properties precede, and are independent of,
classes.
In MIMIC, classes are defined intensionally. A class is defined by a set of
structural and relational properties. In addition, since behavioral properties
are defined in terms of constraints on changes to structural and relational
properties, a class also implies the allowable behavior of its instances. The
membership of a class is dynamic, and consists at any given time of the set
of instances that possess all the properties that define the class. Instances
in the model can acquire and lose properties at any time, and therefore can
enter or leave classes without any explicit operation to add or remove them.
3. Data Modeling 71

This contrasts with the insert and delete operations in relational databases
(typically supported using SQL) to add and remove rows in relations.
The strongest contrast between MIMIC and traditional semantic models
can be seen in the area of classification. In all the models reviewed earlier,
there is no provision that entities or objects can be represented independent
of the type or class to which they belong. Such models can thus be said to
exhibit class dependence. Instances cannot be modeled except as members of
classes.
In MIMIC, however, instances (along with the properties they possess) can
and, in fact, must be modeled independent of any classification. A database
constructed according to an implementation of the MIMIC model would be
first and foremost a database of instances possessing properties. Hence, a
MIMIC representation can be said to exhibit class independence [PW97aj.
Classification enters the MIMIC model as a way of facilitating access to
instances. Since users generally think of things in the domain of interest as
members of classes, it is neither natural nor cognitively reasonable to ex-
pect them to relate easily to unclassified instances. Therefore, a classification
structure in which classes are defined as sets of properties based on the con-
cepts relevant to users makes it convenient to both populate a database and
to retrieve information about instances from that database. However, unlike
databases based on traditional semantic models, the class structure does not
form the basis for structuring data in the implementation. Instead, structur-
ing is based on instances and their properties.
The treatment of classification in MIMIC points out another significant
departure from traditional semantic models. Since there is no classification
implied by the underlying instance/property orientation, multiple class struc-
tures can independently exist on top of an underlying instance collection.
Each of these structures can coexist, providing independent views of (por-
tions) of the instance base corresponding to the interests and needs of different
groups of users. Integration of these local schemas or views is not necessary
for developing the underlying database. Instead, the local views can be pre-
served to provide various "windows" to the data.
Under such an instance-based data model, several difficult issues in data
modeling are either non-existent or solved [PWOOj. First, schema evolution is
merely a matter of redefining classes, or adding and dropping attributes and
relationships to class definitions. In contrast with conventional approaches,
instances do not need to be "moved" from one class to another as definitions
change. Second, view integration does not have to be performed. Each view
can stand alone and serve as the basis for accessing instances relevant to
specific users to whom the view is meaningful. Since view integration is a
time-consuming and difficult activity, this makes the database design process
easier.
If used as the foundation for a DBMS implementation, the model also
resolves several problems in databases that arise from the class-based models
72 J. Parsons

currently used in commercial products [PWOO]. First, database reorganization


is not needed when the database (class) schema evolves (cf. [LH90J). This can
save a large amount of effort in an evolving domain, and mitigate the need
to maintain an outdated schema in an evolving domain. Second, problems of
integrating distinct databases [BLN86] are reduced, since the integration is
needed only at the instance level, and not at the class level.

7.1 Other Applications of Cognition to Data Modeling


Recently, other efforts at bringing cognitive issues to bear on improving the
semantics of data models have been undertaken. For example, the ER and
semantic data models have limited ability to capture relationships among
entities. Ramesh and Browne [RB99] have observed that other kinds of rela-
tionships pervade our thinking about things in the world, and there exist no
explicit constructs in cognitive models to represent these kinds of relationship.
Specifically, people regularly make use of causal, motivational, and hierarchi-
cal relationships to understand the interaction of things in the real world.
Given the pervasiveness of these constructs in human thinking, an obvious
issue arises regarding the impact of the absence of appropriate modeling con-
structs on the ability to express these relationships in a semantic data model.
Ramesh and Browne studied people's ability to express causal relationships in
an ER model versus other (free-form) methods of modeling a stated problem
description. They found that people tend to omit such causal relationships
from a model if an appropriate modeling construct is not available, even if
the relationship is important to the problem at hand (and should affect the
data model). Such a finding suggests that we may need to further extend
semantic models to incorporate constructs to represent additional aspects of
knowledge organization that can affect the design and use of a database. Ad-
ditional work needs to be done, however, to determine the extent to which
such issues arise in database design and affect design decisions.

8 Research Directions
Recent areas of intensive research in the database field have not focused
on data modeling. However, there are reasons to believe that cognitive ap-
proaches to data modeling can inform other questions of interest in the field.
As outlined above, adopting a pluralistic view of classification, as in the
MIMIC model, promises to help deal with the complexity of combining infor-
mation from existing independent and heterogeneous data sources. As cor-
porations merge and business becomes more global, and as valuable new
databases are made available over the Internet, the need to combine existing
databases has never been greater. This has sparked a great interest in re-
search on database integration [SL90]. However, existing research has drawn
almost exclusively on a "class-based" data modeling paradigm such as that
3. Data Modeling 73

of the ER and related semantic data models. As a result, methodologies for


developing global or federated database schemas to support interoperability
among databases face the known difficult problem ofintegrating classes across
databases. By recognizing that class definitions can be conceptually separated
from the actual database contents (instances), integration may need only be
performed at the instance level, leaving each independent database to manage
its own schema while instance-level matching is used to combine data about
things represented in the databases [PWOOj. Additional research is needed to
explore the architectures that can support instance-based interoperation.
To date, research in data modeling has presumed the existence of well-
defined concepts. That assumption has proven to be reasonable for many
kinds of transaction databases that serve organizations today. Data about
employees, customers, and products needed to support day-to-day opera-
tions can be reasonably precisely specified. However, recently there has been
a great deal of interest in exploiting large transaction databases to iden-
tify patterns and relationships in the data that were not anticipated during
database design. This data mining may lead to the identification of classifi-
cations which, though they do not stand up to the rigorous requirements of
well-defined classical concepts, are nonetheless useful in describing similari-
ties of large numbers of entities or transactions. For this kind of application,
there may be much to be gained by examining in more depth the proto-
typical and exemplar-based forms of classification. Given the ability of these
models to better account for observed classification behavior by humans, it
would be valuable to develop data mining algorithms which seek to identify
prototype-based schemas over a database of instances to find potentially use-
ful but unanticipated relationships, or which seek to build clusters of related
instances which reflect less tangible patterns.
Another potential application of prototype-based classification in data-
base development involves the possibilities to develop query procedures which
return results that satisfy a query condition with a specified degree of cer-
tainty, or that satisfy a specified percentage of a query's conditions. To date,
the research done in this area has been on the subject of probabilistic SQL
[DS97j. Probabilistic SQL works on probabilistic relations, in which each tu-
ple has an attribute indicating the probability that the values of the other
attributes of the tuple are true. However, the definition of each table in terms
of attributes is fixed. In contrast, queries that work on instances located some
distance from a central tendency may open many new doors for information
retrieval. Such an approach may be useful when applied to search engines
operating on data distributed over the Internet. It may alp mining.
Another area of growing interest in database research involves broadening
the kinds of data stored in a database. In particular, there is a great deal
of interest in multimedia databases that may contain sounds, images, and
video. Two key questions in the development of such databases are how to
model multimedia objects and how to effectively and efficiently find these
74 J. Parsons

objects through an appropriate query procedure (e.g., based on terms to index


these objects). Since multimedia processing is another fundamental aspect of
cognition, the application of research in cognition on image processing may
provide useful insights into the development of multimedia database models
and systems.
In addition, the rapid evolution of the Internet and its vast and growing
collection of diverse, unstructured and semi-structured information has given
rise to a tremendous need for mechanisms to organize this vast repository to
identify similarities among bodies of documents. Each of the three cognitive
models of classification discussed in this paper offers a potentially useful way
to infer useful classifications of documents on the World Wide Web. By treat-
ing documents as instances with specified properties (perhaps using XML
tags), it is possible to develop classes of documents as those that share speci-
fied properties. This can serve as the basis for search engine functioning. Also,
the instance-based model can be useful in developing a conceptual model of
a hypertext-based information repository. Similarly, prototype-based and ex-
emplar approaches to classification can inform research on organizing large
collections of hypertext materials. For example, these may help in develop-
ing measures of similarity among documents in the absence of well-defined
classifications.
Finally, additional research is needed to examine opportunities for devel-
oping dynamic information sources that adapt their presentation by infer-
ring user preferences from browsing behavior. Indicators such as time spent
visiting a document, combined with an indexing of documents on relevant
attributes can be useful in dynamically reordering documents to help users
find what they need more quickly. This currently cannot be handled well
with a pre-classified site where page classification and sequencing are prede-
termined. In addition to applying this principle to hypertext documents in
general, it has special applicability to on-line catalogs. These currently tend
to be organized according to product hierarchies. Since relevant categoriza-
tions can vary among individuals and over time, a fixed classification may
not be useful in helping a consumer find a desired item quickly. Moreover, if
a consumer cannot find an item quickly, slhe may move on and the retailer
may lose the opportunity to make a sale. The usefulness of instance-based in
this domain needs to be studied.

References

[AC085] Albano, A., Cardelli, L., Orsini, R., Galileo: a strongly-typed inter-
active conceptual language, ACM Transactions on Database Systems
10(2), 1985, 230-260.
[BCG+87] Banerjee, J., Chou, H.-T., Garza, J., Woelk, D., Ballou, N., Kim, H.-
J., Data model issues for object-oriented applications, ACM Trans-
actions on Office Information Systems 5(1), 1987, 3-26.
3. Data Modeling 75

[BLN86] Batini, C., Lenzerini, M., Navathe, S., A comparative analysis of


methodologies for database schema integration, ACM Computing
Surveys 18(4), 1986, 323-364.
[Boo91] Booch, G., Object-oriented design with applications, Benjamin/Cum-
mings, Redwood City, CA, 1991.
[BRJ99] Booch, G., Rumbaugh, J., Jacobson, I., The unified modeling language
user guide, Addison Wesley, Reading, MA, 1999.
[Bro81] Brodie, M.L., On modelling behavioural semantics of databases, Proc.
7th International Conference on Very Large Databases (VLDB'81),
Cannes, France, 1981, 32-42.
[Bro84] Brodie, M.L., On the development of data models, [BMS84], 1984,
19-47.
[BMS84] Brodie, M.L., Mylopoulos, J., Schmidt, J.W., On conceptual mod-
elling: perspectives from artificial intelligence, databases, and pro-
gramming languages, Springer-Verlag, New York, 1984.
[Bub80] Bubenko, J., Information modeling in the context of information
systems development, Information Processing 1980, North-Holland,
1980, 395-411.
[Bun77] Bunge, M., Treatise on basic philosophy, vol. 3, Ontology I, The fur-
niture of the world, Reidel, Boston, MA, 1977.
[Bun79] Bunge, M., Treatise on basic philosophy, vol. 4, Ontology II, A world
of systems, Reidel, Boston, MA, 1979.
[Che76] Chen, P.P.-S., The Entity-Relationship model: toward a unified view
of data, ACM Transactions on Database Systems 1(1), 1976, 9-36.
[Cod70] Codd, E.F., A relational model of data for large shared data banks,
Communications of the ACM 13, 1970, 377-387.
[COD71] CODASYL Data Base Task Group Report, New York, ACM, 1971.
[COD78] CODASYL Data Description Language Journal of Development, Ma-
terial Data Management Branch, Department of Supply and Services,
Ottawa, Ontario, 1978.
[Cod79] Codd, E.F., Extending the database relational model to capture more
meaning, ACM Transactions on Database Systems 4(4), 1979, 397-
434.
[CY91] Coad, P., Yourdon, E., Object-oriented analysis, Prentice-Hall, En-
glewood Cliffs, NJ, 1991.
[DS97] Dey, D., Sarkar, S., Extending SQL support for uncertain data, Con-
ceptual Modeling - ER '97, Proc. 16th International Conference on
Conceptual Modeling, Springer-Verlag, New York, 1997, 102-112.
[EKW92] Embley, D., Kurtz, B., Woodfield, S., Object-oriented systems anal-
ysis: a model-driven approach, Prentice-Hall, Englewood Cliffs, NJ,
1992.
[EN89] Elmasri, R, Navathe, S.B., Fundamentals of database systems, Ben-
jamin/Cummings, Redwood City, CA, 1989.
[GS94] Goldstein, RC., Storey, V.C., Materialization, IEEE Transactions on
Knowledge and Data Engineering 6(5), 1994, 835-842.
[HK87] Hull, R, King, R, Semantic database modeling: survey, applications,
and research issues, ACM Computing Surveys 19(3), 1987, 201-260.
[HM81] Hammer, M., McLeod, D., Database description with SDM: a seman-
tic database model, ACM Transactions on Database Systems 6(3),
1981, 351-386.
76 J. Parsons

[Ken78] Kent, W., Data and reality: basic assumptions in data processing re-
considered, North-Holland, Amsterdam, 1978.
[Lak87] Lakoff, G., Women, fire, and dangerous things: what categories reveal
about the mind, University of Chicago Press, Chicago, IL, 1987.
[LH90] Lerner, B.S., Habermann, A.N., Beyond schema evolution to database
reorganization, Proc. Conference on Object-Oriented Programming
Systems, Languages, and Applications / European Conference on
Object-Oriented Programming (ECOOP/OOPSLA gO), 1990,67-76.
[MS84] Medin, D.L., Smith, E.E., Concepts and concept formation, Annual
Review of Psychology 35, 1984, 113-138.
[MyI91] Mylopoulos, J., Conceptual Modeling and Telos, P. Loucopoulos, R.
Zicari (eds.) , Conceptual modeling, databases, and CASE: an inte-
grated view of information systems development, McGraw-Hill, New
York, 1991.
[Nav92] Navathe, S.B., Evolution of data modeling for databases, Communi-
cations of the ACM 35(9), 1992, 112-123.
[NEL86] Navathe, S.B., Elmasri, R., Larson, J., Integrating user views in
database design, IEEE Computer, June 1986, 50-62.
[Nij76] Nijssen, G., A gross architecture for the next generation database
management systems, G. Nijssen (ed.), Modelling in Database Man-
agement Systems, North-Holland, 1976, 1-24.
[Par96a] Parsons, J., An information model based on classification theory,
Management Science 42(10), 1996, 1437-1453.
[Par96b] Parsons, J., An experimental investigation of local versus global
schemas in conceptual data modeling, Proc. 6th Workshop on Infor-
mation Technologies and Systems (WITS96) , Cleveland, OH, 1996,
61-70.
[PW97a] Parsons, J., Wand, Y., Choosing classes in conceptual modeling, Com-
munications of the ACM 40(6), 1997, 63-69.
[PW97b] Parsons, J., Wand, Y., Using objects for systems analysis, Commu-
nications of the ACM 40(12), 1997, 104-110.
[PWOO] Parsons, J., Wand, Y., Emancipating instances from the tyranny of
classes in information modeling, ACM Transactions on Database Sys-
tems 23(2), 2000, 228-268.
[PM88] Peckham, J., Maryanski, F., Semantic data models, ACM Computing
Surveys 20(3), 1988, 153-189.
[Qui68] Quillian, R., Semantic Memory, M. Minsky (ed.), Semantic Informa-
tion Processing, MIT Press, Cambridge, MA, 1968.
[RB99] Ramesh, V., Browne, G.J., Expressing causal relationships in concep-
tual database schemas, Journal of Systems and Software 45, 1999,
225-232.
[Ros78] Rosch, E., Principles of Categorization, E. Rosch, B. Lloyd (eds.) ,
Cognition and categorization, Erlbaum, Hillsdale, NJ, 1978, 27-48.
[RBP+91] Rumbaugh, J., Blaha, M., Premerlani, W., Eddy, F., Lorensen, W.,
Object-oriented modeling and design, Prentice-Hall, Englewood Cliffs,
NJ, 1991.
[RN88] Rumelhart, D., Norman, D., Representations in memory, Steven's
Handbook of Experimental Psychology (vol. 2): Representations in
Memory, 1988, 511-587.
3. Data Modeling 77

[Sha84] Shaw, M., The impact of modelling and abstraction concerns on mod-
ern programming languages, in [BMS84], 1984, 49-78.
[Shi81] Shipman, D.W., The functional data model and the data language
DAPLEX, ACM Transactions on Database Systems 6(1), 1981, 140-
173.
[SL90] Sheth, A., Larson, J., Federated database systems for managing dis-
tributed, heterogeneous, and autonomous databases, ACM Comput-
ing Surveys 22(3), 1990, 184-236.
[SM81] Smith, E.E., Medin, D.L., Categories and concepts, Harvard Univer-
sity Press, Cambridge, MA, 1981.
[Smi78] Smith, E.E., Theories of semantic memory, W.K. Estes (ed.), Hand-
book of Learning and Cognitive Processes, vol. 6, Erlbaum, Hillsdale,
NJ, 1978, 1-56.
[Smi88] Smith, E.E., Concepts and thoughts, R. Sternberg, E.E. Smith (eds.),
The Psychology of Human Thought, Cambridge University Press,
Cambridge, England, 1988.
[SS77] Smith, J.M., Smith, D.C.P., Database abstractions: aggregation and
generalization, ACM Transactions on Database Systems 2(2), 1977,
105-133.
[Sow76] Sowa, J.F., Conceptual graphs for a database interface, IBM Journal
of Research and Development 20(4), 1976.
[Sto91] Storey, V.C., Meronymic relationships, Journal of Database Admin-
istration 2(3), 1991, 22-35.
[Sto93] Storey, V.C., Understanding semantic relationships, VLDB Journal
2(4), 1993, 455-488.
[Teo90] Teorey, T.J., Database modeling and design: the entity relationship
approach, Morgan Kaufmann, 1990.
[TYF86] Teorey, T.J., Yang, D., Fry, J.P., A logical design methodology for
relational databases using the Extended Entity-Relationship Model,
ACM Computing Surveys 18(2), 1986, 197-222.
[TL82] Tsichritzis, D.C., Lochovsky, F.H., Data models, Prentice-Hall, En-
glewood Cliffs, NJ, 1982.
[Wan89] Wand, Y., A proposal for a formal model of objects, W. Kim, F.
Lochovsky (eds.), Object-Oriented Concepts, Databases, and Applica-
tions, Addison-Wesley, Reading, MA, 1989, 537-559.
[WW88] Wand, Y., Weber, R, An ontological analysis of some fundamental
information systems concepts, Proc. 9th International Conference on
Information Systems, Minneapolis, MN, 1988, 213-225.
[WW93] Wand, Y., Weber, R, On the ontological expressiveness of informa-
tion systems analysis and design grammars, Journal of Information
Systems, 1993,217-237.
[WW95] Wand, Y., Weber, R, Towards a deep structure theory of information
systems, Journal of Information Systems, 1995, 203-223.
[WSW99] Wand, Y., Storey, V., Weber, R, An ontological analysis of the re-
lationship construct in conceptual modeling, ACM Transactions on
Database Systems 24(4), 1999, 494-528.
[WWW90] Wirfs-Brock, R, Wilkerson, B., Wiener, L., Designing object-oriented
software, Prentice-Hall, Englewood Cliffs, NJ, 1990.
4. Object-Oriented Database Systems

Alfons Kemper! and Guido Moerkotte 2

1 Fakultat fUr Mathematik und Informatik, Universitat Passau, Passau, Germany


2 Fakultat fUr Mathematik und Informatik, Universitat Mannheim, Mannheim,
Germany

1. Introduction and Motivation ...................................... 80


1.1 Assessment of Relational Database Technology ................ 80
1.2 Advantages of Object-Oriented Modeling...................... 84
1.3 The ODMG Standard ......................................... 85
2. Object-Oriented Data Modeling................................... 85
2.1 The Running Example ........................................ 86
2.2 Properties of Objects .......................................... 87
2.3 Object Type Definition ........................................ 88
2.4 Modeling Behavior: Operations ................................ 95
2.5 Inheritance and Subtyping .................................... 96
2.6 Example Type Hierarchy ..................................... 100
2.7 Refinement of Operations and Dynamic (Late) Binding....... 101
2.8 Multiple Inheritance ......................................... 104
3. The Query Language OQL ....................................... 106
3.1 Basic Principles .............................................. 106
3.2 Simple Queries ............................................... 106
3.3 Undefined Queries ........................................... 108
3.4 Queries with Select-From-Where-Blocks ...................... 108
3.5 Nested Queries ............................................... 110
3.6 Grouping and Ordering ...................................... 112
3.7 Views........................................................ 113
3.8 Conversion ................................................... 114
3.9 Abbreviations................................................ 115
4. Physical Object Management .................................... 117
4.1 OlD Mapping ................................................ 117
4.2 Pointer Joins ................................................. 120
4.3 Pointer Swizzling ............................................. 124
4.4 Clustering .................................................... 129
4.5 Large Object Management ................................... 133
5. Architecture of Client-Server-Systems ............................ 135
5.1 Query Versus Data Shipping................................. 136
5.2 Page Versus Object Server ................................... 137
6. Indexing ......................................................... 139
6.1 Access Support Relations: Indexing Path Expressions ......... 139
6.2 Function Materialization ..................................... 149
6.3 Indexing Over Type Hierarchies .............................. 155
7. Dealing with Set-Valued Attributes.............................. 160
4. Object-Oriented Database Systems 79

7.1 Introduction ................................................. 160


7.2 Join Algorithms for Set-Valued Attributes.................... 161
7.3 Indexing Set-Valued Attributes............................... 163
8. Query Optimization .............................................. 164
8.1 Overview .................................................... 164
8.2 NFST ........................................................ 166
8.3 Rewrite I .................................................... 171
8.4 Query Optimization .......................................... 181
8.5 Rewrite II and Code Generation ............................. 186
9. Conclusion....................................................... 186

Abstract. This section introduces the reader into object-oriented databases. Af-
ter a brief motivation where we assess the disadvantages of relational database
technology and give the advantages of object-oriented technology, we introduce the
reader to object-oriented modeling. The main modeling constructs are discussed
and illustrated by examples. We then give an introduction to OQL, ODMG's query
language for object-oriented databases. Then come the technical issues like physical
object management, architectures of client-server systems, indexing, dealing with
set-valued attributes, and optimizing OQL.
80 A. Kemper and G. Moerkotte

1 Introduction and Motivation


1.1 Assessment of Relational Database Technology

Before introducing the object-oriented data model we will first assess the
shortcomings of the relational data model. In order to illustrate our discus-
sion, we will consider the following data modeling task: boundary representa-
tion of solid geometric objects. The conceptual schema of a simple boundary
representation of polyeder objects is graphically depicted in Figure 1.1.
The schema consists of four entity sets: Polyeder modeling the highest
level abstraction of a solid geometric object; Faces modeling the outer hull
of a Polyeder in the form of polygons; Edges, which represent the boundaries
of the polygons; and finally Vertices, which contain the metric information
in the form of X, Y, Z coordinates. The four entity sets are associated by
three N : M relationship types: Hull, Boundary, and StartEnd. We assume
that distinct Polyeders have distinct Faces which makes the relationship Hull
1 : N. In the ER-diagram we specify the cardinalities of these relationships
more precisely using the so-called (min,max)-notation. For example, every
edge of a polyeder bounds exactly two faces; every vertex is associated with
at least three edges; and every edge is bounded by two vertices.
Exploiting these cardinalities we can derive the concise relational database
schema shown in Figure LIb. The depicted database extension includes some
of the tuples representing a geometric object of type cuboid, which is iden-
tified within the Polyeder relation by "cubo#5". The relationship Boundary
and StartEnd are both represented by foreign keys in the relation Edges:

• The attributes Fl und F2 of relation Edges are foreign keys referring to


the two Faces that are bounded by the corresponding edge.
• The attributes Vl and V2 of relation Edges are foreign keys referring to
the two Vertices that bound the edge.

Furthermore, we represented the 1 : N-relationship Hull within the relation


Faces as a foreign key PolyID which refers to relation Polyeder.
The relational model of this application incurs several shortcomings:

Segmentation. The information belonging to one application object, e.g.,


the Polyeder named "cubo#5", is segmented and stored in several relations.
Reconstructing such an application object requires to obtain the information
- via join queries - from these different relations. Thus, a database user has
to know the intrinsic details of the schema in order to "work with" such a
well-defined application object.

Artificial keys. The key attribute values have to be chosen unique. It is


not sufficient to guarantee uniqueness within one complex application object,
4. Object-Oriented Database Systems 81

Polyeder Polyeder
Poly ID weight material ...
cubo#5 25.765 iron
tetra#7 37.985 glas

Faces
FaceID PolyID surface
Faces
f1 cubo#5 ...
f2 cubo#5 ...
" . .. . . ..
f6 cubo#5 ' "
f7 tetra#7 ...

Edges
Edges
EdgeID FI F2 VI V2
el f1 f4 vI v4
e2 f1 f2 v2 v3

Vertices
VertexID X Y Z
Vertices vI 0.0 0.0 0.0
v2 1.0 0.0 0.0

v8 0.0 1.0 1.0

(a) (b)

Fig. 1.1. Boundary representation of Polyeders: (a) entity-relationship schema and


(b) relational schema

because different application objects are mapped onto the same relational
schema. For example, the FaceID values have to be unique within the entire
relation Faces; i.e., no other faces may assume the identifier values "f1", ... ,
"f6" , which are assigned for the Faces belonging to the Polyeder "cubo#5".
82 A. Kemper and G. Moerkotte

The burden of guaranteeing uniqueness of the identifier attributes is placed


on the shoulders of the end user.

Lack of data abstraction. The relational model has only one very simple
structuring concept, the relation. In advanced application domains more het-
erogeneous structure occur. A complex object may be composed of a variety
of differently structured subobjects as, for example, a Polyeder in boundary
representation. A natural (and user-friendly) representation of such complex
objects demands more sophisticated abstraction mechanisms than the rela-
tional model offers. In particular, aggregation of different part-objects to form
a higher-level composite object and type hierarchies to support the concepts
of generalization and specialization should be integrated in the data model -
albeit they are not supported in the pure relational model.

Lacking object behavior. From an application programmer's point of


view, an external object, i.e., one that can be identified in the application
domain, consists of two dual dimensions:

1. The structural representation, which models the current state of the ap-
plication object.
2. The behavioral specification, which consists of an interface of operations
by which the object can be queried and modified.

While the first representational dimension of application-specific objects


can be mapped to a relational schema (with the aforementioned disadvan-
tages, though), the behavioral dimension of objects is entirely missing in
conventional relational databases.

External programming interface mismatch. The programming lan-


guage interface developed for the standard SQL language suffers from a severe
mismatch. It is tried to couple two systems that are based on entirely different
execution paradigms:

• General-purpose programming languages, such as C, C++, etc., execute


data in a mode that can be termed "record-at-a-time" .
• Relational database systems handle data in a set-oriented fashion.

Therefore, any embedded database operation that returns a set of tuples,


i.e., a relation, as a result causes problems because none of the widely used
programming languages possesses set-oriented operators.
In Figure 1.2 the aforementioned problems of the relational model are
graphically illustrated. The graphic highlights that the relational database
schema does not cover is not aware of the application-specific behavior that
is associated with the objects; e.g., the operation rotate can only be modeled
4. Object-Oriented Database Systems 83

Application A Application B

Transf. TA

relational database

Fig. 1.2. Visualizing the "impedance mismatch"

as part of the application program. This has the disadvantage that the data-
base system cannot serve as a repository for the operations. This makes the
sharing of application-specific operations among different applications, say,
applications A and B, difficult. In practice, one often experiences that the
same operations are multiply coded by different application programmers, as
exemplified in the graphic.

In order to implement such application-specific operations it is typically


necessary to reconstruct the application objects in data structures of the
programming language. For this purpose tedious transformations have to be
coded that retrieve the information concerning one external object from the
base relations - where this information is segmented. Again, these transfor-
mations are very often replicated by different applications as exemplified by
the two transformation procedures TA and TB in the graphic. Also this trans-
formation process can be very time-consuming, since it typically involves a
join over many base relations - as in our boundary representation example.
84 A. Kemper and G. Moerkotte

1.2 Advantages of Object-Oriented Modeling

The advantages of object-oriented modeling are sketched in Figure 1.3. In


object-oriented data models the structural and the behavioral components
of objects are represented in a uniform schema. Thus, the operations natu-
rally associated with application objects are an integral part of the database
schema. Thereby, the sharing of operations (i.e., software reuse) among dif-
ferent applications is directly supported by the data model.
By the principle of information hiding, an object, such as our cuboid, is
shielded from arbitrary manipulations. Only the predefined operations, such
as rotate, scale, etc., are applicable.
The segmentation of an application object is not visible in the object-
oriented representation, because the semantically rich operations are associ-
ated with an object and hide the internal structural representation from the
object's clients.
Since the programming language, in which the operations are coded, is
integrated in the object model, the transformation process - to transform
the objects into data structures of the application programming language -
that is needed in the relational model for coding computationally complex
operations becomes obsolete.

Application A Application B
someCuboid--+rotate('x', 10); w := someCuboid--+weightO;

scale

volume translate

object-oriented database

Fig. 1.3. Visualizing the advantages of object-oriented data modeling


4. Object-Oriented Database Systems 85

1.3 The ODMG Standard


In order to define a uniform, widely accepted object-oriented data model
standard the Object Database Management Group - abbreviated ODMG -
was established [CBB+97]. ODMG is a joint standardization effort of several
vendors of object-oriented database products.
ODMG's goal was to define a portability standard which allows database
users to migrate from one object-oriented database product to another one-
under relatively low re-development cost. It's goal was not to come up with
an interoperability standard which would allow to uniformly access several
object-oriented database products in the same application.
The ODMG standardization currently supports three application lan-
guages, namely C++, Java and Smalltalk - as illustrated in Figure 1.4. The
ODMG data model shields the application from the vendor-specific object
model by providing a uniform interface - called language binding - within
these application programming languages. In addition, the ODMG standard
defines an SQL-like query language, called Object Query Language (OQL).

c++ Java

Object

Application I DBMS I Application

Model

Smalltalk

Fig. 1.4. Integration of the ODMG object model

2 Object-Oriented Data Modeling


Our discussion of object-oriented data modeling concepts will be based on
the ODMG object model. Furthermore, the ODMG standard includes an
application programming language-independent Object Definition Language
(ODL) which we use to illustrate our examples.
86 A. Kemper and C. Moerkotte

2.1 The Running Example

Figure 2.1 shows a part of a university administration database applica-


tion. The graphical representation is based on Booch's notation [Boo94]. The
"clouds" represent object types (corresponding to entity sets), the labeled
edges represent binary relationships between the object types, the bold-face
arrows denote inheritance relationships and point from the subtype to the su-
pertype. The objects' attributes and their type constraints are listed within
the clouds.

/
--- --- ,...../- .......
'-- --- , / / - .......
/ Students --, t' -- ""'--
/ \ COOffi~ / ~
/ StudentiD : integer 1G M / CourseNo : integer /
I Name: string ( .... r r

\ \,
" Semester' integer r----enrolle-----,- Titl!'l : s!ring \
'-.... . \ '-...." Duration: Integer
-...., \ \ \
\ J ~ _~)
//-~/ ~I//--
,---~/ takinQ. /..... con ents -- N
"y----/ '/
/ Exams -'I
( ExDate: date / .
'-....,\Grade: nUmber\ teaCjhln g
/-~/
---
)
\

'"
_/

giving 1
//--- -.- /-~, --........ -y--- . . . . . . . . . . . . .
/ Assistants) / Professors --"
(/ Expertise: string L-.\I works for ---f/ Rank: string /
'-.... N - 1 '-.... \

'\
\, __ .-/
/-~_) _
/"\
... -
//<1--)
office

/
/
-----//
Employees
~.......
-,
\ /
/
/-i 1
-- ... - ....
Rooms
'-
--,
/ . ) / )
/ BlrthDa~e: qate / / Size: integer /
I, Name.. stnng \ (, RoomNo' integer\
,_ SS#: Integer -...." .
,
\ \ \ J

'--
\
\ __
) /--_/
//---- \.-'
//

Fig. 2.1. An object-oriented model of (a part of) a university administration


4. Object-Oriented Database Systems 87

2.2 Properties of Objects

An object consists of three components:

• Object Identity: Every object has an associated identifier that remains


invariant throughout its life time.
• Type: An object is obtained by instantiating an object type. The object
type determines the structural representation (i.e., attributes and rela-
tionships) and the behavior of the objects belonging to the type's extent.
• Value or State: The values of the attributes and the currently established
relationships to other objects constitute the current state of an object.

In order to illustrate these three components of an object let us concen-


trate on only three example objects of the University Administration example
(cf. Figure 2.2). For example, id1 denotes the (abstract) object identifier of
the object named "Knuth" which is of type Professors. Object identifiers are
used to refer to objects. For example, the identifier id 1 is used in object id2
- or more precisely, in the object identified by id2 - to associate Knuth as
the teacher of the particular course. The objects' states are given within the
boxes. The values are not constrained to be atomic; for example, the object
named Knuth contains the set teaches which is denoted by braces { ... }.

id 1 Professors
SS#: 2137
Name: "Knuth"
Rank: "full"
residesln: id91~--------------~
givenExams: {... }
teaches: {id 2 , id3}

!
CourseNo:
Title:
Courses
5001
"Foundations"
I L CO~
r.C~o~u-r-se~N~o:------~4~6~30~
Title: "The Art"
Duration: 4 Duration: 4
taughtBy: id 1 taughtBy: idl
enrollment: { ... }I- enrollment: { ... }
successors: {... } successors: { ... }
predecessors: { ... } predecessors: { ... }

Fig. 2.2. Some objects of the university administration example


88 A. Kemper and G. Moerkotte

2.3 Object Type Definition


There is some confusion concerning the terminology in the literature. We
will generally use the term object type. Some authors prefer the term class;
however, its use is ambiguous because sometimes the class is meant a type
specification and other times the set of all object that were instantiated from
a particular type. We prefer to clearly differentiate between the two concepts
by using object type to refer to the schema definition and extent (or extension)
to refer to the (dynamically changing) set of instances.

Attributes. Let us first define the attributes of objects of type Professors:


class Professors {
attribute long SS#;
attribute string Name;
attribute string Rank;
};
Attributes are defined by their domain and name. In this example, we
constrained ourselves to atomic value domains, i.e., long and string. How-
ever, in the ODMG model also structured values are possible. For example,
in the next subsection we will define a composite value type for the attribute
ExDate of Exams.

Relationships. We want to illustrate the relationship definitions by exam-


ples from our university administration.

1: I-relationships. Let us first concentrate on the following l:l-relationship

~__p__ro_£_e_ss_o_r_s__~+I~l~--~~>---~lL-~____Ro
___ __s____~
om

In the ODMG model such a relationship can be represented "symmetrically"


in both object types, Professors and Rooms. In Professors we name this
relationship residesln and in Rooms it is named occupiedBy. In both cases
the relationship assumes references to the corresponding objects as values:
residesln is constrained to refer to a Rooms object and occupiedBy is con-
strained to refer to a Professors object.
Now we obtain the (still incomplete) object type definitions:

class Professors {
attribute long SS#;

relationship Rooms residesIn;


};
4. Object-Oriented Database Systems 89

class Rooms {
attribute long RoomNo;
attribute short Size;

relationship Professors occupiedBy;


};

Thus we have defined the relationship office of Figure 2.1 in both "directions"
- from Professors via residesln to Rooms as well as vice versa from Rooms
via occupiedBy to Professors. For a very small part of a university database
the example objects are shown in Figure 2.3.

idg Rooms
RoomNo: 007
id 1 Professors
Size: 18
SS#: 2137 ...
Name:
Rank:
"Knuth"
"full"
occupiedBy: id;'l
residesIn: idg
givenExams: { ... }
teaches: {... }
Fig. 2.3. Example objects illustrating the symmetry of relationships

Unfortunately, this definition cannot guarantee the consistency of the re-


lationship. This is exemplified in Figure 2.4:

• Violation of Symmetry: Room idg with RoomNo 007 appears to be still


occupied by Knuth who, however, has moved to a different room, i.e., the
one with identity ids .
• Violation of the i:i-constraint: This inconsistency also violates the
1:1 functionality of the relationship office because, according to the oc-
cupiedBy values two rooms - idg und ids - are occupied by Knuth.

In order to exclude these kinds of inconsistencies the inverse construct


was incorporated in the ODMG object model. The correct symmetric rela-
tionship definitions would look as follows:
class Professors {
attribute long SS#;

relationship Rooms residesIn inverse Rooms::occupiedBy;


};

class Rooms {
attribute long RoomNo;
90 A. Kemper and G. Moerkotte

id1 Professors
SS#: 2137
Name: "Knuth"
Rank: "full"
residesIn: ids
givenExams: { ... }
teaches: { ... }

idg Rooms ids Rooms


RoomNo: 007 RoomNo: 4711
Size: 18 Size: 21
... ... I- ... ... 1 -
occupiedBy: id 1 occupiedBy: idl

Fig. 2.4. Inconsistent state of the relationship residesln/occupiedBy

attribute short Size;

relationship Professors occupiedBy inverse Professors::residesln;


};

This symmetric relationship definition in both participating object types


enforces the following integrity constraint: p E Professors is referred to by
the occupiedBy relationship of r E Rooms if and only if r is referenced by
relationship residesln of object p. More concisely this is stated as follows:

p = r.occupiedBy {:} r = p.residesln


of course, one could also choose to represent an relationship only unidirec-
tionally in only one of the participating object types. This makes updates of
the relationship more efficient but has the disadvantage that the relationship
can be "traversed" only in the one direction. This may make the formulation
of queries (or application programs) more complex.

l:N-relationships. The relationship teaches is an example of such a one-


to-many relationship type:

~__p_r_o_fu_s_s_or_s__~~~l~--~~~--~1V~~____c_o_u_r_ses____~
4. Object-Oriented Database Systems 91

In an object model such a relationship is represented by a set-valued


relationship in the object type participating "many times" in the relationship
- here Professor:

class Professors {

relationship set (Courses) teaches inverse Courses::taughtByj


}j

class Courses {

relationship Professors taughtBy inverse Professors::teachesj


}j

Again, we defined the relationship symmetrically in both object types and


used the inverse construct to enforce the corresponding integrity constraint.

N :M-relationships. The most general form of a relationship type is many-


to-many or N :M. An example is the following:

~___S_tu_d_e_n_t_s__~r-~l\T~--~~~--~A1L-~____c_o_u_r_s_es____~
Now, the relationship is represented by set-valued relationships in both
object types:

class Students {

relationship set (Courses) enrolled inverse Courses::enrollmentj


}j

class Courses {

relationship set (Students) enrollment inverse Students::enrolledj


}j

Again, we used the inverse specification to enforce the integrity constraint:


if and only if a student s is enrolled in a course e then s is contained in e's
enrollment. More concisely, this is stated as follows:

e E s.enrolled {::} s E e.enrollment


92 A. Kemper and G. Moerkotte

Recursive N:M-relationships. A binary relationship may also be recur-


sive in the sense that the two participating object types are identical. An
example is the relationship prerequisite that associates Courses with other
Courses - one being the successor and the other being the predecessor. This
relationship happens to be many-to-many and is visualized below:

N successor

M predecessor
L -____________________________ ~

The object-oriented representation of such a recursive relationship is basically


the same as for non-recursive relationships - except that both parts of the
symmetric relationship definition are incorporated into the same object type:

class Courses {

relationship set (Courses) predecessors inverse Courses::successors;


relationship set (Courses) successors inverse Courses::predecessors;
};

Ternary relationships. Non-binary, e.g., ternary relationships are best


modeled as separate object types. Let us illustrate this on the following ex-
ample:

Courses Professors

Students

An exam relates Profesors, Students, and Courses. By defining an object


type Exams we can model this ternary relationship via three several binary
relationships relating an exam with the professor giving the exam, with the
student the exam is given to and with the course being the subject of the
exam:
class Exam {
attribute struct ExDate
{ short Day; short Month; short Year; } ExDate;
attribute float Grade;
relationship Professors givenBy inverse Professors::givenExams;
4. Object-Oriented Database Systems 93

relationship Students givenTo inverse Students::takenExams;


relationship Courses contents inverse Courses::examinedIn;
};
Let us restate the other object type definition developed so far:
class Professors {
attribute long SS#;
attribute string Name;
attribute string Rank;
relationship Rooms residesIn inverse Rooms::occupiedBy;
relationship set (Courses) teaches inverse Courses::taughtBy;
relationship set (Exams) givenExams inverse Exams::givenBy;
};

class Courses {
attribute long CourseNo;
attribute string Title;
attribute short Duration;
relationship Professors taughtBy inverse Professors::teaches;
relationship set (Students) enrollment inverse Students::enrolled;
relationship set (Courses) successors inverse Courses::predecessors;
relationship set (Courses) predecessors inverse Courses::successors;
relationship set (Exams) examinedIn inverse Exams::contents;
};

class Students {

relationship set (Exams) takenExams inverse Exams::takenBy;


}
In Figure 2.5 we visualize these object types and their inter-relationships.
Here, the number of arrow heads indicates the functionality of the corre-
sponding relationship: a single-headed arrow denotes a single-valued relation-
ship and a multi-headed arrow represents a multi-valued relationship. Using
double-sided arrows represents the symmetry of the relationship that is be-
ing included in both objects. Thus, an arrow +-+ represents a 1:1 relationship
that is modeled symmetrically in both object types, an arrow - denotes a
l:N relationship that is modeled symmetrically in both object types, and an
arrow - represents a N:M relationship. The labels on the arrows conform
to the relational names in the object types.

Type properties: extents and keys. The extent constitutes the set of
all instances of a particular object type. l The extent of an object type can
1 Further on, we will see that an extent also includes all instances of direct and
indirect subtypes of the object type.
94 A. Kemper and G. Moerkotte

I
Rooms
I
occupied By
residesIn

I Students
I Professors
I
enrolled taken Exams givenExams te aches

I Exams I
takenBy ·1 1 givenBy

contents

examined In

J Courses 1
enrollment l 1 taughtBy

predecessors successors

Fig. 2.5. Graphical representation of the relationships

serve as an anchor for queries, such as "find all Professors whose Rank is
associate" .
The ODMG model allows to specify that an extent is automatically main-
tained. Newly created objects are implicitly inserted into the corresponding
extent(s) and delete objects are removed from the extent.
Furthermore, the ODMG model allows to specify a set of attributes
as keys. The system automatically ensures the uniqueness of theses keys
throughout all objects in the object type's extent.
Extents and keys are object type properties because they are globally
maintained (enforced) for all instances of the object type. In contrast, the
attributes and relationships specified in the type definition are instance prop-
erties because they are associated with every individual object.
Let us illustrate these two type properties on our example object type
Students:
class Students (extent AllStudents key StudentID) {
attribute long StudentIDj
attribute string N amej
4. Object-Oriented Database Systems 95

attribute short Semester


relationship set (Courses) enrolled inverse Courses::enrollment;
relationship set (Exams) takenExams inverse Exams::takenBy;
};

2.4 Modeling Behavior: Operations


In preceding discussions we emphasized already that the behavioral model-
ing is an integral part of the object schema. The objects' behavior is speci-
fied via type-associated operations. These operations constitute the interface
which provides all the operations to create (instantiate, construct) an object,
delete (destruct) an object, query the object's state and modify the object's
state. The interface encapsulates the object's structural representation be-
cause clients of a particular object need to know only the applicable oper-
ations. The structural representation may be entirely hidden (information
hiding) from the clients.
The operations can be classified into three classes:
• Observers: Observers are functions that return information concerning
the internal state of the object instances to which they are applied. Ob-
servers leave the objects on which they are invoked - and hence the entire
database - invariant.
• Mutators: Mutators are operations that change the internal state of the
object instance on which they are invoked. An object type with at least
one mutator is called mutable, otherwise immutable.
• Constructors and Destructors: Constructors are used to create a new
instance of the respective object type. This is often called instantiation
of the type and the newly created object is called the instance. In contrast,
the destructor is invoked to delete an existing object.
On close observations, there is a fundamental syntactical difference be-
tween the two operations: A constructor is invoked on a type to generate
a new object whereas a destructor is invoked on an object.
Because of its language independence, the ODL object type definition only
allows to specify the signatures of the corresponding operations. The imple-
mentation of the operation has to be carried out within the corresponding
application programming language.
The operation's signature corresponds to the invocation pattern and spec-
ifies the following:
• the name of the operation;
• the number and the types of the parameters (if any);
• the type of a possibly returned result; otherwise void;
• an exception that is possibly raised by the operation execution.
Let us now "enrich" one of the example types of our university administration.
Two operations associated with the type Professors are defined as follows:
96 A. Kemper and G. Moerkotte

class Professors {
exception hasNotYetGivenAnyExams { }j
exception alreadyFullProf { }j

float howToughO raises (hasNotYetGivenAnyExams)j


void promotedO raises (alreadyFullProf)j
}j
We have defined two operations - more precisely operation signatures -
associated with Professors:
• The observer how Tough should be implemented such that it returns the
average grade (i.e., a float value) students receive in exams given by the
particular professor. The exception hasNotYetGivenAnyExams is raised
if the particular professor has not given any exams.
• The mutator promoted is used to change the Rank of the professor on
which it is invoked. This operation should be implemented such that an
assistant professor is promoted to associate professor and an associate
professor is promoted to full professor. If invoked on a full professor, the
operation raises the exception alreadyFullProf.
The object type for which the operation is defined is called the receiver
type - here Professors is the receiver type. Correspondingly, the object on
which the operation is invoked is called the receiver object and is constrained
to be an instance (direct or indirect - see below) of the receiver type.
The invocation syntax of the operations depends on the particular appli-
cation programming language. In C++, promoted is invoked on the professor
referenced by the variable myFavoriteProf as follows:
myFavoriteProf~promotedOj

Within the declarative query language OQL an operation is invoked using


the "arrow" or the "dot" notation:
select p.howToughO
from p in AllProfessors
where p.Name = "Curie"j
In this query, the average grade obtained in Curie's exams is determined
by invoking how Tough on the qualifying Professors instance. Note that this
query could possibly produce several values if more than one professor is
named Curie. In this query, the extent AllProfessors of object type Professors
is used as an "entry point" for finding the (presumably) one professor named
"Curie" .

2.5 Inheritance and Subtyping


Inheritance hierarchies are used to relate similar object types. The common
properties are modeled in the so-called supertype and the more specialized
4. Object-Oriented Database Systems 97

properties are provided in the subtype. The subtype inherits all the properties
of all of its (direct and indirect) supertypes. Inheritance does not only cover
structural properties (attributes and relationships) but also the behavior.
Thereby, a subtype always has a superset of the supertype's properties. This
way object-oriented models can safely allow the so-called substitutability: A
subtype instance can always be substituted at places where a supertype in-
stance is expected. Substitutability is the key factor to achieve a high degree
of flexibility and expressive power in object models.

Terminology. We will illustrate the terminology associated with inheritance


and subtyping on the abstract type hierarchy shown in Figure 2.6. On the
left-hand side, three types called Typel, Type2, and Type3 are organized in
a simple inheritance hierarchy. Typel is the direct supertype of Type2 and
the (indirect) supertype of Type3. From the opposite view point, Type3 is a
direct subtype of Type2 and an (indirect) subtype of Typel. The inheritance
is illustrated only on a single attribute associated with every type: Objects
of Typel have the single attribute A, objects of Type2 have one inherited
attribute A and another attribute B associated directly with Type2. Finally,
objects of Type3 have three attributes: A inherited from the indirect super-
type Typel, B inherited from Type2, and C. On the right-hand side, a single
example object (instance) is sketched for each of these types.

0
Object Types Instances
1
idl A: ...
Type1

0
is-a
2
id2 A: .. .
Type2 B: .. .

is-a

Type3 c
id 3 :'"
U
B: .. .
c: .. .
e3

Fig. 2.6. An abstract type hierarchy

Instances of a subtype are implicitly also members of the supertypes' ex-


tent. Thus, in our example the object id3 is an instances of Type3 and also
of Type2 and Type 3. This so-called inclusion polymorphism is visualized in
98 A. Kemper and G. Moerkotte

Figure 2.7. The extents are named Ext Typel , ExtType2, and Ext Type3, re-
spectively. The different sizes of the elements of the particular extents was
chosen to visualize the inheritance: Objects of a subtype contain more infor-
mation/behavior than objects of a supertype.

o
o
ExtType2
------
ExtType3
.~----

D D
o
Fig. 2.7. Illustration of subtyping

It is this inclusion of the subtype's extent in the supertype's extent that


provides the above-mentioned substitutability.
A subtype instance can be used wherever a supertype instance is re-
quired.
Even our very small example type hierarchy illustrates why substitutabil-
ity ''works''. A Type3 instance "knows more" than a Type2 instance because
it has all the properties a Type2 instance has and, additionally, it has the C
property. Also, a Type3 instance knows more than instances of its indirect
supertype Type1. Therefore, an application expecting a Typel or a Type2
instance can very well "get along" with a Type3 instance because the ap-
plication expects an instance that has a (true) subset of the properties the
Type3 instance provides.

Single and multiple inheritance. Depending on the number of direct


supertypes an object type can inherit from, two different approaches to in-
heritance are distinguished:

• single inheritance: Every object type has at most one direct supertype.
4. Object-Oriented Database Systems 99

• multiple inheritance: An object type may have several direct supertypes


all of whose properties it inherits.

In either case - single or multiple inheritance - the directed super/subtype


graph has to be acyclic.
Our simple abstract type hierarchy of Figure 2.6 is, of course, an exam-
ple of single inheritance. In practice, however, the type hierarchy looks much
more complex - even under single inheritance. A more general abstract hier-
archy is shown in Figure 2.8. This type hierarchy is still constrained to single
inheritance. The entire type hierarchy has a single root, i.e., a most general
supertype called Object. 2

Object

Fig. 2.8. Abstract type hierarchy under single inheritance

A major advantage of single inheritance in contrast to multiple inheritance


is that there is always a unique path from any object type to the root of the

2 Actually, in ODMG the root of the type hierarchy of all durable objects is called
d_Object.
100 A. Kemper and G. Moerkotte

type hierarchy. An object type inherits only along this one unique path. For
example, for object type OTn this unique path is:

In contrast, under multiple inheritance there could be different paths from


an object type to the root. This makes multiple inheritance more complex
to handle because an object type may possibly inherit conflicting properties
from these different inheritance paths.

2.6 Example Type Hierarchy

Let us now move from abstract examples to a more practical, though small
example type hierarchy within our university administration. Employees of
a university can be specialized to Professors and Lecturers. This yields the
type structure shown in Figure 2.9. In ODL these types are defined as follows:

class Employees (extent AllEmployees) {


attribute long SS#;
attribute string Name;
attribute date BirthDate;
short AgeO;
long SalaryO;
};
class Lecturers extends Employees (extent AllLecturers) {
attribute string Expertise;
};
class Professors extends Employees (extent AllProfessors) {
attribute string Rank;
relationship Rooms residesIn inverse Rooms::occupiedBy;
relationship set (Courses) teaches inverse Courses::taughtBy;
relationship set (Exams) givenExams inverse Exams::givenBy;
};

The super/subtype relationship between types is modeled by specifying


the supertype following the keyword extends. For illustration, the inherited
properties of an object type are shown within dashed ovals in Figure 2.9. The
properties directly associated with an object type are represented in solid
line ovals. Of course, in the ODL type definition only the latter properties
are defined within the corresponding type definition since the other ones are
implicit by inheritance.
For example, an object 0 Prof of type Professors has a true superset of the
properties of a direct instance of type Employees, say 0 Emp. This, once again
4. Object-Oriented Database Systems 101

SS# )

(" SS# Name")


............. ...... .

C::::"" Name

( BirthDate AgeO )

SalaryO"")

Salary 0

Fig. 2.9. Inheritance of object properties (dotted ovals contain inherited features,
italicized operations are refined)

illustrates why substitutability works: The Professors have all the "knowl-
edge" that Employees have and can therefore be safely substituted in any
context (operation argument, variable assignment, etc.) where Employees is
expected. Likewise, Lecturers are substitutable for Employees. This inclusion
polymorphism is highlighted in Figure 2.10 which shows that all Professors
and all Lecturers are also contained in the extent of Employees.

2.7 Refinement of Operations and Dynamic (Late) Binding


We emphasized already that, besides the structural representation (attributes
and relationships), also the supertypes' operations are inherited by the sub-
type. In many cases, the implementation of the inherited operation can re-
main the same as in the supertype. For example, the operation Age() that
Professors and Lecturers inherit from Employees should have the exact same
102 A. Kemper and G. Moerkotte

AllEmployees

-~
a~_D DDD
0 0

0 0
0 0 0
0 0 0 0 0

0 AllProfessors 0

~ID
0
ODD
0
DDD
---_._-
0

Fig. 2.10. The three extents AllEmployees, AllLecturers and AllProfessors

implementation no matter whether it is invoked an employee, a professor, or


a lecturer. It merely computes the age by "subtracting" the BirthDate from
the current date. That is, any invocation of Age() will execute the same code.
However, there are other inherited operations where one would like to
adapt the coding to the peculiarities of the subtype. In other words, the op-
eration should be refined or specialized in the subtype, thereby overwriting
the coding defined "higher up" in the type hierarchy. We will illustrate opera-
tion refinement on our example operation Salary(). According to the German
system the salary of university employees is computed as follows:

• (Regular) Employees are paid according to the standard formula


2000 + (AgeO - 21) * 100
That is, they are paid a base salary of 2000 DM plus an "experience"
supplement of 100 DM for every year exceeding their 21-st birthday.
• Lecturers receive a base salary of 2500 DM plus an experience supplement
of 125 DM, Le.:
2500 + (AgeO - 21) * 125
• Professors are paid an even higher base salary of 3000 DM and a yearly
experience supplement of 150 DM. That is, their salary is computed by
the following formula:
3000 + (AgeO - 21) * 150
4. Object-Oriented Database Systems 103

The refinement of operations has to be accounted for during program exe-


cution, Le., by the run-time system. The substitutability of subtype instances
in place of supertype instances demands the so-called dynamic binding (or
late binding) of refined operations. This way it is ensured that the most spe-
cific implementation is bound (executed) depending on the exact type of the
receiver object.
Let us demonstrate dynamic binding on an example database state shown
in Figure 2.11 where the extent AllEmployees contains just three objects:
• object id 1 is a direct Professors instance;
• object id u is a direct Lecturers instance; and
• object id7 is a direct Employees instance.

AllEmployees:{ id1 , id l1 , id7}


I

id 1 Professors id 11 Lecturers id7 Employees


SS#: 2137 88#: 3002 S8#: 6001
Name: "Knuth" Name: "Zuse" Name: "Maier"
BirthDate: " . BirthDate: ... BirthDate: ...

Fig. 2.11. An example extent of AllEmployees

Let us formulate an example query in OQL (cf. Section 3) that computes


the total monthly salary of all employees of the university:
select sum(a.Salary())
from a in AllEmployees
In this query, the Salary() operation is, in turn, invoked on all elements
in the extent AllEmployees. However, depending on the exact type of the
corresponding object in AllEmployees a different implementation of Salary()
is bound at run-time. Logically, this dynamic binding is carried out as follows:
First, the run-time system has to determine the type of the receiver object
(the object on which the operation is invoked). Then the type hierarchy is
searched for the most specific implementation of the operation by starting
at the receiver's type and proceeding towards the root of the type hierarchy
(cf. Figure 2.8) until the first implementation of the operation is found. It is
104 A. Kemper and G. Moerkotte

this coding that is bound and executed. This procedure implies that every
object (logically) knows its most specific type, i.e., the type from which it
was instantiated.
For our example type hierarchy the determination of the most specific im-
plementation of Salary() is trivial because every type has its own specialized
implementation:
• For the object id1 the Professors-specific Salary()-computation is bound;
• for object id u the implementation specialized for Lecturers is executed;
and
• for the object identified by id7 the most general implementation defined
in type Employees is dynamically bound.

2.8 Multiple Inheritance


So far, in the examples we constrained ourselves to single inheritance. Let
us briefly illustrate multiple inheritance on the example type TAs (teach-
ing assistants). TAs are Students as well as Employees. Therefore, a natural
modeling would be to make TAs a subtype of both of these types, Students
and Employees, as shown in Figure 2.12.

Employees Students

TAs

Fig. 2.12. An example of multiple inheritance

Objects of type TAs inherit properties from two direct supertypes:


• They inherit the attributes SS#, Name and the operations Salary() and
Age() from Employees and
• from Students they inherit the properties MatrNr, Name, Semester, en-
rolled and takenExams.
Besides the inheritance, one has to consider substitutability as well. TAs
are substitutable for Employees as well as for Students.
4. Object-Oriented Database Systems 105

Multiple inheritance has the disadvantage that there is no longer a unique


(inheritance) path from anyone object type to the root of the type hierarchy
as was the case under single inheritance (cf. Figure 2.8). This may lead to the
inheritance of conflicting properties. For example, a subtype may inherit an
identically named attribute from both its direct supertypes with conflicting
type constraints. Or another example would be the inheritance of different
implementation of the same operation along the different inheritance paths.
In order to avoid such conflicts or ambiguities, the latest version of ODMG
outrules multiple inheritance but allows multiple subtyping by incorporating
the interface concept as found in the programming language Java [AG96J.
An interface does not define any structural representation; it just specifies an
interface consisting of a set of operation signatures. Thus an interface cannot
be used to instantiate objects; it has to be implemented by an object type
in order to be useful. In ODMG an object type can inherit from (extend) a
single supertype only but it may (in addition) implement several interfaces
by defining all their operations. Ambiguities and conflicts are outrules by
a drastic measure: An object type cannot implement interfaces that have
contradicting signatures for the same operation.
For our example of modeling teaching assistants one could choose to define
an interface EmployeeIF and then implement this interface in Employees and
TAs. These definitions would be specified in ODL as follows:

interface EmployeeIF {
short AgeO;
long SalaryO;
};
class Employees: EmployeeIF (extent AllEmployees) {
attribute long SS#;
attribute string Name;
attribute date BirthDate;
};
class TAs extends Students: EmployeeIF (extent AllTAs) {
attribute long SS#;
attribute date BirthDate;
attribute short WorkLoad;
};

Let us concentrate on the object type TAs: It inherits all the features
(properties and behavior) of type Students and, in addition, it implements
the interface EmployeeIF. This makes TAs substitutable in any context where
Students or EmployeeIF objects are expected. However, TAs cannot be sub-
stituted for Employees because the two types are unrelated - in our example
model.
106 A. Kemper and G. Moerkotte

3 The Query Language OQL

The query language OQL is part of the ODMG Standard [Cat94,CBB+97j.


Although the standard itself is constantly evolving, it seems that the query
language OQL has reached a rather fixed state. The language described in this
section is based upon the latest edition of the ODMG Standard [CBB+97j.

3.1 Basic Principles


OQL is a declarative query language whose design is based upon a few basic
principles. The main and most important design principle is the orthogo-
nality of building expressions. Several basic expressions like constants and
named objects are introduced. These basic expressions can be used to build
more complex expressions by applying functions. Contrary to SQL-92 [MS93],
building complex expressions is restricted only by typing rules and nothing
else: as long as the typing rules are obeyed, any expression forms a legal
query. Thereby, OQL relies on the ODMG Object Model [CBB+97j.
Another intent of the designers is to make OQL as similar as possible to
SQL. Ideally, OQL should be an extension of SQL. However, this goal is -
due to the awkwardness of SQL - not easily reached.
OQL is not only an ad-hoc declarative query language but can also be em-
bedded into programming languages for which an ODMG binding is defined.
Using the same type system in the object base and in the programming lan-
guage enables this feature. For example, the result of a query can be readily
assigned to a variable, if typing rules are obeyed.

3.2 Simple Queries

Every constant is already a query. Hence,

5
"Jeff"
are already perfect queries returning the values 5 and uJeff" respectively. If
a named object Dean exists, then
Dean

is also a perfect query returning the object referenced by Dean.


The attributes of named objects are also directly accessible via queries:
Dean.spouse.name

retrieves the name of the spouse of the dean via a path expression.
The query

Dean.subordinates
4. Object-Oriented Database Systems 107

retrieves the value of the set-valued attribute subordinates of the Dean. It


contains all the faculty staff of the department of the Dean.
Not only prefabricated objects can be retrieved from the database. It is
also possible to construct objects in a query. For example, the following query
creates a new Student:
Student(firstname: "Jeff", lastname: "Ullman",
major: "Computer Science")
The parameters within the parentheses allow to initialize certain attributes
of the newly created object. The result of the above query is a new Student
object whose first name is "Jeff".
For the construction of tuple-structured values, the keyword struct is
used. For example, we can construct a tuple value consisting of the three
fields year, month, day by the following query:
struct(year: 1999, month: 12, day: 31)
More complex queries can be build by applying built-in or user defined
functions:
5 + 5 - 10
Dean.ageO + 500
The latter query invokes the method age on Dean which computes the Dean's
age from the attribute dateD/Birth. OQL allows to skip parentheses for meth-
ods without parameters. Hence,
Dean.age + 500
is equivalent to the above query.
Besides the construction of simple values and tuple-structured values,
OQL allows for the construction of bags, sets, lists, and arrays:
bag(I,I,2,2,3)
set(I,2,3)
list(I,2,3)
array(I,2,3)
There exist some special built-in functions called aggregates. They are
min, max, count, sum, and avg. These functions are all unary functions
working on collections. Except for count which can be applied to any collec-
tion, the argument of the other aggregate functions must be a collection of
numbers. The aggregate functions have standard semantics. For example, the
function min computes the minimum of a collection of numbers, the function
avg computes the average of collection of numbers. The query
count(Student)
returns the number of students contained in the extent Student.
108 A. Kemper and G. Moerkotte

3.3 Undefined Queries

Assume that the spouse attribute of the Dean is not defined, i.e. contains a
nil value. Then, the path expression

Dean.spouse.name

produces the special value UNDEFINED. In general, any property of the nil
object is UNDEFINED. Any comparison (e.g. with =, <) produces false if
at least one of the compared values is UNDEFINED. There exists a special
function is_undefined to check whether some value is undefined. By

is_undefined(Dean.spouse.age)

we could check whether the path expression returns a legal result. Applying
any function other than a comparison operator on an undefined value results
in a run-time error. Hence, the query

Dean.spouse.age +5
will result in a run-time error.

3.4 Queries with Select-From-Where-Blocks

As SQL, OQL provides a select-from-where-block. The following query re-


trieves all first year students:

select s
from Student s
where s.year = 1

The variable s is bound successively to all students contained in the extent


Student. For each student s, the predicate contained in the where clause
is checked. Those students whose year property returns 1 are selected. The
result of the query is a bag of students. The extent Student is called the
domain of the (iterator) variable s. We call a query or sub query consisting
of a select, from, and where clause a SFW-block.
In order to retrieve a set of students, the keyword distinct has be applied:

select distinct s
from Student s
where s.year = 1

Whenever distinct is applied, possible duplicates are eliminated and the


result is a set. In the above case, the result is a set of students.
In OQL there exist three alternative possibilities to specify in the from
clause that a variable ranges over a collection:
4. Object-Oriented Database Systems 109

select distinct s
from Student s
where s.year = 1

select distinct s
from s in Student
where s.year = 1

select distinct s
from Student as s
where s.year = 1

All these queries are equivalent.


As in SQL, the from clause can have more than one entry in order to join
collections. With the next query we find all professors and students having
the same name:
select struct(prof: p, stud: s)
from p in Professor, s in Student
where p.name = s.name
Note the usage of struct in the select clause. It is used to construct the result
records. Contrary to SQL, strict OQL does not allow to have multiple entries
in the select clause. In SQL, a result tuple is implicitly constructed. In OQL,
the result tuple has to be constructed explicitly (see also Sec. 3.9).
Quantifiers can be used independently of a SFW-block. They return a
boolean value. The first query asks whether there exists a professor earning
more than $100.000; the second query asks whether all students passed the
database course:
exists p in Professor: p.salaryO > 100.000
for all s in Student: databaseCourse in s.passedCoursesO
The special predicate in tests for membership and database Course is assumed
to be a named object referencing the database course.
Nesting of SFW-blocks can occur in many places within quantifiers. As-
sume that we do not have a named object database Course and that we want
to know whether all students named "Smith" passed the database course. We
can formulate this query as:
for all s in select s
from s in Student
where s.name = "Smith":
for all c in select c
from c in Course
where c. title = "database")
c in s.passedCoursesO
110 A. Kemper and G. Moerkotte

Quantifiers can also occur within the where clause where they play the
role of a selection predicate. The following query retrieves all students that
passed the database course:

select s
from s in Student
where for all c in select c
from c in Course
where c.title = "database")
c in s.passedCoursesO

This query is a little awkward. Another possibility is to use the subset predi-
cate to verify that a set of database courses is a subset of the passed courses.
This saves the universal quantifier. In fact, universal quantifiers can be re-
placed by a subset predicate and vice versa. The alternative formulation of
the query is:

select s
from s in Student
where (select c
from c in Course
where c.title = "database")
<= c in s.passedCoursesO

In OQL, <= denotes the subset predicate. Likewise, >= denotes the superset
predicate, = denotes set equality. The comparison operators < and > can be
used if we test for strict subsets or strict supersets.

3.5 Nested Queries

As can be seen from the above queries, SFW-blocks can occur nested within
SFW-blocks. In fact, in OQL SFW-blocks can be nested anywhere (in the
select, from, and where clause) as long as the typing rules are obeyed. The
following three queries demonstrate nesting in different places:

select struct(stud: s, courseNames: select c.name


from c Course
where c in s.passedCourses())
from s in Student
where s.name = "Smith"

This query retrieves students named "Smith" and for each such student the
names of all courses passed. Note that a reference to the student s occurs
in the inner block nested in the select clause. It occurs in the so-called
correlation predicate c in s.passedCourses() that correlates a course with the
passed courses of a student.
4. Object-Oriented Database Systems 111

Often it is necessary to select the best. For example, we would like to query
the best students. By definition, the best students are those with the highest
gpa. The following query retrieves the best students by applying nesting in
the where clause:

select s
from s in Student
where s.gpa = max (select s.gpa
from s Student)
Additionally, the query demonstrates a typical application of an aggregate
function (max) and shows a block without a where clause. The where clause
is optional and can be omitted, not only in nested blocks.
The next query demonstrates nesting in the from clause. Nesting in the
from clause is a convenient means to restrict a variable's range:

selects
from s in select s
from s in Student
where s.gpa > 10
where s.supervisor = dean

The nested query retrieves all students whose gpa is greater than 10. From
these, those are selected whose supervisor is the dean. Obviously, the query
could be stated much simpler by applying the boolean connective and as
done in the next query:

select s
from s in Student
where s.supervisor = dean and s.gpa > 10
Besides and the other boolean connectives or and not are available in OQL.
The usual boolean expressions can be built from base predicates and the
boolean connectives. They can be used stand-alone or as selection predicates
in the where clause.
Sometimes it is preferable to express a query with collection operations
instead of boolean connectives. The last query can equivalently be stated as
follows:

select s
from s in Student
where s.supervisor = dean
intersect
select s
from s in Student
where s.gpa > 10
112 A. Kemper and G. Moerkotte

where intersect denotes set intersection. The other supported collection op-
elCa.tions are union and except. The latter denotes set minus and is applied
ill the following query:

select s
from s in Student
where s.gpa > 10
except
select s
from s in Student
where s.supervisor = dean
The query retrieves the good students not supervised by the dean.

3.6 Grouping and Ordering

Grouping in OQL looks a little more complex than in SQL. Consider the
Qluery below. It is evaluated as follows. First, the from clause and the where
dause of the SFW-block involved in the query are evaluated. Typically, this
ClID be done by taking the cross product of the result of the expressions in the
from clause. Then, the selection predicate of the where clause is evaluated.
Second, this result is split into different partitions. For each partition, there
will be one tuple in the output of the query. All but one of the attributes cor-
respond to the properties used for grouping in the group by clause. The last
31ttribute is always called partition and is collection-valued. Each collection
c()ntains the result elements from the second step that belong to the accord-
iJm.g partition. Third, unwanted groups can be eliminated by a predicate given
in the having clause.
Let us consider an example. We want to group students into good,
mediocre, and bad students according to their gpa. The following query does
exactly this:

select *
from s in Student
group by good: s.gpa >= 10
mediocre: s.gpa < 10 and s.gpa >= 5
bad: s.gpa < 5

ill order to understand this query, it is useful to look at the result type. The
result type is:
set<struct(good: bool, mediocre: bool, bad: bool,
partition: bag<struct(s: Student» »
The result contains three tuples whose values for the first three attributes
are:
4. Object-Oriented Database Systems 113

1. struct(good: true, mediocre: false, bad: false, partition: ... )


2. struct(good: false, mediocre: true, bad: false, partition: ... )
3. struct(good: false, mediocre: false, bad: true, partition: ... )

This is due to the fact that for the partition attributes (good, mediocre, bad)
three possible value assignments exist: the expressions following the partition
attributes are boolean expressions such that for every student exactly one
predicate results in true. Of course, such a restriction does not exist in OQL
but queries of the above kind are rather typical.
For each combination of values for the partition attributes, the students
exhibiting this value combination are collected in the partition attribute.
Grouping all students by their advisor results in more than a single value
for the partition attribute:

select *
from s in Student
group by adv: s.advisor

Restricting the result tuples to those whose advised student group has a good
gpa is managed by applying a having predicate:

select *
from s in Student
group by adv: s.advisor
having avg(select p.s.gpa from p in partition) >= 10j
Results - with or without grouping - can be ordered by applying the
order by clause. Assume that in the above query we would like to retrieve
the average gpa and order the result by the decreasing average gpa. This can
be done as follows:

select adv.name, avggpa: avg(select p.s.gpa from partition p),


from s in Student
group by adv: s.advisor
having avg(select p.s.gpa from partition p) >= 10j
order by avggpa desc

The annex desc states that we want to order by descending average gpa.
Ordering by an increasing value is specified by asc which is also the default
if no order specifier is given. In general, a list of expressions can be used as
an order specification.

3.7 Views

OQL supports simple views and views with parameters that behave like func-
tions, often returning a collection. For example, if we are often interested in
good students, we might define a view GoodStudent:
114 A. Kemper and G. Moerkotte

define GoodStudent as
select s
from s in Student
where s.gpa >= 10
and to refer to them in another query:
select s.name
from s in GoodStudent
where s.age = 25
Views are persistent. That is, they are stored permanently in the schema
until they are explicitly deleted:
delete definition GoodStudent
In OQL, views are not called views but they are called named queries. A
named query can take parameters. An example is a named query that re-
trieves good students where the measure of what's good and what's not is
given as a parameter:
define GoodStudent(goodGPA) as
select s
from s in Student
where s.gpa >= goodGPA
The syntax for referencing named queries with parameters is the same as the
syntax for function calls:
select s.name
from sin GoodStudent(lO)
where s.age = 25
It is important to note that names for named queries cannot be overloaded.
Whenever the same name occurs for a named query, the old definition is
overwritten.

3.8 Conversion
OQL provides for a couple of conversions. A collection can be turned into a
single element by the element operator. If the argument of element contains
more than a single element, element raises an exception. For example,
element(select s from s in Student where s.name = "Smith")
results in an exception, if there is more than one student named "Smith";
otherwise the single student named "Smith" is returned.
Other conversion operators are concerned with the conversion between
different collection types:
4. Object-Oriented Database Systems 115

listtoset converts a list into a set and


distinct converts a bag into a set.
The special conversion operator flatten flattens a collection of collections
into a collection by taking the union of the inner collections. In case of lists,
the union is the append operator. For example, the result of
flatten(list(list(1,3) ,list(2,4)))
is
list(1,3,2,4).

3.9 Abbreviations
OQL contains a couple of possible abbreviations - Or syntactic sugar - that
makes OQL look more like SQL. The first important construct allows to omit
the explicit construction of tuples in the select clause. OQL allows for queries
with multiple entries in the select clause:
select Pl, ... Pn
from
where
where Pi are projections of the form:
1. expressioni as identifieri
2. identifieri : expressioni
3. expressioni
Such a query is equivalent to:
select struct(identifierl: expressionl, ... identifiern : expressionn )
from
where
In the third case, an anonymous identifier is chosen by the system.
Let us consider an example query where we want to select the names and
ages of all good students:
select s.name, s.age
from Student s
where s.gpa > 10
This query does not look different from an SQL query. If we want to give
names to the projected expressions, we write:
select s.name as studentName, s.age as studentAge
from Student s
where s.gpa> 10
116 A. Kemper and G. Moerkotte

A second kind of syntactic sugar concerns aggregate functions. Instead of


writing:

aggr(selectexpression
from
where ... )

we can write the SQL-like query:

select aggr(expression)
from
where ... )

for any aggregate function min, max, avg, and sum.


The query:

select count(*)
from
where ... )

translates into:

count(select *
from
where ... )
The same abbreviations apply to SFW-blocks exhibiting a distinct.
SQL allows to compare a single value via a comparison operator (=, <,
<=, ... ) and a quantifier (some, all) with a whole set of elements. The same
applies to OQL. For example,

select s
from Student s
where s.gpa >= all (select s1.gpa
from Student s1)
is a perfect OQL query. It retrieves all students whose gpa is greater or equal
than all gpa's found for students. This query is equivalent to:

select s
from Student s
where for all s1 in Student:
s.gpa >= s1.gpa

For a comparison operator () E {=, <, >, <=, >=,! =}, the predicate
expression () some set expression

is equivalent to
4. Object-Oriented Database Systems 117

exists x in setexpression: expression () x


and the predicate
expression () all setexpression
is equivalent to
for all x in setexpression: expression () x.

4 Physical Object Management


Having introduced the object model and the query language OQL we will
now investigate a few issues related to the physical representation and the
"handling" of persistent objects.

4.1 OlD Mapping


Object identity is a fundamental concept to enable object referencing in
object-oriented and object-relational database systems. Each object has a
unique object identifier (OlD) that remains unchanged throughout the ob-
ject's life time. There are two basic implementation concepts for OIDs in
databases: physical OIDs and logical OIDs [KC86].

Physical object identifiers. Physical OIDs contain parts of the initial


permanent address of an object, e.g., the page identifier and the slot within
that page. Based on this information, an object can be directly accessed on a
data page. This direct access facility is advantageous as long as the object is
in fact stored at that address. Updates to the database may require, however,
that objects are moved to other pages. In this case, a place holder (forward
reference) is stored at the original location of the object that contains the
new address of the object. When a moved object is referenced, two pages
are accessed: the original page containing the forward and the page actually
containing the object.
Physical OIDs are sketched in Figure 4.1. The objects identified by 4711:1
(the Course Foundations) and 4711:3 (Ethics) are still on their home page
and can be obtained by a single page access. However, the object identified
by 4711:2 was moved to a different location, i.e., to slot 3 on page 4812.
Therefore, page 4711 contains a forward to this new location. If the object is
moved to yet another page, the forward left behind on page 4711 is updated.
This way, the forward chains are always restricted to a single indirection.
Note that every object knows its (physical) object identifier which does not
change due to the migration to a different location.
With increasing number of forwards, the performance of the DBMS grad-
ually degrades, at some point making reorganization inevitable [GKM96]. O 2
[02T94], ObjectStore [LLO+91], and (presumably) Illustra [St096, p. 57] are
examples of commercial systems using physical OIDs.
118 A. Kemper and G. Moerkotte
pOID

~ 123
I I I
4711:3 0 5001 0 Foun-
"------+
dations 0 ...

4711:1 0 5041 0 Ethics 0 ...


I
4812 I3 I

page 4711
1 2 3
-1-1-1

--- ...
4711:2040520 Mathe-
matical Logic for CS 0

page 4812

Fig. 4.1. Illustration of physical OIDs with forward pointer

Logical object identifiers. Logical OIDs do not contain the object ad-
dress and are thus location independent. To find an object by OlD, however,
an additional mapping structure is required to map the logical OlD to the
physical address of the object. If an object is moved to a different address,
only the entry in the mapping structure is updated. In the following, we
describe three data structures for the mapping. [EGK95] give details and a
performance comparison.
4. Object-Oriented Database Systems 119

Mapping with a B+ -Tree. The logical OID serves as key to access the tree
entry containing the actual object address (cf. Figure 4.2a). In this graphic,
the letters represent the logical OIDs and the numbers denote the physical
address of the corresponding object (e.g., the object identified by a is stored
at address 6). Here, we use simplified addresses; in a real system the address
is composed of page identifier and location within that page - like physical
OIDs. For each lookup, the tree is traversed from the root. Alternatively, if
a large set of sorted logical OIDs needs to be mapped, a sequential scan of
the leaves is possible. Shore [CDF+94] and (presumably) Oracle8 [LMB97]
are systems employing B-trees for OlD mapping.

(a,6) (b,2) (c,3) (d, 7) (e,5) (f,8) (g,9) (h,4) (i,l)

(g,9) a6
(e,5) b2
(b,2) c3
U,8) d7
(d,7) e5
(c,3) f8
(a,6) g9
(h,4) h4
(i, 1) i 1

(b) Hash-Table (c) Direct Mapping


Fig. 4.2. Mapping techniques

Mapping with a hash table. The logical OlD is used as key for a hash
table lookup to find the map entry carrying the actual object address (cf. Fig-
ure 4.2b). For example, Itasca [Ita93] and Versant [Ver97] implement OlD
mapping via hash tables.

Direct mapping. The logical OlD constitutes the address of the map entry
that in turn carries the object's address. In this respect, the logical OlD can
120 A. Kemper and G. Moerkotte

be seen as an index into a vector containing the mapping information. Direct


mapping is immune to hash collisions and always requires only a single page
access (cf. Figure 4.2c). Furthermore, since the logical OIDs are not stored
explicitly in the map, a higher storage density is achieved. Direct mapping
was used in the "old" CODASYL database systems and is currently used,
e.g., in BeSS [BP95].

4.2 Pointer Joins


Inter-object references are one of the key concepts of object-relational and
object-oriented database systems. These references allow to directly traverse
from one object to its associated (referenced) object(s). This is very effi-
cient for navigating within a limited context - so-called "pointer chasing"
- applications. However, in query processing a huge number of these inter-
object references has to be traversed for evaluating so-called functional joins
or pointer joins. Therefore, naive pointer chasing techniques usually cannot
yield competitive performance. Let us demonstrate this on a very simple
example schema consisting of merely two object types Rand S. R-objects
refer to S-objects via a relationship Sref Assume that the S-objects have an
attribute S-Attr that we want to join to the R-objects. The corresponding
query looks as follows:
select r, r.Sref.S_Attr select c, c.taughtBy.Name
from R r from AllCourses c
On the right-hand side an example functional join based on our University
schema is additionally shown: For all courses the name of the instructor is
obtained in the query.
The naive pointer chasing algorithm is illustrated in Figure 4.3. It scans
(Le., sequentially reads) R and "right-away" traverses the reference of the
relationship Sref For logical OIDs, first the Map is looked up to obtain the
address of the referenced S object which is then accessed (dereferenced). In
our illustration direct mapping is assumed for mapping logical OIDs of S -
denoted by letters - to addresses of S - denoted by one digit numbers. If the
combined size of the Map and S exceeds the memory capacity this algorithm
performs very poorly.
In a system employing physical OIDs the algorithm does not need to
perform the lookup in the Map. However, the access to the page the physical
OlD is referring to may reveal that the object has moved to a different page.
In this case, the forward pointer has to be traversed in order to retrieve the
object. Again, the algorithm performs very poorly if the size of S exceeds the
memory capacity.
The "smarter" algorithms take measures to achieve locality in performing
the lookups. For performing the two functional joins with the Map and with
S, respectively, two techniques can be applied to achieve locality: partitioning
and sorting.
4. Object-Oriented Database Systems 121

R
S
OIDR Sref
Map OIDs S..Attr
rl b
a6 1 i 17
r2 e
b2 2 b 11
r3 c
c3 3 c 19
r4 9 4 h 13
r5 i d7
"-'
e5
"-' 5 e 18
r6 d
12
18 6 a
r7 a
g9 7 d 10
rs c
r9 h h4 8 I 14
i 1 9 9 15
rIO i

Fig. 4.3. Naive pointer join with logical OIDs ("-' denotes the pointer dereference)

If partitioning is applied (as illustrated in Figure 4.4), the R-objects are


partitioned such that each partition refers to a memory-sized partition of
the Map. Partitioning is achieved by applying a partitioning function hM on
the Sref value of the R objects. In our example we assume that a two-way
partitioning is sufficient to obtain memory sized chunks of the Map.3 Upon
replacing the logical OIDs of Sref relationships by the addresses obtained
from the Map the objects are once again partitioned for the next functional
join with S. For this, the partitioning function hs is applied on the address of
the referenced S-object. Here, again, we assume that a two-way partitioning
is sufficient.
Instead of partitioning, one could also sort the flattened R tuples. For the
Map lookup the tuples are sorted on the Sref attribute and for the second
functional join they are sorted on the addresses of S objects.
For physical OIDs this algorithm performs analogously except that only
one partitioning/sorting step is needed because the lookup in the Map is not
needed.
[BCK98] investigated pointer joins along set-valued relationships. In this
case, the R objects are assumed to contain a set-valued relationship SrefSet
that relates an R object with a set of S objects. Queries would then look as
follows:

3 If a hash table is used for implementing the Map the same hash function used for
the hash table has to be applied on the Srel value before applying the partitioning
function.
122 A. Kemper and G. Moerkotte

Rl RMI
8
R Tl b Tl 2 OIDs S..Attr
OIDR Sref T3 C T3 3 81
Tl b T6 d,,-+ Map TS 3 17
"-+ 1 i
T2 e T7 a a6 T5 1 2 b 11
c TS C b2 Tg 4
T3 3 c 19
c3 M/"
T4 9 /' 4 h 13
T5 i d7
hM e5
hs
T6 d RM2 82
'\. R2
18 '\. T6 7
T7 a T2 e 5 e 18
TS c T4 9 9 9 M2 T7 6 6 a 12
Tg h T5 i,,-+
h4 T2 5 10
"-+ 7 d
TIO i i 1 T4 9
T9 h 8 I 14
TIO I TIO 8 9 9 15

Fig. 4.4. Pointer join with logical OIDs using partitioning

select struct(Robj: T, S_AttrSet: select struct(stud: s, courseTitles:


(select s.S-Attr (select c. Title
from s in T.Sref'Set)) from c in s.enrolled))
from R r from s in AllStudents

Again, we show one generic query on our abstract schema and another one
based on the University schema. The query on the right-hand side combines
student information with the titles of the courses they are enrolled in.
[BCK98] proposed the partition/merge-algorithm for evaluating such
functional joins along set-valued relationships. It is an adaptation of the
above partition-algorithm. It flattens the R objects but it retains the group-
ing of the flattened R objects across an arbitrary number of functional joins.
This is achieved by interleaving partitioning and merging in order to retain
(very cheaply) the grouping after every intermediate partitioning step. This
is captured in the notation P(PM)* M. We will describe the basic idea of the
algorithm by way of an example.
Figure 4.5 shows a concrete example application of the P(PM)* M-
algorithm with two partitioning steps. The tables R;, RMij and RMSj are
labeled by a disk symbol to indicate that these temporary partitions are
stored on disk.
We start with the extent R containing two objects with logical OIDs rl
and r2 - for simplicity, any additional R attributes are omitted. The set-
valued relationship SrefSet contains sets of references (logical OIDs) to S.
The first processing step flattens these sets and partitions the stream of flat
objects N-way. In our example, the partitioning function hM is 2-way and
maps {a, ... , d} to partition Rl and {e, ... , i} to partition R2. The next
B
B
[61tR Srefl B
100bR-~
(Direct)
R1 rn Map 10IDR SAttrl
RMll
a3....
T1 b "Cl Saddr T1 2 8
T1 C "Cl RM81
<:t:: M1 T1 3 ~ Os SAttr
T2 a a 6 3 T1 11
81
T2 d rvtb 2 h T1 19
17
c 3
~; T1 17
R
OR SrefSet ;(2 ~ d 7
RM12
\
mergervt 3
b
c
11
19 RMS
T2 6 19
T2 13
T1 {b, e, c, g, i} ~M T2 7 h 13 - ' " 1\ OR { SAttr }
M2 T2 17 >f>-
T2 {a, d, c, h, i} T1 {11,19,17,18,15}
\ R2
e 5 ~merge
T1 e 82 T2 {19, 13, 17, 12, 1O}
o0-
......
8 RM21 e 18
T1 9 rvt f
T1 9 9 TI 1 a 12 ~
h o
T2
T2
h h 4
1
4
1
~g,~; d 10
14
~IRM~21/
T1 18
...
(ii'
f T1 15 a
9 15 ~
T2 12
/ i o
T2 10 ~
T1 ~
T1 ~
CD
en

Fig. 4.5. An example application of the partition/merge-algorithm ~


CD

.....
t$
124 A. Kemper and G. Moerkotte

processing step starts with reading R1 from disk, maps the logical OIDs in
attribute Srei to object addresses using the portion M1 of the Map (note
that the Map is not explicitly partitioned) and in the same step partitions
the object streams K-way with partitioning function hs (In our example a
2-way partitioning was assumed and hs maps {I, ... ,4} to partition 1 and
{5, ... ,9} to partition 2). The resulting partitions RM1j (here 1 ~ j ~ 2) are
written to disk. Processing then continues with partition R2 whose objects
are partitioned into RM 2j (1 ~ j ~ 2). The fine-grained partitioning into
the N * K (here 2 * 2) partitions is essential to preserve the order of the flat
R objects belonging to the same R object. The subsequent merge scans N
(here 2) of these partitions in parallel in order to re-merge the fine-grained
partitioning into the K partitions needed for the next functional join step.
Skipping the fine-grained partitioning into N * K partitions and, instead,
partitioning RM into the K partitions right away would not preserve the
ordering of the R objects. In detail, the third phase starts with merging
RMl1 and RM21 and simultaneously dereferencing the S objects referred. In
the example, h ,2] is fetched from RMl1 and the S object at address 2 is
dereferenced. The requested attribute value (S_Attr) of the S object - here 11
- is then written to partition RMS1 as object h,ll]. After processing [r1,3]
from partition RM l1 , [rl,l] is retrieved from RM21 and the object address
1 is dereferenced, yielding a object h, 17] in partition RMS 1. Now that all
flattened objects belonging to r1 from RM 11 and RM 21 are processed, the
merge continues with r2. After the partitions RMl1 and RM21 are processed,
RM12 and RM22 are merged in the same way to yield a single partition RMS2.
As a final step, the partitions RMS1 and RMS2 are merged to form the result
RMS. During this step, the flat objects [r,S_Attr] are nested (grouped) to
form set-valued attributes [r,{S_Attr}]. If aggregation of the nested S_Attr
values had been requested in the query, it would be carried out in this final
merge.

4.3 Pointer Swizzling


In object-oriented database applications one often encounters application pro-
files where a limited set of objects (i.e., a focused context) is repeatedly ac-
cessed and manipulated. Repeatedly accessing these objects via their logical
or physical OIDs would incur a severe performance penalty - even if the
objects remain memory resident throughout the application.
[KK95] classify and describe different approaches to optimizing the ac-
cess to main memory resident persistent objects - techniques that are com-
monly referred to as "pointer swizzling". In order to speed up the access
along interobject references, the persistent pointers in the form of unique ob-
ject identifiers (OIDs) are transformed (swizzled) into main memory pointers
(addresses). Thus, pointer swizzling avoids the indirection of a table lookup
to localize a persistent object that is already resident in main memory.
We classify the pointer-swizzling techniques along three dimensions:
4. Object-Oriented Database Systems 125

1. In place/copy
Here, we distinguish whether the objects in which pointers are swizzled
remain on their pages (in place) on which they are resident on secondary
storage or whether they are copied (copy) into a separate object buffer.
2. Eager/lazy
Along this dimension we differentiate between techniques that will swiz-
zle all pointers that are detected versus those swizzling techniques that
will only swizzle on demand, i.e., when the particular reference is deref-
erenced.
3. Direct/indirect
Under direct pointer swizzling, the swizzled attribute (reference) con-
tains a direct pointer to the referenced in-memory object. Under indirect
swizzling there exists one indirection; that is, the attribute contains a
pointer to a so-called descriptor, which then contains the pointer to the
referenced object.

The three dimensions are summarized in tabular form in Figure 4.6. In the
subsequent sections, we will discuss those three dimensions in a bit more
detail.

Classification of Pointer-Swizzling Techniques


In Place/Copy Eager/Lazy Direct /Indirect
in place eager direct
indirect
lazy direct
indirect
copy eager direct
indirect
lazy direct
indirect

Fig. 4.6. The three dimensions of pointer-swizzling techniques

Depending on whether it is possible to swizzle a pointer that refers to an


object that is not (yet) main memory resident, we distinguish direct from
indirect swizzling. Direct swizzling requires that the referenced object is res-
ident. A directly swizzled pointer contains the main memory address of the
object it references. The problem with direct swizzling is that in case an ob-
ject is displaced from the page or object buffer - i.e., is no longer resident -
all the directly swizzled pointers that reference the displaced object need to
be unswizzled. In order to unswizzle these pointers, they are registered in a
126 A. Kemper and G. Moerkotte

list called reverse reference list (RRL).4 Figure 4.7 illustrates the scenario of
direct swizzling.

Father Mother

Child

Fig. 4.7. Direct swizzling and the RRL

Note that in case of eager direct swizzling, we are not allowed to simply
unswizzle the pointers, as eager swizzling guarantees that all pointers in the
buffer are swizzled - instead, we have to displace those pointers (i.e., their
"home objects"), too. This may result in a snowball effect - however, in this
presentation we will not investigate this effect in detail.
Maintaining the RRL can be very costly; especially in case the degree of
sharing of an object is very high. In our context, the degree of sharing can
be specialized to the fan-in of an object that is defined as the number of
swizzled pointers that refer to the object. Assume, for example, an attribute
of an object is assigned a new value. First, the RRL of the object the old value
of the attribute referenced needs to be updated. Then the attribute needs to
be registered in the RRL of the object it now references. Maintaining the
RRL in the sequence of an update operation is demonstrated in Figure 4.8,
in which an attribute, say, spouse, of the object Mary is updated due to
a divorce from John and subsequent remarriage to Jim. First, the reverse
reference to the object Mary is deleted from the RRL of the object John;
then a. reverse reference is inserted into the RRL of the object Jim.
Indirect swizzling avoids this overhead of maintaining an RRL for every
resident object by permitting to swizzle pointers that reference nonresident
objects. In order to realize indirect swizzling, a swizzled pointer materializes

4 In the RRL the OlD of the object and the identifier of the attribute, in which
the pointer appears, is stored - we say that the context of the pointer is stored.
4. Object-Oriented Database Systems 127

Mary Mary

~
CD
John Jim

before divorce after marrying again


Fig. 4.8. Updating an object under direct swizzling

the address of a descriptor - i.e., a placeholder of the actual object. In case


the referenced object is main memory resident, the descriptor stores the main
memory address of the object; otherwise, the descriptor is marked invalid.
In case an object is displaced from the main memory buffer, the swizzled
pointers that reference the object need not be unswizzled - only the descrip-
tor is marked invalid. Figure 4.9 illustrates this (the dashed box marks the
descriptor invalid).5

Father Mother Father Mother

• • • •

Child

Fig. 4.9. The scenario of indirect swizzling

5 In case of eager indirect swizzling, we need to provide a special pseudo-descriptor


for NULL and dangling references. This pseudo-descriptor is always marked in-
valid.
128 A. Kemper and G. Moerkotte

Exploiting virtual memory facilities. The above pointer swizzling ap-


proaches are commonly referred to as software swizzling because the resi-
dency checks are made by software routines. Another possible way to imple-
ment a persistent store is to exploit the virtual memory facilities supported
by hardware. In the hardware approach, all the references in memory are
implemented as virtual addresses. Thus, pointers to persistent objects are
dereferenced like pointers to transient data; i.e., no interpretation overhead
is necessary for residency checks or to determine the state of a reference.
Like object descriptors for indirect swizzling, virtual memory allows direct
pointers to be kept swizzled and virtual memory addresses referring to ob-
jects located in pages that are not resident in physical main memory. Virtual
memory provides access protection for pages which can be exploited in the
following way: when a reference referring to a non-resident object is derefer-
enced, an exception is signaled, and the persistent object system reads the
corresponding page into the main-memory buffer pool.
Wilson and Kakkad describe the swizzling of pointers and the mapping of
pages in virtual memory as a "wave front" [WH91,WK92j. At the beginning,
all the pages that are referred to by an entry pointer are mapped into virtual
memory and access-protected. These pages are not loaded nor is any physical
main memory allocated. Only when a page is accessed for the first time, it
is loaded and the access protection is released. This is achieved by a trap to
the object base system which is rather expensive. In their Texas persistent
store [SKW92), at the same time (at page-fault time), all pointers located
in the page are swizzled. In Texas, pointers are stored as physical OIDs in
persistent memory, Le., they contain the page number and the offset of the
object they refer to. A persistent pointer referring to an object located in a
page that is already located in virtual memory is translated into a virtual
address by consulting Texas' page table that records the mapping of pages
to their virtual memory addresses. If the page is not registered in the page
table (Le., no pointers referring to objects located in that page have been
encountered before), the page is mapped into virtual memory and access-
protected first. Figure 4.10 illustrates this principle. When a page is accessed
for the first time, it is swizzled (Le., all the pointers in the page are swizzled)
and its access protection is released thereby moving the inner wave front
ahead. At the same time, when pointers are swizzled, the outer wave front is
moved ahead to map new pages into virtual memory.
ObjectStore, a commercial object-oriented database system, apparently
uses a virtual memory mapping approach in a similar way [LLO+91j. Devi-
ating from Texas, the unit of address mapping is the segment, a collection
of pages, rather than an individual page. A persistent pointer contains the
segment number and the offset of the object within the segment. When a
segment is accessed for the first time, the whole segment, Le., all the pages of
the segment, is mapped into virtual memory. ObjectStore thus reduces some
of the computational overhead involved in mapping pages into virtual mem-
4. Object-Oriented Database Systems 129

virtual memory

page is mapped into VM


and access-protected

,..-----, page is swizzled and


'--_-' accessable

Fig. 4.10. Wave front of swizzled and mapped pages in the Texas persistent store

ory since mapping a whole segment at once is cheaper than mapping every
page individually. On the other hand, more virtual memory is reserved by
segments or parts of segments that are never accessed. Pages, however, are
also loaded and swizzled incrementally by ObjectStore in a client/page-server
architecture.

4.4 Clustering

The clustering problem is the problem of placing objects onto pages such
that for a given application the number of page faults becomes minimal.
This problem is computationally very complex - in fact, the problem is NP-
hard. Hence, several heuristics to compute approximations of the optimal
placement have been developed. Here we will just discuss a single heuristic
that is based on graph partitioning.
Let us first motivate clustering by way of an example. Assume that in
many application the three objects id 1> id 2 , and id3 are used together. If they
are stored on separate pages, as exemplified in Figure 4.11, the application
induces three page faults . Assuming an average access time of 10 ms per page
130 A. Kemper and G. Moerkotte

'--- ~

- (idl, ... )
(id2, ... )
(idl, ... ) (i d2, ... )
-
-
r--

(id3, ... )
(id3, ... )

Main Memory --Access Gap--Secondary Storage (disk)

'--
------
~~dl ... ~
t d2,' ...
id3, ...
-
D
rdl, ... ~
td2, ...
i d3, ...

DD
Fig. 4.11. Placement of three related objects onto pages: unclustered (top) versus
clustered (bottom)

access, this fetch phase lasts 30 IllS. The result after fetching these objects is
shown at the top of Figure 4.11. Since the involved objects are quite small,
they could all easily fit on a single page. If all the objects reside on a single
page, only one page access - taking approximately 10 ms - is needed to
fetch all three related objects into main memory. A factor of three is saved.
Obviously, the saving increases the more logically related objects fit onto a
single page. It is obvious, that these three objects should have been placed
on the same page - as shown at the bottom of Figure 4.11.
Besides this obvious saving, there exists another less obvious advantage of
clustering several logically related objects onto a single page. We first observe
that all pages fetched into main memory occupy buffer space. Further, buffer
space is usually restricted. Hence, if too many pages are needed, some of them
4. Object-Oriented Database Systems 131

must be stored back onto disk despite the fact that during the continuation of
the application certain objects they contain are again accessed. This results
in more page faults and, hence, in decreased performance. Less buffer space
is wasted if the percentage of objects on a page needed by an application is
high. Clustering of those objects that are accessed together in an application
increases this percentage and, hence, increases performance. From this point
of view, filling a page totally with objects always needed together is the best
clustering strategy possible.

Fig. 4.12. Referencing within the SIMPLE-example: schema references on the left
and the cluster graph on the right

The optimal clustering for the above example is very intuitive. To illus-
trate that this is not always the case consider the so-called SIMPLE-example
[TN91] exhibiting an interesting pitfall when following the above, intuitively
straightforward clustering strategy of filling pages maximally. There exist
objects with identifiers S, I, M, P, L, and E. They reference each other in
the way indicated on the left-hand side of Figure 4.12. The application we
consider is characterized by the following access pattern or reference string:

where first object S is accessed, subsequently object I. Then 98 further ac-


cesses go from I to S and back ending at I. From here, M is accessed. Again,
the application switches 99 times between M and P. Last, L is accessed
and 99 mutual accesses between Land E take place. Consider the case that
three objects fit onto a single page. Then consider the following placement of
objects onto pages - the brackets [... J indicate page boundaries:

• [S,I,M], [P,L,E]

is a reasonable clustering of the objects since


132 A. Kemper and G. Moerkotte

• The space occupied is minimized.


• The number of outgoing references, Le., relationships between objects on
different pages, is minimized.
Nevertheless, assuming a page buffer with a capacity of only one page, the
above application leads to 198 page faults. This can be seen as follows. The
application first accesses the object S. This leads to the first page fault.
Switching between S and I does not produce any further page fault. This
also holds for accessing M after (SI)99. Accessing P leads to the next page
fault where the page [S, I, M] is replaced by the page [P, L, E]. Switching
back to M again requires a page fault. The next page fault occurs when
accessing P again. Hence, executing (M P)99 after (SI)99 has been executed
leads to 2 * 99 - 1 page faults. After the last access to P, no further page fault
occurs while executing (LE)99. Hence, in summary 198 page faults occurred.
A much better placement is

• [S,I,-], [M,P,-], [L,E,-]


where "-" indicates unoccupied space. For this placement, the above appli-
cation induces only three page faults.
[TN91] were the first who viewed the clustering problem as a graph parti-
tioning problem. For this purpose, a so-called cluster graph is constructed by
monitoring applications. The cluster graph's nodes correspond to the objects
in the database and the weighted edges between any two objects correspond
to the number of consecutive accesses of these two objects. For our SIMPLE-
example the cluster graph is shown on the right-hand side of Figure 4.12.
Then the clustering problem consists of forming page-sized partitions of this
cluster graph such that the accumulated weight of the inter-partition (Le.,
inter-page) edges becomes minimal. Unfortunately, this problem is computa-
tionally infeasible for large object bases because the partitioning problem is
NP-hard. Therefore, known heuristics that find a good but not necessarily
optimal partitioning (clustering) are needed.
Previously known partitioning heuristics, such as the Kernighan&Lin
heuristic [KL70], still had a very high running time. Therefore, in [GKK+93]
a new heuristics for graph partitioning, called Greedy Graph Partitioning
(GGP) , was developed. Graph partitioning is strongly related to subset op-
timization problems for which greedy algorithms often find good solutions
very efficiently.
The GGP algorithm (see Figure 4.13) is based on a simple greedy heuris-
tics that was derived from Kruskal's algorithm for computing the minimum
weight spanning tree of a graph [Kru56]).
First, all partitions are inhabited by a single object only, and all partitions
are inserted into the list PartList. For all objects 01,02 connected by some
edge in the CG with weight W 01 ,02 a tuple (01, 02, W O I, 0 2) is inserted into the
list EdgeList. All tuples of EdgeList are visited in the order of descending
weights. Let (01, 02, W O I,02) be the current tuple. Let P1 , P2 be the partitions
4. Object-Oriented Database Systems 133

INPUT: The clustering graph CG;


OUTPUT: A list of partitions;
ParlList := ( );
Assign each object to a new partition; insert partitions into ParlList;
Let EdgeList be a list of tuples of the form (01,02, WOt.02)
where W0t.02 is the weight of the edge between 01 and 02;
Sort EdgeList by descending weights;
foreach (01,02, W 01 ,02) in EdgeList do begin
Let H, P2 be the partitions containing objects 01, 02;
if P1 # P2 and the total size of all objects of P1 and P2
is less than the page size then begin
Move all objects from P2 to P1 ;
Remove P2 from ParlList;
end if;
end foreach;

Fig. 4.13. The Greedy Graph Partitioning algorithm

to which the objects 01 and 02 are assigned. If PI =I- P2 and if the total size of
all objects assigned to PI and P2 is less than the page size the two partitions
are jOined. 6 Otherwise, the edge is merely discarded - and the partitions
remain invariant.
It is easy to see that the GGP-algorithm obtains the optimal clustering
for the SIMPLE-example consisting of three partially empty pages. It first as-
signs each of the six objects into a separate partition (page). Then it merges
the pages with the objects S and I, M and P, Land E, respectively - in
no particular sequence since there are ties with the weight 198. Having ob-
tained these three pages [8, I, -], [M, P, -], and [£, E, -] no further merging
is possible because the page limit was set at three.

4.5 Large Object Management

So far we have only dealt with objects that fit into one page. However, in
advanced applications there are many "bulky" data types, e.g., multi-media
data like video, audio, images, etc, where this premise no longer holds. There-
fore, techniques for mapping large objects of any size - ranging from several
hundred Kilo bytes to Giga bytes - are needed. It is, of course, not feasible
to simply map such large objects onto a chained list of pages. This naive ap-
proach would severely penalize reading an entire large object or a part in the
"middle" of a large object from the secondary memory. Therefore, smarter
techniques are needed that map large objects onto large chunks of consecutive
pages - called segments - while, at the same time, allowing dynamic growth
and shrinking of objects "in the middle". Also, the object structure has to

6 Partitions are represented as binary trees to accelerate the join operation.


134 A. Kemper and G. Moerkotte

provide for efficient access to random byte positions within the large object
- without having to read the entire part preceding the desired position.

The Exodus approach. In Exodus [CDR+86] large objects are mapped


onto segments of fixed size. Of course, merely chaining these segments would
not solve the problem of efficient random access to random "middle" parts
of the large object. Therefore, a B+ tree was chosen as a directory structure.
This is illustrated in Figure 4.14. The root constitutes the address (physical
OlD) of the large object. For illustration, we assume that segments consist of
4 pages, each containing 100 bytes. Accessing, for example, a part of the large
object starting at byte position 1700 involves traversing the right-most path
of the B+ -tree down to the segment that contains the last 230 bytes of the
object. Growing a large object in the middle is also efficiently supported by
inserting one or more new segment(s) and updating the corresponding path
of the B+ -tree.

root

360 250 300 400 300 230


segment of fixed ize
Fig. 4.14. B-tree representation of a large object in Exodus

The Starburst approach. The Exodus storage structure has the disadvan-
tage of fixed segment sizes. This may be a problem if very differently sized
objects need to be stored. Therefore, in Starburst [LL89] segments with a
fixed growth pattern were introduced. That is, a large object is created by
starting with a segment of a chosen size. Additionally allocated segment are
twice the size of their predecessor segment; except for the last segment which
can have an arbitrary size in order to avoid storage waste. The segments are
chained by a so-called Descriptor - as illustrated in Figure 4.15. The De-
scriptor contains the number of segments (here 5), the size of the first (here
100) and the last segment (here 340), and the pointers to the segments.
This approach seems to favor sequential reads because the segments of
really large objects can be chosen accordingly large. On the other hand,
dynamic growth and shrinking in the middle is more complex than in the
Exodus approach.
4. Object-Oriented Database Systems 135

100 200 400 800 340


Fig. 4.15. Representation of a large object in Starburst

The EOS approach. In EOS [Bil92] the Exodus and Star burst approaches
were combined such that variable sized segments are possible and a B+ -tree
is used as a directory in order to support dynamic growth and shrinking
efficiently. This is illustrated in Figure 4.16.

350 150
Fig. 4.16. Representation of a large object in EOS

[Bil92] also describes a buddy scheme for allocating the variable sized
segments.

5 Architecture of Client-Server-Systems

With the advent of powerful desktop computers in the early 1980's client-
server-computing has become the predominant architecture of database ap-
plications. The database is installed on a powerful backend server while the
application programs are executed on the client computers. Here, we will
briefly survey the architectural design choices for client-server databases.
136 A. Kemper and G. Moerkotte

5.1 Query Versus Data Shipping

The client-server interaction can be classified according to the unit of in-


teraction between clients and database servers: Query Shipping versus Data
Shipping.
In a query shipping architecture - illustrated in Figure 5.1 (adapted from
[Fra96]) - the client sends a query to the server which processes the query
and sends the results back to the client.

Client (Workstation) Server


Queries
Query Processing
Application Object Mgmt.
DB-Interface Results 'fransaction Mgmt.
Buffer Mgmt.
I/O Mgmt.
/ /
~ -......
......
---
Database

Fig. 5.1. Query shipping client-server-architecture

Query shipping is the predominant architecture in today's relational da-


tabase systems. One of the advantages of query shipping lies in the fact that
only pre-selected data items are sent over the network between the server and
the client. A disadvantage is that the client computers are basically idle while
the server may be suffering a high workload when many clients are connected.
So, the server is in charge of performing the bulk of the data processing work.
In contrast, in a data shipping client-server-architecture as shown in Fig-
ure 5.2, the clients are doing the actual data processing. The server has the
role of a "smart disk" that sends clients the requested data and performs the
multi-user synchronization - in cooperation with the client. As can be seen
from the two illustrations, the client system is much more complex in a data
shipping architecture than under query shipping. The client has to buffer
data that was requested from the server, it has to do part of the transaction
management (in cooperation with the server), and it does the entire query
processing.
4. Object-Oriented Database Systems 137

Data shipping is the predominant architecture in object-oriented database


systems. This was motivated by the envisioned profiles of object-oriented da-
tabase applications. Many of the applications operate on a focused context
(Le., a relatively small set of objects) and perform complex and repeatedly
invoked operations on these data. In such a scenario the data shipping ar-
chitecture with its client side buffering works very well. On the other hand,
when evaluating queries that process large amounts of data but select only a
few result objects, a query shipping architecture is more appropriate. [F JK96]
discuss these issues and propose to combine the two approaches in a hybrid
architecture.

Client (Workstation Server


Page or Object
Application
Requests
D B-Interface
Query Processing Data Items Transaction Mgmt.
Object Mgmt. Buffer Mgmt.
Updated
Buffer Mgmt. I/O-Mgmt.
Data Items
/ /
,;-- -....
'--- -'

Database

Fig. 5.2. Data shipping client-server-architecture

5.2 Page Versus Object Server

For a data shipping client-server-architecture there are two choices with re-
spect to the granularity of data items being shipped between the server and
the client(s): page versus object server. In a page server the client requests
entire pages (in the predetermined size of, e.g., 8KB). The effectiveness of
this architecture is dependent on a good clustering of objects onto pages.
This is a prerequisite for making good use of the resources: bandwidth of the
network and buffer space in the client.
In the object server architecture, the client requests individual objects
from the server. This way, only those objects that are actually needed in the
client are sent to over the network and are placed in the client's buffer. That
138 A. Kemper and G. Moerkotte

is, the object server minimizes resource consumption as far as network band-
width and client buffer space is concerned. On the negative side, explicitly
requesting each individual object easily leads to a performance degradation
if many objects are accessed in the application.

[KK94) proposed a dual buffering scheme for a page server architecture


which combines the advantages of object and page server architectures. In
this architecture the client's buffer is segmented into two parts: a page and an
object segment. Clients request entire pages from the server; every incoming
page is put into the page segment of the buffer. Two approaches exist with
respect to referenced objects on that page: (1) under eager copying the object
is immediately (upon first access) copied into the object buffer segment and
(2) under lazy object copying objects remain on the page for as long as the
page is not evicted from the page buffer segment. If the page is actually
evicted, those objects that are deemed important are copied into the object
buffer segment.

Another dimension of the buffer management concerns relocation time.


This is the time at which an object copy - previously extracted from its home
page - is "given up" and, if necessary because of modification, transferred
back into its memory-resident home page. Under eager relocation the object
is immediately copied back onto its home as soon as the - previously evicted
- page returns to the page buffer segment because of an access to another
object on it. Under lazy relocation, the transfer of a modified object onto
its home page occurs only when the object is evicted from the object buffer
segment. In this case, the home page has to be brought in from the server
- unless the page was transferred back to the client in the mean time and
still resides in the page buffer segment. Figure 5.3 summarizes these control
strategies. More details on the design and the performance of a dual buffering
scheme can be found in [KK94).

The advantage of the dual buffer management is that well clustered pages
containing many objects relevant for the particular application are left intact
in the buffer. On the other hand, pages that contain only a few relevant ob-
jects are evicted from the buffer after these few objects have been extracted.
Under dual buffer management, the client's main memory buffer is effectively
utilized because only relevant objects occupy the precious buffer space. This
is achieved without incurring the high client-server interaction rate exhibited
by an object server. It is, of course, the buffer management's task to maintain
access statistics such that the two types of pages - those containing a high
portion of relevant objects and those containing only few relevant objects -
are detected.
4. Object-Oriented Database Systems 139

• object copying • object copying


on replacement on replacement
I
a of home page of home page
z ~ relocation on • relocation on re-
y
c reloading of placement of ob-
0
p home page ject
y
i
n
g
• object copying • object copying
e on identification on identification
a
g • relocation on • relocation on re-
e reloading of placement of ob-
r
home page ject

~
eager lazy
relocation

Fig. 5.3. Classification scheme for dual buffering

6 Indexing

In this and the next section we will use the object base shown in Figure 6.1 for
illustration for the illustration of new indexing techniques in object-oriented
database systems.

6.1 Access Support Relations: Indexing Path Expressions

In the context of associative search one of the most performance-critical op-


erations in relational databases is the join of two or more relations. A lot of
research effort has been spent on expediting the join, e.g., access structures
to support the join, the sort-merge join, and the hash-join algorithm were
developed. Recently, the binary join index structure [VaI87] building on links
[Hii.r78] was designed as another optimization method for this operation.
In object-oriented database systems with object references the join based
on matching attribute values plays a less predominant role. More important
are object accesses along reference chains leading from one object instance to
another. This kind of object traversal is also called functional join [CDV88]
or implicit join [Kim89].
In Section 4.3 we discussed techniques for evaluating such functional joins.
In this section we present a very general indexing structure, called Access Sup-
port Relations (ASRs), which is designed to support the functional join along
140 A. Kemper and G. Moerkotte

id3S Students id37 Students ids3 Students


StudentID: 78634 StudentID: 87364 StudentID: 67843
Name: "Ada" Name: " " Name: " "
Semester: 5 Semester: 7 Semester: 3
enrolled: { ... } enrolled: {... } enrolled: { ... }
takenExams: {id21, id23 } takenExams: {id22, id27 } takenExams: {}

id21 Exams id22 Exams id23 Exams id27 Exams


ExDate: ... ExDate: .,. ExDate: ... ExDate: ...
Grade: 2.0 Grade: 3.0 Grade: 2.0 Grade: 2.0
givenBy: idl givenBy: idl givenBy: ida givenBy: ida
takenBy: ... takenBy: ... takenBy: ... takenBy: ...
contents: id2 contents: id3 contents: id2 contents: id2

id1 Professors ids Professors ida Professors


SS#: 2137 SS#: 2457 SS#: 4567
Name: "Knuth" Name: "Turing" Name: "Babbage"
Rank: "full" Rank: "full" Rank: "assoc."
residesIn: id9 residesIn: residesIn:
givenExams: {... } givenExams: { ... } givenExams: {... }
teaches: {id2, id3} teaches: { ... } teaches: {... }

Fig. 6.1. Example object base with Students, Exams, and Professors

arbitrary long attribute chains where the chain may even contain collection-
valued attributes. The ASRs allow to avoid the actual evaluation of the func-
tional joins by materializing frequently traversed reference chains.

Auxiliary definitions. A path expression has the form


o.A I . ... . An

where 0 is a tuple structured object containing the attribute Al and


o.A I . ..• . Ai refers to an object or a set of objects, all of which have an
attribute Ai+!. The result of the path expression is the set Rn, which is
recursively defined as follows:

Ro := {o}
R i := U V.Ai for 1 :::; i :::; n
vER i _ 1

Thus, Rn is a set of OIDs of objects of type tn or a set of atomic values of


type tn if tn is an atomic data type, such as into
4. Object-Oriented Database Systems 141

It is also possible that the path expression originates in a collection C of


tuple-structured objects, Le., C.Al.··· .An. Then the definition of the set Ro
has to be revised to: Ro := C.
Formally, a path expression or attribute chain is defined as follows:

Definition 1 (Path Expression). Let to, ... , tn be (not necessarily dis-


tinct) types. A path expression on to is an expression to.Al' ... .An iff for
each 1 ~ i ~ n one of the following conditions holds:

• The type ti-l is defined as type t i - l is [... , Ai : ti,.' .l, Le., t i - l is a


tuple with an attribute Ai of type ti 7 .
• The type t i - l is defined as type t i - l is [... , Ai : {ti},"'], i.e., the at-
tribute Ai is set-structured. In this case we speak of a set occurrence at
Ai in the path to.Al' ... .An.
For simplicity of the presentation we assumed that the involved types are
not being defined as a subtype of some other type. This, of course, is generally
possible; it would only make the definition a bit more complex to read.
The second part of the definition is useful to support access paths through
sets 8 . If it does not apply for a given path the path is called linear. A path
expression that contains at least one set-valued attribute is called set-valued.
Since an access path can be seen as a relation, we will use relation exten-
sions to represent materialized path expressions. The next definition maps a
given path expression to the underlying access support relation declaration.

Definition 2 (Access Support Relation (ASR)). Let to, .. . , tn be types,


to.Al.··· .An be a path expression. Then the access support relation
[[to.Al .··· .Anl] is of arity n + 1 and has the following form:
[[to.Al .··· .Anl] : [So, ... , Snl
The domain of the attribute Si is the set of identifiers (OIDs) of objects of
type ti for (0 ~ i ~ n). If tn is an atomic type then the domain of Sn is tn,
i.e., values are directly stored in the access support relation.

We distinguish several possibilities for the extension of such relations. To


define them for a path expression to.Al.··· .An we need n temporary relations
[[to.All], ... , [[tn-l.Anl]·
Definition 3 (Temporary Binary Relations). For each i (1 ~ i ~ n)
- that is, for each attribute in the path expression - we construct the tem-
porary binary relation [[ti-l.Ad]. The relation [[ti-l.Ad] contains the tuples
(id( Oi-l), id( Oi)) for every object Oi-l of type ti-l and 0i of type ti such that

7 This means that the attribute Ai can be associated with objects of type ti or any
subtype thereof.
8 Note, however, that we do not permit powersets.
142 A. Kemper and C. Moerkotte

• Oi-I.Ai = 0i if Ai is a single-valued attribute .


• 0i E Oi-I.Ai if Ai is a set-valued attribute.
If tn is an atomic type then id(on) corresponds to the value On-I.An. Note,
however, that only the last type tn in a path expression can possibly be an
atomic type.
Let us illustrate this on an example of our University schema:
p == §tudents. takenExams.
v
.givenBy . Name
Exams
v
Professors
~----------~v~----------~j
string
The type constraints of the path expression are emphasized with the un-
derbraces. When considering the update problem, it should be obvious that
strong typing - as enforced by the ODMG model - is vital to indexing over
path expressions. Therefore, models with a more relaxed typing paradigm
such as, e.g., GemStone, which is based on the dynamically typed Smalltalk,
have to impose user-specified and dynamically controlled type constraints on
attributes and/or paths that are indexed.
For the path expression specified above the temporary binary relations
have the following extensions:
II[Students.takenExams]11 I[Exams. givenBy]1
OlD Students OIDExams OIDExams OlD Professors
id 35 id 21 id 21 idl
id 35 id 23 id 22 id l
id37 id 22 id23 id 6
id 37 id27 id 27 id6
.. . .. . ... . ..
I[Professors.N ame]1
OlD Professors string
id l "Knuth"
id 5 "Turing"
id 6 "Babbage"
.. . ...

Extensions of access support relations. We now introduce different pos-


sible extensions of the ASR [[to.A I .··· .An ]]. We distinguish four extensions:
1. The canonical extension, denoted [[to .AI . ... .A n ]] can contains only infor-
mation about complete paths, Le., paths originating in to and leading (all
the way) to tn. Therefore, it can only be used to evaluate queries that
originate in an object of type to and "go all the way" to tn.
4. Object-Oriented Database Systems 143

2. The left-complete extension [[to.A I •·•· .An]] left contains all paths origi-
nating in to but not necessarily leading to tn, but possibly ending in a
NULL.
3. The right-complete extension [[to.A I .··• .An]] right, analogously, contains
paths leading to tn, but possibly originating in some object OJ of type tj
which is not referenced by any object of type t j - I via the Aj attribute.
4. Finally, the full extension [[to.A I .··· .An]]/ull contains all partial paths,
even if they do not originate in to or do end in a NULL.

Definition 4 (Extensions). Let t><I (J><[, J><] , [XC ) denote the nat-
ural (outer, left outer, right outer) join on the last column of the first relation
and the first column of the second relation. Then the different extensions are
obtained as follows:

[[io.AI.··· .Anl] can := [[io.All] t><I ... t><I [[in-I.Anl]


[[io.AI .... .Anl] full := [[io.All] J><[ ... J><[ [[in-I.Anl]
[[io.A I.··· .Anl] left := C·· ([[to.AIl] J><] [[tl .A2l]) ... J><] [[in-I.Anl])
[[to.A I.··· .AnJ],-;9 ht := ([[io.All] [XC ... (rrin-2.An-ll] I><[ [[in-I.An]) .. -)

The full extension of [[Students.takenExams.givenBy.Name]]!UII looks as


follows:

[Students. takenExarns.givenBy.N arne]/,,!!


So : DID Students Sl : DID Exams S2 : DID Professors S3 : string
id35 id21 idl "Knuth"
id35 id23 id 6 "Babbage"
id37 id22 idl "Knuth"
id37 id27 id6 "Babbage"
- - ids "'luring"
., . . .. .. . . ..

This extension contains all paths and subpaths corresponding to the underly-
ing path expression. The first four tuples actually constitute complete paths
which would be present in the canonical extension as well; however the fifth
path would be omitted in the canonical extension. In the left-complete ex-
tension the only he first four tuples would be present, whereas the fifth tuple
would also be present in the the right-complete extension.
It should be obvious, that the full extension of an ASR contains more in-
formation than the left- or right-complete extensions which, in turn, contain
more information than the canonical extension. The right- and left-complete
extensions are incomparable. The next definition states under what condi-
tions an existing access support relation can be utilized to evaluate a path
expression that originates in an object (or a set of objects) of type s.
144 A. Kemper and G. Moerkotte

Definition 5 (Applicability).
An access support relation [[to.A 1 •••• .An)] X under extension X is applicable
for a path s.Ai .··· .Aj where s is a sUbtype9 of ti-l under the following
condition, depending on the extension X:

X=ft£ll 1\1:5i:5j:5n
. { X = left 1\ 1 = i :5 j :5 n
Applzcable([[to.A 1 •••• .Anl] x, S.Ai.· ... A j ) = X = right 1\ 1 :5 i :5 j = n
X = can 1\ 1 = i :5 j = n

Storage structure. The storage structure of access support relations is


borrowed from the binary join index proposal by Valduriez [Val87). Each
ASR is redundantly stored in two index structures: the first being keyed
on the left-most attribute and the second being keyed on the right-most
attribute. Suitable index structures are hash tables or B+ -trees. The hash
table is particularly suitable for keys consisting of OlD attributes since only
exact match queries have to be supported. On the other hand, B+ -trees are
advantageous for attributes that allow range queries, e.g., int and float values.
Note, that these attributes can only occur at the right-most column of an
ASR. The following discussion is solely based on B+ -trees; however, it can
easily be adapted to hash tables.
Graphically, the redundant storage scheme consisting of two
B+ -trees for each ASR is visualized for the canonical ASR
[[Students.takenExams.givenBy.Name)] can as follows:

[Students. takenExams.givenBy .Name] can


80: OIDStudents 81 : OIDEzams 82 : OlD Professors 83: string
id35 id21 id1 "Knuth"
id35 id23 id 6 "Babbage"
id 37 id22 id1 "Knuth"
id37 id27 id6 "Babbage"

We will call the left B+ -tree the "forward clustered" tree, and, analo-
gously, the right one the "backward clustered" tree. The left-hand B+ -tree
supports the evaluation of a forward query, e.g., retrieving the professor's
name who has examined the student identified by id35 • The left-hand B+ -tree
supports the evaluation of backward queries - with respect to the underlying
path expression. For our example, an entry point for finding the Students
who have taken exams from "Knuth" is provided by the backward clustered
B+-tree.
This storage scheme is also well suited for traversing paths from left-to-
right (forward) as well as from right-to-Ieft (backward) even if they span
9 Note, that every type is a subtype of itself.
4. Object-Oriented Database Systems 145

over several "connecting" access support relations. Again, let us graphically


visualize the situation:

[Students.takenExams.givenBY)can
80 : OlD Students 8 1 : OlD Exams 82 : OlD Professors
id35 id21 id1
id35 id23 id6
id37 id22 id1
~~ ~n ~

I [Professors.Name)can
80 : OlD Professors 81 : string
id6 "Babbage"
id 1 "Knuth"
id5 "'lUring"

The above example illustrates the virtues of the redundant storage model
for ASRs. The right B+ -tree of the ASR [[Professors. Name]] can directly
supports the lookup of those Professors whose Name is Knuth, i.e., the
one with OlD id l in our example. Then, the right B+ -tree of the ASR
[[Students.takenExams.givenBy]] can supports the traversal to the correspond-
ing Students to obtain the result {id35 , id37 }.
Thus, the backward traversal constitutes a "right-to-Ieft" semi-join across
ASRs:
r[Students.takenExams.givenByllUcan
IIso( U IX r[Projessors.Namel1
O'Sl=Knuth( U ~can
))

Analogously, the "forward clustered" B+ -tree supports the semi-join from left
to right, such that, for instance, the Names of Professors who have examined
student id35 can be retrieved efficiently. This corresponds to the "left-to-
right" semi-join across ASRs:
IIs3(O'so=id35 ([[Students.takenExams.givenByJ] ca) ><I nprojessors.Namel] can)

Join index hierarchies. Recently, [XH94] have adapted the ASR scheme
to a so-called join index hierarchy. Their key idea is to omit the intermediate
objects in the join index and merely store the OIDs of the start and the
target object. In addition, the number of possible paths between the start
and target object is counted and stored.
Their approach is still based on the binary access support relations
[[to.A l ] , ... , [tn-l.A n ]]. A join index covering the sub-path from ti to tj
- denoted JI(ti.Ai+l.··· .Aj ) is obtained from the binary ASRs as follows:
146 A. Kemper and G. Moerkotte

The !XI c operation is similar to a conventional natural join except that it


projects on the OIDs of the start and target objects and it properly derives
the counts of the number of different paths leading from start to target object
and eliminates duplicates with the same start and target by aggregating the
count values.lO

OIDtl OIDt" count


OIDtl OIDtk count OIDtk OIDt " count
id s7 id88 17
ids7 id67 3 id67 id88 3
ids9 id88 8
id s7 id78 2 id78 id 88 4
id s7 id 99 4
id s9 id78 2 id 78 id 99 2
id s9 id99 4

Let us briefly explain the derivation of the first tuple [ids7, id88 , 17]. There
are 3 paths connecting ids7 with id67 and 3 paths connecting id67 with id88.
Therefore, there are 3 * 3 = 9 different ways to traverse from ids7 to id88 via
id67' Likewise, there are 2 * 4 = 8 different ways to traverse from ids7 to id88
via id78. This amounts to 9 + 8 = 17 different paths between ids7 and id88.
For our university database the join index
JI(Students.takenExams.givenBy.Name) looks as follows:

IJI(Students.takenExams.givenBy.Name)I
OlDStudents string count
id 3S "Knuth" 1
id 3S "Babbage" 1
id 37 "Knuth" 1
id37 "Babbage" 1

Note that join indices are always ternary relations - no matter how long
a path expression they cover, because intermediate objects are omitted. In
our case, the join index does not contain less tuples than the canonical ASR
because none of the students in our example database has taken two (or
more) exams from the same professor.
Maintaining just one join index covering the entire path expression is
usually not sufficient because it allows to evaluate only those queries that
span the entire path. The other extreme is to materialize all the possible join
indices that cover anyone of the subpaths. This results in precomputing (and
maintaining) the so-called complete join index hierarchy. For the abstract
path expression to.Al.A2.A3.A4 the complete join index hierarchy is shown
in Figure 6.2.
The disadvantage of the complete join index hierarchy is that materializ-
ing all the possible join indices leads to high storage and high update costs.

10 For simplicity we assume that the binary ASRs are augmented with a count
attribute which is set to 1 in all tuples.
4. Object-Oriented Database Systems 147

length 4

length 3

length 2

length 1

Fig. 6.2. The complete join index hierarchy for a path of length 4

This is especially disadvantageous if some of the join indices are rarely/never


used for query processing.
Assuming that most queries traverse from to to t4 (or vice versa), from
t1 to t4, from t2 to t4, or from to to t2 then the partial join index hierarchy
shown in Figure 6.3 is most appropriate.

length 4

length 3

length 2

length 1

Fig. 6.3. A partial join index hierarchy for a path of length 4 .

Updating a join index hierarchy. Updates on the object base have to be


properly propagated to all materialized join indices that span the correspond-
ing type. Consider the partial join index hierarchy of Figure 6.3: An update of
attribute A2 of an object 01 E tt has to be propagated to the binary ASR (join
index) [[t1.A21] and to the three join indices JI(to.A 1.A2), JI(t1.A2.A3.A4),
and JI(tO.A1.A2.A3.A4)' Such updates are propagated step-wise from the
bottom to the top of the join index hierarchy. For this purpose we observe
the following rule. Let LlJI( tl.A I+1' ... . Ak) denote a set of tuples inserted into
join index JI(tl.AI+ 1. ... .Ak)' Then, the join index JI(tl.A I +1' ... . Ak . ... .Ap)
is properly updated by inserting the tuples Ll JI(tl.A I +1.··· .Ak) 1><l c
JI(tk.Ak+1.··· .Ap). An analogous rule holds if the tuples were inserted into
the join index spanning a suffix (instead of a prefix) of the encompassing
path; e.g., if tuples were inserted into JI(tk.Ak+1' ... .Ap). Inserting tuples
148 A. Kemper and G. Moerkotte

into a join index has to be done with care: If a tuple [id i , idj , n] exists in
the relation and another tuple [id i , idj , m] representing m additional paths
between idi and idj is inserted, the two should be combined to the one tuple
[id i , idj , (n + m)].
Let us illustrate this bottom-up update propagation on our partial join
index hierarchy of Figure 6.3. Inserting the additional tuple(s) Ll [[tl.A 2]] into
[t l .A2 ]] is propagated to the other join indices as follows:
1. JI(to.A I .A2) is updated by inserting the tuples
LlJI(to.A I .A2) := [[to.A I ]] IXI eLl [[tl.A 2]] .
2. JI(tl.A 2.A3.A4) is updated by inserting the tuples
Ll[[tl.A2] IXI cJI(t2.A3.A4)'
3. In updating JI(to.A I .A2.A3.A4) the set of new tuples LlJI(to.A I .A2)
for join index JI(tO.AI.A2) that was computed in step 1. is
reused. JI(to.A I .A2.A3.A4) is updated by inserting the tuples
LlJI(to.A I .A2) IXI cJI(t2.A3.A4)'

Literature. Access support relations were developed by the authors [KM90].


A more detailed description can be found in [KM92]. Access support relations
constitute a generalization of two relational techniques: the links developed
by Harder [Har78] and the binary join indices proposed by Valduriez [VaI87].
Rather than relating only two relations (or object types) our technique allows
to support access paths ranging over many types. The ASR scheme subsumes
and extends several previously proposed strategies for access optimization in
object bases. The index paths in GemStone [MS86] are restricted to chains
that contain only single-valued attributes and their representation is limited
to binary partitions of the access path. Similarly, the object-oriented access
techniques described for the Orion model [BK89] are contained as a special
case in our framework. [KD91] reports on an indexing technique for hierar-
chical object structures, i.e., nested relations, which is related to our access
support relations.
Our technique differs in three major aspects from the aforementioned
approaches:
• access support relations allow collection-valued attributes within the at-
tribute chain
• access support relations may be maintained in four different extensions.
The extension determines the amount of (reference) information that is
kept in the index structure.
• The paths over which ASRs are defined may be decomposed into parti-
tions (sub-paths) of arbitrary lengths. This allows the database designer
to choose the best extension and path partitioning according to the par-
ticular application characteristics.
Also the (separate) replication of object values as proposed for the Extra
object model [SC89] and for the POSTGRES model [SAH87,SeI88] are largely
subsumed by ASRs.
4. Object-Oriented Database Systems 149

The join index hierarchies were proposed by [XH94].

6.2 Function Materialization

Let us nOW discuss an optimization technique that is devised to expedite


the evaluation of queries containing function invocations, such the following
example query:

select s
from s in AllStudents
where s.gpaO > 3.0

The function materialization technique is rooted in view materialization,


which is a well-known optimization technique of relational database systems.
Here, we present the basics of a similar, yet more powerful optimization COn-
cept for object-oriented data models: function materialization. Exploiting the
object-oriented paradigm - namely classification, object identity, and encap-
sulation - facilitates a rather easy incorporation of function materialization
into (existing) object-oriented systems. Only those types (classes) whose in-
stances are involved in some materialization are appropriately modified and
recompiled - thus leaving the remainder of the object system invariant. Fur-
thermore, the exploitation of encapsulation (information hiding) and object
identity provides for additional performance tuning measures which drasti-
cally decrease the invalidation and rematerialization overhead incurred by
updates in the object base.

Storing materialized results. There are two obvious locations where ma-
terialized results could possibly be stored: in or near the argument objects of
the materialized function or in a separate data structure. Storing the results
near the argument objects meanS that the argument and the function result
are stored within the same page such that the access from the argument to
the appropriate result requires nO additional page access. In general, storing
results near the argument objects has several disadvantages:

• If the materialized function f : tI, ... , tn -7 tn+l has more than one
argument (n > 1) One of the argument types must be designated to hold
the materialized result. But this argument has to maintain the results of
all argument combinations - which, in general, won't fit On One page.
• Clustering of function results would be beneficial to support selective
queries on the results. But this is not possible if the location of the ma-
terialized results is determined by the location of the argument objects.

Therefore, [KKM94] chose to store materialized results in a separate data


structure disassociated from the argument objects. If several functions are
materialized which share all argument types, the results of these functions
150 A. Kemper and G. Moerkotte

may be stored within the same data structure. This provides for more effi-
ciency when evaluating queries that access results of several of these functions
and, further, avoids to store the arguments redundantly. These thoughts lead
to the following definition:
Definition 6 (Generalized Materialization Relation, GMR).
Let t1, ... , tn, t n+1, ... , tn+m be types and let h, ... , fm be side-effect free
functions with fj : h, ... , tn ---t tn+j for 1 :::; j :::; m. Then the generalized
materialization relation ((h, ... , fm)) for the functions h,.··, fm is of arity
n + 2 * m and has the following form:
((h,···, fm))
tn+m, Vm : boot]
Intuitively, the attributes 0 1 , ... ,On store the arguments (Le., values if the
argument type is atomic or references to objects if the argument type is
complex); the attributes h, ... , f m store the results or - if the result is of
complex type - references to the result objects of the invocations of the
functions iI, ... , fm; and the attributes V1, ... , Vm (standing for validity)
indicate whether the stored results are currently valid.
An extension of the GMR ((iI, ... , fm)) is consistent if a true validity
indicator implies that the associated materialized result is currently valid,
Le.:

In the remainder of this paper we consider only consistent GMR exten-


sions. However, consistency is only a minimal requirement on GMR exten-
sions. Further requirements like completeness and validity are introduced in
the next subsection where the retrieval of materialized results is discussed.
The above definition of consistency provides for some tuning measure with
respect to the invalidation and rematerialization of results. Upon an update
to a database object that invalidates a materialized function result we have
two choices:
1. immediate rematerialization: The invalidated function result is immedi-
ately recomputed as soon as the invalidation occurs.
2. lazy rematerialization: The invalidated function result is only marked as
being invalid by setting the corresponding Vi attribute to false. The rema-
terialization of invalidated results is carried out as soon as the load of the
object base management system falls below a predetermined threshold or
- at the latest - at the next time the function result is needed.
In this presentation we will discuss only the materialization of functions
having complex argument types. As can easily be seen it is not practical
to materialize a function for all values of an atomic argument type, e.g.,
float. Therefore, in [KKM94] we proposed restricted GMRs for materializing
functions for selected parameters only.
4. Object-Oriented Database Systems 151

An example of a GMR comprising only a single materialized function,


i.e., Students.gpa is shown below:

((Students.gpa))
01 : OlD Students gpa : float Vgpa : bool
i d35 2.0 true
id37 2.5 true
id53 - true
... ... . ..

Retrieval of materialized results. The GMR manager - which manages


all GMR extensions - has to facilitate a flexible retrieval interface in or-
der to support queries containing invocations of materialized functions. For
example, if all results of the materialized function f; are requested by some
query, e.g., to perform some aggregate operation on the results, all results cur-
rently being valid can be obtained from the GMR - invalid results have to be
(reo-)computed. These (reo-)computed results are also used by the GMR man-
ager to update the GMR. Further, if the GMR is not complete, i.e., it does
not contain an entry for each argument combination, the results of missing
argument combinations have to be computed as well. Missing GMR entries
whose results are computed during the evaluation of some query may be in-
serted into the GMR (for a discussion of incomplete GMR extensions see
below).
Note that invalidated or missing results need not necessarily be
(reo-)computed upon the evaluation of some query. For example, if any stu-
dent having a gpa greater than 3.0 is to be retrieved (e.g., to serve as a tutor),
and if such a student can be found by inspecting the (incomplete) GMR no
invalidated or missing results need be (reo-)computed.
However, if a GMR extension is complete, i.e., it contains one entry for
each argument combination, and the results of all functions occurring in the
respective query are valid, the query can be evaluated on the GMR without
having to (reo-)compute any result. Subsequently, we formalize the notions of
valid and complete GMR extensions.
Definition 7 (Valid Extension). A consistent extension of the GMR
((II, ... ,1m)) is called f;-valid iff
7r V; ((II,···, 1m)) = { true}

Definition 8 (Complete extension). A consistent extension of the GMR


((II, ... , 1m)) is a complete extension iff
7r Ob ... ,On ((II, ... ,1m)) = ext(tl) X ••• X ext(t n)
where ext(ti) denotes the extension of type ti, i.e., the set of all instances of
type ti.
152 A. Kemper and G. Moerkotte

Upon the creation of a new GMR the database administrator can choose
whether the GMR extension has to be complete or whether the extension
may be set up incrementally (starting with an empty GMR extension). In-
crementally set up GMR extensions can be used as a cache for function results
that were computed during the evaluation of queries. If the number of entries
is limited (due to space restrictions) specialized replacement strategies for the
GMR entries can be applied. Note that GMRs must be set up incrementally
if they contain at least one partial function.
It should now be obvious that the example query Q3 can be evaluated as

11 01 (a gpa>3.0 ((Students.gpa)))
as long as the GMR ((Students.gpa)) is gpa-valid and complete.

Storage representation of GMRs. The flexible retrieval operations on the


GMRs require appropriate index structures to avoid the exhaustive search of
GMR extensions. For that, well-known indexing techniques from relational
database technology can be utilized. The easiest way to support the flexible
and efficient access to any combination of GMR fields would be a single multi-
dimensional index structure, denoted MDS, over the fields 0 1 , •.• , On, h,
···,/m:

Here, the first n + m columns constitute the (n + m )-dimensional keys of


the multi-dimensional storage structure. The m validity bits VI, ... , Vm are
additional attributes of the records being stored in the MDS.
Instead of using multi-dimensional storage structures, such as the Grid-
File [NHS84], one could also utilize more conventional indexing schemes to
expedite the access on GMRs of higher arity. The index structures are chosen
according to the expected query mix, the number of argument fields in the
GMR, and the number of functions in the GMR. A good proposal for multi-
dimensional indexing based on regular B-trees is given in an early paper by
V. Lum [Lum70].

Invalidation and remateriaiization of function results. When the


modification of an object 0 is reported, the GMR manager must find all
materialized results that become invalid. This task is equivalent to deter-
mining all materialized functions I and all argument combinations 01, ... , On
such that the modified object 0 has been accessed during the materialization
of 1(01, . .. ,On). Note that in general references are maintained only unidirec-
tionally. In this case, there is no efficient way to determine from an object 0
the set of objects that reference 0 via a particular path. Therefore, the GMR
manager maintains reverse references from all objects that have been used
in some materialization to the appropriate argument objects in a relation
4. Object-Oriented Database Systems 153

called Reverse Reference Relation (RRR). The RRR contains tuples of the
following form:
[id{o), f, (id{Ol), ... ,id{on))]
Herein, id{o) is the identifier of an object 0 utilized during the materialization
of the result f{ol, ... , on). Note that 0 need not be one of the arguments
01, ... , On; it could be some object related to one of the arguments. Thus,
each tuple of the RRR constitutes a reference from an object 0 influencing a
materialized result to the tuple of the appropriate GMR in which the result
is stored. We call this a reverse reference as there exists a reference chain in
the opposite direction in the object base. l l
Definition 9 (Reverse Reference Relation). The Reverse Reference Re-
lation RRR is a set of tuples of the form
[0: OlD, F: Functionld, A: (OlD)]
For each tuple r E RRR the following condition holds: The object (with the
identifier) r.O has been accessed during the materialization of the function
r.F with the argument list r.A. Remember, that the angle brackets (... )
denote the list constructor.
The reverse references are inserted into the RRR during the materializa-
tion process. Therefore, each materialized function f and all functions in-
voked by f are modified - the modified versions are extended by statements
that inform the GMR manager about the set of accessed objects. During
a {re-)materialization of some result the modified versions of these functions
are invoked.
For our University object base a part of the RRR that controls the in-
validation of precomputed results in the GMR ((Students.gpa)) is shown in
Figure 6.4. Each time an object is updated in the object base, the RRR
is inspected to find out which materialized results have to be invalidated
(lazy rematerialization) or recomputed (immediate rematerialization). Ref-
erence [KKM94] describes ways to detect object updates by schema modifi-
cation and efficient algorithms for maintaining the RRR - which, of course,
changes under object base updates.

Strategies to reduce the invalidation overhead. The invalidation mech-


anism outlined so far is (still) rather unsophisticated and, therefore, induces
unnecessarily high update penalties upon object modifications. In [KKM94],
four complementary techniques to reduce the update penalty - consisting of
invalidation and rematerialization - by better exploiting the potential of the
object-oriented paradigm were developed. The techniques described there are
based on the following ideas:
11 This holds only if no global variables are used by the materialized function.
Otherwise, the RRR contains reverse references not only to the argument objects
but also to the accessed global variables.
154 A. Kemper and G. Moerkotte

RRR ((Students.gpa))
0 F A 01 : OlD Students gpa : float Vgpa : bool
id21 Students.gpa (id35) id35 2.0 true
id22 Students.gpa (id37) id37 2.5 true
id23 Students.gpa (id35) id53 - true
id27 Students.gpa (id37) ... ... . ..
.. . .. . . ..

Fig. 6.4. The data structures of the GMR manager

1. isolation of relevant object properties: Materialized results typically de-


pend on only a small fraction of the state of the objects visited in the
course of materialization. For example, the materialized gpa certainly
does not depend on the Semester and Name attributes of a Student.
2. reduction of RRR lookups: The unsophisticated version of the invalidation
process has to check the RRR each time any object 0 is being updated.
This leads to many unnecessary table lookups which can be avoided by
maintaining more information within the objects being involved in some
materialization - and thus restricting the lookup penalty to only these
objects.
3. exploitation of strict encapsulation: By strictly encapsulating the repre-
sentation of objects used by a materialized function, the number of up-
date operations that need be modified can be reduced significantly. Since
internal subobjects of a strictly encapsulated object cannot be updated
separately - without invoking an outer-level operation of the strictly en-
capsulated object - we can drastically reduce the number of invalidations
by triggering the invalidation only by the outer-level operation.
4. compensating updates: Instead of invoking the materialized function to
recompute an invalidated result, specialized compensating actions can
be invoked that use the old result and the parameters of the update
operation to recompute the new result in a more efficient way.

Literature. The function materialization discussed here is - in its basic


ideas - similar to materialization of views in the relational context. The
most important work is reported in [BCL89] and [BLT86]. Lindsay proposed
so-called relational snapshots [AL80] - which, however, are not guaranteed
consistent with the actual state of the database. The snapshots are only
periodically recomputed and, thereby, brought back into a consistent state.
Thus, a snapshot can only be used for certain applications, e.g., browsing,
that do not require a completely consistent information contents.
Further work in precomputing queries and database procedures was
done in the extended relational database project POSTGRES [SR86]. Here,
the so-called "QUEL as a Datatype" attributes are precomputed and
4. Object-Oriented Database Systems 155

cached in separate data structures. The control concepts are discussed in


[Han87 ,Han88,Jhi88,SeI88,SAH87 ,SJG+90j.
The function materialization concepts discussed in this chapter were de-
veloped by the authors and C. Kilger. The basic ideas are described in
[KKM91j; a more detailed discussion is given in [KKM90j.

6.3 Indexing Over Type Hierarchies

For the discussion of this section consider the type hierarchy shown in Fig-
ure 6.5. Based on this type hierarchy we can phrase the following three

Person

Emp Student

r
Manager

r
CEO
Fig. 6.5. Sample type hierarchy

queries, the meaning of which should be obvious:


Q1: select e Q2: select c Q3: select p
from Emp e from CEO c from Person p
where e.salary > 200000 where c.salary > 200000 where p.age > 60
In query Q1 we want to retrieve Emps (including Managers and CEOs)
whose salary exceeds 200000. In query Q2, on the other hand, we are only
interested in the CEOs with such a high salary; because of substitutability
the result of Q2 is a subset of the result obtained in query Q1. In query
Q3 all Persons whose age exceeds 60 are retrieved, i.e., including all Emps,
Managers, CEOs and Students.
The evaluation of such queries can (and should) be supported by indexing.
Indexing can be viewed as a special case of access support relations, except
that mostly the backward clustered B+ -tree is relevant and, therefore, the
forward clustered tree may be omitted. Consider, for example, the index
for Emp.salary which is represented as the ASR [[Emp.salaryj] as follows
- assuming that id4 , id5 , id6 , and id7 are Emp objects, id ll and id13 are
Managers and idn and id88 identify CEOs:
156 A. Kemper and G. Moerkotte

[Emp.salary]
80: OIDEmp 81 : int
id4 90000
id 5 100000
id7 100000
id n 150000 B+
ids 260000
id13 900000
id77 1500000
id88 2000000

Note, that this index, because of substitutability, implicitly includes all


Managers and all CEOs, for which the salary is "known". Therefore, the eval-
uation of query Q1 is very well supported since it involves a lookup in a single
B+ -tree only. However, query Q2 is not as well supported since it involves re-
trieving the OIDs from the index [[Emp.salary]] for which the salary attribute
exceeds 200000. The resulting set, however, contains Emps and Managers as
well. So, in a (costly) second phase, the CEOs have to be extracted from this
set.

Single type indexing. The idea of single type indexing is to incorporate


only the direct instances of a particular type in the index. Let us denote
the set of direct instances of a type T as T. Then, for our example we
could create three separate indexes, i.e.: [Emp.salary]], [[Manager. salary]] ,
and [CEO . salary]] .
These indexes have the following form:

[Emp.salary]
80 : OlD Emp 8 1 : int
id4 90000
id5 100000
id7 100000
ids 260000

[Manager .salary] [CEO.salary]


80: OlD Manager 8 1 : int 80 :OIDcEO 8 1 : int
id l l 150000 id77 1500000
id 13 900000 idss 2000000

Now, evaluating query Q2 is very well supported because it involves only


a lookup in the index [[ CEO.salary]].
4. Object-Oriented Database Systems 157

However, the evaluation of query Q1 now involves a lookup in three sep-


arate B+ -trees and unioning the results, Le.:

a 81>200000 [[EmP.salary]]
u a 81>200000 [Manager. salary]]
U a 81>200000 [[CEO.salary]]
This problem appears even more severe when considering query Q3 under
the assumption of separate single type indexing on the age attribute.

Type hierarchy indexing. Because of the above discussed disadvantages


of separate single type indexing [KKD89] propose the use of a type hierarchy
index. The type hierarchy index consists of a single B+ tree which comprises
all direct and indirect instances of the indexed type. This is basically what we
adhered to in the ASR definitions, as well; except that [KKD89] developed a
special structure of the leaf nodes which is sketched as follows:
entry 1 entry 2
II
1, 2, . . 0, •.• , ••• 1 j
"--v--"

This layout of the leaf nodes provides support for extracting the (OIDs
of) objects of a particular type by jumping to the corresponding offset which
is maintained in the key directory.
[LOL92] developed an indexing scheme, called H-trees, for combining type
hierarchy indexing with single type indexing. The basic idea consists of nest-
ing B+ -trees, Le., nesting the index tree of a subtype within the tree of the
super type. For our example this is graphically visualized in Figure 6.6, where
the three H-trees HEmp for direct Emp instances, HManager for direct Man-
ager instances, and H CEO for CEO instances are sketched.
The nesting is achieved by incorporating so-called L pointers which refer
from the supertype index to the subtype index tree. There are two essential
conditions for a valid H-tree nesting:

1. The range of the subtree referenced by an L-pointer must be contained


in the range of the referencing node of the supertype tree. In terms of
our example, the range of the subtree T2 must be contained in the range
of T1, and the range of T4 must be contained in the range of T 3 .
158 A. Kemper and G. Moerkotte

HOED

Fig. 6.6. Three nested H-trees

2. All leaf nodes of a subtype tree have to be covered by the supertype


tree. This means, that all leaf nodes have to be reachable by following
L-pointers emanating from the supertype tree.

A single type lookup on H-trees is carried out by searching in the corre-


sponding H-tree and simply ignoring the L-pointers. A type hierarchy lookup
is carried out by searching in the H -tree of the root type (over which the query
is stated) and traversing the L-pointers to subtype trees.
Of course, the maintenance of the H-trees imposes a severe overhead on
update operations; the exact penalty of which still has to be investigated
more thoroughly.

The CG tree. [KM94] pointed out the principal difference between a key
grouping index - such as the CH-tree - and a set-grouping index - such as
the H-tree. Figure 6.7 sketches the relative performance of these two indexing
schemes for exact match and range queries. The key grouping scheme has very
good performance (Le., low numbers of pages have to be read) for exact match
queries whereas the set grouping scheme degenerates if many sets (Le., many
levels of a type hierarchy) have to be processed. This is due to the fact that
basically every type extent is covered by a separate B-tree. On the other hand,
the key grouping scheme shows poor performance for range queries because
it has to process a large number of leaf pages. It cannot draw profit from
a restriction on the number of sets (type extents) that should be processed
because all sets' objects are intermixed on the leaf pages.
Observing this principal difference, [KM94] designed the so-called CG-tree
which combines the advantages of both schemes. The idea is to replace the
leaf pages of a B+ -tree by several linked lists, one for each set (type extent)
being indexed. This basic idea is illustrated in Figure 6.8 for two sets (type
extents) 81 and 82 only.
The linked lists of leaf pages are considered to be at level 1 of the tree.
Then, at level 2 of the tree particularly structured so-called directory pages
4. Object-Oriented Database Systems 159

- •••• Grouping by key

•••••.••.•. Grouping by set

h: Height of the tree


n: Number of indexed sets

n Number of n Number of
Exact match query queried sets Range query queried sets

Fig. 6.7. Grouping by key versus grouping by set

81-objects
82-objects

Fig.6.S. Basic idea of the CG-tree

are needed that reference the pages at level 1. The directory pages have the
following structure - for n indexed sets S1, ... , Sn:

The directory page contains m search keys Kl. ... , Km. Thereby, the m
ranges R 1 , •.• , Rm are defined. For each range, the directory contains n point-
ers to level 1 pages. The pointer ~.Sj refers to the page of set
srobjects whose keys are in the range ~, i.e., whose keys are in the in-
terval [Ki' Ki+1). If the set Si does not contain any such elements, ~.Sj is
null.
The higher-up nodes of the CG-tree are regular B+ -tree nodes - having
just one emanating node pointer per range.
The cardinality of the indexed sets and their distribution of key values
may be non-uniform. In our example, one can expect that higher salaries
are typically found for CEO objects whereas the lower salaries are usually
paid to "regular" Emp objects. To compensate for this skew in attribute
value distribution, leaf nodes may be shared by several neighbored directory
entries. Such a situation is shown in Figure 6.9. Assuming that an underflow
occurs in the leaf pages L1 and/or L2 of the tree shown on the left-hand side.
This underflow is compensated by merging the two leaves into a single leaf
160 A. Kemper and G. Moerkotte

page L12 - as shown on the right-hand side. This combined leaf page is now
referenced by two neighboring directory entries via the pointers Rl.Sl and
R2. S1.

Fig. 6.9. Sharing of leaf pages

Unfortunately, space limitations prohibit a more detailed description of


the CG-tree in this presentation; more details can be found in [KM94J.

7 Dealing with Set-Valued Attributes

7.1 Introduction

In queries, selection and join predicates may not only refer to single-valued at-
tributes but also to set-valued attributes. Predicates on set-valued attributes
can be used as selection predicates and join predicates. The next query is an
example of the former:

select s
from s in Student
where Requirements <= s.coursesPassed

The query retrieves students who have passed at least all courses contained
in the set of courses Requirements. This query can be evaluated efficiently by
using an index on Student.coursesPassed.
The following query contains a join predicate based on set-valued at-
tributes. It matches students with courses. The result is a pair of courses and
students such that the student passed all the courses which are a prerequisite
for the course:

select c, s
from c in Course, s in Student
where c.prerequisites <= s.coursesPassed
4. Object-Oriented Database Systems 161

No traditional join algorithm designed for fast join processing - like hash-
join or sort-merge join - is able to handle this query. The only possible
evaluation strategy is to use the slow nested-loop join where every course's
prerequisites is compared with every student's coursesPassed. Obviously, this
is quite expensive.
Both queries contained the subset predicate <= but other queries could
use set equality, the strict or non-strict superset predicate and other variants.
All these possible set predicates can be treated by the techniques introduced
in this section. They rely on signatures which will be discussed in the next
section. Both, the new join algorithms and the new index structures use
signatures as their essential ingredient. Another common variant is to test
two sets for a non-empty intersection. This case cannot be treated by any of
the methods in this section.

Superimposed coding and signatures. The join operators and the index
structures for set-valued attributes represent sets by their signature. When
applying the technique of superimposed coding, each element of a given set s
is mapped via a coding function to a bit field of length b - called signature
length - where exactly k < b bits are set. These bit fields of all elements in
the set are superimposed by a bitwise or operation to yield the final signature
denoted by sig(s).
The following property of signatures is essential. Given two sets sand t,
the implication
sBt ==> sig(s)Bsig(t)
holds for any comparison operator B E {=,~,;;;?} where sig(s) ~ sig(t) and
sig(s) ;;;? sig(t) are defined as
sig(s) ~ sig(t) := sig(s)&-sig(t) = 0
sig(s) ;;;? sig(t) := sig(t)&-sig(s) = 0
As in the programming language C, & denotes bitwise and and - denotes
bitwise complement.

7.2 Join Algorithms for Set-Valued Attributes


The execution time of the simple nested-loop join evaluation depends on the
algorithm used for set comparison. We expect a set comparison algorithm
based on sorting to be faster than an algorithm performing pairwise compar-
ison of the set elements. Yet a faster mechanism of comparing sets is based
on signatures. The signatures of the two sets to be compared are computed
and compared. Only if the signature test is passed successfully, the original
sets have to be compared.
Introducing signature-based set comparisons into the nested-loop join al-
ready gives a recognizable speed-up. But even better algorithms can be de-
signed [HM97]. The algorithm discussed here is based on the traditional hash
162 A. Kemper and G. Moerkotte

join. Other algorithms extend the traditional sort-merge algorithm or cannot


be derived from traditional join algorithms like the tree-join [HM96).
Let us discuss first the case where the join predicate is based on set
equality. For every object, we compute the signature of its set-valued join
attribute. Thereby, we set k to one because otherwise inefficient random
number generators have to be used which does not pay. The lowest d bits of
the signature are then used as the hash value of the corresponding object.
A hash table is build for the inner relation. Each bucket in the hash table
contains pairs consisting of the signature and the object. After building the
hash table, the algorithm proceeds by computing the signature for every
object of the outer relation. Again, the lowest d bits are taken as the hash
value and the corresponding bucket is retrieved. Then, the signatures of the
bucket entries are compared and if this test is passed successfully, the set-
valued attributes of both object are compared. If they are equal, the output
object is constructed.
For subset predicates, this procedure must be refined. The first step -
building the hash table - remains unchanged but the probing step of the
outer relation's objects is modified. For every object of the outer relation we
must find those objects of the inner relation whose set-valued attribute is a
superset of the set-valued attribute of the outer object. Hence, we must step
through all supersets, or if we switch the inner and the outer relation, we
must step through all the subsets of a given set-valued attribute. Therefore,
we generate all signatures coding for subsets. More specifically, we generate
only the lowest d bits of all signatures that could code for a subset of a given
signature. If for example 0110 are the lowest d bits of our original signature,
we perform a hash table lookup for the hash keys 0110, 0100, 0010, and 0000.
There exist fast algorithms for computing superset and subset signatures.
It employs only a few bit-operations for each sub-/superset signature to be
generated [VM96):

Subset Signature Generation Superset Signature Genera-


tion
s = a & -aj
while(s) { s =-a & - -aj
s = a & (s - a)j while(s) {
process(s)j s=-a&(s--a)j
} processes)j
}

The performance of the signature-based hash join algorithms depends on


the signature size and other tuning parameters. We do not discuss these issues
here but instead refer the reader to [HM97) for implementation details.
4. Object-Oriented Database Systems 163

7.3 Indexing Set-Valued Attributes


We now tackle the problem of how to index a set of objects on a set-valued at-
tribute. The simplest index structure is the sequential signature file [IK093].
It contains a sequence of pairs [sig(oi.A), re!(oi)] for every object 0i whose
set-valued attribute A is indexed. The second component re!(oi) of the pair
contains a reference to the object, e.g. a physical object identifier. Since a
sequential signature file is much smaller than the original set of objects to be
indexed and furthermore it is clustered, retrieving qualifying objects via the
sequential signature file already results in high performance improvements.
Further alternatives exist, if we index the entries of the sequential sig-
nature file. Therefore, traditional index structures like extendible hashing
[FNP+ 79] or R-trees [Gut84] can be used.

d=3
000
001 d' =3 (x) = 010
010
01 1
100 d' =3 (x) = 011
10 1
1 10
111

Fig. 7.1. Extendible signature hashing

An extendible signature hashing index is divided into two parts, the direc-
tory and the buckets [FNP+79]. A bucket contains pairs [Sig(Oi.A), re!(oi)].
The directory begins with a header holding the global depth d. Further, it
consists of 2d entries containing references to buckets (see Fig. 7.1). When
looking up a data item in the directory, the lowest d bits of its signature are
used if the predicate to be evaluated is based on set equality. If a ~-predicate
is employed, than the same mechanism as for the signature-based hash join
164 A. Kemper and G. Moerkotte

is used: all possible d bit endings of signatures for subsets are generated and
each one is looked up in the hash table. Insertions and deletions are treated
the same way as in the original proposal for extendible hashing [FNP+79].

[sig(o,A) I sig(~ A) I sig(03.A), J

'\' denotes bitwise or

Fig. 7.2. Russian Doll Tree

Another possibility to index the entries contained in a sequential signature


file is to apply an R-tree [Gut84]. This results in the S-Tree [Dep86] or Russian
Doll Tree (RDT) [HP94]. The leaf pages of the RDT contain again the pairs
[Sig(Oi.A), ref(oi)]. Internal nodes contain for every child an entry of the form
[sig, ref] where sig is a signature derived by superimposing all signatures
found in the child node referenced by ref (see Fig. 7.2). During lookup, a
child node has to be inspected, if the sig part matches the signature of the
query set.
More details on the index structures and a performance study can be
found in the literature. It should be noted both approaches to use an index
structure yield good performance only if the query set is about the same size
as the indexed object's set-valued attribute.

8 Query Optimization

8.1 Overview

A query optimizer for object bases includes optimization techniques from re-
lational query optimization (e.g. join ordering) as well as new optimization
techniques (e.g. type-based rewriting). We will concentrate on the new op-
timization problems and techniques. For traditional optimization techniques
see [vBii90,JK84].
4. Object-Oriented Database Systems 165

A query optimizer typically involves the phases shown in Fig. 8.1. Within
the first phase, syntactic analysis takes place. Syntactic analysis is divided
into two substeps: lexical analysis and parsing. During lexical analysis a token
stream is generated which is then during parsing translated into an abstract
syntax tree. The techniques involved here are standard compiler techniques,
and are not discussed here.

-- Syntactical
Analysis
r---o NFST r---- Rewrite I r----
Query
Optimization
,....--

Rewrite II
-- Code
Generation
---;;..

Fig. 8.1. Query optimizer architecture

The second phase (NFST) includes several substeps:


1. normalization,
2. factorization,
3. semantic analysis, and
4. translation into some intermediate representation.
The intermediate representation can be based on some object calculus, com-
prehensions, an object algebra, or some enhanced version of an object algebra.
We will use some enhanced version of an object algebra. The enhancement
consists of a representation for SFW-blocks. The query optimizer module
NFST is discussed in more detail in section 8.2.
The third phase includes query rewrite. Here, semantic information is used
to rewrite and simplify the query and to introduce indexes. A simple query
rewrite phase has also been introduced for query optimizers for extensible
database systems [PHH92j. We discuss this phase in section 8.3.
The fourth phase is the core of the optimizer and computes an optimal
(query execution) plan. It implements cost-based optimization by exploring a
subset of all plans equivalent to the original query - the search space. Within
this search space, the cheapest plan found is selected. At the core of query
optimization are several algorithms for fast enumeration of specific search
spaces. Query optimization is the subject of section 8.4.
166 A. Kemper and G. Moerkotte

The fifth phase is again a rewrite phase. Here, small cosmetic rewrites are
applied to the plan in order to prepare it for the code generation phase. The
phases Rewrite II and Code Generation are discussed in section 8.5.
Different implementations of query optimizers use different names for
these phases and sometimes permute the phases or steps within the phases.
For example, some optimizers perform the semantic analysis after the trans-
lation into the algebra. Since the implementation of query optimizers is quite
tricky, different architectures have been designed to facilitate the organiza-
tion of the optimization process. Among them are rule-based query optimiz-
ers, region-based query optimizers, blackboard-based query optimizers, and
query optimizers using object-oriented implementation techniques. However,
we do not go into the details of these architectural approaches but instead
discuss the main tasks and techniques of each phase.

8.2 NFST
The first two steps of this phase consist in normalization and factorization
of the expressions occurring in the query. During normalization we introduce
a new variable for every function or operator call in the original query and
bind these variables to the according expressions. All function applications
are gathered in a define clause appended to the SFW-block. Consider the
following example query:
select distinct s.name, s.age, s.supervisor.name, s.supervisor.age
from s in Student
where s.gpa > 8 and s.supervisor .age < 30

Here, the only (explicit) function calls are attribute accesses. For each at-
tribute access a new variable holding the result is introduced. Since some at-
tributes (e.g. s.age) are accessed multiple times, these are factorized. The def-
initions of the newly introduced variables are gathered in the define clause.
Hence, the result of the normalization and factorization steps is:
select distinct sn, sa, ssn, ssa
from s in Student
where sg > 8 and ssa< 30
define sn = s.name
sg = s.gpa
sa = s.age
ss = s.supervisor
ssn = ss.name
ssa = ss.age

The next step during the NFST-phase is the semantic analysis. It works
recursively through the query. First, for every identifier in the from clause
4. Object-Oriented Database Systems 167

the type is determined. For example, the identifier Student is looked up in


the schema and is determined to be of type set(Student}. AB a result, the
variable s ranging over Student has type Student. Then the type for s.name
is determined. A lookup in the schema determines that name is an attribute
of type string. Hence, the type for s.name and hence sn becomes Student.
Similar, we compute the types for the other attribute accesses. For instance,
we discover that the type of s.supervisor and ss is Professor, a class with
extension. Basically, the part of the semantic analysis dealing with attributes,
arithmetic operators, function calls and methods is very much the same as
in usual compilers for block- and object-oriented programming languages
[WM95]. The result type of a SFWD-block is always a collection. Dependent
on the distinct specification of the query, it is either a set or a bag. If an
order clause is specified, the result type is list. The element type is deter-
mined by the type of the entry (or entries) in the select clause. The typing
rules for queries containing a group by clause have already been discussed in
subsection 3.6.

PROJECT [sn, sa, ssn, ssa]

SELECT [sg >8 and ssa <30]

EXPAND [sn:s.name, sg:s.gpa, ss:s.supervisor]

EXPAND [ssn:ss.name, ssa:ss.age]

SCAN [s:student]
Fig. 8.2. Algebraic representation of a query
168 A. Kemper and G. Moerkotte

In the last step of the NFST-phase the query is translated into some in-
ternal representation. Several such representations have been proposed. They
are based on object calculi, comprehensions, an object algebra. Here, we fa-
vor a simplified version of the object algebra. We explain the translation
process into the algebra by means of two examples. The above query serves
as our first example. Its translation into the algebra can be found in Fig. 8.2.
The expression SCAN[s: Studentj scans the extent Student and produces tu-
ples with a single attribute s successively bound to the object-identifiers of
the Students. We assume that all algebraic operators work on sets of tuples
where the attribute values may be complex and not just atomic values like in
the relational model. The EXPAND operator expands the given input tuples
by new attribute values. For example, the bottom most EXPAND opera-
tor adds the three attributes sn, sg, and ss to its input tuples. The expand
operator comes in different flavors and is also called map or materialize op-
erator [BK90,BMG93,KMP92]. The SELECT operator selects those input
tuples that satisfy the given predicate. The PROJECT operator performs a
projection for a given set of attributes.
Besides the already stated algebraic operators, we also treat a SFWD-
block as an n-ary algebraic operator. The collection-valued entries in the
from clause are considered as the input arguments of this special operator.
However, in a typical runtime system of an object-oriented database manage-
ment system there is no direct evaluation possibility for such a block. Prior
to execution these blocks are translated into "regular" algebraic expressions.
We briefly describe this translation process.
In standard relational query processing multiple entries in the from clause
are translated into a cross product. This is not always possible in object-
oriented query processing. Consider the following query:

select distinct s
from s in Student, c in s.courses
where c.name = "Database"

which after normalization yields:

select distinct s
from s in Student, c in s.courses
where cn = "Database"
define cn = c.name

The evaluation of c in s.courses is dependent on s and cannot be evaluated


if no s is given. Hence, a cross product would not make much sense. To deal
with this situation, the d-join has been introduced [CM93]. It is a binary
operator that evaluates for every input tuple from its left input its right input
and flattens the result. Consider the algebraic expression given in Fig. 8.3. For
4. Object-Oriented Database Systems 169

every student s from its left input, the d-join computes the set s.courses. For
every course c in s.courses an output tuple containing the original student s
and a single course c is produced. If the evaluation of the right argument of
the d-join is not dependent on the left argument, the d-join is equivalent with
a cross product. The first optimization is to replace d-joins by cross products
whenever possible.

PROJECT [s]

SELECT [cn=" Database"]

EXPAND [cn:c.name]

SCAN [s:student] D-JOIN [c:s.courses]


Fig. 8.3. An algebraic operator tree with ad-join

Queries with a group by clause must be translated using the unary


grouping operator GROUP which we denote by r.
It is defined as
rg;()A;!(e) = {y.Ao [g: Glly E e,
G = f({xlx E e,x.ABy.A})}
where the subscript have the following semantics: (i) g is a new attribute
that will hold the elements of the group (ii) BA is the grouping criterion for
a sequence of comparison operators B and a sequence of attribute names A,
and (iii) the function f will be applied to each group after it has been formed.
We often use some abbreviations. If the comparison operator B is equal to
"=", we don't write it. If the function f is identity, we omit it. Hence, rg;A
abbreviates rg;=A;id.
Consider the general query pattern:
select e
from Xl in Xl, ... , Xn in Xn
170 A. Kemper and G. Moerkotte

where p
group by
al: el

Let E be the translation of:


select *
from Xl in Xl, ... , Xl in Xl
where p
Then the translation of the query pattern is

where Xai:ei is the EXPAND operator (also sometimes called materialize op-
erator). We often use Greek letters for algebraic operators to make plans
more compact. The operator Xe is similar to the EXPAND operator. It eval-
uates for every input element the expression e. The results are collected in
the output. The unary grouping operator will play another major role during
unnesting nested queries (cf. Sec. 8.3). The traditional nest operator [SS86]
is a special case of unary grouping. It is equivalent to rg;=A;id.
Let us consider a small example query:
select struct(age: s.age, gpa: s.gpa, cnt: count(partition))
from s in Student
group by a: s.age
g: s.gpa
After normalization and factorization, we have:
select struct(age: sa, gpa: sg, cnt: cp)
from s in Student
define sa: s.age
sg: s.gpa
group by a: sa
g: sg
define cp: count(partition)
We had to introduce a second define clause in order to normalize those
expressions in the select clause, that can only be computed after grouping.
This applies to all expressions referring to partition. We easily see that the
entries in the group by clause of the normalized and factorized query block
can be simplified to contain only the group variables (attributes) sa and sg.
Abbreviating SCAN[x:X] by X[x], the translation into the algebra yields

q == Xstruct(age:sa,gpa:sg,cnt:cp) (rpartition;=sa,sg;count (E))


E == Xsg:s.gpa(Xsa:s.age(Student[s]))
4. Object-Oriented Database Systems 171

The expression q can be simplified to

n age:sa,gpa:sg,cnt:cp (rcp;=sa,sg;count (E))

where we used a special case of the PROJECT (n) operator which includes
renaming.

8.3 Rewrite I

The goal of this phase is to rewrite the query with rules that either allow for a
more efficient evaluation of the query or that facilitate later query optimiza-
tion. The prevailing example of the first case is unnesting. Nested queries
enforce a nested loops evaluation strategy and fix certain parts of the join
order. U nnesting typically leads to plans which are orders of magnitude faster
than the nested counterparts [PHH92]. Type-based rewriting is a technique
of the second kind. After its application, the optimizer can consider a larger
search space and, hence, most probably finds better plans. However, it should
be noted that it is not always clear which of the techniques discussed in this
section will in an actual optimizer implementation occur in the rewrite phase
and which will occur in the optimization phase. Some optimizers even don't
have a rewrite phase.
The rewrite phase includes traditional optimization techniques [JK84] in-
vented in the relational context as well as optimization techniques especially
tailored for queries against object bases. The traditional optimization tech-
niques used in the query rewriting phase include pushing of the boolean COn-
nector not, simplifications, introduction of transitively implied equality pred-
icates, and the introduction of indexes. The latter point subsumes not only
the introduction of traditional index structures like B-trees [BM72,Com79]
but also of more advanced index structures like ASRs [KM90], join index
hierarchies [XH94] and GMRs [KKM91]. These plans are represented by re-
placing the SCAN on extents by an according INDEX-SCAN operator.

Type-based rewriting and pointer chasing elimination. The first


rewrite technique especially tailored for the object-oriented context is type-
based rewriting. Consider again our query:
select distinct sn, ssn, ssa
from s in Student
where sg> 8 and ssa < 30
define Sn = s.name
sg = s.gpa
ss = s.supervisor
ssn = SS.name
ssa = ss.age
172 A. Kemper and G. Moerkotte

The algebraic expression in Fig. 8.2 implies a scan of all students and a
subsequent dereferentiation of the supervisor attribute in order to access the
supervisors. If not all supervisors fit into main memory, this may result in
many page accesses. Further, if there exists an index on the supervisor's age,
and the selection condition ssa < 30 is highly selective, the index should
be applied in order to retrieve only those supervisors required for answering
the query. Type-based rewriting enables this kind of optimization. For any
expression of certain type with an associated extent, the extent is introduced
in the from clause. For our query this results in:

select distinct sn, pn, pa


from s in Student, p in Professor
where sg > 8 and pa < 30 and ss = p
define sn = s.name
sg = s.gpa
ss = s.supervisor
pn= sS.name
pa= ss.age

As a side-effect, the attribute traversal from students via supervisor to pro-


fessor is replaced by a join. Now, join-ordering allows for several new plans
that could not be investigated otherwise. For example, we could exploit the
above mentioned index to retrieve the young professors and join them with
the students having a gpa greater than 8. The according plan is given in
Fig. 8.4. Turning implicit joins or pointer chasing into explicit joins which
can be freely reordered is an original query optimization technique for object-
oriented queries.
Consider the query:

select distinct p
from p in Professor
where p.room.number = 209

Straight forward evaluation of this query would scan all professors. For every
professor, the room relationship would be traversed to find the room where
the professor resides. Last, the room's number would be retrieved and tested
to be 209. Using the inverse relationship, the query could as well be rewritten
to:

select distinct r.occupiedBy


from r in Room
where r.number = 209
4. Object-Oriented Database Systems 173

PROJECT [sn, pn, pal

JOIN [ss=p]

SELECT [sg>8] SELECT [pa<30]

EXPAND [sg:s.gpa EXPAND [pa:p.age, pn:p.name]


ss:s.supervisor
sn:s.name]

Student [s] Professor [p]


Fig. 8.4. A join replacing pointer chasing

The evaluation of this query can be much more efficient, especially if there
exists an index on the room number. Rewriting queries by exploiting in-
verse relationships is another rewrite technique to be applied during Rewrite
Phase 1.

U nnesting. Another important topic for rewriting is unnesting of queries.


Experience with unnesting shows two things: (1) it is error prone [Day87],
[GL87,Kie84,Kim82,Mur89,Mur92] and (2) it can speed up query evalua-
tion several orders of magnitude [PHH92]. The first point motivates unnest-
ing at the algebraic level since correctness proofs are easier to perform
at this level [CM93,CM95a,Ste95]. In this section we only treat some
basic unnesting techniques. More unnesting techniques can be found in
[CM93,CM95a,PHH92,Ste95].
The simplest kind of unnesting occurs for uncorrelated subqueries. For
example, in the query:
select s
from s in Student
where s.age = max(select s.age
from s in Student)
174 A. Kemper and G. Moerkotte

The subquery for computing the maximum age of all students is not corre-
lated to the outer query. Uncorrelated sub queries behave like constant ex-
pressions in a query. Hence, unnesting can already take place during the
normalization step in the NFST phase. For the above query, the result is:
define ma = max( select sa
from s in Student
define sa = s.age)
select s
from s in Student
where sa= ma
define sa = s.age

The define preceding the SFWD-block is then evaluated prior to the block.
However, sometimes more efficient ways to evaluate a query are possible.
According to Kim's classification of nested queries [Kim82], there are the
following types of nested queries:
• Type A nested queries have a constant inner block returning single ele-
ments.
• Type N nested queries have a constant inner block returning sets.
• Type J nested queries have an inner block that is dependent on the outer
block and returns a set.
• Type J A nested queries have an inner block that is dependent on the
outer block and returns a single element.
A second dimension of the classification of nested queries in the object-
oriented context is their location: nested queries can occur in the select,
from, and where clause. We concentrate on unnesting of queries in the
where clause. Unnesting in the select clause is treated similar. Unnesting
nested queries in the from clause can be performed by techniques given in
[CM95a,PHH92].
Type A nested queries can be unnested by moving them one block up
(like in the example). Sometimes, more efficient ways to unnest these queries
are possible. In the example the extent of Student has to be scanned twice.
This can be avoided by introducing the new algebraic operator MAX defined
as
MAXf(e) := {xix E e, f(x) = maxYEe(f(y))}
The MAX operator can be computed in a single pass over e.
Using MAX the above query can be expressed in the algebra as
q == M AXs.age(Student[s])
4. Object-Oriented Database Systems 175

Type N nested queries can also be unnested by moving them one block
up, since they also do not depend on their surrounding block. Again, more ef-
ficient evaluation plans are sometimes possible. We distinguish three different
kinds of predicates occurring within the outer where clause:
1. f(x) in select ...
2. not (f(x) in select ... )
3. f(x) = (~, 2, ... ) select ...
where x represents variables of the outer block, f a function (or sub query)
on these variables and =,~, 2, ... are set comparisons.
Subsequent equivalences will be subject to constraints. To express these
constraints we need some abbreviations. We denote by :F the free vari-
ables/attributes of an algebraic expression and by A the attributes in the
result of an algebraic expression. Further we use the standard short hands
for SELECT (0"), JOIN (t><I ), E~PAND (X), left semi-join (XI ), left outer-
join ( J><I ), and left anti-join ( XI ).
1. Type N queries with an in operator can be transformed into a semi-join by
using the following equivalence inspired by relational type N unnesting:
O"AIEXA2(e2)el == el t>< A1=A2e2 (1)
if Ai ~ A(ei), F(e2) n A(el) = 0
The first condition is obvious, the second merely stipulates that expres-
sion e2 must be independent of expression el.
2. Also inspired by the relational type N unnesting is the following equiv-
alence which turns a type N query with a negated in operator into an
anti-join:
O"Alll XA 2(e2)el == el t>< A1=A2e2 (2)
if Ai ~ A(ei), F(e2) n A(el) =0
The third case does not have a counterpart in SQL. However, if we formu-
late the corresponding queries on a relational schema using the non-standard
SQL found in [Kim82J, they would be of type D - resolved by a division.
Using standard SQL, they would require a double nesting using EXISTS
operations. Unnesting Type D queries using a relational division can only
handle very specific queries where the comparison predicate corresponds, in
our context, to a non-strict inclusion. Hence, the third case is typically treated
by moving the nested query to the outer block, so that it is evaluated only
once and then rely on fast set comparison operators.
The algebraic expression for query:
select p
from p in Professor
where p.residesIn in select r
from r in Room
where r.size > 30
176 A. Kemper and G. Moerkotte

is:

q =: O"prEXr(e2) (el)
el =: Xpr:p.residesln(Projessor[p])
e2 =: O"rs>30(Xrs:r.size(Room[rj))
and Eq. 1 can be applied. The result is

q =: el I>< pr=re2
where we reuse expressions el and e2 from above.
Contrary to Kim's unnesting technique for the relational context, type J
and JA queries are treated by the same set of equivalences in the object-
oriented context. For queries featuring a in or not in in the where clause,
the equivalences for type N queries only need slight modifications:

1.

0" Al EXA2 (CTp (e2»el =: el I><


AI=A2/\pe2 (3)
if Ai ~ A(ei), F(p) ~ A(el U e2), F(e2) n A(ed =0
This equivalence is similar to the one used for type N queries. It just takes
into account a predicate p relying on both el and e2 (second condition).
2.

O"AI~XA2(CTp(e2»el =: elr>< AI=A2(e21>< pel) (4)


if Ai ~ A(ei), F(p) ~ A(el U e2), F(e2) n A(el) =0
Type J not in queries cannot be translated directly using an anti-join
operation: a semi-join has to be performed first.
Let us consider an example for the second equivalence. We retrieve all
professors that reside in a room that belongs to their department:
select p
from p in Professor
where p.residesln not in select r
from r in Room
where p.dept = r.belongsTo
The algebraic expression

q =: O"pr~Xr(CTpd=rb(e2»el

el =: Xpd:p.dept (Xpr:p.residesI n (Pro j essor [P]))


e2 =: Xrb:r.belongsTo(Room[r])
is equivalent to the query and matches the left-hand side of Eq. 4. Hence, it
can be transformed into
4. Object-Oriented Database Systems 177

el pr=r(e2 t>< pd=rbeI).


r><
The remaining cases require the use of a unary or a binary grouping op-
erator. The unary grouping operator is introduced in section 8.2. The binary
grouping operator is defined as
e I r g;A ,oA2;fe2 = {yo [g: Glly E eI,G = f({xlx E e2,y.AI(}x.A2})}
It takes three arguments as subscripts: 9 is the name of a new attribute that
must not occur in el or e2. A I (}A 2 is a comparison between the two sequences
of attributes Al and A2 where Al are attributes of el and A2 are attributes
of e2. The last argument f is a function that is applied to each group after
grouping has been successfully applied.
We give the three most important equivalences for unnesting using the
grouping operators. The most general equivalence is
X9:!(O"AI 8A 2(e2))(eI) == eIr g ;A IoA2;fe2 (5)
if Ai <;;; A(ei), get Al U A 2, F(e2) n A(eI) = 0
There exist two other equivalences which deal more efficiently, using simple
grouping, with two special cases. The equivalence
g=!(0)
X9:!(O"AI=A2(e2))(eI) == 7rA 2 (eI J><I A,=A2(rg ;A2;f(e2))) (6)
if Ai <;;; A(ei)' F(e2) n A(eI) = 0,
Al n A2 = 0, get A(eI) U A(e2)
relies on the fact that the comparison of the correlation predicate is equality.
The superscript 9 = f(0) is the default value given by the left-outer join to
the attribute 9 when there is no element in the result of the group operation
which satisfies Al = A2 for a given element of el.
The equivalence
7rA,,9(X9:!(O"A28A, (e2))(eI)) == 7rA , :A2,9(rg ;0A2;f(e2)) (7)
if Ai <;;; A(ei), F(e2) n A(eI) = 0,
get A(ed U A(e2)'
el == 7rA , :A 2 (e2)
relies on the fact that there exists a common range over the variables of the
correlation predicate (third condition). The expression 7r A, :A2 renames the
attributes A2 and projects on the attribute g. We believe that the last two
cases are more common than the general case.
Let us consider an application of the last equivalence. The query:
select struct(stud: sl, cnt: select count(*)
from s2 in Student
where s2.gpa > s1.gpa)
from sl in Student
178 A. Kemper and G. Moerkotte

retrieves for every student the number of better students. It translates into
the algebra as

q == 1I"stud:sl,cnt:c(Xc:count(u.29>.19(e2» (el))
el == Xslg:s1.gpa(Student[s1])
e2 == Xs2g:s2.gpa(Student[s2])
Applying Eq. 7 yields

11"stud:sl,cnt:c( 11"sl:s2(rc;>s2.gpa;count (e2)))

which can be simplified to

11"stud:s2,cnt:c (rc;>s2.gpa;count (e2)).

Quantifier treatment. Existential quantifiers are treated by rewriting the


query. A query of the form:
select
from
where exists x in X : p

can be rewritten to:


select
from
where exists(select x
from x in X
where p)

Here, exists denotes the test for emptiness as in SQL. Unnesting now pro-
ceeds by the technique given in [PHH92j.
The general template for a query with a universal quantifier is:
select el
from el in El
where for all e2 in select e2
from e2 in E2
where p: q
The predicate p is called mnge predicate and q is called quantifier predicate.
Both of them possibly refer to el and/or e2. This results in 16 different cases.
All but three of them are rather trivial [CKM+97a,CKM+97bj. The more
complex cases give rise to three classes:
4. Object-Oriented Database Systems 179

1. p(e2), q(el' e2)


The range predicate depends on e2 only and the quantifier predicate
depends on both, el and e2.
2. peel, e2), q(e2)
The range predicate refers to el and e2, and the quantifier predicate refers
to e2.
3. p(el,e2), q(el,e2)
The range predicate and the quantifier predicate refer to el and e2.

Although there exist several alternative treatments of universal quantifiers


for these cases, those using the anti-join are typically the most efficient
[CKM+97a,CKM+97b]. Depending on the class, the query template can be
rewritten into the one of the following algebraic expressions:

• Class 1: EI[el] ~ ~q(eloe2) O"p(e2) (E2 [e2J)


• Class 2: EI [ell ~ p(eloe2) 0"~q(e2)(E2[e2J)
• Class 3: Edel] I>< p(eloe2)A~q(eloe2) E2[e2]
These algebraic expressions are equivalent to the original query template
only if the attributes el form a superkey for the elements in El, i.e., EI [ell
contains no duplicates. This constraint is trivially true if EI is an extension
and el is bound to the key. If this constraint is not satisfied, other techniques
must be applied [CKM+97a,CKM+97b].
The students having passed all database courses can be retrieved by:

select s.name
from s in Student
where for all c in (select c
from c in Course
where c.name like "%database%"):
c in s.coursesPassed

This query is of Class 1 and translates into

Xs.name (Xsc:s.coursesPassed (

Student[sJ) IX c\lscO"cn like "%database%"(Xcn:c.name(Course[cJ))


Departments with no full professor are retrieved by:

select d.name
from d in Department
where for all p in (select p
from p in Professor
where p.dept = d):
p.status != "full professor"
180 A. Kemper and G. Moerkotte

Belonging to Class 2, the query translates into

Xd.name(Department[d]

t>< pd=dO'ps=" full professor" (Xpd:p.dept (Xps:p.status (Prof essor[p]))))

Departments attracting all students in their city are specified by the


query:

select d.name
from d in Department
where for all s in (select s
from s in Student
where s.city = d.city):
s.dept = d

The algebraic expression

Xd.name(Xdc:d.city(Department[d])

t>< dc=scl\do/sdXsc:s.city (Xsd:s.dept (Student [s])))


computes these departments.

Semantic query rewrite. Semantic query rewrite exploits knowledge (se-


mantic information) about the content of the object base. This knowledge is
typically specified by the user. We already saw one example of user-supplied
information: inverse relationships. As we already saw, inverse relationships
can be exploited for more efficient query evaluation.
Another important piece of information is knowledge about keys. In con-
junction with type inference, this information can be used during query
rewrite to speed up query execution. A typical example is the following query:

select distinct *
from Professor pI, Professor p2
where p1.university.name = p2.university.name

By type inference, we can conclude that the expressions p1.university and


p2.university are of type University. If we further knew that the name of
universities are unique, that is the name is a candidate key for universities,
then the query could be simplified to:

select distinct *
from Professor pI, Professor p2
where p1.university = p2.university
4. Object-Oriented Database Systems 181

Evaluating this query does no longer necessitate accessing the universities to


retrieve their name.
Some systems consider even more general knowledge in form of equiva-
lences holding over user-defined functions [AF95,Fl096j. These equivalences
are then used to rewrite the query. Thereby, alternatives are generated all of
which are subsequently optimized.

8.4 Query Optimization

All query optimization is based on algebraic equivalences. Algebraic equiva-


lences allow us to express a query via different algebraic expressions. These
expressions are equivalent to the original query but can exhibit vastly differ-
ent costs. The standard algebraic equivalences for SELECT (0") and JOIN
(~) include

O"P1AP2 (e) == O"Pl (O"p2 (e)) (8)


(e))
O"Pl (O"p2 == O"P2 (O"Pl (e)) (9)
O"Pl (el ~P2 e2) == O"Pl (eI) ~P2 e2 (10)
el e2
~Pl == e2 ~Pl el (11)
(el ~Pl e2) ~P2 e3 == el ~Pl (e2 ~P2 e3) (12)
where e and ei are algebraic expressions and P and Pi are predicates. Some
of these algebraic equivalences always hold, some can only be used if certain
conditions are satisfied. These conditions require that the consumer/producer
relationship must not be disturbed by the equivalence. Take for example
Equivalence 10. The selection predicate PI uses (consumes) certain attributes.
These attributes must all be available (produced by) expression el. Other-
wise, pushing O"Pl inside the join is not valid.
Given the above set of equivalences, it becomes clear that joins and selec-
tions can be freely reordered as long as the consumer/producer relationship is
not disturbed. This fact is exploited by specialized dynamic programming al-
gorithms that generate optimal plans for (sub-) queries involving either joins
only [SAC+79j, or joins and selections [CS97,SM98j. Traditionally, selections
are pushed and cross products avoided [SAC+79j. The rational for pushing
selections is that they are typically cheap in the relational context and dimin-
ish the size of the inputs for subsequent joins. In the object-oriented context,
user-defined functions and predicates may occur which may exhibit a consid-
erable run time. As a consequence, pushing these expensive selections past a
join is not always the best thing to do [HS93j. Even for single extent queries,
a careful ordering of predicates can yield high performance gains [KMS92j.
The reason for abandoning cross products is that this reduces the search
space. The rational behind this is that plans with cross products are consid-
ered expensive since cross products are typically expensive. However, lately
it became apparent that some real-world queries involve to small relations
182 A. Kemper and G. Moerkotte

[OL90]. For these queries, a plan containing a cross product of small rela-
tions is often superior to those plans without cross products. Hence, newer
dynamic programming algorithms consider cross products as well.
One such algorithm that generates plans with cross products, selections,
and joins is given in Figure 8.5. The algorithm is described in pseudo code. It
generates optimal bushy trees - that is, plans where both join partners can be
intermediate relations. Efficient implementation techniques for the algorithm
can be found in [SM98]. As input parameters, the algorithm takes a set of
relations R and a set of predicates P. The set ofrelations for which a selection
predicate exists is denoted by Rs. We identify relations and predicates that
apply to these relations. For all subsets Mk of the relations and subsets P,.
of the predicates, an optimal plan is constructed and entered into the table
T. The loops range over all Mk and Pl. Thereby, the set Mk is split into two
disjoint subsets Land L', and the set P,. is split into three parts (line 7).
The first part (V) contains those predicates that apply to relations in L only.
The second part (V') contains those predicates that apply to relations in L'
only. The third part (p) is a conjunction of all the join predicates connecting
relations in Land L' (line 8). Line 9 constructs a plan by joining the two
plans found for the pairs [L, V] and [L', V'] in the table T. If this plan has so
far the best costs, it is memorized in the table (lines 10-12). Last, different
possibilities of not pushing predicates in P,. are investigated (lines 15-19).
For queries against object-oriented databases, the third major operator
is the EXPAND operator (X). The following equivalences show that the EX-
PAND operator is also freely reorder able with selections and joins:

Xa:e(o-Pl (el)) == (o-Pl (Xa:e(el)) (13)


Xa:e (el ~Pl e2) == Xa:e (el) ~Pl e2 (14)
Again, these equivalences only hold if the consumer/producer relationship is
not disturbed.

Class hierarchies. Another set of equivalences known from the relational


context involves the UNION operator (u) and plays a vital role in dealing
with class/extent hierarchies. Consider the simple class hierarchy given in
Figure 8.6. Obviously, for the user, it must appear that the extent of Em-
ployee contains all Managers. However, the system has different alternatives
to implement extents. Most OBMSs organize an object base into areas or
volumes. Each area or volume is then further organized into several files. A
file is a logical grouping of objects not necessarily consisting of subsequent
physical pages on disk. Files don't share pages.
The simplest possible implementation to scan all objects belonging to a
certain extent is to perform an area scan and select those objects belonging
to the extent in question. Obviously, this is far to expensive. Therefore, some
more sophisticated possibilities to realize extents and scans over them are
needed. The different possible implementations can be classified along two
4. Object-Oriented Database Systems 183

proc Optimal-Bushy- Tree(R, P)


1 for k = 1 to n do
2 for all k-subsets Mk of R do
3 for i = 0 to minCk, m) do
4 for all i-subsets PI of Mk n Rs do
5 besLcosLso_far = ooj
6 for all subsets L of Mk with 0 < ILl < k do
7 L' = Mk \
L, V = PI n L, V' = Pi n L' j
8 P = /\{pi,j I Pi,j E P, R; E V, Rj E V/}j / / p=true possible
9 T = (T[L, VllXlp T[L', V'])j
10 if Cost(T) < besLcosLso_far then
11 besLcosLso_far = Cost(T)j
12 T[Mk' Ptl = Tj
13 flj
14 odj
15 for all R E PI do
16 T = lTR(T[Mk' PI \ {R}])j
17 if Cost(T) < besLcosLso_far then
18 besLcosLso_far = Cost(T)j
19 T[Mk,Pd = Tj
20 fl·,
21 odj
22 odj
23 odj
24 odj
25 odj
26 return T[R, 81j

Fig. 8.5. A dynamic programming optimization algorithm

dimensions. The first dimension distinguishes between logical and physical


extents, the second distinguishes between strict and (non-strict) extents.

Logical vs. physical extents. An extent can be realized as a collection of ob-


ject identifiers. A scan over the extent is then implemented by a scan over
all the object identifiers contained in the collection. Subsequently, the object
identifiers are dereferenced to yield the objects themselves. This approach
leads to logical extents. Another possibility is to implement extent member-
ship by physical containment. The best alternative is to store all objects of
an extent in a file. This results in physical extents. A scan over a physical
extent is then implemented by a file scan.

Extents vs. strict extents. A strict extent contains the objects (or their OIDs)
of a class excluding those of its subclasses. A non-strict extent contains the
objects of a class and all objects of its subclasses.
184 A. Kemper and G. Moerkotte

Employee name: string


salary: int
boss: Manager
1
Manager
boss: CEO

1
CEO

Fig. 8.6. A sample class hierarchy

Given a class C, any strict extent of a subclass C' of C is called a subextent


ofC.
Obviously, the two classifications are orthogonal. Applying them both
results in the four possibilities presented graphically in Fig. 8.7. [CM95b)
strongly argues that strict extents are the method of choice. The reason is
that only this way the query optimizer might exploit differences for extents.
For example, there might be an index on the age of Manager but not for
Employee. This difference can only be exploited for a query including a re-
striction on age, if we have strict extents.
However, strict extents result in initial query plans including UNION
operators. Consider the query:
select e
from e in Employee
where e.salary> 100.000
The initial plan is
O"sa>lOO.OOO(Xsa:x.salary((Employee[x] U Manager[x]) U CEO [x]))
Hence, algebraic equivalences are needed to reorder UNION operators with
other algebraic operators. The most important equivalences are
el U e2 == e2 U el (15)
el U (e2 U e3) == (el U e2) U e3 (16)
O"p(el U e2) == O"p(el) U O"p(e2) (17)
Xa:e(el U e2) == Xa:e(el) U Xa:e(e2) (18)
(el U e2) ~p e3 == (el ~p e3) U (e2 ~p e3) (19)
Equivalences containing the UNION operator sometimes involve tricky typing
constraints. These go beyond the current chapter and the reader is referred
to [MZD94].
4. Object-Oriented Database Systems 185

Strict Extents Extents

Employee: {e1, e2, ....} Employee': {e1, e2, .... , m1, ... , e1}
L
o
G
I Manager: {m1 .... } Manager': {m1 .... , e1, ... }
C
A
L
CEO: {e1 .... } CEO': {e1, ... }

Employee: {e1: [name: Peter, salary:20.000, boss: m1], Employee': {e1: [name: Peter, salary:20.000, boss: m1],
e2: [name: Mary, salary:21.000, boss: m1], e2: [name: Mary, salary:21.000, boss: m1],
..... } ;;;1': [name: Paul, salary: 100.000, boss: e1],
P
H ;;1', [name: May, salary: 500.000, boss: e1],
Y Manager: {m1: [name: Paul, salary:100.000, boss: e1], .... }
S ... }
I Manager': {m1: [name: Paul, salary: 100.000, boss: e1],
C
A CEO: {e1: [name: May, salary: 500.000, boss: e1], ;;1': [name: May, salary: 500.000, boss: e1],
L ... } .... }

CEO': {e1: [name: May, salary: 500.000, boss: e1], ... }

Fig, 8.1. Implementation of extents

Disjunction, All optimization techniques discussed so far neglect the prob-


lem of disjunctions. If a disjunction occurs in a query, then two traditional
solutions exist. The query is normalized into conjunctive normal form, or
into disjunctive normal form. Since the latter requires subtle duplicate han-
dling mechanisms, the former alternative is the method of choice. How-
ever, plans resulting from disjunctive normal form can be more efficient
[KMP+94,SPM +95].
Both methods have two main disadvantages. First, computing the normal
form consumes exponential time and space. Second, it cannot readily prevent
expensive predicates or function calls from being evaluated several times.
An alternative possibility to deal with queries containing disjunction is
the so-called bypass technique. It does not rely on any normal form but
instead introduces four new variants for selection and join. Besides the stan-
dard result, they produce a second output stream. The selection operator for
example produces one output stream (the positive stream) containing input
elements for which the selection predicate evaluates to true and one output
stream (the negative stream) containing input elements for which the selec-
tion predicate evaluates to false. For the join, several alternatives exist to
produce more than a single output. The most important one produces one
standard output stream containing the concatenation of joining tuples from
both input streams. The other output then contains those tuples from the left
186 A. Kemper and G. Moerkotte

input stream, who do not have a join partner in the right stream. Generation
of bypass plans is beyond the scope of this chapter and the reader is referred
to the literature [KMP+94,SPM+95].

8.5 Rewrite II and Code Generation

We assume that the algebra is implemented in the runtime system by iterators


[Gra93]. Using this concept, next calls to iterators can be saved by small
rewritings taking place in phase Rewrite II. Typically equivalences applied
are

lTp(lTq(e)) == lTpAq(e)
lTp( el C><I qe2) == el C><I qApe2

Similar equivalences exist for projections which are also pushed down dur-
ing the Rewrite II phase. Another major performance improvement can be
achieved by factorizing common algebraic subexpressions [CD92].
The code generation phase heavily depends on the runtime system. The
relatively fixed part is the translation of the algebraic operators within the
query evaluation plan. They are translated into according iterators. The flex-
ibility concerns the translation of the subscripts, for example the selection
predicates. Three alternative exist. First, they can be translated directly into
machine code. This approach is rather efficient but makes code generation
dependent on the underlying hardware. The second alternative is to interpret
the expressions. This is easiest to implement and machine independent but
also less efficient. The third alternative is a compromise. The query evalua-
tion plan is translated into an interpreted code similar to machine code. The
generated machine code is then executed by a virtual machine. This guar-
antees hardware independence and a performance between the other two
alternatives.

9 Conclusion

We pointed out the advantages of object-oriented databases compared to


relational databases. The most important ones are more expressive modeling
constructs and no impedance mismatch. Likewise, we showed that many new
and interesting techniques exist to make efficient implementations of object-
oriented databases. Although object-oriented databases have not been very
successful commercially, the ideas live on in object-relational databases and
XML databases.
4. Object-Oriented Database Systems 187

References

[AF95] Aberer, K., Fischer, G., Semantic query optimization for methods in
object-oriented database systems, Proc. IEEE Conf. on Data Engi-
neering, 1995, 70--79.
[AG96] Arnold, K., Gosling, J., The Java progmmming language, Addison-
Wesley, Reading, MA, USA, 1996.
[AL80] Adiba, M.E., Lindsay, B.G., Database snapshots, Proc. 6th Interna-
tional Conference on Very Large Data Bases (VLDB), 1980, 86-91.
[BCK98] Braumandl, R, Claussen, J., Kemper, A., Evaluating functional joins
along nested reference sets in object-relational and object-oriented
databases, Proc. 24th International Conference on Very Large Data
Bases (VLDB) , 1998, 110--122.
[BCL89] Blakeley, J.A., Coburn, N., Larson, p.-A., Updating derived relations:
detecting irrelevant and autonomously computable updates, ACM
Trans. on Database Systems 14(3), 1989, 369-400.
[Bil92] Biliris, A., The performance of three database storage structures for
managing large objects, Proc. ACM SIGMOD Conf. on Management
of Data, 1992, 276-285.
[BK89] Bertino, E., Kim, W., Indexing techniques for queries on nested ob-
jects, IEEE Trans. Knowledge and Data Engineering 1(2), 1989,
196-214.
[BK90] Beeri, C., Kornatzky, Y., Algebraic optimization of object-oriented
query languages, S. Abiteboul, P.C. Kanellakis (eds.), Lecture Notes
in Computer Science 470, 3rd International Conference on Database
Theory (ICDT'90), Springer-Verlag, Berlin, 1990, 72-88.
[BLT86] Blakeley, J.A., Larson, P.-A., Tompa, F.W., Efficiently updating
materialized views, Proc. ACM SIGMOD Conf. on Management of
Data, 1986, 61-71.
[BM72] Bayer, R, McCreight, E., Organization and maintenance of large
ordered indices, Acta Informatica 1(4), 1972,290--306.
[BMG93] Blakeley, J., McKenna, W., Graefe, G., Experiences building the
Open OODB query optimizer, Proc. ACM SIGMOD Conf. on Man-
agement of Data, 1993, 287-295.
[Bo094] Booch, G., Object-oriented analysis and design, Benjamin/Cum-
mings, Redwood City, CA, USA, 1994.
[BP95] Biliris, A., Panagos, E., A high performance configurable storage
manager, Proc. IEEE Conf. on Data Engineering, 1995, 35-43.
[Cat94] Cattell RG.G. (ed.), Object database standard, Morgan Kaufmann
Publishers, San Mateo, CA, USA, 1994.
[CBB+97] Cattell, R., Barry, D., Bartels, D., Berler, M., Eastman, J., Gamer-
man, S., Jordan, D., Springer, A., Strickland, H., Wade D., The
object database standard: ODMG 2.0, The Morgan Kaufmann Series
in Data Management Systems, Morgan Kaufmann Publishers, San
Mateo, CA, USA, 1997.
[CD92] Cluet, S., Delobel, C., A general framework for the optimization of
object-oriented queries, Proc. ACM SIGMOD Conf. on Management
of Data, 1992, 383-392.
188 A. Kemper and G. Moerkotte

[CDF+94] Carey, M.J., DeWitt, D.J., Franklin, M.J., Hall, N.E., McAuliffe,
M.L., Naughton, J.F., Schuh, D.T., Solomon, M.H., Tan, C.K., Tsa-
talos, O.G., White, S.J,. Zwilling, M.J., Shoring up persistent appli-
cations, Proc. ACM SIGMOD Conf. on Management of Data, 1994,
383-394.
[CDR+86] Carey, M., DeWitt, D., Richardson, J., Shekita, E., Object and file
management in the EXODUS extensible database system, Proc. 12th
International Conference on Very Larye Data Bases (VLDB) , 1986,
91-100.
[CDV88] Carey, M.J., DeWitt, D.J., Vandenberg, S.L., A data model and query
language for EXODUS, Proc. ACM SIGMOD Con/. on Management
of Data, 1988, 413--423.
[CKM+97a] Claussen, J., Kemper, A., Moerkotte, G., Peithner, K., Optimizing
queries with universal quantification in object-oriented and object-
relational databases, Proc. 29rd International Conference on Very
Larye Data Bases (VLDB) , 1997, 286--295.
[CKM+97b] Claussen, J., Kemper, A., Moerkotte, G., Peithner, K., Optimizing
queries with universal quantification in object-oriented and object-
relational databases, Technical Report MIP-9706, University of Pas-
sau, Fak. f. Mathematik u. Informatik, 1997.
[CM93] Cluet, S., Moerkotte, G., Nested queries in object bases, Proc. 4th In-
ternational Workshop on Database Programming Languages - Object
Models and Languages, 1993, 226--242.
[CM95a] Cluet, S., Moerkotte, G., Classification and optimization of nested
queries in object bases, Technical Report 95-6, RWTH Aachen, 1995.
[CM95b] Cluet, S., Moerkotte, G., Query optimization techniques exploiting
class hierarchies, Technical Report 95-7, RWTH Aachen, 1995.
[Com79] Comer, D., The ubiquitous B-tree, ACM Computing Surveys 11(2),
1979, 121-137.
[CS97] Chaudhuri, S., Shim, K., Optimization of queries with user-defined
predicates, Technical Report, Microsoft Research, Advanced Technol-
ogy Division, One Microsoft Way, Redmond, WA 98052, USA, 1997.
[Day87] Dayal, U., Of nests and trees: a unified approach to processing queries
that contain nested sub queries, aggregates, and quantifiers, Proc 19th
International Conference on Very Larye Data Bases (VLDB) , 1987,
197-208.
[Dep86] Deppisch, U., S-tree: a dynamic balanced signature index for office
retrieval, Proc. 9th Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval (SIGIR'86),
1996, 77-87.
[EGK95] Eickler, A., Gerlhof, C.A., Kossmann, D., A performance evaluation
of OlD mapping techniques, Proc. 21th International Conference on
Very Larye Data Bases (VLDB) , 1995, 18-29.
[FJK96] Franklin, M.J., Jonsson, B., Kossmann, D., Performance tradeoffs
for client-server query processing, Proc. ACM SIGMOD Conf. on
Management of Data, 1996, 149--160.
[Flo96] Florescu, D., Espaces de recherche pour l'optimisation de requetes
objet (Search spaces for query optimization), PhD thesis, Universite
de Paris VI, 1996.
4. Object-Oriented Database Systems 189

[FNP+79] Fagin, R., Nievergelt, J., Pippenger, N., Strong, H., Extendible hash-
ing - a fast access method for dynamic files, ACM Trans. on Database
Systems 4(3), 1979, 315-344.
[Fra96] Franklin, M., Client data caching: a foundation, Kluwer Academic
Press, 1996.
[GKK+93] Gerlhof, C.A., Kemper, A., Kilger, C., Moerkotte, G., Partition-
based clustering in object bases: from theory to practice, Lecture
Notes in Computer Science 730, Intl. Conf. on Foundations of Data
Organization and Algorithms (FODO), Springer-Verlag, Berlin, 1993,
301-316.
[GKM96] Gerlhof, C.A., Kemper, A., Moerkotte, G., On the cost of monitoring
and reorganization of object bases for clustering, ACM SIGMOD
Record 25(3), 1996, 28-33.
[GL87] Ganski, R.A., Long, H.K.T., Optimization of nested SQL queries
revisited, Proc. ACM SIGMOD Conf. on Management of Data, 1987,
22-33.
[Gra93] Graefe, G., Query evaluation techniques for large databases, ACM
Computing Surveys 25(2), 1993,73-170.
[Gut84] Guttman, A., R-trees: a dynamic index structure for spatial search-
ing, Proc. ACM SIGMOD Conf. on Management of Data, 1984,
47-57.
[Han87] Hanson, E., A performance analysis of view materialization strategies,
Proc. ACM SIGMOD Conf. on Management of Data, 1987, 440-453.
[Han88] Hanson, E., Processing queries against database procedures: a perfor-
mance analysis, Proc. 1988 ACM SIGMOD International Conference
on Management of Data, 1988, 295-302.
[Hiir78] Hiirder, T., Implementing a generalized access path structure for a
relational database system, ACM Trans. on Database Systems 3(3),
1978, 285-298.
[HM96] Helmer, S., Moerkotte, G., Evaluation of main memory join algo-
rithms for joins with set comparison join predicates, Technical Report
13/96, University of Mannheim, Mannheim, Germany, 1996.
[HM97] Helmer, S., Moerkotte, G., Evaluation of main memory join algo-
rithms for joins with set comparison join predicates, Proc. 23rd In-
ternational Conference on Very Large Data Bases (VLDB), 1997,
386-395.
[HP94] Hellerstein, J., Pfeffer, A., The RD-tree: an index structure for sets,
Technical Report 1252, University of Wisconsin, Madison, Wisconsin,
1994.
[HS93] Hellerstein, J.M., Stonebraker, M., Predicate migration: optimizing
queries with expensive predicates, Proc. ACM SIGMOD Conf. on
Management of Data, 1993,267-276.
[IK093] Ishikawa, Y., Kitagawa, H., Ohbo, N., Evaluation of signature files
as a set access facility in OODBMS, Proc. ACM SIGMOD Conf. on
Management of Data, 1993, 247-256.
[Ita93] Itasca Systems Inc., Technical summary for Release 2.2, Itasca Sys-
tems, Inc., USA, 1993.
[Jhi88] Jhingran, A., A performance study of query optimization algorithms
on a database system supporting procedures, Proc. 14th International
Conference on Very Large Data Bases (VLDB), 1988, 88-99.
190 A. Kemper and G. Moerkotte

[JK84] Jarke, M., Koch, J., Query optimization in database systems, ACM
Computing Surveys 16(2), 1984, 111-152.
[KC86] Khoshafian, S.N., Copeland, G.P., Object identity, Proc. ACM Conf.
on Object-Oriented Programming Systems and Languages (OOP-
SLA), 1986, 408-416.
[KD91] KefUer, U., Dadam, P., Auswertung komplexer Anfragen an hier-
archisch strukturierte Objekte mittels Pfadindexen, Proc. der GI-
Fachtagung Datenbanksysteme fur Buro, Technik und Wissenschaft
(BTW), Informatik-Fachberichte No. 270, Springer-Verlag, 1991,
218-237.
[Kie84] Kiessling, W., SQL-like and Quel-like correlation queries with ag-
gregates revisited, ERL/UCB Memo 84/75, University of Berkeley,
1984.
[Kim82] Kim, W., On optimizing an SQL-like nested query, ACM Trans. on
Database Systems 7(3), 1982, 443-469.
[Kim89] Kim, W., A model of queries for object-oriented databases, Proc.
15th International Conference on Very Large Data Bases (VLDB),
1989, 423-432.
[KK94] Kemper, A., Kossmann, D., Dual-buffering strategies in object
bases, Proc. 20th International Conference on Very Large Data Bases
(VLDB), 1994,427-438.
[KK95] Kemper, A., Kossmann, D., Adaptable pointer swizzling strategies
in object bases: design, realization, and quantitative analysis, The
VLDB Journal 4(3), 1995, 519-566.
[KKD89] Kim, W., Kim, K.C., Dale, A., Indexing techniques for object-
oriented databases, W. Kim, F.H. Lochovsky (eds.), Object-Oriented
Concepts, Databases, and Applications, Addison-Wesley, 1989, 371-
394.
[KKM90] Kemper, A., Kilger, C., Moerkotte, G., Materialization of functions
in object bases: design, realization, and evaluation, Technical Re-
port 28/90, Fakultiit fiir Informatik, Universitiit Karlsruhe, Karl-
sruhe, 1990.
[KKM91] Kemper, A., Kilger, C., Moerkotte, G., Function materialization in
object bases, Proc. ACM SIGMOD Conf. on Management of Data,
1991, 258-268.
[KKM94] Kemper, A., Kilger, C., Moerkotte, G., Function materialization in
object bases: design, implementation and assessment, IEEE Trans.
Knowledge and Data Engineering 6(4), 1994, 587-608.
[KL70] Kernighan, B., Lin, S., An efficient heuristic procedure for partition-
ing graphs, Bell System Technical Journal 49(2), 1970, 291-307.
[KM90] Kemper, A., Moerkotte, G., Access support in object bases, Proc.
ACM SIGMOD Conf. on Management of Data, 1990,364-374.
[KM92] Kemper, A., Moerkotte, G., Access support relations: an indexing
method for object bases, Information Systems 17(2), 1992, 117-146.
[KM94] Kilger, C., Moerkotte, G., Indexing multiple sets, Proc. 20th Interna-
tional Conference on Very Large Data Bases (VLDB), 1994, 180-191.
[KMP92] Kemper, A., Moerkotte, G., Peithner, K., Object-orientation axioma-
tised by dynamic logic, Technical Report #92-30, RWTH Aachen,
Germany, 1992.
4. Object-Oriented Database Systems 191

[KMP+94] Kemper, A., Moerkotte, G., Peithner, K., Steinbrunn, M., Optimizing
disjunctive queries with expensive predicates, Proc. ACM SIGMOD
International Conference on Management of Data, 1994, 336-347.
[KMS92] Kemper, A., Moerkotte, G., Steinbrunn, M., Optimization of Boolean
expressions in object bases, Proc. 18th International Conference on
Very Large Data Bases (VLDB), 1992, 79-90.
[Kru56] Kruskal, J.B., On the shortest spanning subtree of a graph and the
travelling salesman problem, Proc. Amer. Math. Soc. 7, 1956, 48-50.
[LL89] Lehman, T.J., Lindsay, B.G., The Starburst long field manager, Proc.
15th International Conference on Very Large Data Bases (VLDB),
1989, 375-383.
[LLO+91] Lamb, C., Landis, G., Orenstein, J., Weinreb, D., The ObjectStore
database system, Communications of the ACM 34(10), 1991, 50-63.
[LMB97] Leverenz, L., Mateosian, R, Bobrowski, S., Oracle8 Server - concepts
manual, Oracle Corporation, Redwood Shores, CA, USA, 1997.
[LOL92] Low, C.C., Ooi, B.C., Lu, H., H-trees: a dynamic associative search
index for OODB, Proc. ACM SIGMOD Can/. on Management of
Data, 1992, 134-143.
[Lum70] Lum, V.Y., Multi-attribute retrieval with combined indexes, Com-
munications of the ACM 13, 1970, 660-665.
[MS86] Maier, D., Stein, J., Indexing in an object-oriented DBMS, K.R Dit-
trich, U. Dayal (eds.), Proc. IEEE Intl. Workshop on Object-Oriented
Database Systems, IEEE Computer Society Press, 1986, 171-182.
[MS93] Melton, J., Simon, A., Understanding the new SQL: a complete guide,
Morgan Kaufman, San Mateo, California, 1993.
[Mur89] Muralikrishna, M., Optimization and dataflow algorithms for nested
tree queries, Proc. 15th International Conference on Very Large Data
Bases (VLDB), 1989, 77-85.
[Mur92] Muralikrishna, M., Improved unnesting algorithms for join aggregate
SQL queries, Proc. 18th International Conference on Very Large Data
Bases (VLDB), 1992, 91-102.
[MZD94] Mitchell, G., Zdonik, S., Dayal, U., Optimization of object-oriented
queries: problems and applications, A. Dogac, M.T. Ozsu, A.
Biliris, T. Sellis (eds.), Advances in Object-Oriented Database Sys-
tems, NATO ASI Series F: Computer and Systems Sciences, vol. 1SO,
Springer-Verlag, Berlin, 1994, 119-146.
[NHS84] Nievergelt, J., Hinterberger, H., Sevcik, K.C., The grid file: an adapt-
able, symmetric multikey file structure, ACM Trans. on Database
Systems 9(1), 1984, 38-71.
[02T94] O 2 Technology, Versailles Cedex, France, A technical overview of the
O2 system, 1994.
[OL90] Ono, K., Lohman, G.M., Measuring the complexity of join enumer-
ation in query optimization, Proc. 16th International Conference on
Very Large Data Bases (VLDB), 1990, 314-325.
[PHH92] Pirahesh, H., Hellerstein, J., Hasan, W., Extensible/rule-based query
rewrite optimization in Starburst, Proc. ACM SIGMOD Can/. on
Management of Data, 1992, 39-48.
[SAC+79] Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, RA., Price,
T.G., Access path selection in a relational database management
192 A. Kemper and G. Moerkotte

system, Proc. ACM SIGMOD Conf. on Management of Data, 1979,


23-34.
[SAH87] Stonebraker, M., Anton, J., Hanson, E., Extending a database system
with procedures, ACM 1hzns. on Database Systems 12(3), 1987,350-
376.
[SC89] Shekita, E., Carey, M. J., Performance enhancement through repli-
cation in an object-oriented DBMS, Proc. ACM SIGMOD Conf. on
Management of Data, 1989, 325-336.
[Sel88] Sellis, T.K., Intelligent caching and indexing techniques for relational
database systems, Information Systems 13(2), 1988, 175-186.
[SJG+90] Stonebraker, M., Jhingran, A., Goh, J., Potamianos, S., On rules,
procedures, caching and views in data base systems, Proc. ACM
SIGMOD Conf. on Management of Data, 1990, 281-290.
[SKW92] Singhal, V., Kakkad, S., Wilson, P., Texas: an efficient, portable per-
sistent store, A. Albono, R. Morrison (eds.), Persistent Object Sys-
tems, 5th Intl. Workshop on Persistent Object Systems, Workshops
in Computing, Springer-Verlag, New York, Berlin, 1992, 11-33.
[SM98] Scheufele, W., Moerkotte, G., Efficient dynamic programming al-
gorithms for ordering expensive joins and selections, H.-J. Schek, F.
Saltor, I. Ramos, G. Alonso (eds.), Advances in Database Technology,
Lecture Notes in Computer Science 1977, 6th International Confer-
ence on Extending Database Technology (EDBT'98), Springer-Verlag,
Berlin, 1998, 201-215.
[SPM+95] Steinbrunn, M., Peithner, K., Moerkotte, G., Kemper, A., Bypassing
joins in disjunctive queries, Proc. 21th International Conference on
Very Large Data Bases (VLDB), 1995, 228-238.
[SR86] Stonebraker, M., Rowe, L., The design of POSTGRES, Proc. ACM
SIGMOD Conf. on Management of Data, 1986, 340-355.
[SS86] Schek, H.-J., Scholl, M.H., The relational model with relation-valued
attributes, Information Systems 11(2), 1986, 137-147.
[Ste95] Steenhagen, H., Optimization of Object Query Languages, PhD the-
sis, University of Twente, 1995.
[Sto96] Stonebraker, M., Object-relational DBMSs: the next great wave, Mor-
gan Kaufmann Publishers, San Mateo, CA, USA, 1996.
[TN91] Tsangaris, M.M., Naughton, J.F., A stochastic approach for cluster-
ing in object bases, Proc. ACM SIGMOD Conf. on Management of
Data, 1991, 12-21.
[Val87] Valduriez, P., Join indices, ACM 1hzns. on Database Systems 12(2),
1987, 218-246.
[vBii90] von Biiltzingsloewen, G., Optimierung von SQL-Anfragen fur par-
allele Bearbeitung (Optimization of SQL-queries for pamllel process-
ing), PhD thesis, University of Karlsruhe, 1990.
[Ver97] Versant Object Technology, Versant release 5, October 1997,
http://www.versant.com/.
[VM96] Vance, B., Maier, D., Rapid bushy join-order optimization with
Cartesian products, Proc. ACM SIGMOD Conf. on Management
of Data, 1996, 35-46.
[Wil91] Wilson, P., Pointer swizzling at page fault time: efficiently supporting
huge address spaces on standard hardware, ACM Computer Archi-
tecture News 19(4), 1991, 6-13.
4. Object-Oriented Database Systems 193

[WK92] Wilson, P., Kakkad, S., Pointer swizzling at page fault time: efficiently
supporting huge address spaces on standard hardware, Pmc. Int.
Workshop on Object Orientation in Operating Systems, Paris, IEEE
Press, 1992, 364-377.
(WM95] Wilhelm, R., Maurer, D., Compiler design, Addison Wesley, 1995.
[XH94] Xie, Z., Han, J., Join index hierarchies for supporting efficient nav-
igations in object-oriented databases, Pmc. 20th International Con-
ference on Very Large Data Bases (VLDB) , 1994, 522-533.
5. High Performance Parallel Database
Management Systems

Shahram Ghandeharizadehl, Shan Gaol, Chris Gahagan 2 , and Russ


Krauss 2

1 University of Southern California, Los Angeles, USA


2 BMC Software Inc., Houston, USA

1. Introduction ..................................................... 195


2. Partitioning Strategies ........................................... 196
2.1 Multi-Attribute Partitioning ................................. 198
3. Join Using Inter-Operator Parallelism ............................ 201
3.1 Discussion ................................................... 202
4. ORE: a Framework for Data Migration ........................... 203
4.1 Three Steps of ORE .......................................... 206
4.2 Predict: Fragments to Migrate ............................... 207
4.3 Performance Evaluation of ORE .............................. 210
4.4 Homogeneous Configuration.................................. 212
4.5 Heterogeneous Configuration................................. 214
5. Conclusions and Future Research Directions ...................... 216

Abstract. Parallelism is the key to realizing high performance, scalable, fault tol-
erant database management systems. With the predicted future database sizes and
complexity of queries, the scalability of these systems to hundreds and thousands
of nodes is essential for satisfying the projected demand. This paper describes three
key components of a high performance parallel database management system. First,
data partitioning strategies that distribute the workload of a table across the avail-
able nodes while minimizing the overhead of parallelism. Second, algorithms for
parallel processing of a join operator. Third, ORE as a framework that controls
the placement of data to respond to changing workloads and evolving hardware
platforms.
5. High Performance Parallel Database Management Systems 195

1 Introduction
Database management systems (DBMS) have become an essential compo-
nent of many application domains, e.g., airline reservation, stock market
trading, etc. In the arena of high performance DBMS, parallel database sys-
tems have gained increased popularity. Example research prototypes include
Gamma [DGS+90J, Bubba [BAC+90J, XPRS [SKP+88J, Volcano [Gra94bJ,
Omega [GCK+93J, etc. Products from the industry include Tandem's Non-
Stop SQL [Tan88J, NCR's DBC/IOI2 [Ter85J, Oracle Parallel Server [Ora94],
IBM's DB2 parallel edition [BFG+95], etc. The hardware platform of these
machines is typically a multi-node platform, see Figure l.la, where each node
might be a computer with one or more disks, see Figure l.lb. In these sys-
tems, several forms of parallelism can be utilized to improve the performance
of the system. First, parallelism can be applied by executing several queries or
transactions simultaneously. This form of parallelism is termed inter-query
parallelism. Second, inter-operator parallelism can be employed to execute
several operators in the same query concurrently. For example, multiple nodes
could execute two or more relational join operators of a complex bushy join
query in parallel. Finally, intra-operator parallelism can be applied to each
operator within a query. For example, multiple nodes can be employed to
execute a single relational selection operator. This chapter describes how a
system employs these alternative forms of parallelism.

l.la A multi-node configuration l.lb Sample configuration of one node


Fig. 1.1. Hardware platform of a parallel DBMS

The placement of data is important for alternative forms of parallelism.


This constitutes one focus of this book chapter. In Section 2, we describe
the tradeoffs associated with exploiting intra-operator parallelism to execute
the selection operator. This has a significant impact on the performance of a
complex query. For example, if the appropriate degree of intra-operator par-
allelism cannot be provided for the selection operator, the performance and
196 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss

degree of intra-operator parallelism of the other operators in a complex query


plan may severely be limited. This is specially true for parallel systems that
utilize the concept of pipelining and data flow to execute complex queries,
Le., systems such as Gamma [DGS+90], Bubba [BAC+90], Volcano [Gra94b],
etc.
Section 3 provides an overview of three alternative algorithms for parallel
processing of the join operator: sort-merge, Grace and Hybrid hash-join. We
refer the interested reader to [Gra93] for a comprehensive description of these
strategies and other alternatives to process this operator. In Section 4, we
describe ORE as a 3 step framework that controls the migration of fragments
across the nodes of a parallel DBMS. This section reports on an evaluation
of this techniques for both a homogeneous and heterogeneous platform. Brief
conclusions and future research directions are offered in Section 5.

2 Partitioning Strategies

Multiprocessor database machines utilize the concept of horizontal partition-


ing [RE78,LKB87] to distribute the tuples of each relation across multiple
nodes. The strategy used for partitioning a relation is independent of the
storage structure used at each site. The database administrator (DBA) for
such a system must consider a variety of alternative organizations for each
relation. Three popular partitioning strategies include: 1) round-robin, 2)
hash partitioning, and 3) range partitioning. The first strategy distributes
the tuples of a relation in a round-robin fashion among the nodes. In ef-
fect, this results in a completely random placement of tuples with a uniform
distribution of tuples across the nodes. In the hash partitioning strategy, a
randomizing function is applied to the partitioning attribute of each tuple
to select a home site for that tuple. In the last strategy, the DBA specifies
a range of key values for each partition or site. This strategy gives a greater
degree of control over the distribution of tuples across the sites.
Figures 2.1 and 2.2 show how hash and range partitioning disperse the
records of a Stock table across a 3 node configuration. Figure 2.1 shows a hash
function that consumes each record and maps it to one of the nodes by: 1)
converting its Symbol attribute value from a string into a 64 bit integer, and
2) computing the remainder of this number when divided by 3, the number
of nodes. This reminder is the node id and ranges between integer values
0, 1 and 2. Figure 2.2 shows range partitioning where those records with a
Symbol attribute value starting with letter 'A' to 'I' are assigned to node 0,
'J' to 'Q' are assigned to node 1, and 'R' to 'Z' are assigned to node 2. Each
piece of the table is termed a fragment. In Figures 2.1 and 2.2, each fragment
consists of two records. The "Symbol" attribute is termed the partitioning
attribute. With range and hash partitioning strategies, the system may direct
those queries that reference the partitioning attribute to a single node. For
example, the system can direct a query that retrieves the closing price of
5. High Performance Parallel Database Management Systems 197

Symbol Opening CloBing PIE

BMC 15.20 18.25 21.53

AXP 32.71 31.50 25.38

ORCL 15.25 15.00 34.09

MSFT 64.70 66.00 40.77

SNE 39.60 40.00 40.06

UPS 53.60 57.60 25.92

0(SymbOI) = (Symbol to 64 bit ;nt) % ~

~--------------~--------------~
~--....:::.----

Fig. 2.1. Hash declustering of Stock relation using Symbol attribute

"BMC" to node 1 with range partitioning. (With hash, this query would
be directed to node 2.) This frees up the other two nodes to process other
queries.
When a transaction updates the partitioning attribute value of a record,
the system might migrate the record from one node to another in order to
preserve the integrity of the partitioning strategy. In the example of Fig-
ure 2.2, if the partitioning attribute (Symbol) value of a record changes from
"AXP" to "XAP", the system migrates this record from node 1 to node 3 to
preserve the integrity of range partitioning strategy.
In [GD90], we quantified the performance tradeoff associated with range,
hash and round-robin partitioning strategies using the alternative indexing
mechanisms provided by the Gamma database machine. This study reveals
that for a shared-nothing multiprocessor database machine, no partitioning
strategy is superior under all circumstances. Rather, each partitioning strat-
egy outperforms the others for certain query types. The major reason for
this is that there exists a tradeoff between exploiting intra-query parallelism
by distributing the work performed by a query across multiple nodes and
the overhead associated with controlling the execution of a multisite query.
Localizing the execution of queries requiring minimal amount of resources,
results in the best system response time and throughput since the overhead
associated with controlling the execution of the query is either minimized or
eliminated. On the other hand, for queries requiring more resources, certain
tradeoffs are involved. In general, with access methods that result in the re-
trieval of only the relevant tuples from the disk, if the selectivity factor of the
query is very low, it is advantageous to localize the execution of the query
to a single node. While the hash partitioning strategy localizes the execution
of the exact match selection queries that reference the partitioning attribute,
198 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss

the range partitioning strategy attempts to localize the execution of all query
types that reference the partitioning attribute regardless of their selectivity
factor. At the other end of the spectrum, the round-robin partitioning strat-
egy directs a query to all the nodes containing the fragments of the referenced
relation.

Symbol Opening Closing PIE


BMC 15.20 18.25 21.53

AXP 32.71 31.50 25.38

ORCL 15.25 15.00 34.09

MSFT 64.70 66.00 40.77

8NE 39.80 40.00 40.06

UPS 53.30 57.80 25.92

( Range Partition )
,- -t--
A- I J-Q "'
R-Z

Fig. 2.2. Range declustering of Stock relation using Symbol attribute

For sequential scan queries, the best response time and throughput is
observed when the partitioning strategy constructs the smallest fragment size
on each node, the execution of each query is localized to a single node, and the
simultaneously executing queries are evenly dispersed across the nodes. The
system generally performs best when the query executes all by itself at a site
and performs a series of sequential disk requests. By localizing the execution
of the query to a single node, there is a higher probability of maintaining
the sequential nature of disk requests made by a query, free from interference
of the other concurrently executing queries. Thus, for the sequential scan
queries, the optimal partitioning strategy is the range partitioning strategy.

2.1 Multi-Attribute Partitioning


One may find many alternative multi-attribute partitioning strategies in
the literature. These can be categorized into 3 groups based on their ob-
jectives. The first strives to optimize the processing of the join operator.
Examples include strategies described in [0085,HL90]. (Section 3.1 details
these two techniques.) The second distributes data in a manner so that a
selection query performs approximately the same amount of work on each
node. Example strategies include Disk Modulo (DM) [DS82], Fieldwise XOR
(FX) [KP88], Error Correcting Codes (ECC) [FM89], Coordinate Modulo
5. High Performance Parallel Database Management Systems 199

Distribution (CMD) [LSR92], Hilbert Curve Allocation (HCAM) [FB93j,


vector-based declustering [CR93j, Golden Ratio Sequences (GRS) [BSCOOj.
For a comparison of some of these strategies see [MS98j.
The third group strives to localize the execution of a query that references
a partitioning attribute to as few nodes as possible. Example strategies in-
clude MAGIC [GD94,G093j and Bubba's extended range declustering strat-
egy [BAC+90j. A comparison of these two techniques is detailed in [GD92j,
demonstrating the superiority of MAGIC.
We describe the first group of multi-attribute declustering techniques in
Section 3.1 with a description of the hash-join algorithm. While at first glance
the remaining two groups might appear contradictory, they complement one
another because they are appropriate for different query classes. While the
2nd approach is appropriate for those queries that perform sufficient work at
each node to eclipse the overhead of parallelism, the 3rd is appropriate for
those that suffer from the overhead of coordinating multi-site queries. Ideally,
the later query class should be directed to one node in order to minimize the
overhead of parallelism.
In the following, we provide a brief description of the MAGIC decluster-
ing strategy. MAGIC differs from the range and hash partitioning strategies
in two ways. First, relations are declustered into fragments using several at-
tributes instead of one. Thus, it can restrict the subset of nodes used to
execute selection operations on any of the partitioning attributes. Second,
the number of fragments and their assignment to nodes is determined from
the characteristics of the selection operations accessing the relation.

Symbol
A-D E-B I-L ..... 0-" u-z

0-10 1 2 3 , 5 6

11-20 7 8 9 10 11 12

P 21-30 13 14 15 16 17 18
/
E
31-'0 19 20 21 22 23 2'

41-50 25 26 27 28 29 30

51-CO 31 32 33 3' 35 36

Fig. 2.3. A two-dimensional declustering of Stock relation with MAGIC


200 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss

In order to further motivate the MAGIC partitioning strategy, recall the


Stock table of Figure 2.3. Assume one half of the accesses (termed query
type A) to the Stock relation use an equality predicate on the Symbol at-
tribute (e.g., select Stock.all from Stock where Stock.Symbol = "BMC") and
the remaining queries (termed type B) use a range predicate on the PIE at-
tribute (e.g., select Symbol from Stock where PIE> 10.03 and PIE < 10.09).
Furthermore, assume that both queries retrieve only a few tuples. For this
workload, the appropriate access methods are a hash index on the Sym-
bol attribute and a B+ index on the PIE attribute of the STOCK relation.
However, either attribute could be selected as the partitioning attribute be-
cause both queries have minimal resource requirements and, hence, should
be executed by only one or two nodes. Since the range and hash partitioning
strategies can decluster a relation only on a single attribute, both are forced
to direct either the type A or the type B queries to all the nodes, incurring
the overhead of using more nodes than absolutely necessary.
On the other hand, MAGIC declustering would construct a two dimen-
sional directory on the Stock relation, as shown in Figure 2.3, in which each
entry corresponds to a fragment - a disjoint subset of the tuples of the re-
lation. The rows of the directory correspond to ranges of values for the PIE
attribute, while the columns correspond to the intervals of the Symbol at-
tribute value. The grid directory consists of 36 entries (Le., fragments) and,
assuming a system consisting of exactly 36 nodes, each fragment will be as-
signed to a different node (the details of how less contrived cases are handled
is described in [GD94]). For example, tuples with Symbol attribute values
ranging from letters A through D and PIE attribute values ranging from
values 21 to 30 are assigned to node 13.

Next, contrast the execution of queries A and B when the Stock table is
hash partitioned on the Symbol attribute with when it is declustered using
MAGIC and the assignment presented in Figure 2.3. Query type A is an exact
match query on the Symbol attribute. The hash partitioning strategy local-
izes the execution of this query to a single node. The MAGIC declustering
strategy employs six nodes to execute this query because its selection pred-
icate maps to one column of the two dimensional directory. As an example,
consider the query that selects the record corresponding to BMC Software
(Stock.Symbol = "BMC"). The predicate of this query maps to the first col-
umn of the grid directory and nodes 1, 7, 13, 19, 25, and 31 are employed to
execute it.

Query type B is a range query on the PIE attribute. The hash partition-
ing strategy must direct this query to all 36 nodes because PIE is not the
partitioning attribute. Again, MAGIC directs this query to six nodes since its
predicate value maps to one row of the grid directory and the entries of each
row have been assigned to six different nodes. If instead the Stock relation
was range partitioned on the PIE attribute, a single node would have been
5. High Performance Parallel Database Management Systems 201

used to execute the second query; however, then the first query would have
been executed by all 36 nodes.
Consequently, the MAGIC partitioning strategy uses an average of six
nodes, while the range and hash partitioning strategies both use an average
of 18.5 nodes. Ideally, however, a single node should have been used for each
query since they both have minimal resource requirements. Approximating
the optimal number of nodes closely provides two important benefits. First,
the average response time of both queries is reduced because query initia-
tion overhead [CAB+88] is reduced. Second, using fewer nodes increases the
overall throughput of the system because the "freed" nodes can be used to
execute additional queries.

3 Join Using Inter-Operator Parallelism

A common join operator is the equi-join operator, R.A = S.A. It concatenates


a tuple of R with those tuples of S that have matching values for attribute
A. This section describes sort-merge [DKO+84,Gra93,Gra94a], Grace hash-
join [NKT89] and Hybrid hash-join [DKO+84,Sha86] to parallel process this
operator. A common feature of these algorithms is their re-partitioning of
relations Rand S using the joining attribute A. This divides the join oper-
ator into a collection of disjoint joins that can be processed in parallel and
independent of one another. Following a description of each algorithm, Sec-
tion 3.1 describes how these techniques compare with one another and the
role of multi-attribute partitioning strategies with these algorithms.

Sort-merge. A parallel version of sort-merge join is a straightforward ex-


tension of its single-node implementation. Its details are as follows. First, the
smaller of the two joining relations, R, is hash partitioned using attribute A.
Its tuples are stored in a temporary files as they arrive at each node. Next,
relation S is partitioned across the nodes using the same hash function ap-
plied to attribute A. The use of the same hash function guarantees that those
tuples of R at node 1 may join only with those of S at the same node. In a
final step, a local merge join operation is performed by each node, in parallel
with other nodes. The results might be stored in a file or pipelined onto other
operators that might consume the result of this join operator.

Grace hash-join. The Grace hash-join algorithm [NKT89] works in three


steps. In the first step, the algorithm hash partitions relation R into N buckets
using its join attribute A. In the second step, it partitions relation S into N
buckets using the same hash function. In the last step, the algorithm processes
each bucket Bi of Rand S to compute the joining tuples.
Ideally, N should be chosen in a manner so that each bucket is almost
the same as the available memory without exceeding it. To accomplish this
202 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss

objective, the algorithm starts with a very large value for N. This reduces the
probability of a bucket exceeding the memory size. If the buckets are much
smaller than main memory, the algorithm combines several buckets into one
during the third phase to approximate the available memory.
This algorithm is different than sort-merge in one fundamental way: In its
last step, the tuples from bucket Bi of R are stored in memory resident hash
tables (using the join attribute, attribute A). The tuples from bucket Bi of 8
are used to probe this hash table for matching tuples. Grace-join may use the
smaller table (say R) to determine the number of buckets: this calculation is
independent of the larger table (8).

Hybrid hash-join. The Hybrid hash-join also operates in three steps. Its
main difference when compared with Grace hash-join is as follows. It main-
tains the tuples of the first bucket of R to build the memory resident hash
table while constructing the remaining N-l buckets are stored in temporary
files. Relation 8 is partitioned using the same hash function. Again, the last
N-l buckets are stored in temporary files while the tuples in the first bucket
are used to immediately probe the memory resident hash table for matching
tuples.

3.1 Discussion

A comparison of sort-merge, Grace and Hybrid hash-join algorithms (along


with other variants) is reported in [Bra84,DG85,8D89j. In general, Grace and
Hybrid provide significant savings when compared with sort-merge. Hybrid
outperforms Grace as long as the first bucket does not overflow the available
memory. Assuming that the size of Rand 8 are fixed, both Hybrid and sort-
merge are sensitive to the available memory size. Grace hash-join is relatively
insensitive to the amount of available memory because it performs bucket
tunning in the first step. The performance of Hybrid improves when large
amounts of memory are available. 8ort-merge also benefits (in a step-wise
manner as a function of available memory) because it can sort Rand 8 with
fewer iterations of reading and writing each table.
One may employ bit filters [Bab79,VG84j to improve the performance
of these algorithms. The concept is simple. An array of bits is initialized to
zero. During the partitioning phase of R, a hash function is applied to the
join attribute A of each tuple and the appropriate bit is set to one. The fully
constructed bit filter is then used when partitioning relation 8. When reading
a record of 8, the same hash function is applied to the joining attribute of each
tuple. If the corresponding bit is checked then it is transmitted for further
processing. Otherwise, there is no possibility of that tuple joining and it can
be eliminated from further consideration. This minimizes the network traffic
and subsequent processing, e.g., with sort-merge, the eliminated tuples are
not sorted, reducing the number of I/Os.
5. High Performance Parallel Database Management Systems 203

One may control partitioning of tables to enhance the performance of the


join operator. For example, the DYOP technique [0085] distributes a data
file into a set of partitions (or buckets) by repeatedly subdividing the tuple
space of multiple attribute domains (in a fashion that is almost identical to
the grid file [NH84] algorithm). To execute a hash-join query efficiently, the
size of each partition is defined to equal the aggregate memory of the nodes
in the system. Since the DYOP structure preserves the order of tuples in
the attribute domain space, the bucket formation step of Grace hash-join
algorithm is eliminated and the join of relations Rand S is accomplished
by reading each relation only once. Similarly, [HL90] also proposes the use
of a multi-attribute partitioning to minimize the impact of data distribution
during the construction of the hash table on the inner relation when executing
a parallel hash join. The basic idea is as follows. Assume a relation R that
is frequently joined with relations Sand T. When R is joined with S, the
A attribute of R is used and, when R is joined with T, the Y attribute of
R is used as the joining attribute. By building a grid file on the A and Y
attributes of R which is then used to decluster the tuples of R, it is possible
to minimize how many tuples of R are redistributed when it is joined with
either S or T.

4 ORE: a Framework for Data Migration

While techniques such as MAGIC decluster a relation by analyzing its work-


load, this workload might evolve over time. Another challenge is the gradual
evolution of a homogeneous system to a heterogeneous one. This might hap-
pen for several reasons. First, disks fail and it might be more economical
to purchase newer disk models that are faster and cheaper than the origi-
nal models. Second, the application might grow over time (in terms of both
storage and bandwidth requirement) and demand additional nodes from the
underlying hardware. Once again, it might be more economical to extend the
existing configuration by purchasing newer hardware that is faster than the
original nodes.
With evolving workloads and environments, data must be re-organized
to respond to these changes. Ideally, the parallel DBMS should respond
to these changes and fine-tune the placement of data. This can be per-
formed at different granularities: 1) record level by repartitioning records
and controlling assignment of records to each node [LKO+OO], and 2) frag-
ment level [SWZ98,VBW98,GGG+01] by either migrating fragments from
one node to another or breaking fragments into pieces and migrating some
of its pieces to different nodes. We focus on the later approach in the rest of
this chapter.
In order to simplify discussion and without loss of generality, we assume
an environment consisting of K storage devices. In essence, each node of
Figure 1.1a is a storage device. Each storage device di has a fixed storage
204 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss

capacity, C(di ), an average bandwidth, BW(di ), and a Mean Time To Failure,


MTT F(d i ). With one or more applications that consume Btotal bandwidth
during a fixed amount of time, ideally, each disk must contribute a bandwidth
proportional to its BW(di ):

(1)

The bandwidth of a disk is a function of block size (13) and its physical
characteristics [GG97,BGM+94]: seek time, rotational latency, and transfer
rate (tfr). It is defined as:

13
BW(di ) = tfr x 13 + (tfr x (seek time + rotational latency)) (2)

Given a fixed seek time and rotational latency, BW(di ) approaches disk trans-
fer rate with larger block sizes.
There are F files stored on the underlying storage. The number of files
might change over times, causing the value of F to change. A file fi might
be partitioned into two or more fragments. Its number of fragments is in-
dependent of the number of storage devices, i.e., K. Fragments of a file
may have different sizes. Fragment j of file fi is denoted as fi,j' In our
assumed environment, two or more fragments of a file might be assigned
to the same disk drivel. Moreover, a file Ii may specify a certain availabil-
ity requirement from the underlying system. For example, it may specify
that its mean-time-to-data-Ioss, MTT DL(Ii), should exceed 200,000 hours,
MTT DLmin(fi) = 200,000 hours.
We assume physical disk drives fail independent of one another. Each
disk has a certain failure rate [ZGOO,SS82,Gib92], termed A/ailure. Its mean-
time-to-failure (MTTF) is simply:-,_l_.
Afa1.lure
When a file (say fJ) is partitioned
into n fragments and assigned to n disks (say d l to d n ) then the data be-
comes unavailable in the presence of a single failure 2 • Hence, it is defined as
follows [ZGOO,SS82,Gib92]:
1
MTTDL(fi) = I:ni=l A.
/atlure
(d.)
t
(3)

For example, if the MTTF of disk A and B is 1 million and 2 million hours,
respectively, then the MTTDL of a file with fragments scattered across these
two disks is 666,666 hours.

1 As compared with [SWZ98] that requires each fragment of a file to be assigned


to a different disk drive.
2 There has been a significant amount of research on construction of parity data
blocks and redundant data, see [ZGOO] that focuses on this for heterogeneous
disks. This topic is beyond the focus of this study. In this chapter, we control the
placement without constructing redundant data.
5. High Performance Parallel Database Management Systems 205

We use the EVEREST [GIZ96,GIZ01] file system to approximate a con-


tiguous layout of a file fragment on the disk drive. With EVEREST, the
basic unit of allocation is a block, also termed sections of height O. EVER-
EST combines these blocks in a tree-like fashion to form larger, contiguous
sections. As illustrated in Figure 4.1, only sections of size(block) x Bi (for
i 2: 0) are valid, where the base B is a system configuration parameter. If
a section consists of B i blocks then i is said to be the height of the section.
In general, B height i sections (physically adjacent) might be combined to
construct a height i + 1 section.

o - - - - Block,

eClion ie ....
Deplh :2

Buddie, Buddie,

Fig. 4.1. Physical division of disk space into blocks and the corresponding logical
view of the sections with an example base of B = 2

To illustrate, the disk in Figure 4.1 consists of 16 blocks. The system is


configured with B = 2. Thus, the size of a section may vary from 1, 2, 4,
8, up to 16 blocks. In essence, a binary tree is imposed upon the sequence
of blocks. The maximum height, given by3 N = PogB(lsi~:rbf~~k)J)l, is 4.
With this organization imposed upon the disk drive, sections of height i 2: 0
cannot start at just any block number, but only at offsets that are multiples
of Bi . This restriction ensures that any section, with the exception of the one
at height N, has a total of B-1 adjacent buddy sections of the same size
at all times. With the base 2 organization of Figure 4.1, each block has one
buddy.
A fragment might be represented as several sections. Each is termed a
chunk. The file system maintains the heat of each chunk at the granularity
of a fixed offset from its section height. For example, with a chunk of height
8, the system might maintain its heat at offset 2. With B equal to 2, this
means that the system maintains the heat of four section of height 6 that
constitute this chunk. This enables the reorganization algorithm to break a
3 To simplify the discussion, assume that the total number of blocks is a power of
B. The general case can be handled similarly and is described in [GIZ96,GIZOlj.
206 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss

fragment into many smaller pieces and disperse them amongst the available
disk drives.

4.1 Three Steps of ORE

Our framework consists of 3 logical steps: monitor, predict, and migrate.


We partition time into fixed intervals, termed time slices. During monitor,
we construct a profile of the load imposed by each file fragments per time
slice. During predict, we compute what fragments to migrate from One disk
to another in order to enhance system performance. Migrate changes the
placement of candidate fragments. Below, we detail each of these steps.

Monitor constructs a profile of the load imposed on each disk drive and the
average response time of each disk d i . The load imposed on disk drive d i is
quantified as the bandwidth required from disk di • It is the total number of
bytes retrieved from di during a time slice divjded by the duration of the
time slice. The average response time of d i is the average response time of
the requests it processes during the time interval.
This process produces three tables that are used by the other two steps:

• FragProfiler table maintains the average block request size, heat, and load
imposed by each fragment hj per time slice,
• for each disk drive d i per time slice, DiskProfiler table maintains the heat,
load, standard deviation in system load, average response time, average
queue length, and utilization of di ,
• FragOvlp table maintains the OVERLAP between two fragments per
time slice. The concept of OVERLAP is detailed in Section 4.2.

Predict determines what fragments to migrate to enhance response time.


Section 4.2 describes several techniques that can be employed for this step.
In Section 4.3, we quantify the tradeoff associated with these alternatives.

Migrate modifies the placement of data. We considered two algorithms for


fragment migration. With the first, the fragment is locked in exclusive mode
while it is migrated from dare to ddst. This simple algorithm prevents up-
dates while the fragment is migrating. It is efficient and easy to implement.
However, the data might appear to be unavailable during the reorganiza-
tion process. Due to this limitation, we ignore this algorithm from further
consideration.
The second supports concurrent updates by performing each against two
copies of the migrating fragment: (a) One On dsre, termed primary, and (b) the
other on ddst, termed secondary. The secondary copy is constructed from the
primary copy of the fragment. All read requests are directed to the primary
copy. All updates are performed against both the primary and secondary
copy. The migration process is a background task that is performed based On
5. High Performance Parallel Database Management Systems 207

availability of bandwidth from dsre. It assumes some buffer space for staging
data from primary copy to facilitate construction of its secondary copy. This
buffer space might be provided as a component of the embedded device.
Depending on its size, the system might read and write units larger than
a block. Moreover, it might perform writes against ddst in the background
depending on the amount of free buffer space. Once the free space falls below
a certain threshold, the system might perform writes as foreground tasks that
compete with active user requests [AKN+97].

4.2 Predict: Fragments to Migrate

In this section, we describe two algorithms that strive to distribute the load
of an application evenly across the K disks. These are termed EVEN and
EVEN C / B. As implied by their name, EVEN C / B is a variant of EVEN. A
taxonomy of alternative techniques can be found in [GGG+Ol].

EVEN: Constrained by bandwidth. At the end of each time slice, EVEN


computes the fair-share of system load for each disk drive. Next, it identifies
the disk with (a) maximum positive load imbalance, termed dsre, and (b)
minimum negative load imbalance, termed ddst. (The concept of load imbal-
ance is formalized in the next paragraph.) Amongst the fragments of dsre,
it chooses the one with a load closest to the minimum negative load of ddst.
It migrates this fragment from dsre to ddst. This process repeats until either
there are no source and destination disks or a new time slice arrives.
The maximum positive load imbalance pertains to those disks with
an imposed load greater than their fair share. For each such disk di , its
t5+(di ) = load(di ) - Fairshare(di ). Positive imbalance of ~ is defined as
Fai::h::'e d. . EVEN identifies the disk with highest such value as the source
disk, dsre, to migrate fragments from.
The minimum negative load imbalance corresponds to those disks with
an imposed load less than their fair share. For each such disk di , its
t5-(di ) = load(di ) - Fairshare(di ). Negative imbalance of di is Fai::h:;e
d.).
The disk with the smallest negative imbalance4 is the destination disk, ddst,
and EVEN migrates fragments to this disk.
EVEN defines XTRA as the difference between fair share of dare and its
current load, XTRA = load(d sre ) - Fairshare(d sre ). The difference between
fair share of ddst and its current load is termed LACKING, LACKING =
Fairshare(ddst) - load(ddst). EVEN identifies fragments from dsre with an
imposed load approximately the same as LACKING. Next, it migrates these
fragments to ddat.

4 Given two disks, d 1 and d2 with negative imbalance of -0.5 and -2.0, respectively,
d2 has the minimum negative load imbalance.
208 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss

EVEN c / B : Constrained by bandwidth with Cost/Benefit Consid-


eration. EVENC / B extends EVEN by quantifying the benefit and cost of
each candidate migration from dare to ddat. The next paragraph describes
how the system quantifies the cost and benefit of each candidate migration.
EVEN c / B sorts candidate migration based on their net benefit, i.e., benefit
- cost, scheduling the one that provides greatest savings. After each migra-
tion, the cost of each candidate migration is re-computed (because it might
have changes) and the list is resorted. Section 4.3 shows that this algorithm
outperforms EVEN.
In the rest of this section, we describe how to quantify the benefit and
cost of migrating a fragment Aj from dare to ddat. Its unit of measurement
is time, i.e., milliseconds. The cost of migrating a fragment is the total time
spent by dare to read the fragment and ddat to write the fragment.
The benefit of migrating Aj is measured in the context of previous time
slices. ORE hypothesizes a virtual state where Aj resides on ddst and mea-
sures the improvement in average response time. In essence, it estimates an
answer to the following question: "What would be the average.response time
if Aj resided on ddst?" By comparing this with the observed response time,
we quantify the benefit of a migration. Of course, this number might be a
negative value which implies no benefit to performing this migration. Note
that this methodology assumes that the past access patterns are an indication
of future access patterns.
We start by describing a methodology to estimate an answer to the hypo-
thetical "what-if' question. Next, we formalize how to compute the benefit.
Our methodology to estimate an answer to the "what-if' question is fairly
accurate; its highest observed percentage of error is 23%. We realize this
accuracy for two reasons: First, we assume the system is previewed to all
block references and the status of each storage device. Second, we maintain
one additional piece of information, namely the degree of overlap between
two fragments, termed OVERLAP(Aj, fk,l)' This information is maintained
for each time slice and used to predict response time.
In order to define OVERLAP and describe our methodology, and without
loss of generality, assume that we are answering the "what-if' question in
the context of one time slice. To simplify the discussion further, assume that
the environment consists of homogeneous disk drives. (This assumption is
removed at the end of this section.) The average system response time, RTavg ,
is a function of average response time observed by requests referencing each
fragment. Assuming F files, each partitioned into at most G fragments, it is
defined as:

RT. _ E::l E7=1 RTavg(Aj) (4)


avg - F x G

The average response time of a fragment, RTavg(Aj), is the sum of its average
service time, Savg(Aj), and wait time, Wavg(Aj), ofrequests that reference
5. High Performance Parallel Database Management Systems 209

it:

(5)

Savg(A;) is a function of the disk it resides on and average requested


block size. For each fragment, as detailed in monitor step of Section 4.1, ORE
maintains the average requested block size in the FragProfiler table. Thus,
given a disk drive ddst and a fragment A;, ORE can estimate Savg(A;) if
A; resided on ddst (using the physical characteristics of ddst).
To compute Wavg , we note that each request has an arrival time,
Tarvl, that can be registered by the embedded device. For each fragment
A; residing on disk di , we maintain when the requests referencing A;
will depart the system, termed Tdepart. Tdepart is estimated by analyzing
the wait time in the queue of di . Upon the arrival of a request refer-
encing fragment fk,l' we examine all those fragments with a non-negative
Tdepart. For each, we set OYERLAP(fk,l,ji,;,Tarvl) to be the difference be-
tween Tarvl(fk,l) and Tdepart(fi,;): OYERLAP(fk,l, A;,Tarvz) = Max(O,
Tdepart(fi,;) - Tarvl(!k,l)). For a time slice, OVERLAP (fk,l , A;) is the
sum of those OVERLAP (fk,l , A;,Tarvz) whose Tarvl is during the time slice.
In our implementation, we maintained OYERLAP(fk,l, A;) as an integer
that is initialized to zero at the beginning of each time slice. Upon the ar-
rival of a request referencing fk,l' we increment OYERLAP(/k,l, A;) with
OYERLAP(/k,l, A;,Tarvl)' This minimizes the amount of required memory.
OVERLAP (fk,l , A;) defines how long requests referencing fk,l wait in a
queue because of requests that reference A;. Assuming that A; and fk,l are
the only fragments assigned to disk di and the system processes #Req(fk,l)
requests that reference fk,l' the average wait time for these requests is:

w: (I' ) _ OVERLAP(!k,I,fi,;) + OVERLAP(fk,l,fk,l) (6)


avg J k,l - #Req(fk,l)

It is important to observe the following two details. First, self OYERLAP


is also defined for a fragment /k,l, i.e., there exists a value for OYERLAP(/k,l,
fk,l)' This enables ORE to estimate how long requests that reference the
same fragment wait for one another. Second, this paradigm is flexible enough
to enable ORE to maintain OYERLAP (fk,l' A;) even when fk,l and A;
reside on different disks. ORE uses this to estimate a response time for a
hypothetical configuration where A; migrates to the disk containing fk,l.
Third, ORE can estimate the response time of a disk drive for an arbitrary
assignment of fragments to disks using Equation 4.
Based on Equation 5, there are two ways to enhance response time ob-
served by requests that reference a fragment. First, migrate the fragment to
a faster disk for an improved service time, Savg. Second, migrate a fragment
A; away from those disks whose resident fragments have a high OVERLAP
with A;. Figure 4.2 shows the pseudo-code to estimate the benefit of migrat-
ing A; from dsrc to ddst. ORE may compute this for N previous time slices
210 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss

1. Number of accesses processed by disk dsre is Access8re


2. Number of accesses processed by disk dd8t is Accessd8t
3. Look-up the average response time of dsre prior to migration, termed
RT8re,before
4. Look-up the average response time of dd8t prior to migration, termed
RTdst,before
5. Estimate the average response time of d8re after migration, termed RT8re,after
6. Estimate the average response time of ddst after migration, termed RTdst,after
7. Total response time savings of dsre after migration is:
SavingSBre=(AccessBre,after x RTBre,after) - (AcceSSBre,before X RTsre,before).
8. Total response time savings of ddBt after migration is:
Savingsd8t=(AccesSdst,after x RTdst,after) - (AccesSdst,before X RTd8t,before).
9. Benefit of migrating Ai is Benejit(fi,i)=Savingssre + Savingsdst.

Fig. 4.2. Pseudo-code to compute the benefit of a candidate migration

where N is an arbitrary number. The only requirement is that the embed-


ded device must provide sufficient space to store all data pertaining to these
intervals.
The OVERLAP of two fragments is maintained in the FragOvlp table.
Given G fragments, in the worst case scenario, the system maintains G 2 G t
integer val~es. For example with a 630 fragments (G=630) and records that
are 348 bytes long, in the worst case scenario, the system would store 65
megabytes of data per time slice. In our experiments, the amount of required
storage was significantly less than this, only 70 kilobytes per time slice. With
the 80-20 rule, we expect this to hold true for almost all applications. In
Section 5, we describe how ORE can employ a circular buffer to limit the size
of trace data that it gathers from the system.

4.3 Performance Evaluation of ORE

We used a trace driven simulation study to quantify the performance of ORE.


We analyzed two alternative environments: First, a homogeneous environ-
ment consisting of identical disk models. Second, a heterogeneous environ-
ment consisting of different disk models. For both environments, ORE pro-
vides significant performance enhancements. In the following, we start with
a brief overview of the trace driven simulation model. Next, we present the
obtained results for each environment and our observations.
The traces were gathered from a production Oracle database management
system on a HP workstation configured with 4 gigabyte of memory, and 5
terabytes of storage devices (283 raw devices). The database consisted of
70 tables and is 27 gigabyte in size. The traces were gathered from 4 pm,
April 12 to 1 pm April 23, 2001. It corresponds to 23 million operations
on the data blocks. The file reference is skewed where approximately 83% of
accesses reference 10% of the files. Moreover, accesses to the tables are bursty
5. High Performance Parallel Database Management Systems 211

as a function of time. This is demonstrated in Figure 4.3, where we plot the


number of requests to the system as a function of time. In all experiments,
the duration of a time slice is 6 minutes, i.e., each tick on the x-axis of the
presented figures is 6 minutes long.

# of Requests

140000

120000

100000

80000

60000

40000 II
IW~ II, J~
It, I
20000

o
II \( 1(1 IJiMJ ~

Time Slice ID

Fig. 4.3. Number of requests as a function of time

We used the Java programming language to implement our simulation


model. It consists of 3 class definitions:
1. Disk: This class definition simulates a multi-zone disk drive with a com-
plete analytical model for computing seeks, rotational latency, and trans-
fer time. When a disk object is instantiated, it reads its system parameters
from a database management system. Hence, we can configure the model
with different disk models and different number of disks for each model.
A disk implements a simplified version of the EVEREST file system.
2. Client: The client generates requests for the different blocks by reading
the entries in the trace files.
3. Network Switch: This class definition implements a simplified SAN switch
that routes messages between the client and the disk drives. The file
manager is a component of this module. The file manager services each
request generated by a client. It controls and maintains the placement of
data across disk drives. Given a request for a block of a file, this module
212 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss

locates the fragment referenced by the request and resolves which disk
contains the referenced data. It consults with the file system of the disk
drive to identify the appropriate cylinder and track that contains the
referenced block.

The file manager implements the 3-step re-organization algorithm of ORE,


see Section 4.
We conducted experiments with both a large configuration consisting of
283 raw devices that corresponds to the physical system that produced the
traces and smaller configurations. The smaller configurations are faster to
simulate. The performance results presented in this chapter are based on one
such configuration consisting of 9 disk drives. We analyze two environments:
First, a homogeneous one consisting of nine 180 gigabyte disk drives with a
transfer rate of 40 megabytes per second (MB / sec). These disks were mod-
elled after the high density, Ultra160 SCSI/Fibre-Channel disks introduced
by Seagate in late 2000. Our second environment is a heterogeneous one con-
sisting of three different disk models: 1) three disk drives identical to those
used with the homogeneous environment, 2) three 60 gigabyte disk drives,
each with a transfer rate of 20 MB/sec, and 3) three 20 gigabyte disk drives,
each with a transfer rate of 4 MB/sec.

4.4 Homogeneous Configuration

Cumuladve A....... u-po...e Tbae (MlIIlaeccnub) Cumuladve A _ \IoopoDH Tbae (MlIIlaeccnub)


a:xm ••••••••••• -••••••••••••••••••••••••••••••••• -_ •••••••••• _-_ •••• __ •__ •••• 35000 •• _-_ ••••••••••••••• __ ._-- •• _- ••• --- •• -- ••••• _-- ••••••• ------- •••••••• -- •• -

70000 --_ ••••••••••••••••••••••••••••••••••••••••••••••••• _- •••••••••••••••••••

80DDD •••••••••• -•••••••••• __ ••••••••••••••••• --•••• _- •••• _-•• _--. __ ••••••••••

50000 ••••••••••••••• __ •••••••••. _•••••••• _-- ••••••• _- •••••• -•••••••••••••••••

40IlOO ••• _- ••••• - •• - •••••••••••• __ •••••••••••• _••••••••••••••• __ ••••••••• - ••••

Tbae8llcem

4.4a. Starting with time slice 1 4.4b. Starting with time slice 200
Fig. 4.4. Cumulative average response time for the homogeneous environment

Figure 4.4 shows the performance of alternative predict techniques using


the trace. The x-axis of this figure denotes time, i.e., different time slices. The
y-axis is the cumulative average response time. It is computed as follows. For
each time slice, we compute the total number of requests and the sum of all
5. High Performance Parallel Database Management Systems 213

response times till the end of that time slice. The cumulative average response
time is the ratio of these two numbers , i.e. , total response time. If during a
total requests
time slice, no requests are issued then the cumulative average response time
remains constant. This explains the periodic flat portions.
In addition to EVEN and EVEN C / B, these figures present the response
time for three other configurations. These correspond to:
• No-reorganization: this represents the base configuration that processes
requests without on-line reorganization.
• Optimal: this configuration assigns requests to the disks in a round-robin
manner, ignoring the placement of data and files referenced by each re-
quest. This configuration represents the theoretical lower bound on re-
sponse time that can be obtained from the 9 disk configuration.
• Heat-Based: This is an implementation of the re-organization algorithm
presented in [SWZ98]. Briefly, this algorithm monitors the heat [CAB+88]
of disks and migrates the fragment with highest temperature from the
hottest disk to the coldest one if: a) the heat of the target disk after
this migration does not exceed the heat of the source disk, and b) the
hottest disk does not have a queue of pending requests. The heat of a
fragment is defined as the sum of the number of block accesses to the
fragment per time unit, as computed using statistical observation during
some period of time. The temperature of a fragment is the ratio between
its heat and size. The heat of a disk is the sum of the heat of its assigned
fragments [CAB+88,KH93].
Figure 4.4a and b show the cumulative average response time starting
with the 1th and 200 th time slice, respectively. The former represents a cold
start while the later is a warm start after 20 hours of using the framework.
In both cases, ORE is a significant improvement when compared with no-
reorganization. (ORE refers to the framework consisting of the three possible
algorithms: EVEN, EVEN c / B , and Heat-Based.) The peaks in this figure
correspond to the bursty arrival of requests which result in the formation of
queues. Even though Optimal assigns requests to the nodes in a round-robin
manner, it also observes formation of queues because many requests arrive
in a short span of time.
We also analyzed the performance of alternative algorithms on a daily
basis. This was done as follows. We set the cumulative average response time
to zero at midnight on each day. When compared with the theoretical Opti-
mal, ORE is slower by an order of magnitude. Figure 4.5 shows how inferior
EVEN, EVEN c / B and Heat-Based are when compared with Optimal. The
y-axis on this figure is the percentage difference between an algorithm (say
EVEN) and Optimal. A large percentage difference is undesirable because it
is further away from the ideal. We show two different days, corresponding to
the best and worst observed performance. During day 2, ORE is 50 to 300
percent slower than the theoretical Optimal. During day 6, ORE is at times
several orders of magnitude slower than Optimal.
214 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss

SOO.()l)% ---- ----- -_ •• ---._ ••• --_._ •••• -----_ •• _•• - •••••••• _._._. --_._ ••• -_. 3600.00% ._-_. -- _•••• _-_ •••••••• _.- ••• _._-----._--------- - •• -- ••••• _--.---.

450.tIO'% .----.---- ••• ------------- •••• ----••••• ---••••••••••••• ----- •••••••


31XI).l10% ••. -- •• -.------------ ..•• --.----.---••••••• --•••••••••••••• -•• -- ••
400..00% ••• ---.-- ••• ------ ••• ---- •••••••••••••• --- •••••••••.••• -.--- ••• -_.-

350lII11' ••••• -- ••. ------ ••••••• --- .•.•••••••••.•••••••• -.---- •.••••• ------- 2500.00"5 ••••••••••••••••• _-_ •••••• -.-- ••••• --.--------------- --------_.

300.00% •• _.• ---- ••••• _-_ ••• ----- ••• ------ ••• ---- --_. ------ •••• ----- ••• ---
20D0.CIO% •••••••••••••••••• --- ••••••••• - ••••••• - •••• - •• -.- •••• ---.-.- •• -.
250AIO'Ao • - ....••••••••••••••. -- ••••••••. -.- ..•.•••••••••••••••••••
15OO.OO"J1o .•••••••••..•...••••••••••..••••.••..............•.•...........
200~ ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
EVEN
150.000. ...............••.......•.••..•...••.........•..••••••••••••••.••• 1~ ••••••••••••••••••••••••••••••••••••••••• _•••• _••••••••••••••••••

H'._ ...........................a_ ..................... .


EVEN
so...,.. ..... -..................................................... .
._------------....-
341 381 381
EVENc:!B
401 421 441 481 481 SOt ~ 541 581
~~-
1301 1321 1341 .... ~---------------
1381 1381 140t 1421 1441 1481 148t 1501 1521

TlmeSll... m

4.5a Day 2 4.5b Day 6


Fig. 4.5. Percentage degradation relative to Optimal

4.5 Heterogeneous Configuration


We analyzed the performance of a heterogeneous configuration consisting of 9
disk drives. These disks correspond to 3 different disk models: 1) 180 gigabyte
disks with a 40 megabyte per second transfer rate, 2) 60 gigabyte disks with
a 20 megabyte per second transfer rate, and 3) 20 gigabyte disks with a 4
megabyte per second transfer rate. Our environment consisted of 3 disks for
each model.
The experimental setup differed from that of Section 4.4 in several ways.
First, we increased the block size to 128 kilobyte. With a 2 kilobyte block
size, the bandwidth of each disk is almost identical because the seek and
rotational delays constitute more than 99% of transfer time, see Equation 2.
Second, we do not have Optimal. With a heterogeneous configuration, the
faster disks can service requests faster and it is no longer optimal to assign
requests to disks in a round-robin manner. Similarly, we eliminated the Heat-
Based technique because its extension to a heterogeneous environment would
be similar to EVEN.
Figure 4.6 shows the cumulative average response time of the system
with EVEN, EVEN c / B , and no-reorganization. These results demonstrate
the superiority of ORE as a re-organization framework. EVENc/B enhances
performance for several reasons. First, it migrates the fragments with a high
imposed load to the faster disks, processing a larger fraction of requests faster.
Thus, when a burst of requests is issued to these fragments, each request
spends less time in the queue. Second, it migrates the fragments that are
referenced together onto different disk drives in order to minimize the incurred
wait time (using the concept of OVERLAP).
We compared EVEN with EVEN c / B on a day-to-day basis. This proce-
dure is identical to those of the homogeneous configuration where the cumu-
5. High Performance Parallel Database Management Systems 215

Cumulallve A _ ~ TIme (MiIIlHeoDob) CumulatiYeA_ ~ TIme (MIIIloeooada)


1400000 ................ __ ........ _................. _...................... ___ .. 500000 •••••••••••••••••• _-_ •••••••••••••••••••••••••••••••••••••••••••••••••••.

450000 __ •• u •••••••• • •••••••••••••••••••••••••••••••••••••••••••••••••••••••

1200000··························· .. ························ ... -............. 4011000.-..... --......... -.-... --............. -.-.-!!':.'.~~-.


1000000·· ... _.

400000 --- ••••• _-.- •••••• _- •••• --_ ••••••••• -••••••.•••••• -•••.••••• -•••••••..

Tlme8UcoID Tlme8UcoID

4.6a. Starting with time slice 1 4.6b. Starting with time slice 200
Fig. 4.6. Cumulative average response time for the heterogeneous environment

lative average response time is reset to zero at the beginning of each day,
12 am. Generally speaking, EVEN c / B is superior to EVEN. In Figures 4.7a
and b, we show the percentage degradation relative to EVENc/B observed
for two different days, day 3 and 6. These correspond to the best and worst
observed performance with EVEN. During day 3, EVEN provides a perfor-
mance that is at times better than EVENc / B . During day 6, EVEN exhibits
a performance degradation that is several orders of magnitude slower than
EVENc / B . In this case, no re-organization outperforms EVEN .

.............. ~lIon P_1lepada1ion


2!O.oo-., •••.••••••••.•••••••••••••••••••••.••.•...••.•••...••••••..•••.• _. __ ._._ 35DOO~ ••••• ---.-- •• -.-•• -----------••.••••. -•• --- •••.• --.- •••• -.-- •• --.- •• --

~.-- ... -...... -............. ---.- ...... -... -............. -....... _-_ .. .
2OD~---- ••••• - •••••••• - •• -.-.-- ••• - •••••• - •• ---- ••• ---.-- •••••• -- •••• _---.-

25DOO.IID'I. - •.. - -. --.-.. --- .... -... -.... -- ... -.•. -.. -..•... -... -..... -..... -.
150.IJO'I. •• -.- ••.•• -.- ••• -- ••••.•.••••••••. - ••• -....••• - •...• -.-- •• -.-- •••• -.- •.

20000~·····-··············-···-·-··-·······-·--····-··---·- ..... -..... --.


1OD~···-----·-----··---·---·-··---·---·-·····--·-----··.-.- ...-.-----... - 1SJOO~·····································-·-··········- ............. -.. -

SO~--· .. ··-· .. ------.. -.... --...... --......... -....... -.--...... -.. --... 10000.00..·······-······················-···-····-·-··-··-··· ............... .
EVEN SJOO.IID'I. •••••••• - ••• - •••••••••••••• - •••• - ••••••••••••••••••.••••••••••••.••.
0.... ~-1=================== N.a-pnJu1ion
0..... !"-':;;;;;;;;;;;;;;;-==-=:=Eii:iiiii=ii~;"
-50.IID'I. .. -- •••• -- •••••• -- .. -.- •• - .•• -- •• -.--- ••••••••..••• --.--.-- •..••. --.- •. 130'1 1321 1341 1381 1311 140\ 1421 '44' , .. 148'1 180'11521

Tlme8UcoID Tlme8liooID

4.7a. Day 3 4.7b. Day 6


Fig. 4.7. Percentage degradation relative to EVEN cjB
216 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss

5 Conclusions and Future Research Directions

This paper provides an overview of techniques to realize a parallel, scalable,


high performance database management system. We described the role of
alternative partitioning strategies to distribute the workload of a query across
multiple nodes. Next, we described the design of parallel sort-merge, Grace
and Hybrid hash join to process the join operator. Finally, we detailed ORE
as a three step framework that controls the placement of fragments in order
to respond to: a) changing workloads, and b) dynamic hardware platforms
that evolve over a period of time. We demonstrated the superiority of this
framework using a trace driven evaluation study.
Physical design of parallel database management systems is an active
area of research. One emerging complementary effort is in the area of Stor-
age Area Network (SAN) architectures that strive to minimize the cost of
managing storage. A SAN is a special-purpose network that interconnects
different data storage devices with servers. While there are many definitions
for a SAN, there is a general consensus that it provides access at the gran-
ularity of a block and is typically targeted toward database applications. A
SAN might include an embedded storage management software in support of
virtualization. This software includes a file system that separates storage of
a device from the physical device, i.e., physical data independence. Virtual-
ization is important because it enables a file to grow beyond the capacity of
one disk (or disk array). Such embedded file systems can benefit from ORE
and its 3 step framework [GGG+Ol].
Another important research direction is an online capacity planner that
is aware of an application's performance requirements, e.g., desired response
time guarantees at pre-specified throughputs. This component should detect
when a system is not meeting the desired requirements and suggest changes
to the hardware platform. With a SAN, this might be an integral component
of the embedded file system. Such a capacity planner empowers the human
operators to address performance limitations effectively.
Finally, we plan to extend ORE to incorporate availability [PGK88,ZGOO]
techniques. These techniques construct redundant data in order to continue
operation in the presence of disk failures. For example, chain decluster-
ing [HD90,HD91,GM94] constructs a backup copy of a fragment assigned
to node 1 onto an adjacent node 2. The original fragments on node 1 are
termed primary while their backup copies on node 2 are termed secondary.
If node 1 fails, the system continues operation using secondary copies stored
on node 2. While ORE, see Section 4, controls the placement of data based
on the availability needs of a fragment, it does not consider the placement of
primary and secondary copies when migrating fragments from one node to
another. As a simple example, it can switch the role of primary and backup
copies to respond to workload changes.
5. High Performance Parallel Database Management Systems 217

Acknowledgments
We wish to thank Anouar Jamoussi and Sandra Knight of BMC Software
for collecting and providing traces used in this study. We also thank William
Wang, Sivakumar Sethuraman, and Dinakar Yanamandala of USC for assist-
ing with the implementation of our simulation model.

References
[AKN+97] Aref, W., Kamel, I., Niranjan, T., Ghandeharizadeh, S., Disk schedul-
ing for displaying and recording video in non-linear news editing sys-
tems, Proc. Multimedia Computing and Networking Conference, SPIE
Proceedings, vol. 3020, 1997, 1003-1013.
[Bab79] Babb, E., Implementing a relational database by means of specialized
hardware, ACM Transactions on Database Systems 4(1), 1979, 1-29.
[BAC+90] Boral, H., Alexander, W., Clay, L., Copeland, G., Danforth, S.,
Franklin, M., Hart, B., Smith, M., Valduriez, P., Prototyping Bubba,
a highly parallel database system, IEEE Transactions on Knowledge
and Data Engineering 2(1), 1990, 4-24.
[BFG+95] Baru, C.K., Fecteau, G., Goyalet, A., Hsiao, H., Jhingran, A., Pad-
manabhan, S., Copeland, G.P., Wilson, W.G., DB2 Parallel Edition,
IBM Systems Journal 34(2), 1995, 292-322.
[BGM+94] Berson, S., Ghandeharizadeh, S., Muntz, R, Ju, X., Staggered strip-
ing in multimedia information systems, Proc. ACM Special Interest
Group on Management of Data, Minneapolis, Minnesota, SIGMOD
Record 23(2), 1994, 79-90.
[Bra84] Bratbergsengen, K., Hashing methods and relational algebra oper-
ations, Proc. Very Large Databases Conference, Singapore, Morgan
Kaufmann, 1984,323-333.
[BSCOO] Bhatia, R., Sinha, RK., Chen, C., Dedustering using Golden Ratio
Sequences, Proc. 16th International Conference on Data Engineering,
San Diego, California, 2000, 271-280.
[CAB+88] Copeland, G., Alexander, W., Boughter, E., Keller, T., Data place-
ment in Bubba, Proc. ACM Special Interest Group on Management
of Data, Chicago, Illinois, SIGMOD Record 17(3), 1988, 99-108.
[CR93] Chen, L.-T., Rotem, D., Dedustering objects for visualization, Proc.
Very Large Databases Conference, Dublin, Ireland, Morgan Kauf-
mann, 1993, 85-96.
[DG85] DeWitt, D.J., Gerber, R., Multiprocessor hash-based join algorithms,
Proc. Very Large Databases Conference, Stockholm, Sweden, Morgan
Kaufmann, 1985, 151-164.
[DGS+90] DeWitt, D., Ghandeharizadeh, S., Schneider, D., Bricker, A., Hsiao,
H., Rasmussen, R., The Gamma database machine project, IEEE
Transactions on Knowledge and Data Engineering 2(1), 1990, 44--{)2.
[DKO+84] DeWitt, D.J., Katz, RH., Olken, F., Shapiro, L.D., Stonebraker,
M.R., Wood, D., Implementation techniques for main memory data-
base systems, ACM Special Interest Group on Management of Data
Record 14(2), 1984, 1-8.
218 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss

[DS82] Du, H.C., Sobolewski, J.S., Disk allocation for Cartesian product files
on multiple-disk systems, ACM 7ransactions on Database Systems
7(1), 1982, 82-101.
[FB93] Faloutsos, C., Bhagwat, P., Declustering using fractals, Proc. 2nd In-
ternational Conference on Pamllel and Distributed Information Sys-
tems, 1993, 18-25.
[FM89] Faloutsos, C., Metaxas, D., Declustering using error correcting codes,
Proc. Symp. on Principles of Database Systems, 1989, 253-258.
[GCK+93] Ghandeharizadeh, S., Choi, V., Ker, C., Lin, K., Design and imple-
mentation of the Omega object-based system, Proc. 4th Austmlian
Database Conference, 1993, 198-209.
[GD90] Ghandeharizadeh, S., DeWitt, D., A multiuser performance analysis
of alternative declustering strategies, Proc. 6th IEEE Data Engineer-
ing Conference, 1990, 466-475.
[GD92] Ghanderharizadeh, S., DeWitt, D., A performance analysis of al-
ternative multi-attribute declustering strategies, Proc. ACM Special
Interest Group on Management of Data, San Diego, California, SIG-
MOD Record 21(2), 1992, 29-38.
[GD94] Ghandeharizadeh, S., DeWitt, D.J., MAGIC: a multiattribute declus-
tering mechanism for multiprocessor database machines, IEEE 7rans-
actions on Pamllel and Distributed Systems 5(5), 1994, 509-524.
[GG97] Gray, J., Graefe, G., The Five-Minute Rule ten years later, and other
computer storage rules of thumb, ACM Special Interest Group on
Management of Data Record 26(4), 1997,63-68.
[GGG+01] Ghandeharizadeh, S., Gao, S., Gahagan, C., Krauss, R., An on-line
reorganization framework for embedded SAN file systems, Submitted
for publication, 2ool.
[Gib92] Gibson, G., Redundant disk arrays: reliable, pamllel secondary stor-
age, The MIT Press, 1992.
[GIZ96] Ghandeharizadeh, S., Ierardi, D., Zimmermann, R., An algorithm for
disk space management to minimize seeks, Information Processing
Letters 57, 1996, 75-8l.
[GIZOl] Ghandeharizadeh, S., Ierardi, D., Zimmermann, R., Management of
space in hierarchical storage systems, M. Arbib, J. Grethe (eds.), A
Guide to Neuroinformatics, Academic Press, 200l.
[GM94] Golubchik, L., Muntz, R.R., Fault tolerance issues in data decluster-
ing for parallel database systems, Data Engineering Bulletin 17(3),
1994, 14-28.
[G093] Gottemukkala, V., Omiecinski, E., The sensible sharing approach
to a scalable, high-performance database system, Technical Report
GIT-CC-93-24, Georgia Institute of Technology, 1993.
[Gra93] Graefe, G., Query evaluation techniques for large databases, ACM
Computing Surveys 25(2), 1993,73-170.
[Gra94a] Graefe, G., Sort-merge-join: an idea whose time has passed? Proc.
IEEE Conf. on Data Engineering, 1994, 406-417.
[Gra94b] Graefe, G., Volcano - an extensible and parallel query evaluation sys-
tem, IEEE 7ransactions on Knowledge and Data Engineering 6(1),
1994, 120-135.
5. High Performance Parallel Database Management Systems 219

[HD90] Hsiao, H., DeWitt, D., Chained declustering: a new availability strat-
egy for multiprocessor database machines, Proc. 6th International
Data Engineering Conference, 1990, 456-465.
[HD91] Hsiao, H.-I., DeWitt, D., A performance study of three high avail-
ability data replication strategies, Proc. 1st International Conference
on Pamllel and Distributed Information Systems, 1991, 18-28.
[HL90] Hua, K., Lee, C., An adaptive data placement scheme for parallel
database computer systems, Proc. Very Large Databases Conference,
Brisbane, Australia, Morgan Kaufmann, 1990, 493-506.
[KH93] Katz, R.H., Hong, W., The performance of disk arrays in shared-
memory database machines, Distributed and Pamllel Databases 1(2),
1993, 167-198.
[KP88] Kim, M.H., Pramanik, S., Optimal file distribution for partial match
retrieval, Proc. ACM Special Interest Group on Management of Data,
Chicago, Illinois, SIGMOD Record 17(3), 1988, 173-182.
[LKB87] Livny, M., Khoshafian, S., Boral, H., Multi-disk management algo-
rithms, Proc. 1987 ACM SIGMETRICS Conference on Measurement
and Modeling of Computer Systems, 1987,69-77.
[LKO+OO] Lee, M.L., Kitsuregawa, M., Ooi, B.C., Tan, K., Mondal, A., To-
wards self-tuning data placement in parallel database systems, Proc.
ACM Special Interest Group on Management of Data, Dallas, Texas,
SIGMOD Record 29(2), 2000, 225-236.
[LSR92] Li, J., Srivastava, J., Rotem, D., CMD: a multidimensional decluster-
ing method for parallel data systems, Proc. 18th Conference on Very
Large Databases Conference, Vancouver, Canada, Morgan Kaufmann,
1992, 3-14.
[MS98] Moon, B., Saltz, J., Scalability analysis of declustering methods for
multidimensional range queries, IEEE Transactions on Knowledge
and Data Engineering, 10(2), 1998, 310-327.
[NH84] Nievergelt, J., Hinterberger, H., The grid file: an adaptive, symmetric
multikey file structure, ACM Transactions on Database Systems 9(1),
1984, 38-7l.
[NKT89] Nakano, M., Kitsuregawa, M., Takagi, M., Query execution for large
relation on functional disk system, Proc. 5th International Conference
on Data Engineering, Los Angeles, 1989, 159-167.
[0085] Ozkarahan, E., Ouksel, M., Dynamic and order preserving data par-
titioning for database machines, Proc. Very Large Databases Confer-
ence, Stockholm, Sweden, 1985, 358-368.
[Ora94] Oracle & Digital, Omcle Pamllel Server in Digital Environment,
Technical Report, Oracle Inc., 1994.
[PGK88] Patterson, D., Gibson, G., Katz, R., A case for Redundant Arrays
of Inexpensive Disks (RAID), Proc. ACM Special Interest Group on
Management of Data, Chicago, Illinois, SIGMOD Record 17(3), 1988,
109-116.
[RE78] Ries, D., Epstein, R., Evaluation of distribution criteria for dis-
tributed database systems, Technical Report UCB/ERL, Technical
Report M78/22, UC Berkeley, 1978.
[SD89] Schneider, D.A., DeWitt, D.J., A performance evaluation of four
parallel join algorithms in a shared-nothing multiprocessor environ-
220 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss

ment, Proc. ACM Special Interest Group on Management of Data,


Portland, Oregon, SIGMOD Record 18(2), 1989, 110-121.
[Sha86] Shapiro, L., Join processing in database systems with large main
memories, ACM Transactions on Database Systems 11(3),1986,239-
264.
[SKP+88] Stonebraker, M., Katz, R., Patterson, D., Ousterhout, J., The de-
sign of XPRS, Proc. Very Large Databases Conference, Los Angeles,
California, Morgan Kaufmann, 1988, 318-330.
[8882] Siewiorek, D.P., Swarz, R.S., The theory and pmctice of reliable sys-
tem design, Digital Press, 1982.
[8WZ98] Scheuermann, P., Weikum, G., Zabbak, P., Data partitioning and
load balancing in parallel disk systems, Very Large Databases Journal
7(1), 1998, 48-66.
[Tan88] Tandem Performance Group, A benchmark of NonStop SQL on the
debit credit transaction, Proc. ACM Special Interest Group on Man-
agement of Data, Chicago, Illinois, SIGMOD Record 17(3), 1988,
337-341.
[Ter85] Teradata Corp., DBC/l012 data base computer system manual, Doc-
ument No. ClO-0001-02, Release 2.0., Teredata Corp. 1985.
[VG84] Valduriez, P., Gardarin, G., Join and semi-join algorithms for a multi-
processor database machine, ACM Transactions on Database Systems
9(1), 1984, 133-161.
[VBW98] Vingralek, R., Breitbart, Y., Weikum, G., Snowball: scalable stor-
age on networks of workstations with balanced load, Distributed and
Pamllel Databases 6(2), 1998, 117-156.
[ZGOO] Zimmermann, R., Ghandeharizadeh, 8., HERA: heterogeneous exten-
sion of RAID, Proc. International Conference on Pamllel and Dis-
tributed Processing Techniques and Applications, Las Vegas, Nevada,
2000, 103-113.
6. Advanced Database Systems

Gottfried Vossen

Dept. of Information Systems, University of Miinster, Germany


and PROMATIS Corp., San Ramon, California, USA
1. Introduction ..................................................... 222
2. Preliminaries..................................................... 227
2.1 Basics from Relational Databases ............................ 227
2.2 Relational Algebra and the Calculi ........................... 229
2.3 Rule-Based Languages ....................................... 231
3. Data Models and Modeling for Complex Objects ................. 234
3.1 Complex Objects and Object-Orientation .................... 236
3.2 Object-Relational Databases................................. 239
3.3 Designing Databases with Objects and Rules................. 241
3.4 Semi-Structured Data and XML .............................. 243
4. Advanced Query Languages ...................................... 249
4.1 Object-Based Languages. Path Expressions................... 249
4.2 Querying Semi-Structured Data and XML Documents ........ 252
4.3 XQuery ...................................................... 255
4.4 Towards a Foundation of Procedural Data .................... 256
4.5 Meta-SQL ................................................... 260
5. Advanced Database Server Capabilities .......................... 262
5.1 RAID Architectures .......................................... 263
5.2 Temporal Support ........................................... 266
5.3 Spatial Data................................................. 268
5.4 Transactional Capabilities, Workflows, and Web Services ..... 270
6. Conclusions and Outlook ......................................... 274

Abstract. Database systems have emerged into a ubiquitous tool in computer ap-
plications over the past 35 years, and they offer comprehensive capabilities for stor-
ing, retrieving, querying, and processing data that allow them to interact efficiently
and appropriately with the information-system landscape found in present-day fed-
erated enterprise and Web based environments. They are standard software on vir-
tually any computing platform, and they are increasingly used as an "embedded"
component in both large and small (software) systems (e.g., workflow management
systems, electronic commerce platforms, Web services, smart cards); they continue
to grow in importance as more and more data needs to get stored in a way that
supports efficient and application-oriented ways of processing. As the exploitation
of database technology increases, the capabilities and functionality of database sys-
tems need to keep track. Advanced database systems try to meet the requirements
of present-day database applications by offering advanced functionality in terms of
data modeling, multimedia data type support, data integration capabilities, query
languages, system features, and interfaces to other worlds. This article surveys the
state-of-the-art in these areas.
222 G. Vossen

1 Introduction
The practical need for efficient organization, creation, manipulation, and
maintenance of large collections of data, together with the recognition that
data about the real world, which is manipulated by application programs,
should be treated as an integrated resource independently of these programs,
has led to the development of database management. In a nutshell, a data-
base system consists of a piece of software, the database management system,
and some number of databases. Modern database systems are mostly client-
server systems where a database server is responding to requests coming
from clients; the latter could be end-users or applications (e.g., a browser on
a notebook, a query interface on a palmtop) or even application servers (e.g.,
a workflow management system, a Web server). Database systems have be-
come a fundamental tool in many applications over the past 35 years, ranging
from the original ones in administrative and business applications to more
recent ones in science and technology, and to current ones in electronic com-
merce and the World-Wide Web. They are now standard software on virtually
any computing platform; in particular relational database systems, that is,
database systems that are based on the relational model of data, are avail-
able on any type of machine, from laptops (or even personal digital assistants
and smartcards) to large-scale supercomputers. Moreover, database servers
and systems continue to grow in importance as more and more data needs to
get stored in a way that supports efficient and application-oriented ways of
processing.
Historically, database systems have started out in the late 60s as simple
data stores with a conceptual level added to file systems and could hence
provide an early form of data independence; the field was soon taken over
by relational systems in the 70s. However, the quest for departing from pure
relational systems has been around for more than 20 years; indeed, techni-
cal areas such as CAD/CAM or CASE have early on demanded so-called
"non-standard" database systems that departed from simple data types such
as numbers and character strings; later, applications such as geography or
astronomy requested an integration of images, text, audio, and video data.
Nowadays, database systems are strategic tools that are integrated into the
enterprise-wide landscape of software and its information-related processes
and workflows, and to this end can provide features such as user-defined
types, standardized data exchange formats, object semantics, user-defined
functions, rules, powerful query tools, and sophisticated transactional tech-
niques. They support a variety of applications from simple data tables, to
complex data integration from multiple sources, to analysis by warehousing,
to business processes through a close interaction even with workflow man-
agement systems. This requires a number of properties, and we try to give a
glimpse of these properties and features in this article.
Commercial vendors have been picking up on these developments for a
variety of reasons:
6. Advanced Database Systems 223

1. The requirement of putting complex objects in a database is vastly un-


derstood from a conceptual point of view, both from the perspective of a
data model and from that of a language. For many specific problems and
research topics, it is by now clear what can and what cannot realistically
be done.
2. Database systems can be interfaced well with a variety of external tools,
including graphical user interfaces (GUls), object-oriented programming
languages, and object-oriented interfaces between packaged components
such as the Common Object Request Broker Architecture (CORBA), see
[OHE97,Vos97], or the Distributed Component Object Model (DCOM).
Connections between database systems and Java have been investigated
in detail and are readily available for smooth object transitions between
databases and programs [MEOO,Ric01].
3. There is a continuously growing set of applications that wants to put more
and more data into a database, for reasons such as declarative access,
transactional properties, efficient storage utilization, and fast retrieval. A
typical example is data that is pumped down to earth from a satellite or
data that is stored in a digital library. This often goes together with the
requirement of being able to analyze that data for statistical and other
purposes or to perform mining tasks on it.
4. Many new applications, in particular those that are Web-enabled, cannot
even work properly without database support, typically in some form of
direct connectivity, e.g., by accessing a database from a Web browser.
Examples of such application include electronic banking and, more gen-
erally, electronic commerce, or the merging areas of Web services and
Web-based learning. In this context, the arrival of XML in the database
world has created the demand for being able to handle corresponding
documents in a database, and these documents are typically exchanged
over Internet and Web, or integrated from a variety of sources.

Comment 1 above in particular applies to object-oriented database systems


[KM94], which grew out of the desire to combine database system function-
ality with the programming paradigm of object-orientation. Note that Com-
ment 1 is not meant to imply that no research is needed any more in the
database area; it is only that system developers start making use of various
research achievements obtained over the past 15 years. In particular, data-
base research on complex objects as well as on putting structure and behavior
together has mostly concentrated on data of a single medium. More recently,
multimedia databases try to take care of the integration of maintenance of
multiple such data taken together. Comment 3 above refers to the emerging
areas of data warehousing and data mining, which separate data for analytical
purposes from operational data and then apply online analytical processing
(OLAP) and mining techniques to the warehouse [Gar98].
It is worth noting that a common environment in which database systems
are found today is a distributed one consisting of a wide variety of application
224 G. Vossen

Users
Clients

Application
Servers

Database
Servers

Databases

Fig. 1.1. Federated system architecture

and data servers, as indicated in Figure 1.1. These servers will typically be
heterogeneous in that they use different products, different interfaces, differ-
ent data models and access languages. In addition, the servers can differ in
their degree of autonomy in the sense that some may focus on the workload
of a specific business process (e.g., stock exchange), while less autonomous
servers may be willing to interact with other servers. We are not elaborating
here on the technical problems that have to be solved when living in a fed-
erated system [WV02], but we will use Figure 1.1 as a motivation for data
integration challenges we will come across.
Advanced database systems try to meet their requirements by offering
advanced functionality in terms of data modeling and integration support,
query languages, and system features; we will survey these areas below. Data
modeling essentially refers to the question of how to build a high-quality
database schema for a given application (and to maintain it over time under
evolution requests). Data modeling is commonly perceived as a multi-step
process that tries to integrate static and dynamic aspects derived from data
and functional requirements. Recent achievements here include the possibility
to take integrity constraints into account, and to design triggers or even event-
condition-action (ECA) rules that capture active aspects. There are now even
methodologies that allow to design and model even advanced applications in
terms of a unified framework which can then be declared and used in an
advanced database system.
The latter is due to the fact that a category of products is now com-
mon which is called object-relational (OR). Corresponding systems essen-
6. Advanced Database Systems 225

tially combine capabilities of object-oriented (00) database systems with


improved SQL capabilities, but in particular stick to the relational view of
data in that two-dimensional tables are considered the appropriate means to
present (most) data on a computer screen. As do 00 systems, OR systems
allow to define data of vastly any type, including text, audio data, video data,
images, pictures, and more recently XMLj moreover they support the storage
and manipulation of such data, and are hence suited for the emerging area
of multimedia and Internet applications.
Query languages in advanced database systems need a closer integration
of data and programs than relational systems used to have. This is a kind
of dilemma, since database systems have in the first place always been an
attempt to separate data from the programs that access it. The paradigm of
object-orientation suggests that giving up this separation in a controlled way
is a good idea, and consequently advanced systems follow this suggestion, by
offering, for example, the possibility to encapsulate a specific data type with
functions specifically designed for it, or to inherit functions from one type to
another. As has been proved by the many proposals for rule-based languages,
there is even ground for bringing declarativeness and object-orientation to-
gether. However, encapsulation of data and programs in an 00 sense is only
one side of the coin. Another is the handling of procedural data that pops up
in a database system in a variety of ways. Examples include data dictionar-
ies, stored procedures, view-definition maintenance, or Web-log analysis. For
putting procedural data into a database and for handling it in a way that
is appropriate for programs, only a few proposals exist which have not yet
found their way into commercial systems.
The last aspect mentioned above, that of advanced system features, refers
to a wide collection of aspects which have previously been studied in isolation,
and which are tentatively brought together in an advanced database system.
They include the support of historical as well as temporal data, spatial data
as in geographic applications, multidimensional data and their storage struc-
tures, but also advanced transaction concepts going beyond standard ACID
transactions and serializability. System functionality in any such direction
typically goes along with advanced capabilities at a higher level of abstrac-
tion, e.g., the data model or the query language of the respective system, so
that there is rarely a need to cover it in isolation. On the other hand, ad-
vanced query optimization functionality or parallel architectures do go with-
out a conceptual counterpart, and remain vastly invisible to the end-user.
The above separation of data modeling and integration, query languages,
and system features follows a traditional perspective for a database imple-
mentor, namely to view database server functionality as being organized in
four different layers as shown in Figure 1.2: The interface layer manages
the interfaces to the various classes of users, including a database adminis-
trator, casual users and application programmers. The language processing
layer has to process the various forms of requests that can be directed against
226 G. Vossen

Interface Layer
Data Model and Query Language
Host Language Interfaces
Other Interfaces
Language Processing Layer
View Management Language Compiler
Semantic Integrity Control Language Interpreter
Authorization Query Decomposition
Query Optimization
Access Plan Generation
Transaction Management Layer
Access Plan Execution Transaction Generation
Concurrency Control Recovery
Storage Management Layer
Physical Data Structure Management
Bugger Management Disk Accesses

Fig. 1.2. Functional layers of a database server

a database. A query is decomposed into elementary database operations; the


resulting sequence is then subject to optimization with the goal of avoiding
executions with poor performance. An executable query or program is passed
to the transaction management layer, which is in charge of controlling con-
current accesses to a shared database ("concurrency control") and at the
same time makes the system resilient against possible failures. Finally, the
storage management layer takes care of the physical data structures (files,
pages, indexes) as well as buffer management and performs disk accesses.
This view remains valid in advanced systems, but with many additions
and enhancements under the surface, the most relevant of which we will try
to touch upon in this chapter. For the purposes of this chapter, we assume the
reader to have a basic familiarity with databases and database management
systems in general, and with the relational model in particular, for example
along the lines of [ENOO,RGOO,GUW02,SKS02j. We will survey several tech-
nical preliminaries in Section 2 for the sake of completeness, but without too
much depth. We will then look into the three major areas just sketched in
more detail. Section 3 is about advanced data modeling and integration, Sec-
tion 4 on advanced language capabilities, and Section 5 on advanced system
functionality. Some conclusions and future prospects are given in Section 6.
We mention that various topics that could also be attributed to "ad-
vanced" systems and their capabilities will not be discussed in detail here;
these include (as mentioned) object-oriented database systems, parallel
and distributed database systems, data warehousing and data mining, and
Internet-oriented systems.
6. Advanced Database Systems 227

2 Preliminaries

In this section we briefly put various preliminaries together; further details


can be found in a variety of textbooks, including [ENOO,RGOO,Ull88,GUW02]
or [SKS02].

2.1 Basics from Relational Databases

The relational model [Cod70] is based on the mathematical notion of a rela-


tion and organizes data in the form of tables. A table has attributes describing
properties of data objects (as headline) and tuples holding data values (as
other rows). A table is hence a set of tuples, where tuple components can
be identified by their associated attributes. This is restricted, for example,
from the point of view of types in programming languages, since it only
allows the application of a tuple constructor to attributes and given base
domains, followed by the application of a set constructor. On the other hand,
the simplicity of the relational model allows an elegant and in-depth formal
treatment [AHV95, Ull88, Ull89, Vos96].
Figure 2.1 shows a sample relational database describing computers with
their IP addresses, some of their users and HTTP documents whose structure
refers to other documents; finally, there is a log keeping track of which user
accessed what document in the context of a session. The example exhibits the
most important features of a relational databases: Relations have attributes
which can take values from a given domain, e.g., of type integer, date, or
string. Each such value is atomic, i.e., not composed of other values. Each
relation has one or more attributes that identify its tuples uniquely and min-
imally; such attributes are called a key. For example, an IP-Address uniquely
identifies a computer, or a URL uniquely identifies a document. Moreover, the
source and target URL of a structure only form a key if considered together;
in other words, there is a multivalued relationship between the two (indeed,
a source document can refer to multiple targets, as shown by the example).
Finally, various relations can be glued together through global constraints,
such as inclusion dependencies; for example, a ClientIP occurring in the log
relation should of course be the IP-Address of an existing computer.
The schema of a relational database generally refers to a particular appli-
cation, and is obtained through a design process that often starts from a more
abstract conceptual view [BCN92,MR92]. Conceptual design is mostly accom-
plished by employing a semantic data model such as the Entity-Relationship
(ER) model [Che76,FV95a,ThaOO], in which the world is described in terms
of entities and their relationships. Transformation rules then state how to
derive relational schemata from the entity and relationship types present in
such a diagram. A variety of such rules have been described in the literature,
and present-day systems often come with design aids into which these rules
have been programmed.
228 G. Vossen

Computer IP-Address DomainNarne OSType


128.176.159.168 ariadne.urn.de Unix
128.176.158.86 helios.urn.de Unix
128.176.6.1 www.urn.de Unix

Document URL Content Created


www.um.de/index.html Text 1997-10-18
www.um.de/db.html Text 1997-10-19

Structure SourceURL TargetURL


www.um.de/index.html www.urn.de/index.html
www.um.de/index.html www.urn.de/db.html

User EMailAddress Name


lechten@helios.um.de Jens
vossen@helios.um.de Gottfried

Log User ClientIP No URL


lechten@ ... 128.176.159.168 1 www.urn.de/index.html
lechten@ ... 128.176.159.168 2 www.urn.de/db.html

Fig. 2.1. A sample relational database

Figure 2.2 shows an ER diagram for the relational database from Figure
2.1. As can be seen, some relations stem from entity types, while others are
derived from relationship types (and some optimizations have already been
applied). In this example, it is even easy to go from one representation to the
other in either direction. In many applications, doing forward engineering
is as important as being able to do reverse engineering, in which a given
database is "conceptualized" [FV95b], e.g., for the purpose of migrating to
another data model.
An important observation regarding the entries in a relational table or the
types of attributes is that there is no "built-in" restriction in the database
concept saying that data has to be numerical or to consist of simple character
strings only. Indeed, by taking a closer look at Figure 2.1 we see that the URL
of a document is essentially a path expression that represents a unique local
address; we can easily imagine the path to be even globally unique or being
computed via an expression that takes, for example, search parameters into
account or that itself has an inner structure. In other words, a data entry in
a table could as well be the description of a program, and by the same token
it could be of a completely different type: an image in gif or jpg format, an
mp3 music file, an avi video. We will see later what the impact of this will
be and how such unconventional types can be handled in a database that is
essentially relational.
6. Advanced Database Systems 229

Computer User

Document

Fig. 2.2. An ER diagram for the sample relational database

2.2 Relational Algebra and the Calculi


The operational part of the relational model consists of algebraic operations
on relations as already defined by [Cod70j. These operations provide the
formal semantics of a relational query language known as relational alge-
bra. Most fundamental are five operations on relations: projection, selection,
union, difference, natural join. The first two of these are unary, while the
others are binary operations. A number of additional operations like inter-
section, Cartesian product, semi-join and division can be defined in terms of
these.
A projection of a relation R over attribute set X onto a set Y ~ X of
attributes restricts all tuples of R onto Y, and gets rid of duplicate elements.
A selection of R with respect to a Boolean condition C selects all those tuples
from R satisfying C. For example, referring back to the sample database from
Figure 2.1,
7l"IP_Address(Computer)
produces
IP-Address
128.176.159.168
128.176.158.86
128.176.6.1
230 G. Vossen

while
O'DomainName = ·www.um.de·(Computer)
yields
IP-Address DomainName OSType
128.176.6.1 www.um.de Unix
The three binary operations we introduce are as follows. Union as well as
difference are the usual set operations, applicable to relations that have the
same attributes and are thus "compatible". The natural join of relations R
and S combines the tuples of the operands into new ones according to equal
values for common attributes. For example, to compute address and name of
those users who have participated in a session, we can write

7l'EMailAddress, Name(O'EMailAddress = User(User I><l 7l'User(Log)))


which yields
EmailAddress Name
lechten@helios.um.de Jens
We are not defining relational algebra, or RA for short, formally here, but
mention that an RA expression can be represented as a tree, the parse tree
of its generation, in which each internal node is labelled by an operator and
each leaf by the name of a relation schema. The parse tree is often taken as
a starting point for query optimization; see [YM98,ENOO,SKS02j for details.
RA is a procedural language because the user is expected to specify how
the result is to be obtained. In addition, RA is closed since its operators take
one or more relations as operands and produce a result relation that may
in turn be an operand to another operator. Finally, RA expressions can be
evaluated efficiently, since all allowed operations have polynomial time com-
plexity. It was already observed by [Cod70j that there exists a declarative
counterpart to RA, relational calculus (RC). This observation was based on
the insight that a relation schema R with n attributes and a relation symbol
R in logic with arity n are similar concepts. Thus, a set of relation schemata
{ R 1 , ..• , Rk } occurring in a database schema can be considered as a vocab-
ulary of relation symbols; additionally, variables are needed which can range
over tuples (tuple calculus, RTC) or, alternatively, domain elements (domain
calculus, RDC).
Both RC languages have a solid theoretical foundation since they are
based on first-order predicate logic. Semantics is given to formulas by inter-
preting them as assertions on a given database. For example, the following
RTC expression finds the IP addresses of all computers:
{t I (3 u) (Computer(u) 1\ u(IP-Address) = t)}
We mention that every RA expression can be translated in time polynomial
in its size into an equivalent safe RC expression and vice versa. Thus, a
6. Advanced Database Systems 231

measure of completeness for relational query languages is available. Indeed,


RA, RDC, and RTC are frequently called Codd-complete, and other languages
are termed "complete" if their expressive power coincides with that of, say,
RA. Neither RA nor RC has been implemented in pure form in a relational
system. However, several language implementations have directly been based
on them; this in particular applies to the standard relational language SQL
[DD97].
We have mentioned relational algebra and its declarative counterparts for
several reasons here: First, a data model always comes with a structural and
an operational part; in the case of the relational model, these are relations
(tables) on the one hand and the algebraic operations (or the logical formulas)
on the other. Second, real-world languages (such as SQL) generally take their
semantics from algebras, for the simple reason that algebra expressions can be
optimized. Thus, run-time efficiency has something to do with proper formal
underpinnings. Third, relational algebra can be, and has been, generalized
in a variety of ways and to a number of settings, including nested relations,
complex objects, object-oriented data models, procedural data, to mention
just a few. So studying the basics of relational algebra is not a bad idea
when solid foundations for advanced studies are needed. We will see mild
indications of this later in the chapter.

2.3 Rule-Based Languages


A third paradigm for relational database query languages, besides algebra
and calculus, has for a long time been the exploitation of logic programming
[AV82,VK76,AHV95] in the context of (relational) databases. In the area
of programming languages, this paradigm was influenced by artificial intel-
ligence and automated theorem proving, and brought about the language
Prolog that emphasizes a declarative style of programming. The confluence
of logic programming and the area of databases has been driven by the iden-
tification of several common features: First, logic programming systems man-
age small "databases" which are single-user, kept in main memory and con-
sist of fact information and deduction rules. Database systems, on the other
hand, deal with large, shared data collections kept in secondary memory, and
support efficient update and retrieval, reliability and persistence. Second, a
"query" ("goal") in logic programming is answered through a chain of de-
ductions built to prove or disprove some initial statement. A database query
is processed by devising an efficient execution plan for extracting the desired
information from the database. Third, dependencies specify which database
states are considered correct, and a database system is expected to enforce
them. In logic programming, constraints are rules that are activated whenever
the "database" is modified.
Considering these commonalities and the fact that declarative languages
are natural query languages especially in the relational model, it seems rea-
sonable to combine logic programming with databases. The result, a logic-
232 G. Vossen

oriented or deductive database, is capable of describing facts, rules, queries


and constraints uniformly as sets of formulas in first-order logic, so that the
notion of logical implication can be used as a basis for defining the semantics
of a declarative language. The most prominent example for the use of logic
programming in the context of databases is Datalog, see [CGT90,Ull88,Ull89]
or [GUW02,AHV95]. This language has a syntax similar to that of (pure)
Prolog, but, unlike Prolog, is set-oriented, insensitive to orderings of retrieval
predicates or of the data, and has neither function symbols nor special-
purpose predicates (for controlling the execution of a program). The basic
idea behind Datalog and its application in querying a relational database is
to define new (derived or intensional) relations in terms of the (extensional)
relations in a given database and the new relations themselves, and to do so
by providing a set of Horn clauses.
Compared to relational algebra or calculus, a Datalog program has the
obvious new feature that it allows recursion and is thus able to compute
queries like the transitive closure of a binary relation. However, Datalog itself
is not more powerful than relational algebra; instead, the two languages (or
the sets of queries they can compute) are incomparable w.r.t. set inclusion.
Instead of giving formal details of Datalog, which can be found, for exam-
ple, in [Ull88, Ull89,AHV95], let us take a look at a typical example. Consider
the following relations which we assume are given in a database ("exten-
sional" relations):

Person Name Age Sex


Parent Name Child Paul 7 male
John Jeff John 78 male
Jeff Margaret Jeff 55 male
Margaret Annie Margaret 32 female
John Anthony Annie 4 female
Anthony Bill Anthony 58 male
Bill 34 male

Now consider the query "who are the children of John?" In SQL, we write
this as
select Child from Parent where Name 'John'
whereas in Datalog we simply write
?- Parent(John. X).
The answer, computed from the above two tables, will be X = {Jeff, Anthony}.
The important feature is to be able to define new relations intensionally,
i.e., through rules. To this end, let Father and Mother be intensional relations,
defined from the extensional (given) relations above through the following
rules:
6. Advanced Database Systems 233

Father(X,Y) :- Person(X,_,male), Parent(X,Y).


Mother(X,Y) :- Person(X,_,female), Parent(X,Y).
Implicitly, this defines the following relations:
Father Child
John Jeff
Mother Child
Jeff Margaret
Margaret Annie
John Anthony
Anthony Bill

We can now use intensional relations like extensional ones. For example, the
query "who is the mother of Annie?" is written as
?- Mother(X, Annie).
The answer is obtained by evaluating the right-hand side of the rule defining
the intensional relation in question, which is
Mother(X,Y) :- Person(X,_,female), Parent(X,Y).
Variable Y occurring in this rule is unified (equated) with value 'Annie', while
variable X is unified with value 'Margaret,' such that the following is obtained:
Person (Margaret , 32, female), Parent (Margaret , Annie).
This immediately gives the answer X = {Margaret}.
Another important feature is the possibility to make use of recursion in
Datalog, i.e., to let the same predicate occur in the body and in the head of a
rule. Consider the following rules defining predecessors, siblings, and cousins:
Predecessor(X,Y) :- Parent (X,Y) .
Predecessor(X,Y) :- Parent (X,Z), Predecessor(Z,Y).
Sibling(X,Y) :- Parent (Z,X) , Parent(Z,Y), not(X=Y).
Cousin(X,Y) Parent(X1,X) , Parent(Y1,Y), Sibling(X1,Y1).
Cousin(X,Y) :- Parent(X1,X) , Parent(Y1,Y), Cousin(X1,Y1).

An evaluation of these rules relative to the state shown earlier will yield the
following intensional relations (where attribute names are again shown for
clarity, but are not part of the definition):
Predecessor Successor
John Jeff
Jeff Margaret
Margaret Annie
John Anthony Cousinl Cousin2
Anthony Bill Margaret Bill
John Margaret
Jeff Annie
John Bill
John Annie
234 G. Vossen

It is interesting to note that, although Datalog is not available as the query


language of a commercial database system (and probably never will be), fea-
tures that have emerged from Datalog (and related languages) can meanwhile
be found in SQL, among them recursion. Indeed, most commercial SQL im-
plementations now have a feature called the WITH clause, which can be used
to define an intensional relation which is computed recursively.
We recommend [Via97j as an excellent survey of rule-based languages,
in particular Datalog variants, e.g., with (inflationary or non-inflationary)
fixpoint semantics, model-theoretic semantics, semi-positive, stratified, or
well-founded semantics. This paper also provides expressive power as well
as complexity results. Alternatively, [Liu99j may be consulted. We should
also mention that the arrival of rules in the context of databases has ren-
dered the active database system paradigm possible, where the idea is to
integrate possibilities of monitoring events and testing conditions into a da-
tabase, and to trigger appropriate actions when necessary; details can be
found in [WC96,CCWOO,PD99j. Commercial systems provide limited active
capabilities through constraints, assertions, or triggers. More advanced is the
exploitation of logic programming in databases in the context of information
extraction and wrapping from the Web; to this end, we refer the reader to
[BFGOla,BFGOlbj for the Lixto project that has recently been commercial-
ized.

3 Data Models and Modeling for Complex Objects

We now embark on a discussion of advanced data models and modeling,


and this discussion has to major streams: On the one hand, it shows how
to make use of the features provided by modern database systems, which go
considerably beyond simple modeling with entities and relationships; systems
surrounding a database, e.g., programming systems, operating systems, and
application software nowadays know, for example, the notion of an object,
so that a seamless integration is desirable. On the other hand, due to the
wide dissemination of access to the Web, it has become important to allow a
high flexibility w.r.t. data structuring and composition; to this end, we will
touch upon XML, the Extensible Markup Language that has been developed
by the World-Wide Web Consortium (W3C) (see http://www . w3 . org/xml
for latest versions of relevant documents). XML has been penetrating the
database field considerably during recent years, as it is considered an answer
to the question of how to handle semi-structured data appropriately and with
database technology.
A natural extension of the basic relational model is to allow nested rela-
tions, i.e., relations whose attributes can take other relations as values; this
has originally been proposed in [Mak77j. Jiischke and Schek [JS82j were the
first to propose an algebra for nested relations. Many other people looked
6. Advanced Database Systems 235

Computer User

99
Document

Fig. 3.1. An ER diagram for the sample database using nested tuples (x)
and sets (*)

into the question of how to generalize the results obtained for the flat model
to the nested one, and proved theorems about language equivalence or com-
pleteness, expressive power and complexity. Of particular relevance to nested
algebras are the structure-manipulating operations nest and unnest.
Figure 3.1 shows an alternative representation of the information previ-
ously shown in Figure 2.2, in which some aspects are modeled more directly.
In particular, a document now has a set of URLs that can be reached from it,
as opposed to a recursive relationship structure needed earlier. In addition,
log entries now do not need a numbering any more, since the entries associ-
ated with a particular user are put in a nested relation, i.e., a set of tuples
of computer as well as document keys.
More generality and additional flexibility are achieved by allowing con-
structors (typically beyond tuple and set, i.e., including list, bag etc.) that
can be applied to (base or already derived) types in an orthogonal fashion.
Finally, if such "complex" types and their instantiations are combined with
type-specific behavior, we arrive at what is known as object-orientation; if we
236 G. Vossen

put this in an appropriately extended relational database, we end up being


object-relational.
What we have just sketched represents approximately 20 years of research
and development in the data model field, which has happened in response to
an increasing need for capturing more complex data in terms of a data model,
and to augment relational representations beyond fiat tables. Complex data
can range from highly-structured data representing CAD objects or struc-
tured documents, to images, pictures, text, and audio or video data. Clearly,
not only modeling capabilities have been extended, but also the correspond-
ing languages. In this section, we look at data models and modeling, and then
return to the language issue in the next section. Models and modeling on the
one hand refer to what can be expressed in terms of a data model, and on
the other to the question of what methodologies are available for casting an
application into such a model.

3.1 Complex Objects and Object-Orientation

The investigation and study of data models has followed at least three di-
rections over the years: The first focused on so-called complex objects, which
are typed objects recursively constructed from atomic objects using construc-
tors for tuples, sets, or other data structures (e.g., bags, lists, arrays). It was
soon recognized that complex structures alone are not sufficient. Indeed, an
increasing interest has recently been in modeling behavioral aspects of ob-
jects as well, and in encapsulating object structure and object behavior. This
has paved the way for including object-oriented features in databases, which
in turn has given rise to the other two directions: One focused on so-called
pure objects, in which basically everything in a database is considered as an
abstract object that has a unique identity. The schema of a database can
then be considered as a directed graph, whose nodes are class names, and
whose edges represent single- or multi-valued attributes. A database instance
becomes another directed graph, whose nodes represent objects, and whose
edges are references between these objects (Le., attribute values). While such
a model is theoretically appealing, it appears too sophisticated for many real-
world applications; the third and currently most active direction therefore is
to distinguish between objects and their values, and to let only the former
have an identity. The exposition in this section will mostly center around this
latter direction, in particular since it nicely carries over to object-relational
structures as found in several present-day "universal server" systems.
Object-orientation has been recognized as an important new paradigm in
the area of programming languages ever since the arrival of the language Sim-
ula. It is roughly based on the following five fundamental principles [LV98J:

1. Each entity of the real world is modeled as an object which has an exis-
tence of its own, manifested in terms of a unique identifier (distinct from
its value).
238 G. Vossen

Fig. 3.2. The ingredients of an object data model

the possibility to organize classes in an inheritance hiemrchyj also not shown


is the fact that class attributes are allowed to reference other classes, thereby
forming an aggregation lattice.
Notice that Figure 3.2 easily carries over to a programming language sce-
nario, say, based on Java, where again classes are distinguished from objects,
and states from behavior. In fact, it is not even difficult to make a transition
from an ER model to a set of Java class definitions, and for this and other rea-
sons databases and programming languages have moved towards each other
over the years using the object-oriented paradigm as the appropriate meeting
platform. This is particularly true for SQL-based systems and Javaj relevant
standards in this area include JDBC and SQLJ, see [MEOO,RicOlj.
A typical application scenario is that 00 components from different ven-
dors and with distinct tasks need to run on the same system and, more
importantly, need to understand each other and even cooperate. In this situ-
ation, the development of standards is appropriate, and for the 00 world
this was discovered early on by the Object Management Group (OMG)
(www. omg . org), now one of the biggest consortia in the computer industry.
Their major spin-off, the Object Database Management Group (ODMG), has
presented a standardization proposal for 00 databases which covers an ob-
ject model, a query language, and programming language bindings [CBB+OOj.
The bottom-line situation nowadays is that every major 00 model develop-
ment essentially sticks to the ODMG proposal.
6. Advanced Database Systems 237

2. Each object has encapsulated into it a structure and a behavior. The


former is described in terms of attributes (instance variables), where at-
tribute values, which together represent the state of the object, can be
identifiers of other objects so that complex objects can be defined via
aggregation. The latter consists of a set of methods, i.e., procedures that
can be executed on the object.
3. The state of an object can be accessed or modified exclusively by sending
messages to the object, which causes it to invoke a corresponding method.
4. Objects sharing the same structure and behavior are grouped into classes,
where a class represents a "template" for a set of similar objects. Each
object is an instance of some class.
5. A class can be defined as a specialization of one or more other classes. A
class defined as a specialization forms a subclass and inherits both struc-
ture and behavior (Le., attributes and methods) from its superclasses.

In the area of programming languages and also in software engineering, the


introduction of these principles marked a radical departure from the con-
ventional view that active junctions operate on passive (data) objects. In-
stead, the goal now is to have active objects react to messages from other
objects. It was recognized that this feature is crucial for achieving data ab-
straction and information hiding, and for supporting modularity, re-usability
and extensibility in program designs. Considering the success of the object-
oriented paradigm in these areas, it is not surprising that a merger of object-
orientation and database technology got on researchers' agendas. While ini-
tially there has been a lot of confusion about what an object-oriented database
system (OODBS) actually is, a working definition that is still valid today
has been established in [ABD+89]. For databases it is crucial to distinguish
schema and instances: A schema is developed during a design process, and es-
sentially tells which kinds of real-world entities and their relationships should
appear in the database, what reasonable abstractions for them are, and how
they are consequently structured using the features of the data model at
hand. The core aspects of an object data model are summarized in Figure
3.2.
The (only) structuring mechanism is the class which describes both struc-
ture and behavior for its instances, the objects. Structure is captured as a
type for a class, where a type is nothing but a description of a domain, i.e., a
set of values, and mayor may not be named (in the former case, type names
distinct from class names and attribute names must be provided). Values
comprise the state of an object and can be as complex as the type system
allows (Le., depending on the availability of base types and constructors like
tuple, set, bag, list, etc.). Behavior is manifested in a set of messages associ-
ated with each class (its external interface), which are internally implemented
using methods that are executable on objects. Formally, messages are speci-
fied by providing a signature, and by associating several signatures with the
same message name, the latter gets overloaded. Not shown in Figure 3.2 is
6. Advanced Database Systems 239

In the context of advanced database systems, we mention that object-


orientation has also been combined with rule-based languages. Indeed, the
direction of deductive and object-oriented database (DOOD) languages has
been investigated for quite a while now, with a number of interesting results
that are, however, not directly commercially relevant. Notable approaches
include the Chimera language of IDEA [CM94), F-Logic [KLW95), and IQL
[AK98); for surveys we refer to [AHV95,LV98).

3.2 Object-Relational Databases


As was already anticipated in [8890), the presentation of data that is kept in
a database in the two-dimensional form of a table is convenient and appro-
priate even in cases where the data itself consists not just of atomic values.
Thus, even data that represents the state of an object such as an image or
a piece of text can be displayed in relational form. Conversely, vendors of
relational database systems are not abandoning their investments, but tend
to build 00 features into their next system generations. The result is a com-
bination of 00 and relational, termed object-relational or "OR" for short
[8B98,Cha98,BroOl,CZOl). Moreover, the current version of the 8QL stan-
dard, SQL:1999, comprises an object-relational data model (and brings along
advanced query features) [M801). The major 00 characteristics of 8QL:1999
can be summarized as follows:
• The type system of 8QL is extended by user-defined types (UDTs), which
can be distinct or structured.
• User-defined functions (UDFs) can be attached to UDTs and then provide
type-specific behavior; these functions can be SQL expressions or written
in a foreign language.
• Types can be arranged in a specialization hierarchy which acknowledges
inheritance.
• A similar feature applies to relations: A table can be defined as a subtable
of one or more other tables and then inherits their attributes.
Distinct types are always based on predefined types and represent a spe-
cialized usage; corresponding operators need to be derived individually (or
remain undefined otherwise). For example,

CREATE DISTINCT TYPE Money AS Integer WITH COMPARISONS;


CREATE FUNCTION "+"(Money, Money) RETURNS Money
SOURCE "+"(Integer, Integer);
derives a new type Money from Integer and attaches to it a function "+" for
adding two values of type Money. Now if salary and bonus are two attributes
of type Money, the expression salary + bonus would be a valid one.
Each tuple in every table can be equipped with a unique row identifier,
which can be used either implicitly for identification purposes, or explicitly,
240 G. Vossen

for example, as a foreign-key value. Technically, a row id is a specific data


type; its values can even be made visible (and then accessed as any other
attribute).
The OR combination has become popular due to the fact that big vendors
of relational database management systems like Oracle, Informix, IBM and
others are promoting OR in their current generation of systems [Cha98,SB98].
For further illustration purposes, we consider a few examples referring
to the Informix Dynamic Server (IDS) system [BroOI], which features sup-
port for base type extensions, complex objects, inheritance, and more. IDS
provides for complex data and object-oriented concepts in an SQL setting
through unique record identifiers, user-defined types and corresponding op-
erators as well as functions and access methods, complex objects, inheritance
of data and functions, polymorphism, overloading, dynamic extensibility, ad-
hoc queries and active rules to ensure data integrity.
The following are valid type and table declarations in the IDS language;
we again refer to the application previously shown in Figures 2.1 and 2.2:
create row type text_type
( header varchar(20) not null,
contents varchar(250) not null);

create row type doc_type


( URL varchar(50) not null,
Content text_type not null,
Created date not null,
TargetURL set(varchar(50)));

create row type User_type


( EMailAddress varchar(40) not null,
Name varchar(30) not null);

create table Document of type doc_type


(primary key (URL));

create table User of type User_type


(primary key (EMaiIAddress));
These definitions first create three types describing the relevant features of
a text, a document, and a user, resp. Finally, a table named User is created
whose tuples are of type UseLtype. The base data types used in these def-
initions are the standard ones from SQL; however, as can be seen in type
doc_type, the set constructor (set) can also be used.
We mention that most present-day systems support a simpler way of
defining specific data types and their behavior, which leaves almost all of the
work to the user or the application programmer. What they provide is a built-
in BLOB ("binary large object") type which, for example, can take multimedia
6. Advanced Database Systems 241

objects like medical images or audio messages. BLOBs can typically hold up
to 2 GB of binary data. Most of the time, they are not directly stored in
tables, but are represented by descriptors, and they can be loaded directly
from files. With BLOBs, the definition of more complex user-defined types
is more tricky, since the BLOB's structure is typically hidden in the data
structure of a corresponding program that reads or writes the BLOB, and
the same applies to the BLOB's behavior. Indeed, a BLOB becomes useful
only through attached functions that "decompose" the BLOB as desired.
On the positive side, BLOBs can be used for any kind of data that would
normally not fit into the structures or the types of a (relational) system.
We should point out that for object-relational database systems it is nowa-
days common to provide predefined class libraries for specific applications,
e.g., text, HTML pages, audio data, video data, graphics, images. Informix
calls them DataBlades, while IBM calls them Database Extenders and Oracle
Cartridges.

3.3 Designing Databases with Objects and Rules

The advanced modeling capabilities that are available in modern database


models or systems of course impose new challenges on design methodologies.
Traditionally, database design has been a reasonably well-understood pro-
cess that centers around some variation of the Entity-Relationship approach.
Moreover, relational databases have contributed the algorithmic approach of
normalization [MR92] that yields schemata with desirable properties. Today,
the picture is not that simple any more, and there are several reasons for
this:

1. Database design has to respect more aspects of an application than just


the structure of data items to be stored. Indeed, it is common to incorpo-
rate into a database design process also some form of functional design
[BCN92] that tries to capture dynamic aspects of working with a data-
base. Common approaches to functional design utilize methods that have
previously been developed in the area of software engineering, including
data-flow diagrams, state-charts, or Petri nets.
2. With the arrival of object-based data models, more semantics can be
captured within a design process, and more integrity constraints (e.g.,
triggers, event-condit ion-action rules) may become representable as part
of a database design.
3. Databases that center around an object model will often exist in a world
of other software that is itself object-oriented, and designed using object
techniques. Therefore, it makes sense to exploit object techniques, e.g.,
object-oriented analysis and design, UML, in the context of database
design as well.
4. Many applications nowadays attempt to integrate data modeling and da-
tabase design with more general contexts such as enterprise-wide informa-
242 G. Vossen

Requirements

Conceptual Models

Application Dynamic Object and


Operation
Model Model
Model

Deductive
OODBMS

Fig. 3.3. The IDEA methodology

tion modeling or process modeling, e.g., for capturing business processes


and their computerizable parts, workflows.

In this situation, traditional de~ign approaches for databases are no longer


sufficient, and new and more comprehensive approaches need to be devel-
oped. We mention one major effort in this direction, the IDEA methodology
[CF97J, an object-oriented methodology which covers the three classical ac-
tivities of analysis, design, and implementation, and which additionally in-
cludes a prototyping intermediate phase. Figure 3.3 gives an overview of the
methodology.
Analysis is devoted to a collection and specification of requirements, using
conceptual models and graphical notations which are already well established,
including state-charts and ER diagrams. Design is the process of translating
requirements into documents that provide a precise specification of the ap-
6. Advanced Database Systems 243

plication at hand, and is conducted by mapping semi-formal specifications


into fully formal ones. Within IDEA, design is divided into schema design
and rule design, where the former refers to types, classes, relationships, and
operations, and the latter refers to deductive as well as to active rule de-
sign. Prototyping, next, is the process through which a design specification
is turned into an initial executable version of the system under design, for
being able to test compliance with the requirements. Finally, implementation
maps specifications into schemas, objects, and rules of real-world database
systems. To this end, IDEA covers a variety of systems, including Oracle,
DB2, IDS, Ode, or Validity; details are in [CF97].
It is clear that only comprehensive approaches such as IDEA will in the
long run be able to provide the design support that complex modern data-
base applications and their integration into applications require. On the other
hand, there seems to be another approach emerging that is quite complemen-
tary to what was just said: As interactions between multiple systems become
more and more a necessity, and the configuration and setup of interacting sys-
tems and system components becomes more frequent, more ad-hoc, and less
long-lasting, a new way of designing artifacts is currently underway. Indeed,
in the world of XML, to be discussed next, a-priori design is not always an
issue as pressing as it has been considered in the traditional database world;
more often, design is (if at all) done "on the fly" and eventually adapted to
whatever is needed later.

3.4 Semi-Structured Data and XML


Database systems and their applications today are under increasing pres-
sure to manage data that does not easily or naturally conform to a common
data model. Recall Figure 1.1, where multiple clients access multiple database
servers through a variety of application servers. An underlying scenario could
be that of a shop from which customers can buy over the Web: One database
server manages the product catalog in which clients browse as long as they
are undecided yet. Another server takes orders and converts them into bills
and shipments at the back office; a third server would be in charge of handling
payments, i.e., secure money transfers. In such a situation, data often comes
from external organizations or partners, and its structure is often only par-
tially known, and may change without notice. Other examples exhibiting the
same characteristics include data extracted from the Web for collection in a
digital library, data resulting from the integration of heterogeneous sources,
databases in molecular biology, or data in Web site management systems.
Typically such data includes both fragments with a well-known structure
and ones with an unknown or even non-existing structure, and hence has
been termed "semi-structured data" .
In the second half of the 1990s, database researchers have started to ex-
tend data management techniques to semi-structured data. First, they have
addressed semi-structured data at the logical data level, and studied data
244 G. Vossen

E"try

TV Show

Fig. 3.4. A semi-structured movie database

models and query languages. At this level semi-structured data resembles


object-oriented one, but without a schema: some objects have missing at-
tributes, others may have multiple occurrences of the same attribute, the
same attribute may occur with different types in different objects, etc. Im-
portantly, in the semi-structured model there is no a-priori schema, and the
data is self-describing (Le., the schema is embedded with the data). The
need to allow users to formulate ad-hoc queries on semi-structured data, in
the spirit in of SQL, has motivated research on query languages for semi-
structured data. A sample semi-structured database with information about
movies is shown in Figure 3.4; it resembles the information that can be found
in the Internet Movie Database (see www.imdb.com). Notice that the cast of
a movie may consist of just actors, or of credit actors and (ordinary) actors;
TV shows are sequences of episodes, etc.
An important feature that distinguishes them from relational or object-
oriented query languages is that they allow the user to navigate objects with
only partially known structure; we will have to say more about this in the
section on query languages.
Formally, the semi-structured data model is a labelled, directed graph,
with the nodes corresponding to objects and the edges to attributes. Each
edge is labelled with an attribute drawn from some predetermined universe
of attribute names (strings), and leaf nodes are labelled with values, from
some predetermined universe of atomic values (e.g., int, string, multime-
dia objects, etc). Structured data is trivially represented that way. For ex-
ample, a record [A=5, B="abc", C=5. 9] is represented by four objects: a
root object with three outgoing edges labelled A, B, and C to leaves having
the values 5, "abc", and 5.9, respectively. Sets are represented by choosing
6. Advanced Database Systems 245

all attributes the same; for example the set {3,6,9} will have a root object
with three edges all labelled member to objects with values 3, 6, 9. More
complex graphs may represent nested collections, shared objects, cyclic struc-
tures, etc. Note that there are no a-priori restrictions on the structure: objects
may have any combinations of attributes, even repeated ones, collections may
be heterogeneous, and attributes may have any type.
The most popular model of semi-structured data is OEM (Object Ex-
change Model), originally developed for the Tsimmis data integration project
at Stanford University [CGH+94,UllOO]. The literature also contains a few
non-essential variations of this basic data model, e.g., labels can be placed on
nodes or on edges; see [ABSOO] for a good account of the relevant literature.
At the physical level, semi-structured data depends on the application
at hand. For example, in applications like the integration of heterogeneous
sources, some external sources happen to be relational databases. Here the
mapping into the logical model of semi-structured data is easy; the hard
part is dealing with the fact that these sources often have limited access
capabilities to their data. Other sources, especially those on the Web, export
semi-structured data simply as text files. Each source has its own preferred
way of formatting the text file, and even that can change without notice;
writing wrappers to map such data into the logical level is a work-intensive
task and, to a some serious extend, also an important research topic.
Research on semi-structured data has focused, among other issues, on
schema specification, schema extraction (from data), expressive power and
complexity of query languages, and optimizations. Several research proto-
types have been built and are publicly available [ABSOO,FLM98,SV98]. On
the other hand, many discussions about models for and modeling of semi-
structured data have been ended by the arrival of XML, the Extensible
Markup Language, which represents an important linguistic framework for
describing data that is to be transported or exchanged without any regard
for layout. Essentially, XML is a meta-language, i.e., a language in which
other languages can be specified, yet it can readily be used for writing sim-
ple (and, of course, also complex) documents that describe (semi-structured
or structured) data. Although XML is already widely covered in books, e.g.,
[ABSOO,HoqOO,CZOl], the "world" ofXML is still changing at such a fast pace
that the interested reader better checks on the Web for latest information,
for example at www . w3 • org/xml.
A simple example of an XML document describing bibliographic infor-
mation is shown in Figure 3.5. Here, a bibliography can comprise books as
well as articles, and each such element can have an inner structure captured
by nested elements. For example, a book can have an author plus (possibly)
additional authors, and besides that has a title, a publisher, and a (publica-
tion) year. Elements start with an opening and end with a closing tag in a
similar way this is done in other markup languages (e.g., g\'IEX, HTML); if
every element has both an opening and a closing tag (and the latter at the
246 G. Vossen

<bibliography>

<book>
<author> S. Abiteboul
<additional_author>
<name> R. Hull </name>
<name> V. Vianu </name>
</additional_author>
</author>
<title> Foundations of Databases </title>
<publisher> Addison-Wesley </publisher>
<year> 1995 </year>
</book>

<article>
<author> E.F. Codd </author>
<title>
A Relational Model of Data for Large Shared Data Banks
</title>
<journal> Communications of the ACM </journal>
<year> 1970 </year>
</article>

</bibliography>

Fig. 3.5. Sample XML document

same level of nesting as the former), the overall document is well-formed. It


is important to note to XML documents are ordered, i.e., if we exchange,
say, the title and journal elements in the article document shown in
Figure 3.5, we formally obtain a different document. Another observation on
XML documents, in particular on well-formed ones, is that their structure
can easily be represented as a tree. Indeed, the document from Figure 3.5
gives rise to the tree representation shown in Figure 3.6 (where, for space
reasons, minor straightforward abbreviations had to be applied to various
element names and their values).
Notice that an XML document like the one under discussion here is vastly
self-explanatory: The various tags (or markups) describe their respective con-
tent in a more or detailed way, and at the same time they determine the
structure of an XML document. Note also that there is no layout informa-
tion contained in an XML document.
XML is much richer in modeling capabilities than indicated here; indeed,
we refer the reader to [ABSOO,HoqOO,CZOlj for further details (on language
components such as attributes that can be associated with elements or ref-
erences that can occur inside a document, namespaces for making sure that
6. Advanced Database Systems 247

bibliography

book article

~
author title journal year
author title publisher year I I I I
Codd RM CACM 1970
~ I I I
SA addauthor Found AW 1995
/'--....
name name
I I
RH VV

Fig. 3.6. Tree representation of the sample XML document

naming of elements is unique to an application, or linking that allows to link


a document internally or externally).
From a database perspective, there are certain analogies between a rela-
tional database and an XML document. For example, in a database data is
described through attributes and modeled in a schema, while in the context
of XML data is described through tags and modeled in a DTD or a schema, as
we discuss next. While a document like the one shown in Figure 3.5 appears
somewhat arbitrary in structure at first glance, XML receives its more power
when a document's structure is fixed in advance, just like a database schema
is typically created before a corresponding database is populated. To this end,
XML offers the option of specifying a document type definition (DTD), which
is essentially a context-free grammar that allows regular expressions over the
non-terminals on right-hand sides of productions (a "regular" context-free
grammar). For example, Figure 3.7 shows a DTD for the sample document
from Figure 3.5. In this DTD, like in regular expressions, "I" stands for alter-
native, "+" means one or more occurrences, commas separate inner structure,
and "PCDATA" is the only terminal data type (meaning that XML documents
essentially contain text). Notice that DTDs themselves do not follow a strict
XML-type of syntax, since all declarations are essentially attributes of the
ELEMENT tag (which, strictly speaking, is not even a tag).
In order to create a language based on XML, typically for some specific
purpose such data exchange, one basically has to fix a namespace and to
design a DTD (or to search the Web to find out whether a suitable DTD
already exists). Next, an initial language proposal is circulated within the
relevant community, which often results in fine-tuning; finally the language
is published and can then be used within that community. Many examples of
languages that have been specified in that way exist today; for example, news
248 G. Vossen

<!ELEMENT bibliography (book I article)+>

<!ELEMENT article (author, title, journal, year»

<!ELEMENT book (author, title, publisher, year»

<!ELEMENT author (peDATA, additional_author?»

<!ELEMENT additional_author name+>

<!ELEMENT name peDATA>

<!ELEMENT title peDATA>

<!ELEMENT journal peDATA>

<!ELEMENT year peDATA>

<!ELEMENT publisher peDATA>

Fig. 3.7. A DTD for the XML document containing book information

agencies employ NewsML for publishing news that radio stations or newspa-
per publishers can subscribe to. Other examples of XML-based languages are
SMIL (Synchronized Multimedia Integration Language), MathML (Mathe-
matical Markup Language), WML (Wireless Markup Language), or BSML
(Bioinformatic Sequence Markup Language). More recently, XML Schema
has been proposed as a way of adding more database-like features, in partic-
ular type information, to a conceptual language or data structure specifica-
tion. XML schemas offer additional data types and features like inheritance
(or "derivation"), so that semi-structured data as well as many other appli-
cations can be adequately supported.
The relevance of XML to databases stems from several facts: First, XML
appears as an appropriate way of handling semi-structured data, and from
a terminological point of view resembles many database concepts: structure
vs. contents, schemas, or typing. We will see in the next section that another
such concept is declarative querying, which is perceived as a good way to ex-
plore large collections of XML documents. Second, as many XML documents
will be generated automatically, since XML is in many applications consid-
ered as a reasonable format for exchanging data, there is a growing need of
storing XML documents in a database. To this end, database vendors are
picking up and offer extensions to their systems or native XML support. We
refer the reader to [Kos99,CFPOO,SSB+Olj for further information. Third,
XML is easily coupled with programming languages such as Java. Indeed,
an easy transition from XML to Java can be accomplished using the Doc-
6. Advanced Database Systems 249

ument Object Model (DOM), which allows the representation of an XML


document as a tree of node objects or elements specified in the Interface
Definition Langauge (IDL) for further processing in programs. Technically, a
DaM parser takes an XML document as input an transform it into a tree-
structured object model with a base class named node that has attributes
such as nodeName, nodeType, and node Value , and that has links to other
classes like ownerDocument, next Sibling, firstChild, etc. All these classes
come with predefined methods that could be called from a Java program to
operate on the XML document; details can be found in [HoqOO]. The DOM
tree for our sample document from Figure 3.5 would roughly look as shown
in Figure 3.6.
We should mention that all the major vendors of object-relational data-
base systems are providing XML extensions to their products. However, the
emphasis there is mainly on "publishing" (representing) relational data or
results of SQL queries in XML format [SSB+Ol,FSTOOJ, so that they can be
further processed using the standard XML tools. The other direction, where
large XML documents are decomposed and stored in tabular format, has also
been researched [STZ+99,DFS99] and is reaching commercial products, for
example in the form of introducing a new data type "XML".

4 Advanced Query Languages

The second major area we will survey for advanced database systems is the
wide field of query languages. In particular, we will look at three represen-
tative subareas here, object-based languages, rule-based languages, and pro-
cedural data; as we go along, we will also touch upon the issue of querying
XML documents.

4.1 Object-Based Languages. Path Expressions

Object-based languages, i.e., languages for working with object-based databa-


ses, need to meet a variety of requirements [LV98] regarding their universality,
descriptivity, expressive power, genericity, or extensibility, to name just a
few. In addition, they need to have a number of specific properties, which
we briefly sketch here. First, an object-based language must reflect the fact
that there are objects to be dealt with, which need to be distinguished from
values, so it must be possible to access objects as a whole, their values, or even
parts of their values. Since a query expressed in a language usually returns
derived data, which may even be input to another query, there is also the
representation problem for object-based languages, as it needs to be clarified
whether items in the result of a query are objects (with new identifiers) or
values, or whether the result, if it happens to be a set, forms a sub-instance of
an existing instance, or a subclass of a given class. Another specific feature is
the ability to navigate through a complex schema, which will be discussed in
250 G. Vossen

greater detail below; navigation in object-based environments is reminiscent


of computing joins in relational databases, although objects allow a more
"conceptual" approach to navigation.
An important novel feature of an object-based language is the incorpora-
tion of methods or functions defined on classes or objects. A method call must
in principle be allowed in every place in a query where an attribute name is
allowed, so that (parameter-less) methods and attributes can essentially be
used interchangeably. Another problem we mention is that of update opera-
tions, which must be able to implicitly create new id's for new objects, and
associate them with existing classes appropriately.
As we mentioned, a major feature object-based languages need to possess
over relational query languages is the ability to navigate through a given
database or its schema. Let us spend a few words on this topic, which ex-
hibits a major portion of the flexibility that is now available in database
query languages. Navigation is usually provided by means of path expres-
sions [BK89,LV98], where a path expression over a given object schema is an
expression of the form X.Al.A2 ..... An, n ~ 1, where x is a variable for an
object of class Gl, Al is an attribute of that class, and for 1 < i ::; n, Ai is
an attribute of class Gi , being the type of attribute A i - l in class Gi - l .
Consider the sample database schema shown in Figure 4.1, in which rect-
angles are classes with attributes, single arrows denote aggregation, and thick
lines denote inheritance (Le., an employee is a person, and an automobile is a
vehicle), and a star indicates that the attribute in question is set-valued. For
example attribute age of class Person is of type Integer, while attribute res-
idence is of type Address, meaning that the value of residence in a particular
instance of class Person is an object from the current instance of Address.
The following are path expressions:

• x.manufacturer.headquarter.street
(the street of the headquarter of the manufacturer of vehicle x)
• x. president.familyMembers. owned Vehicles. color
(the colors of the vehicles owned by the family members of the president
of company x)

When path expressions are used, establishing a desired navigation path may
be complex, for example due to the requirement that correct typing must
be obeyed. However, by picking up ideas from universal relation interfaces
that were developed during the 1980s, path expressions can be simplified con-
siderably, as has been demonstrated in [VV93a]. We sketch the central idea
next. Consider a schema like the one shown in Figure 4.1 and the definition
of a path expression given above. This definition has several consequences,
which impose unnecessary limitations on the usage of path expressions in
queries: They have to be specified "in full", i.e., it is not allowed that a
sequence A l .A2..... An of attributes is interrupted at any point. For exam-
ple, if we ask for the cc value of an automobile x, we would have to use
tEj
6. Advanced Database Systems 251

V~ llDlElIlIllIl
AuIDmabIe

drivetrain
body
... engine
transmission • hp
cc

~
VeNcIe 1
L '""'-' I
name
age
model .... residence
ownedVehicles·
manufacturer ~

color

1
I~ I
name
headuarter
... n

.r e-.. I
divisions·
president - street
city

~.
qualifications·

r "
salary
DIvIIIon 1 ' - - - - familyMemoors·

~~

name
location
manager
employees

Fig. 4.1. OODB schema for illustrating path expressions

x.drivetrain.engine.cc. However, there is just one way to connect an automo-


bile to a cc value, so that an expression like x.cc suffices. The query asking
for all cc values of automobiles owned by employees would lead to a path
expression like [KKS92]

x. owned Vehicles [y]. drivetrain. engine. cc

where variable x stands for an employee and variable y creates a link from
vehicles (owned by employees as persons) to class Automobile. It is easily seen
that this formulation is far from perfect. Indeed, the [y]-selector enforces the
requirement that the vehicle must actually be an automobile. This could
be performed automatically by the underlying system if class names were
252 G. Vossen

allowed in the middle of a path expression, i.e., if we could write

x. owned Vehicles. Automobile. drivetrain. engine. cc

or even

Employee. owned Vehicles. Automobile. drivetrain. engine. cc

Since there is only one way to connect employees to cc values, the latter could
even be simplified to
Employee. cc
Implicitly, this assumes inheritance links to be bidirectional, and the same
can be applied to aggregation links. The approach is further developed in
[VV93aj, where a formal treatment as well as more examples can be found.
It demonstrates on the one hand again the impact that previous studies in
the context of the relational model may have in more advanced models; at
the same time it shows the higher potential associated with object-based
languages.
Features such as path expressions show up in basically every object-based
database language. A quasi-standard ist currently provided by the ODMG
proposal OQL, an acronym for Object Query Language [CBB+OOj. In brief,
OQL relies on the ODMG object model, extends SQL with object features
such as path expressions, polymorphism, or late binding, and provides high-
level primitives to deal with sets of objects. It is a functional language whose
operators can be freely composed, but it is not computationally complete.
OQL is a pure query language without update operators, which can be in-
voked from a programming language for which an ODMG binding is defined.

4.2 Querying Semi-Structured Data and XML Documents


As mentioned in the previous section in the context of semi-structured data,
an important feature of query languages for semi-structured data is that they
allow the user to navigate objects with only partially known structure; this is
often done with regular path expressions. In slight modification of the OODB
schema shown in Figure 4.1, in an object-oriented system we would write a
sample path expressions like
x. divisions[TrucksJ. manager. residence. street
to get the street address of the residence of the manager of the truck division
of company x. When the schema evolves or changes, the same expression may
need to be changed to
x. headquarter. divisions[Trucksj. administration. manager. street.
In a system supporting semi-structured data one could write the path ex-
pression as
6. Advanced Database Systems 253

x. *.divisions[TrucksJ. *.manager. *.street

where wildcards * allow us to navigate the part of the data that has an
unknown or uncertain schema.
The study of semi-structured data and, somehow related, that of the
Web have raised several new theoretical questions. The regular expressions
found in query languages for semi-structured data are particular instances of
recursion. While most problems for recursive queries are undecidable (con-
tainment, equivalence, etc.), it turns out that they become decidable for the
restricted class of regular path expression queries. This has motivated re-
search on the optimization and containment of queries with regular path ex-
pressions. The Web and its browsing-style computation has generated other
kinds of questions: What is a good model of computation on the Web? What
kind of queries can we ever hope to answer? Computations can only traverse
links in forward direction, and, since the Web is ever growing, queries will
never be able to exhaust a search. Such questions have generated interesting
research on the limitations of query computations on the Web [AVOO,ABSOO],
and have also started to generate a conceptual understanding of the Web
[KRR+OOj.
When looking at the tree structure of (semi-structured) documents (e.g.,
the movie database previously shown in Figure 3.4) and in particular at that
of an XML document, it is obvious that a common way of retrieving infor-
mation will be to specify how the tree should be traversed (typically starting
from the root) and what conditions have to be satisfied by subtrees in order
to be considered a match (or relevant to the result). For example, consider the
sample XML document from Figure 3.5 once more, whose tree representation
was shown in Figure 3.6. It is easy to imagine that this bibliography is much
larger, with many books and many articles; we can also imagine that the tree
structure is deeper nested, for example in order to reflect more information
about a publisher, keyword sections for articles, or even links between the
various publications of a particular author. "Queries" to such a tree would
then amount to descriptions of how to search through the tree. For example,
looking for all titles of books or articles occurring in the bibliography could
be written as follows:

bibliography//title

This path specification essentially asks for a tree traversal that starts from the
root of the document in question (bibliography) and that descends from there
to (the values of) all title elements, which could be at arbitrary depths (the
latter is indicated by "/ /,,). In order to express selection conditions (e.g., "all
titles from publications by author Ullman") additional language constructs
would be needed. We mention that all this is provided in a language called
XSL (XML Stylesheet Language) [KayOl], which is not exactly a language for
specifying style sheets, i.e., layout information, but which is mainly a language
254 G. Vossen

for describing tree transformations (in its XSLT portion), i.e., procedures for
transforming input XML trees into output XML trees.
As an example, assume that we had represented the Person class (actually
just a single instance from that class) from Figure 4.1 as an XML document as
shown in Figure 4.2 (where it is assumed that the particular person considered
owns three vehicles). Then an XSL program returning all vehicles of the

Person

name age residence ownedVehicles


~
Vehicle Vehicle Vehicle

Fig. 4.2. Tree representation of the Person class

person in question would look as follows:


<xsl:template>
<xsl:apply-templates>
</xsl:template>

<xsl:template match = "Person//Vehicle">


<result> <xsl:value-of/> </result>
</xsl:template>
This sample program consist of two template rules, each shown as an
xsl : template tag, the first of which says "start at the root, and apply the
remaining rules to the document starting there". The second then looks for
Vehicle items under Person, and for each such item returns its value. Note
that XSL is apparently not a declarative language, although the tag-based
syntax suggests that an XSL program resenbles an XML document.
We also mention that a way of specifying path expressions has already
been standardized for XML, which is XPath, the XML Path Language. In
brief, a location path in that language declaratively describes a set of nodes
in a given document; it consists of one or more locating steps, where each
step comprises an axis, a node test, and a step qualifier. Returning to Figure
4.2, a path whose evaluation is assumed to start at the root of this document
and which looks for the second vehicle owned by the respective person would
be written as follows:
/descendant: :ownedVehicles/descendant: :Vehicle[position()=2]
In this (simple) example, descendant is an axis, owned Vehicles and Vehicle
are node tests, and positionO=2 is a step qualifier.
6. Advanced Database Systems 255

4.3 XQuery
The aforementioned XSL is a language for "querying" XML documents that
closely follows the syntactical spirit of XML, but it is not as easy to use as a
common database language (under the assumption that a collection of XML
documents is considered a "database"). For this reason, alternative propos-
als have been under discussion for quite a while [STZ+99,BCOOj. Recently, a
convergence of this discussion has been reached through XQuery, the XML
Query Language (see www. w3. org/TR/xquery as well as [CRL+02]) that is
vastly based on an earlier proposal called Quilt [CRF01j. XQuery knows a
number of different expressions for specifying queries, including path expres-
sions (in style of XSL, see above), element constructors (to make sure that
query results can conform to XML syntax), expressions involving operators
and functions, conditional or quantified expressions, list constructors, ex-
pressions that test or modify data types, and FLWR (pronounced "flower")
expressions.
A FLWR expression is reminiscent of an SQL query expression and gen-
erally consists of (up to) four types of clauses: a FOR clause, a LET clause,
a WHERE clause, and a RETURN clause (in this order). The first part of a
FLWR expression serves to bind values to one or more variables, where values
to be bound can be represented by expressions (e.g., by path expressions).
A FOR clause is used whenever iteration is needed, as each expression in a
FOR clause returns a list of nodes (from the XML document to which the
query is applied). The result of the entire clause then is a list of tuples, each
of which contains a binding for each of the variables. A LET clause serves
local binding purposes as in functional languages such as Scheme. A WHERE
clause acts as a filter for the binding tuples generated by preceding FOR and
LET clauses; only those for which the given WHERE predicates are true are
used to invoke the RETURN clause. The latter, finally, generates the output
of the FLWR expression, which may be a node, an ordered forest of nodes,
or just a value.
As an example, consider our XML document from Figures 3.5 (text) and
3.6 (tree) once more. The following FLWR expression lists the titles of books
from the bibliography that were published by AWL in 1995:

FOR $b IN document("bibliography.xml")//book
WHERE $b/publisher = "AWL" AND $b/year = "1995"
RETURN $b/title

It is assumed here that the bibliography is a single document contained in a


file named bibliography .xml. The FOR clause states that the interest is in
books from this bibliography; the WHERE clause asks for AWL books from
1995. Finally, for all candidates found, the title is returned.
We mention that, as of Spring 2002, XQuery still has "working draft" sta-
tus who latest revision was published in December 2001, and that it is hence
several steps away from becoming a W3C "recommendation". However, the
256 G. Vossen

language core seems stable, and database system vendors are already looking
at ways to support XQuery. Moreover, the responsible committees have made
a serious effort to "surround" XQuery with a number of other documents:
query requirements, query use cases, a data model, a syntax and a formal se-
mantics; all of these documents can be found at www. w3c. org/XML/Query. A
vastly complete account of the currently studied interactions between XML
and databases, in particular database querying, is given by [WilOOj.

4.4 Towards a Foundation of Procedural Data


We next look into the database functionality of stored procedures and pro-
cedural data, and review recent approaches of dealing with them in a for-
mally precise, yet powerful and appealing manner; as will be seen, these are
even not far from being practical. Stored procedures are a common feature
of present-day systems that come with a client-server architecture. In such
a setting, clients send requests for data to the database server, and they
do so in various ways. For example, a query server is able to accept SQL
queries from clients, evaluate them, and return the result to the client. In
a stored-procedure approach, the server stores precompiled procedures that
are executed when called by a client with the appropriate parameters. These
procedures are typically collections of SQL statements that are stored in the
database itself under a specific name, and retrieved from the database when
called. However, there is typically no functionality beyond that; in particular,
it is generally not possible to manipulate stored queries, for example, by SQL
expressions, or to generate them dynamically based on the current contents
of the database.
Although a feature widely used in modern systems, the study of proce-
dural data from a conceptual point of view is thus still in its infancy. Techni-
cally, the idea of handling and manipulating procedural data goes back to the
work of Stonebraker on Quel as a data type [SAH +84,SAH87j. It has many
applications beyond stored procedures, such as data dictionaries, used for
querying meta data or view-definition maintenance, event-condition-action
(ECA) rules in active databases, methods in object-oriented databases, and
more recently Web interfaces.
In [VVV96j, we have introduced and studied nA, the reflective algebra,
which exploits (linguistic) reflection, i.e., the ability of a program to gener-
ate code that is integrated in its own execution, in the context of relational
databases, in an attempt to increase the expressive power of the relational
algebra. Technically, relational algebra expressions are stored in a specific pro-
gram relation which is separated from data relations. Importantly, a program
relation can be created dynamically, manipulated by standard operators from
relational algebra, and evaluated dynamically through an eval operator.
As a simple example, consider a parent-child relation R over a schema
{P, C}, and the query that wants to compute the set of all children and grand-
children of Fred. The following algebra program (which can easily rewritten
6. Advanced Database Systems 257

in the traditional closed form and which is executed line by line) computes
the result in the variable that occurs in the last assignment (Xs):

Xl :=R; X6 := X4 t><I Xs;


X2 := (P : Fred); X7 := -rrCf(X6);
Xg := Xl t><I X 2 ; Xg :=XgUX7;
X4 := PC/Cf(Xg); Xs := -rrp(Xg).
Xs := PP/cf(Xd;

The program relation storing this little program would have the following
contents:

sno var op att-l att-2 arg-l arg-2 ret const


1 Xl R
2 X2 P Fred
3 Xg t><I Xl X 2
4 X 4 P C C' X g
5 Xs P P C' Xl
6 X6 t><I X 4 Xs
7 X 7 -rr C' X6
8 Xg U Xg X 7
9 Xs -rr P Xg

In general, program relations can be created dynamically based on the


database contents, and then be evaluated using eval. The evaluation of an
algebra program stored in such a program relation returns the result of eval-
uating that program on the underlying database.
As shown in [VVV96], this approach allows to formalize Stonebraker's
Quellanguage as well as other approaches recently made in this direction, and
RA can even express the PTIME queries (over ordered databases), thereby
increasing the expressive power of the relational algebra considerably. How-
ever, the price to pay is an untyped setting, since the eval operator could
be applied to an ill-formed program relation and then give an error (or an
empty result), and there is no way of avoiding such situations. Also, the lan-
guage is difficult to use, since algebra computations are essentially "assembly-
language" programs.
There have been various attempts to overcome these drawbacks and to
design "cleaner" languages for similar purposes: one that takes reflection to
the level of SQL, and one which is closer to the classical relational algebra.
The former language is proposed in [MVOOa] and is called Reflective SQL
(RSQL); it essentially reapplies the RA formalisms and constructions at the
level of SQL queries. Thus, there is a program relation that can now take SQL
queries as contents, and there is an eval mechanism that can dynamically
evaluate SQL programs stored in a program relation. Since the resulting
258 G. Vossen

language is still not too easy to use, a schema-independent version of it called


SISQL was put on top of it in [MVOOb].
The second language that represents a work-around is the recently pro-
posed MA, the meta algebra [NVV +99]. In this language, expressions com-
prise names of object (data) and meta relations (the latter can hold queries
and data), the standard relational operators, and new operators to work on
query columns, which are called extract, rewrite, and eval. The impor-
tant issue of typing is taken care of in a conservative way: The type of an
ordinary relation is its width (Le., number of attributes), and the type of a
relational algebra expression is the type (number of attributes) of the result
relation. Next, the type of a relation containing algebra expressions is defined
as the types of the columns containing ordinary data values or expressions of
a designated type, and finally the type of an expression of MA is again the
type of the result relation which may contain expressions.
As an example, let 8, T, and U be binary relations, and consider the
following relation R whose first and third columns are data columns, and
whose second and fourth columns are expression columns of arity 4 and 2,
resp.:

a a1=2(8) x a1=2(T) d8
b a1=4(8 x T) U (8 x 8) eU
C 7I"l,2,3,4a2=6a4=5(8 x (T x 8)) 18

An ordinary algebra operation such as projection can be applied as before.


The new operators, specifically designed to work on query columns are (by
way of the same example) as shown below. They all keep the original con-
tents of their operands and attach their result to it; we here use additional
projections to keep results small.
7I"l,3,4,5extract2:4(R) extracts from column 2 all sub expressions of arity
4 (and then projects onto columns 1,3,4, and 5):

ad 8 0"1=2(8) x 0"1=2(T)
be U 0"1=4(8 x T) U (8 x 8)
be U 0"1=4(8 x T)
b e U8x8
b e U8xT
cf 8 7r1,2,3,40"2=60"4=5(8 x (T x 8))
cf 8Tx8

7I"l,3,5rewrite-one2:D4->T(R) rewrites (each) one occurrence of an expression


from column 4 in column 2 by T (and then projects onto columns 1, 3, and
5), where D4 is a variable for expressions in column 4 (which can hence be
instantiated by 8 or U):
6. Advanced Database Systems 259

ad 0'1=2(T) x 0'1=2(T)
cf 71'1,2,3,40'2=60'4=S(T x (T x 8))
cf 71'1,2,3,40'2=60'4=S(8 x (T x T))
Similarly, 71'1,3,Srewrite-aI12:D4-+T(R) rewrites all occurrences of an expres-
sion from column 4 in column 2 (simultaneously) by T (and projects as
before):

ad 0'1=2(T) x 0'1=2(T)
b e 0'1=4(8 x T) U (8 x 8)
cf 71'1,2,3,40'2=60'4=S(T x (T x T))
Think of rewrite as operating on the parse-tree representation of a query: It
takes a subtree of a parse tree and repiaces one or all occurrences of the sub-
tree by another subtree. The last new operator, eval, takes a query column
and attaches the result of evaluating all the queries in that column to the
given relation.
General definitions of these operators appear in [NVV +99], and it is
shown there that extract, rewrite, eval are primitive operators, i.e.,
MA is non-redundant, and that MA is a conservative extension of the
relational algebra, i.e., it coincides with the relational algebra on ordinary
databases.
As a concrete application, consider a bookstore database which is queried
over the Internet. Let queries be algebra expressions. Imagine we want to
monitor the database usage by maintaining a meta relation Log of type [0, (4)],
containing pairs (u, q), where u is a username and q is a query u has posed.
Our focus thus is on queries of arity 4 returning 4-ary relations, e.g., sets of
book records. The query show the results of all queries posed every user is
expressed as
71'1,3,4,s,6evaI2(Log)
Similarly, determine all queries that gave no result is

8how the union of the results of the queries posed by Jones is

71'3,4,S,6evaI20'1='Jones' (Log)
More examples appear in [NVV+99].
Interestingly, for MA there is a "Codd theorem" in style of the one men-
tioned previously, as a meta calculus, restricted to safe expressions, can be
shown to be equivalent to MA. However, when compared to the reflective
algebra, there exists a limitation on the expressive power of MA, due to
its typed nature, since some computationally simple, well-typed queries are
not definable in MA. The intuitive reason is that the computation of such a
query requires untyped intermediate results.
260 G. Vossen

We have mentioned these approaches to procedural data in the context


of advanced database query languages since they approach the problem for
putting procedural data on solid formal grounds, and they open up an entire
area of research on query languages which can handle query expressions as
appropriately as they can handle data. As we will indicate next, they can
even be incorporated into SQL through an appropriate exploitation of XML.

4.5 Meta-SQL
In this subsection, we briefly sketch a practical meta-querying system called
Meta-SQL [VVV02), where stored queries are represented as syntax trees
in XML format. This representation allows us to use XSLT for a (syntacti-
cal) manipulation of stored queries. Many syntactical meta-queries can then
directly be expressed simply by allowing XSLT function calls within SQL
expressions. We note that it would be easy to substitute XSLT by XQuery
in this approach.
We consider relational databases as before, except that in a table columns
can now be of type "XML". In any row of such a table, the attribute cor-
responding to a column of type XML holds an XML document. To query
databases containing XML in this way, it seems natural to extend SQL by
allowing calls to XSLT functions, in the same way as extensible database sys-
tems extend SQL with calls to external functions. However, in these systems,
external functions have to be precompiled and registered before they can be
used. In Meta-SQL, the programmer merely includes the source of the needed
XSLT functions and can then call them directly.
As an example, consider a simplified system catalog table called Views
which contains view definitions. There is a column name of type string, holding
the view name, and a column def of type XML, holding the syntax tree of
the SQL query defining the view, in XML format. For example, over a movies
database, suppose we have a view DirRatings defined as follows:
create view DirRatings as
select director, avg(rating) as avgrat
from Movies group by director
Then catalog table Views would have a row with the value for name equal to
'DirRatings', and the value for def equal to the following XML document:
<query>
<select>
<sel-item>
<column>director<lcolumn>
</sel-item>
<sel-item>
<aggregate>
<avg/>
6. Advanced Database Systems 261

<column-ref>
<column>rating</column>
</column-ref>
</aggregate>
<alias>avgrat</alias>
</sel-item>
</select>
<from>
<table-ref>
<table>Movies</table>
</table-ref>
</from>
<group-by>
<column-ref>
<column>director</column>
</column-ref>
</group-by>
</query>
Clearly, for writing such XML representations of SQL expressions in a uni-
form way, an XML DTD is needed, which can be derived, for example, from a
BNF syntax for SQL such as the grammar given by Date [DD97]. The derived
DTD is given in [YVV02].
Now consider the (meta-) query "which queries do the most joins?" which
is to be applied to our Views table. For simplicity, let us identify the number
of joins an SQL query performs with the number of table names occurring
in it. To express this meta-query in Meta-SQL, we write an auxiliary XSLT
function count_tables, followed by an obvious SQL query calling this func-
tion:
function count_tables returns number

begin <xsl:template match="/">


<xsl:value-of
select=" count U /table) "/>
</xsl:template>
end

select name from Views


where count_tables(def) = (select max(count_tables(def»
from Views)
The first line declares the XSLT function in Meta-SQL; between begin and
end arbitrary XSLT code is allowed. In general, Meta-SQL allows multiple
XSLT functions to be declared and called in the SQL query that follows the
function declarations.
262 G. Vossen

The combination of SQL and XSLT just sketched provides a basic level
of expressive power, yet for more complex syntactical meta-queries SQL can
be enriched with XML variables which come in addition to SQL's standard
range variables. XML variables range over the subelements of an XML tree,
where the range can be narrowed by an XPath expression. XML variables thus
allow to go from an XML document to a set of XML documents. Conversely,
we also add XML aggregation, which allows us to go from a set of XML
documents to a single one. SQL combined with XSLT and enriched with
XML variables and aggregation offers all the expressive power needed for
ad-hoc (syntactical) meta-querying. To allow for a form of semantical meta-
querying as well, it suffices to add again an evaluation function that takes
the syntax tree of some query as input and produces the table resulting from
executing the query as output.
The resulting language Meta-SQL is compatible with modern SQL im-
plementations offered by contemporary extensible database systems. Indeed,
these systems support calls to external functions from within SQL expres-
sions, which allows us to implement the XSLT calls. Furthermore, XML vari-
ables and the evaluation function can be implemented using set-valued exter-
nal functions. XML aggregation, finally, can be implemented as a user-defined
aggregate function.

To conclude this section, we mention that, starting with the seminal paper
on HiLog [CKW93], the concept of schema querying has received considerable
attention in the recent database literature. Clearly, schema querying is a spe-
cial kind of meta-querying. For instance, SchemaSQL [LSS01] augments SQL
with generic variables ranging over table names, rows, and column names. It
is not difficult to simulate SchemaSQL in Meta-SQL. Although the focus here
has been on meta-querying as opposed to general XML querying, it should
be understood that Meta-SQL, even without eval, can serve as a general
query language for databases containing XML documents in addition to or-
dinary relational data. Its closeness to standard SQL and object-relational
processing is a major advantage.

5 Advanced Database Server Capabilities

In this chapter we survey several system aspects of advanced database servers.


As before, the selection we present is on the one hand incomplete and driven
by personal taste; on the other, it trusts that features such as transaction
management, distributed database systems, and parallel database systems
are sufficiently treated elsewhere. We will here look into RAID architectures
and disk arrays that are now popular in applications that store and manage
huge amounts of data, temporal facilities of database systems, spatial data,
and specific features of advanced transactional capabilities.
6. Advanced Database Systems 263

5.1 RAID Architectures

Traditionally, databases have always been kept on magnetic disks, and since
disks are a relatively cheap storage medium these days, it has become common
to use a larger number of disks for storing the data of a database. If these
disks are uniform from a technical point of view, i.e., have the same access
times and storage capacities, it makes sense to organize data on these disks
in such a way that a speed-up in processing time is achieved. The common
way to do so is to distribute data over the available disks such that accesses
can be performed in parallel. However, there is a second aspect to be kept in
mind, that of protecting data against loss and corruption, so that it might
also be a good idea to keep a little redundancy among multiple available
disks.
The most successful form of disk-oriented data storage nowadays is the
disk array or the RAID architecture, which stands for Redundant Array of
Inexpensive Disks [CLG+94]. Its success is mainly due to the fact that it
allows for an adaptable balance between efficiency and safety. In its simplest
form, known as RAID-O, there is a number of n disks which are accessible
through a single disk controller. In such a setting data can be stored so
that parallel access becomes an option. The common way to do so is by
striping data items in a bit-oriented or a block-oriented fashion. With bit-
wise striping, each byte to be stored is spread over several disks, each of
which takes another bit. For example, if there are 8 disks, each can take one
of the bits in a byte; since all 8 bits can be read in parallel, access is 8 times
faster than with a single disk. Under block-wise striping, consecutive storage
blocks are distributed over consecutive disks, usually in a circular fashion.
More precisely, for n disks the ith block of a file or a data set is stored on
disk (i mod n) + 1, or on disk (i + j - 1 mod n) + 1 if storing the blocks starts
with block 0 on disk j > 1. This is illustrated in Figure 5.1, where block ao
is stored on disk 1, al on disk 2, etc.; for the b blocks storing starts from disk
2.
Clearly, the approach just described increases throughput and shortens
access times through an exploitation of the available parallelism, but appar-
ently this scheme is sensitive to crashes of single or even multiple disks. There
are at least two work-arounds: replication of data, or keeping additional in-
formation through which data can be reconstructed in case of an error. RAID
levels higher than 0 can essentially be distinguished by the way they trade
space utilization for reliability.
A RAID-l architecture uses mirror disks, so that only half of the available
disks can be used for storing data, and the other half is a copy of the first.
A disk an its mirror are together considered a logical disk. This principle is
illustrated in Figure 5.2, where the striping shown is again block-oriented as
in Figure 5.1. Apparently, RAID-1 is good for applications such as logging in
database systems, where high reliability is mandatory. The underlying idea
is that a disk and its mirror will rarely crash together. For reading data in a
264 G. Vossen

DIsk
Coatroller

I
I I
1
1
1
-.., 1
3

·0 ., "0 ·2 ", ., "2


." b"
"" "5 be
Fig. 5.1. RAID-O architecture with striping

RAID-l setting, reading the corresponding disk (or its mirror if the disk has
crashed) is apt, for writing both disks need to be accessed.

Disk
c-a-o.r

I
I I
I
., .,
1 I 1 1
b, b, bz
., ., bz
·0 ·0
·z b, ·z b, b" b"
." b5 ." bIJ bo ". bo
".
DoIa M1nw DoIa M1nw

Fig. 5.2. RAID-l architecture with mirror disks and striping

Other RAID levels partially give up reliability for the sake of space uti-
lization. In particular, RAID-2 uses bit striping; bits distributed over various
disks are additionally encoded, so that data bits are augmented with code
bits. The techniques used are related to those used for other storage com-
ponents as well (hence the name "memory-style EEC") and are often based
on Hamming codes or on block codes, which, for example, encode 4-bit data
in 7-bit code words and are then able to locate and correct single-bit errors.
Thus, 4-bit striping would require seven disks, four of which would take data
bits, the others the additional code bits.
6. Advanced Database Systems 265

RAID-3 makes use of the observation that it is generally easy for a disk
controller to detect whether one of the attached disks has crashed. If the
goal to just detect a disk crash, a single parity bit per byte or half-byte
suffices, which would be set according to odd or even parity. For bit striping,
individual bits would again be stored on separate disks, and an extra disk
stores all parity bits. This is illustrated in Figure 5.3 for four data disks.
When data is read in the case shown in Figure 5.3, bits are read from all four
disks; the parity disk is not needed (unless a disk has crashed). However,
when data is written, all five disks need to be written.

DIsk
Coatrolkr

I I lI

LJ LJ
Pull)'

Data Data Data Data

Fig. 5.3. RAID-3 architecture

RAID-4 uses block-oriented striping with parity; one parity block per set
of blocks from the other disks is kept on a separate disk. Reading data is now
faster than with RAID-3; however, the parity block can become a bottleneck
when writing data. RAID-5 is also block-oriented, now with distributed par-
ity which tries to avoid that bottleneck. Data and parity are distributed over
all disks; for example, with five disks the parity block for the nth set of blocks
is written to disk (n mod 5) + 1, while the data blocks are stored on the four
other disks. Finally, RAID-6 stores additional information to make the disk
array robust against a simultaneous crash of multiple disks; this is called P
+ Q redundancy. Reed-Solomon codes are used to protect an array against
a parallel crash of two disks, using two additional disks for encoding. Table
5.1 summarizes the various RAID levels.
As the data that is stored in a file system or a database grows, disk arrays
are becoming more and more popular. A trend for the near future seems to
be to make disks more and more "intelligent" , so that, for example, searching
can be directed by the disk controller instead of the database or the oper-
ating system. Clearly, disk arrays are particularly suited for data intensive
applications that have to deal with versioning, temporal data, spatial data,
or more generally multimedia data. On the other hand, a clever and efficient
266 G. Vossen

Table 5.1. RAID-Levels


o Nonredundant
1 Mirrored
2 Memory-Style ECC
3 Bit-Interleaved Parity
4 Block-Interleaved Parity
5 Block-Interleaved Distributed Parity
6 P+Q Redundancy

logical organization of the data in index structures is still crucial for achiev-
ing reasonable performance; see [GG98] for a survey on index structures, and
[VitOl] for one on external memory algorithms and data structures.

5.2 Temporal Support


The next system functionality we discuss derives from the observation that
traditional database systems store data that represents a snapshot of the
current situation in the outside world at a particular (normally the current)
point in time. If an update occurs, data is overwritten, and old values, now
considered no longer valid, are simply forgotten. There are many applications
nowadays where this view is too simplistic, and where a better system support
for time is needed. This, for example, applies to banks which need to keep
account activities and balances around for long periods of time.

Professor Name Rank


Mary Full
Tom Associate
Laura Assistant
Bill Full
Kathryn Associate

Fig. 5.4. A sample relation for discussing time issues

As a simple example, we here consider the personnel database of an Amer-


ican university holding the relation shown in Figure 5.4. Clearly, such a re-
lation can answer queries like "what is Mary's rank" or "who is (currently)
an associate professor". However, the database in not capable of answering
queries like " what was Mary's rank two years ago" or "Laura will be pro-
moted to the next higher rank in two months". A temporal database system
would be able to answer such queries, essentially by keeping several versions
of data items over time. Notice that SQL's data types date and time are not
enough for that purpose, since their proper use would imply that relevant
temporal queries are known at design time.
6. Advanced Database Systems 267

Since the latter is rarely the case, a temporal database provides system
support for time, and typically distinguishes several kinds of time:

1. Transaction time (or "registration time") denotes the time at which a


particular information is put in the database by an insert or an update
command. This is easily captured by two additional attributes, say, be-
gin and end, denoting the time interval from insertion to deletion of a
particular fact, like in the sample relation shown in Figure 5.5, where 00
indicates that a tuple has not yet been deleted.

Name Rank Transaction Time


begin end
Mary Associate 25.08.1987 15.12.1992
Mary Full 15.12.1992 00
Tom Associate 07.12.1992 00
Mike Assistant 10.01.1993 25.02.1984
Laura Assistant 22.07.1995 00
Bill Assistant 22.07.1985 23.11.1990
Bill Associate 23.11.1990 22.03.1994
Bill Full 22.03.1994 00
Kathryn Associate 31.03.1995 00

Fig. 5.5. The same relation with transaction time

Transaction time can be used to answers queries like "what has been
Mary's rank on 10.12.1992", but on the other hand it only records activ-
ities on the database, not in the application. A tuple becomes valid as
soon as it is stored in the database.
2. Valid time tries to reflect the validity of a fact in the application at hand,
independent of the time at which this fact gets recorded in the database.
Our sample relation could now look as shown in Figure 5.6, where 00 is
used to denote the fact that something is still valid. Notice that valid
time makes it possible to update data pro-actively, i.e., with an effect for
the future, but also retro-actively, i.e., with an effect for the past.

A temporal database is typically capable of combining valid time and trans-


action time, and in addition keeps user-defined time around for being able
to represent whatever a user wants beyond transaction or valid time.
Clearly, a temporal database needs specific language properties for han-
dling time as well as temporal data, an issue that is taken care of in the lan-
guage Temporal SQL (or TSQL for short) [EJS98,TCG+93,ZCF+97j. The
general syntax of a TSQL query has the form

select { select-list }
from { relations-list}
268 G. Vossen

Name Rank Valid Time


from to
Mary Associate 01.09.1987 01.12.1992
Mary Full 01.12.1992 00
Tom Associate 05.12.1992 00
Mike Assistant 01.01.1993 01.03.1984
Laura Assistant 01.08.1995 00
Bill Assistant 01.07.1985 31.12.1990
Bill Associate 01.01.1991 31.03.1994
Bill Full 31.03.1994 00
Kathryn Associate 01.04.1995 00

Fig. 5.6. The sample relation with valid time

where { conditions}
when { time-clauses }

in which the when clause is new. In this clause, several temporal comparison
operators may used, including before, after, during, overlap, follows, or
precedes, which refer to time intervals. As an example, the query asking for
Mary's rank at the time Tom arrived is written as

select X.Rank
from Professor X, Professor Y
where X.Name = 'Mary' and Y.Name 'Tom'
when X.interval overlap Y.interval

As can be seen, the time interval stored in a relation can now be accessed
via the. interval extension of the relation name in question. As TSQL (and
more recently TSQL2) gets standardized, we will see temporal capabilities as
ordinary capabilities of database systems emerge in the near future.

5.3 Spatial Data

We next look at an advanced system functionality that has been of interest for
many years already, and that only recently opened the appropriate tracks for
formal research. Spatial data arises when spatial information has to be stored,
which is typically information in two or three dimensions. Examples include
maps, polygons, bodies, shapes, etc. A spatial database supports spatial data
as well as queries to such data, and provides suitable storage structures for
efficient storage and retrieval of spatial data. Applications include geographic
information systems, computer-aided design, cartography, medical imaging,
and more recently multimedia databases [SK98,Sub98,SJ96].
Data models for representing spatial data have several properties that
clearly distinguish them from classical data models:
6. Advanced Database Systems 269

1. They need to be capable of representing data from an n-dimensional


space, i.e., from a set of points which is infinite, but not even count ably
infinite. In other words, the information to be represented is inherently
infinite, so that, similar to deductive databases, only intensional repre-
sentations can be used.
2. The intensional character of a spatial data model has an impact on generic
as well as on used-defined operations, as a corresponding language must
be closed under both types of operations. This is generally difficult due
to the fact that a variety of operations is typically needed.
3. The information to be represented generally does not enjoy the elegant
geometric properties of a structure created by humans, but expresses un-
symmetric phenomena from nature and their visualizations. This requires
specific algorithms for dealing with the information, which are based on
algebraic, geometric, and topological properties.

For illustrating some of the problem with representing spatial data, we briefly
consider the so-called raster-graphics model. In this model, spatial informa-
tion is intensionally represented in discretized form, namely as a finite set of
raster points which are equally distributed over the object in question; this is
reminiscent of a raster graphics screen which is an integer grid of pixels each
of which can be switched on or off (i.e., be set to 1 or 0). Infinity is captured
in this model by assuming that for each point p, infinitely many points in
the neighborhood of p have the same properties as p. Now this model can
exhibit anomalies which are due to the absence of the properties of Euclidian
geometry.
For example, a straight line is represented in the raster model by two
of its raster points. In case a line does not exactly touch two points, it is
assumed that points that are "close" to the line can be used to represent it.
The following situation, illustrated in Figure 5.7, is now possible: Straight
line g1 is represented by points A and B, g2 by A and C, and g3 by D and
E. Apparently, g2 and g3 have an intersection point, which, however, is not
a raster point. So following the raster philosophy, the point closest to the
intersection is chosen as its representative; in the example shown, this is F.
Now as an intersection point, F needs to be a point on line g2; on the other
hand, since it is also a point of g1, it is also an intersection point of g1 and
g2. Therefore, g1 and g2 have two intersections (the other is A), which is
impossible from a classical geometric point of view.
We mention that there are other models for representing spatial data.
Moreover, many data structures exist for storing such multi-dimensional data
[GG98). Efficient algorithms are then needed for answering typical queries,
which may be exact or partial match queries or, more often, range queries.
Imagine, for example, the data to represent a map of some region of the world;
then a range query might ask for all objects having a non-empty intersection
with a given range, e.g., "all cities along the shores of a river, with a distance
of at most 50 miles from a given point" .
270 G. Vossen

A E
• • •
• • •
• • •
• • •
• • •
• D

C

Fig. 5.7. Line intersections in the raster-graphics data model

Similar problems arise with image data, with complex graphics, and with
pictures, and the situation is technically made more complicated by the facts
that (i) often all these various types of data occur together, and (ii) pic-
tures may be silent or moving, Le., video data. A major problem then is to
guarantee a continuous retrieval of a specific bandwidth for a certain period
of time, e.g., in applications such as on-demand video [EJH+97j. Another
problem area is given by image processing based on the contents of an image
database, which emerges to the task of not only retrieving images, but also
interpreting them or searching them for specific patterns. Finally, a combina-
tion of spatial data with temporal aspects has to deal with geometries that
change over time; if changes occur continuously, the data represents moving
objects, an area whose study has only just begun [GBE+OOj.

5.4 Transactional Capabilities, Workflows, and Web Services

The final system capability we look into here is transactions. Database sys-
tems allow shared data access to multiple users and simultaneously provide
fault tolerance. In the 1970s, the transaction concept [Gra78j emerged as a
tool to achieve both purposes. The basic idea (the "ACID principle") is to
consider a given program operating on a database as a logical unit (Atomic-
ity), to require that it leaves the Consistency of the database invariant, to
process it as if the database was at its exclusive disposal (Isolation), and to
make sure that program effects survive later failures (Durability). To put this
to work, two services need to be provided: Concurrency control brings along
synchronization protocols which allow an efficient and correct access of mul-
tiple transactions to a shared database; recovery provides protocols that can
6. Advanced Database Systems 271

react to failures automatically [WV02,BN97,Vos95,GUW02]. (Recent devel-


opments have even suggested to unify the two [AVA+94,SWY93,VYB+95].)
Even outside database systems, the transaction concept pops up in TP
monitors [BN97] and in COREA, a standard middleware platform for dis-
tributed object computing [OHE97]. Moreover, it plays a central role in
present-day server federations, for example, in electronic commerce and shop-
ping, where product catalogs, order processing, and payments are typically
handled by individual servers that have to cooperate according to transac-
tional guarantees.
The transaction concept delivers ways to guarantee correct executions
over multiple, concurrent operations, and it does so both for simple, page-
level operations, e.g., reads and writes, and for complex, user- or application-
oriented operations, e.g., SQL updates, message invocations on objects, da-
tabase processing triggered from a Java program. It has turned out to be
both an abstraction mechanism and an implementation technique [Kor95].
However, various problems with flat ACID transaction models and tradi-
tional, single-server concurrency control mechanisms remain, including the
following:

• long transactions, e.g., in CAD applications, cause concurrency conflicts,


• application semantics, e.g., control-flow dependencies as in reservation
transactions, application-level parallelism, or alternative actions, are
vastly lost when restricting the attention to read and write operations,
since ACID transactions at the page level of abstraction recover data,
but not activities,
• collaboration and cooperation required in modern application scenarios
are not supported.

Consequently, the issues advanced transaction models and transactional con-


cepts try to capture include user control over transaction execution, ade-
quate modeling of complex activities, long-running and long-living activities,
open-ended activities, compensating operations, cooperation, interactiveness,
modular construction of database server software, e.g., layered architecture,
or system federations. There are two basic features advanced models can
bring along:

1. Further operational abstractions (beyond the traditional reads and writes


on pages), and
2. a departure from strict ACID.

For the former, many options are available, including providing more oper-
ations, providing higher-level operations, providing more execution control
within and between transactions, or providing more transaction structure.
Structure, in turn, can refer to parallelism inside a transaction, it can refer
to transactions inside other transactions, or it can even refer to transactions
plus other operations inside other transactions. In essence, the goal thus is to
272 G. Vossen

enhance the expressive power of the transaction concept, and to do so in such


a way that not only complex or long-running activities, but also structured
collections of such activities can be modeled adequately.
The interested reader will find a thorough introduction to this subject
in [WV02], where the distinction between page-level and object-level concur-
rency control and recovery is consequently developed and studied. One fun-
damental idea for a departure from the page level of abstraction is to consider
higher-level operations and their implementation through page-level opera-
tions, and to take this procedure upwards along the hierarchy of layers found
in a typical database server (e.g., as shown in Figure 1.2). In other words,
transactions are allowed to contain other transactions as sub-transactions,
thereby giving transactions a tree structure whose leaves are elementary oper-
ations, but whose other nodes all represent transactions. If a sub-transaction
appears atomic to its parent, it can be reset without causing the parent to
abort, too. Furthermore, if sub-transactions are isolated from each other, they
can execute in parallel. Two prominent special cases of nested transactions
are closed ones, in which sub-transactions have to delay their commit until
the end of their root transaction, and open ones in which sub-transactions
are allowed to commit autonomously. If all leaves in a transaction tree are
of the same height, multilevel transactions result, in which the generation
of sub-transactions can be driven by the functional layers of the underly-
ing system. The theory of such "nested" transactions, initiated in [Mos85],
has been studied and developed intensively in recent years [Elm92,Wei91]. It
turns out that most of what has been developed for page-level transactions,
e.g., conflict-based serializability, two-phase locking protocols or redo-undo
recovery, can nicely be generalized to object models of transactions.
Research over the past ten years has investigated extensions of the classi-
cal transaction domain to the dimensions of operation models, data models,
and system models; see, for example, [RS95]. The most recent results of these
investigations nicely demonstrate the desire of database developers to open
their systems up for enterprise-wide integration and collaboration with other
systems, as they include extensible transactions models [Elm92J, customized
transaction management [GHM96,GHS95], or frameworks for the specifica-
tion of transaction models and their properties. For example, ACTA [CR94] is
a tool for synthesizing extensible models and can be used for specification of
and reasoning about transaction effects and interactions; TSME [GHK+94]
is a programmable system supporting implementation-independent specifi-
cation of application-specific transaction models and configuration of trans-
action management mechanisms to enforce such models [RS95]. Customized
transaction management ensures the correctness and reliability of distributed
applications which implement processes that access heterogeneous systems;
at the same time, it supports the functionality each particular application or
process requires. Ideally, it supports an extensible transaction model and
a management mechanism that are application-specific, user-defined, and
6. Advanced Database Systems 273

multi-system. Moreover, it copes with changes in the correctness and relia-


bility requirements of applications, and in the transactional capabilities local
systems provide.
Especially in this area, a convergence can be noted of transactional ca-
pabilities and process-orientation, which comes from the desire to exploit
transactional properties in the context of the automated parts of a business
process, i.e., in workfiows. In a nutshell, workflows are activities involving
the coordinated execution of multiple tasks performed by different process-
ing entities, or procedures where documents, information or tasks are passed
between participants according to defined sets of rules to achieve, or con-
tribute to, an overall (business) goal [RS95]. Tasks represent work to be done,
and can be specified as textual descriptions, forms, messages, computer pro-
grams, etc. Processing entities can be humans or software systems, and can
typically assume different roles within the execution of a workflow. Work-
flow management [LROO] denotes the control and coordination of (multiple)
workflow executions; a workflow management system manages the scheduling
of tasks and verifies the constraints that are defined for transitions among
activities. Thus, workflow management aims at modeling, verifying, optimiz-
ing, and controlling the execution of processes [GHS95,RS95,LROO,AH02]; it
allows to combine a data-oriented view on applications, which is the tradi-
tional one for an information system, with a process-oriented one in which
(human or machine) activities and their occurrences over time are modeled
and supported properly. The field has gained considerable interest recently.
Workflow execution requirements include the support for long-running ac-
tivities with or without user interaction, application-dependent correctness
criteria for executions of individual and concurrent workflows, adequate inte-
gration with other systems (e.g., file managers, database systems, which have
their own execution or correctness requirements), reliability and recoverabil-
ity w.r.t. data, or the option of compensating activities instead of undoing
them. It is these requirements that suggest to exploit the transaction concepts
in this context.
Advanced transaction models as well as customized transaction manage-
ment seem suited to meet the requirements imposed by workflow manage-
ment, since their characteristics concern issues such as transaction structure,
intra-transaction parallelism, inter-transaction execution dependencies, re-
laxed isolation requirements, restricted failure atomicity, controlled termina-
tion along the lines of [AVA+94,SWY93,VYB+95], or semantic correctness
criteria. Indeed, several proposals already go in this direction [BDS+93];
prototypical systems include Exotica [AAA+96] or METUflow [AAH+97].
However, traditional transactional techniques rarely suffice, since transac-
tion models typically provide a predefined set of properties which mayor
may not be required by the semantics of a workflow. Also, processing en-
tities involved in workflow execution may not provide support for facilities
required by a specific transaction model. Thus, transactional workflows have
274 G. Vossen

their specific properties which have only recently begun to study such con-
troversial issues like commit/abort vs. fail, compensation vs. undo, inter-
ruptability of long-running activities, coordination and collaboration even at
a transactional level, transactional vs. non-transactional tasks, decoupling
transactional properties (in particular atomicity and isolation) into appro-
priately small spheres, serializability vs. non-serializable (e.g., goal-correct)
executions [RMB+93,VV93b], and the distinction between local and global
correctness criteria [EV97J.
To conclude our system considerations for databases, we briefly touch
upon an area that is of increasing interest these days, and that at the same
time uses database management systems as an "embedded" technology hardly
visible from the outside. Transactional workflows become particularly rele-
vant today in the context of the Internet, which is, among other things, a
platform for offering electronic services. While Internet services in the past
have widely relied on forms, present-day services are offered over the Web
and are more and more oriented towards an automated use of computers as
well as an automated exchange of documents between them. A Web service
aims at the provision of some kind of service that represents an interoperation
between multiple service providers. For example, one could think of a moving
service on the Web that combines a service for arranging the moving of fur-
niture with a service that orders a rental car and a service for changing the
address of the mover in various places. More common already are electronic
shopping services in which a catalog service is combined with a payment col-
lection service and a shipping service. In business-to-business scenarios, Web
services come in the form of marketplaces where buying, selling, and trading
within a certain community (e.g., the automotive industry) is automated.
Each service by itself typically relies on a database system.
From a conceptual viewpoint, each individual service could be perceived
as a workflow with its underlying transactional capabilities, so that the goal
becomes to integrate these workflows into a common one that can still provide
certain transactional guarantees. Thus, what was said above about advanced
transactions becomes readily applicable. On the other hand, there are more
conceptual problems to make Web services fly, including ways for uniform
communication so that services can talk to each other in a standardized way
(in particular beyond database system borders), or possibilities to describe,
publish, and find Web services conveniently and easily. A recent account of
the situation in this area is provided by [CGSOIJ.

6 Conclusions and Outlook


After a long period of stability in the area of commercial products, the major
database system vendors have introduced a host of novel features into their
systems in recent years. Conceptually, the object-relational approach has re-
placed pure relational systems, and the current version of the SQL language
6. Advanced Database Systems 275

incorporates the object-relational approach, but also XML functionality at


any increasing degree. Moreover, universal servers based on object capabilities
promise to provide appropriate functionality and efficient access to data types
that range from integers and characters to multi-media types such a audio
and video. System-wise, the shift to component technology can also be seen
in database products, the idea being that different capabilities in a large sys-
tem could be supplied by top-of-the line products suitably plugged together.
Finally, databases on the Internet and as part of intranets and extranets are
growing in importance. These developments make database systems an ad-
vanced technology that can indeed be exploited in strategic enterprise-wide
software landscapes; indeed, database systems are a solid foundation for the
data modeling, storage, and processing needs of current, but also of future
applications.
Clearly, there are new challenges on the horizon. Technically, database-
system functionality is already available in palmtop computers, and we can
expect it on smartcards soon [BBP+OOj; we can soon expect mobile clients
that connect to stationary servers while on the move. Another such challenge
is data streams [BBD+02j, which form a new model of data processing in
which data does no longer come in discrete, persistent relations, but rather
arrives in large and continuous streams whose intensity and frequency even
varies over time. Smartcards as well as streams pose new problems for query
processing and optimization, or for transaction atomicity, to name just two
areas. The former also require a radical departure from traditional database
system architecture, which has been adding complexity for many years; future
systems will have to be easier to manage and administer, a vision that is
advocated in [CWOOj.
Application-wise, data integration, for example in the context of search
engines for the Web, will become even more relevant in the future, and one of
the directions databases are taking here is the support of semi-structured data
and XML [ABSOO,Bun97,SV98,WilOOj. Closely related, text databases and,
more generally, databases that can store structured documents are gaining
the attention of the research community [NV97 ,NV98j; finally, digitallibmries
and databases that form a basis for electronic commerce are arriving [RunOOj.
So the future looks bright, and the database field is not at all in danger of
loosing its attractiveness.

References

[AAA+96] Alonso, G., Agrawal, D., El Abbadi, A., Kamath, M., Giinthor, R.,
Mohan, C., Advanced transaction models in workflow contexts, Proc.
12th IEEE Int. Conf. on Data Engineering, 1996, 574-58l.
[AAH+97] Arpinar, LB., Arpinar, S., Halid, U., Dogac, A., Correctness of work-
flows in the presence of concurrency, Proc. 3m Int. Workshop on Next
276 G. Vossen

Generation Information Technology and Systems, Neve Han, Israel,


1997, 182-192.
[ABD+89] Atkinson, M., Bancilhon, F., DeWitt, D., Dittrich, K., Maier, D.,
Zdonik, S., The object-oriented database system manifesto, Proc. 1st
Int. Conf. on Deductive and Object-Oriented Databases, 1989,40-57.
[ABSOO] Abiteboul, S., Buneman, P., Suciu, D., Data on the Web, Morgan
Kaufmann Publishers, San Francisco, CA, 2000.
[AH02] van der Aalst, W., van Hee, K., Workflow management - models,
methods, and systems, The MIT Press, Cambridge, MA, 2002.
[AHV95] Abiteboul, S., Hull, R., Vianu, V., Foundations of databases, Addison-
Wesley, Reading, MA, 1995.
[AK98] Abiteboul, S., Kanellakis, P.C., Object identity as a query language
primitive, Journal of the ACM 45, 1998, 798-842.
[AV82] Apt, K.R., Van Emden, M.H., Contributions to the theory of logic
programming, Journal of the ACM 29, 1982, 841-862.
[AVOO] Abiteboul, S., Vianu, V., Queries and computation on the Web, The-
oretical Computer Science 239, 2000, 231-255.
[AVA+94] Alonso, G., Vingralek, R., Agrawal, D., Breitbart, Y., EI Abbadi, A.,
Schek H.-J., Weikum, G., Unifying concurrency control and recovery
of transactions, Information Systems 19, 1994, 101-115.
[BBD+02] Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J., Models
and issues in data stream systems, Technical Report No. 19, Database
Research Group, Stanford University, 2002.
[BBP+OO] Bobineau, C., Bouganim, L., Pucheral, P., Valduriez, P., PicoDBMS:
scaling down database techniques for the smartcard, Proc. 26th Int.
Conf. on Very Large Data Bases, 2000, 11-20.
[BCOO] Bonifati, A., Ceri, S., Comparative analysis of five XML query lan-
guages, ACM SIGMOD Record 29(1), 2000, 68-79.
[BCN92] Batini, C., Ceri, S., Navathe, S.B., Conceptual database design - an
Entity-Relationship approach, Benjamin/Cummings, Redwood City,
CA,1992.
[BDS+93] Breitbart, Y., Deacon, A., Schek, H.-J., Sheth, A., Weikum, G.,
Merging application-centric and data-centric approaches to support
transaction-oriented multi-system workflows, ACM SIGMOD Record
22(3), 1993, 23-30.
[BFGOla] Baumgartner, R., Flesca, S., Gottlob, G., The Elog Web extraction
language, R. Nieuwenhuis, A. Voronkov (eds.), Logic for Program-
ming, Artificial Intelligence, and Reasoning, Lecture Notes in Arti-
ficial Intelligence 2250, 8th International Conference on Logic for
Programming, Artificial Intelligence, and Reasoning, Springer-Verlag,
2001, 548-560.
[BFGOlb] Baumgartner, R., Flesca, S., Gottlob, G., Declarative information ex-
traction, Web crawling, and recursive wrapping with Lixto, T. Eiter,
W. Faber, M. Truszczynski (eds.), Logic Programming and Nonmono-
tonic Reasoning, Lecture Notes in Artificial Intelligence 2179, LP-
NMR 2001, 6th International Conference on Logic Programming and
Nonmonotonic Reasoning, Springer-Verlag, 2001, 21-41.
[BK89] Bertino, E., Kim, W., Indexing techniques for queries on nested ob-
jects, IEEE Trans. Knowledge and Data Engineering 1, 1989, 196-
214.
6. Advanced Database Systems 277

[BN97j Bernstein, P.A., Newcomer, E., Principles of tmnsaction processing,


Morgan Kaufmann Publishers, San Francisco, CA, 1997.
[Bro01j Brown, P., Object-relational database development - a plumber's
guide, Prentice Hall, Upper Saddle River, NJ, 2001.
[Bun97j Buneman, P., Semistructured data, Proc. 16th ACM SIGACT-
SIGMOD-SIGART Symposium on Principles of Database Systems,
1997,117-121.
[CBB+OOj Cattell, R.G.G., Barry, D., Berler, M., Eastman, J., Jordan, D., Rus-
sell, C., Schadow, 0., Stanienda, T., Velez, F. (eds.), The Object Data
Standard: ODMG 3.0, Morgan Kaufmann Publishers, San Francisco,
CA,2000.
[CCWOOj Ceri, S., Cochranem R.J., Widom, J., Practical applications of trig-
gers and constraints: successes and lingering issues, Froc. 26th Int.
ConI on Very Large Data Bases, 2000, 254-262.
[CF97j Ceri, S., Fraternali, P., Designing database applications with objects
and rules - the IDEA methodology, Addison-Wesley, Reading, MA,
1997.
[CFPOOj Ceri, S., Fraternali, P., Paraboschi, S., XML: current developments
and future challenges for the database community, C. Zaniolo, P.C.
Lockemann, M.H. Scholl, T. Grust (eds.), Lecture Notes in Com-
puter Science 1777, 7th Int. Conf. on Extending Database Technology
(EDBT 2000), Springer-Verlag, Berlin, 2000, 3-17.
[CGH+94j Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakon-
stantinou, Y., Ullman, J.D., Widom, J., The TSIMMIS project: in-
tegration of heterogeneous information sources, Proc. 10th Meeting
of the Information Processing Society of Japan, Tokyo, Japan, 1994,
7-18.
[CGSOlj Casati, F., Georgakopoulos, D., Shang, M.-C. (eds.), Technologies for
e-services, Lecture Notes in Computer Science 2193, 2nd Interna-
tional Workshop (TES 2001), Springer-Verlag, Berlin, 2001.
[CGT90j Ceri, S., Gottlob, G., Tanca, L., Logic progmmming and databases,
Springer-Verlag, Berlin, 1990.
[Cha98j Chamberlin, D., A complete guide to DB2 universal database, Morgan
Kaufmann Publishers, San Francisco, CA, 1998.
[Che76j Chen, P.P.S., The Entity-Relationship model: toward a unified view
of data, ACM Tmns. Database Systems 1, 1976,9-36.
[CKW93j Chen, W., Kifer, M., Warren, D.S., HiLog: A foundation for higher-
order logic programming, Journal of Logic Progmmming 15, 1993,
187-230.
[CLG+94j Chen, P.M., Lee, E.K., Gibson, G.A., Katz, R.H., Patterson, D.A.,
RAID: high-performance, reliable secondary storage, ACM Comput-
ing Surveys 26, 1994, 145-185.
[CM94j Ceri, S., Manthey, R., Chimera: a model and language for active
DOOD systems, Proc. 2nd International East/West Database Work-
shop, Springer-Verlag, Berlin, 1994, 3-16.
[Cod70j Codd E.F., A relational model of data for large shared data banks,
Communications of the ACM 13, 1970, 377-387.
[CR94j Chrysanthis, P.K., Ramamritham, K., Synthesis of extended transac-
tion models using ACTA, ACM Trans. Database Systems 19, 1994,
450-491.
278 G. Vossen

[CRF01] Chamberlin, D., Robie, J., Florescu, D., Quilt: an XML query lan-
guage for heterogeneous data sources, Proc. 3rd Int. Workshop on the
Web and Databases (WebDB 2000), in [SV01].
[CRL+02] Cagle, K., Russell, M., Lopez, N., Maharry, D., Saran, R., Early
Adopter XQuery, Wrox Press, 2002.
[CWOO] Chaudhuri, S., Weikum, G., Rethinking database system architecture:
towards a self-tuning RISC-style database system, Proc. 26th Int.
ConJ. on Very Large Data Bases, 2000, 1-10.
[CZ01] Chaudhri, A.B., Zicari, R., Succeeding with object databases - a prac-
ticallook at today's implementations with Java and XML, John Wiley
& Sons, New York, 2001.
[DD97] Date, C.J., Darwen, H., A guide to the SQL standard, Addison-
Wesley, Reading, MA, 4th edition, 1997.
[DFS99] Deutsch, A., Fernandez, M.F., Suciu, D., Storing semistructured data
with STORED, Proc. ACM SIGMOD International Conference on
Management of Data, 1999, 431-442.
[EJH+97] Elmagarmid, A.K., Jiang H., Helal A.A., Joshi A., Ahmed M., Video
database systems - issues, products, and applications, Kluwer Aca-
demic Publishers, 1997.
[EJS98] Etzion, 0., Jajodia, S., Sripada, S. (eds.), Temporal databases:
research and practice, Lecture Notes in Computer Science 1399,
Springer-Verlag, Berlin, 1998.
[Elm92] Elmagarmid, A.K. (ed.), Database transaction models for advanced
applications, Morgan Kaufmann Publishers, San Francisco, CA, 1992.
[ENOO] Elmasri, R., Navathe, S.B., Fundamentals of database systems,
Addison-Wesley, Reading, MA, 3rd edition, 2000.
[EV97] Ebert, J., Vossen, G., I-serializability: generalized correctness for
transaction-based environments, Information Processing Letters 63,
1997, 221-227.
[FLM98] Florescu, D., Levy, A., Mendelzon, A., Database techniques for the
World-Wide Web: a survey, ACM SIGMOD Record 27(3), 1998,59-
75.
[FSTOO] Fernandez, M.F., Suciu, D., Tan, W.C., SilkRoute: trading between
relations and XML, Computer Networks 33, 2000, 723-745.
[FV95a] Fahrner, C., Vossen, G., A survey of database design transformations
based on the Entity-Relationship model, Data & Knowledge Engi-
neering 15, 1995, 213-250.
[FV95b] Fahrner, C, Vossen, G., Transforming relational database schemas
into object-oriented schemas according to ODMG-93, Lecture Notes
in Computer Science 1013, 4th Int. ConJ. on Deductive and Object-
Oriented Databases, Springer-Verlag, Berlin, 1995,429-446.
[Gar98] Gardner S.R., Building the data warehouse, Communications of the
ACM 41(9), 1998, 52-60.
[GBE+OO] Giiting, R.H., Bohlen, M.H., Erwig, M., Jensen, C.S., Lorentzos, N.A.,
Schneider, M., Vazirgiannis, M., A foundation for representing and
querying moving objects, ACM Transactions on Database Systems
25, 2000, 1-42.
[GG98] Gaede, V., Giinther, 0., Multidimensional access methods, ACM
Computing Surveys 30, 1998, 170-231.
6. Advanced Database Systems 279

[GHK+94] Georgakopoulos, D., Hornick, M., Krychniak, P., Manola, F., Specifi-
cation and management of extended transactions in a programmable
transaction environment, Proc. 10th IEEE Int. Con/. on Data Engi-
neering, 1994, 462-473.
[GHM96] Georgakopoulos, D., Hornick, M., Manola, F., Customizing transac-
tion models and mechanisms in a programmable environment sup-
porting reliable workflow automation, IEEE Trans. Knowledge and
Data Engineering 8, 1996, 630-649.
[GHS95] Georgakopoulos, D., Hornick, M., Sheth, A., An overview of workflow
management: from process modeling to workflow automation infras-
tructure, Distributed and Parallel Databases 3, 1995, 119-153.
[Gra78] Gray, J., Notes on data base operating systems, R. Bayer, M.R. Gra-
ham, G. Seegmiiller (eds.), Operating systems - an advanced course,
Lecture Notes in Computer Science 60, Springer-Verlag, Berlin, 1978,
393-48l.
[GUW02] Garcia-Molina, H., Ullman, J.D., Widom, J., Database systems: the
complete book, Prentice Hall, Upper Saddle River, NJ, 2002.
[HoqOO] Hoque, R., XML for real programmers, Morgan Kaufmann Publish-
ers, San Francisco, CA, 2000.
PS82] Jiischke, G., Schek, H.J., Remarks on the algebra of non first nor-
mal form relations, Proc. 1st ACM SIGACT-SIGMOD Symposium
on Principles of Database Systems, 1982, 124-138.
[Kay01] Kay, M., XSLT programmer's reference, 2nd edition, Wrox Press,
200l.
[KKS92] Kifer, M., Kim, W., Sagiv, Y., Querying object-oriented databases,
Proc. ACM SIGMOD Int. Conf on Management of Data, 1992, 393-
402.
[KLW95] Kifer, M., Lausen, G., Wu, J., Logical foundations of object-oriented
and frame-based languages, Journal of the ACM 42, 1995, 741-843.
[KM94] Kemper A., Moerkotte G., Object-oriented database management -
applications in engineering and computer science, Englewood-Cliffs,
NJ, Prentice-Hall, 1994.
[Kor95] Korth, H.F., The double life of the transaction abstraction: funda-
mental principle and evolving system concept, Proc. 21st Int. Con/.
on Very Large Data Bases, 1995, 2-6.
[Kos99] Kossmann, D. (ed.), Special issue on XML, Bulletin of the IEEE
Technical Committee on Data Engineering 22(3), 1999.
[KRR+OO] Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins,
A., Upfal, E., The web as a graph, Proc. 19th ACM SIGMOD-
SIGACT-SIGART Symp. on Principles of Database Systems, 2000,
1-10.
[Liu99] Liu, M., Deductive database languages: problems and solutions, ACM
Computing Surveys 31, 1999, 27-62.
[LROO] Leymann, F., Roller, D., Production workflow - concepts and tech-
niques, Prentice Hall, Upper Saddle River, NJ, 2000.
[LSS01] Lakshmanan, L.V.S., Sadri, F., Subramanian, I.N., SchemaSQL: an
extension of SQL for multidatabase interoperability, ACM TI-ansac-
tions on Database Systems 26, 2001, 476-519.
[LV98] Lausen, G., Vossen, G., Object-oriented databases: models and lan-
guages, Addison-Wesley, Harlow, UK, 1998.
280 G. Vossen

[Mak77] Makinouchi, A., A consideration on normal form of not-necessarily-


normalized relation in the relational data model, Proc. 9rd Int. Conf.
on Very Large Data Bases, 1977, 447-453.
[MEOO] Melton, J., Eisenberg, A., Understanding SQL and Java together - a
guide to SQLJ, JDBC, and related technologies, Morgan Kaufmann
Publishers, San Francisco, CA, 2000.
[Mos85] Moss, J.E.B., Nested tmnsactions: an approach to reliable distributed
computing, The MIT Press, Boston, MA, 1985.
[MR92] Mannila, H., Rilihii, K.J., The design of relational databases, Addison-
Wesley, Reading, MA, 1992.
[MSOl] Melton, J., Simon, A., SQL:1999 - understanding relational language
components, Morgan Kaufmann Publishers, San Francisco, CA, 2001
[MVOOa] Masermann, U., Vossen, G., Design and implementation of a novel
approach to keyword searching in relational databases, J. Stuller,
J. Pokorny, B. Thalheim, Y. Masunaga (eds.), Cu.rrent Issues in
Databases and Information Systems, Lectu.re Notes in Computer
Science 1884, East-European Conference on Advances in Databases
and Information Systems (ADBIS-DASFAA 2000), Springer-Verlag,
Berlin, 2000,171-184.
[MVoob] Masermann, U., Vossen, G., SISQL: schema-independent database
querying (on and off the Web), Proc. 4th International Conference
on Database Engineering and Applications, IEEE Computer Society
Press, Los Alamitos, CA, 2000, 55-64.
[NV97] Neven, F., Van den Bussche, J., On implementing structured docu-
ment query facilities on top of a DOOD, F. Bry, R. Ramakrishnan,
K. Ramamohanarao (eds.), Deductive and Object-Oriented Databases,
Lecture Notes in Computer Science 1941, 5th Int. Conf. on Deduc-
tive and Object-Oriented Databases, Springer-Verlag, Berlin, 1997,
351-367.
[NV98] Neven, F., Van den Bussche, J., Expressiveness of structured docu-
ment query languages based on attribute grammars, Proc. 17th ACM
SIGACT-SIGMOD-SIGART Symposium on Principles of Database
Systems, 1998, 11-17.
[NVV+99] Neven, F., Van den Bussche, J., Van Gucht, D., Vossen, G., Typed
query languages for databases containing queries, Information Sys-
tems 24, 1999, 569-595.
[OHE97j Orfali, R., Harkey, D., Edwards, J., Instant CORBA, John Wiley &
Sons, New York, 1997.
[PD99] Paton, N.W., Diaz, 0., Active database systems, ACM Computing
Surveys 31, 1999,63-103.
[RGOO] Ramakrishnan, R., Gehrke, J., Database management systems,
McGraw-Hill, Boston, MA, 2nd edition, 2000.
[RicDl] Riccardi, G., Principles of database systems with Internet and Java
applications, Addison-Wesley, Boston, MA, 2000.
[RMB+93] Rastogi, R., Mehrotra, S., Breitbart, Y., Korth, H.F., Silberschatz,
A .. On correctness of non-serializable executions, Proc. 12th ACM
SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems,
1993, 97-108.
6. Advanced Database Systems 281

[RS95] Rusinkiewicz, M., Sheth, A., Specification and execution of transac-


tional workflows, W. Kim (ed.), Modern Database Systems, Addison-
Wesley, Reading, MA, 1995, 592-620.
[RunOO] Rundensteiner, E.A. (ed.), Special issue on database technology in
e-commerce, Bulletin of the IEEE Technical Committee on Data En-
gineering 23(1}, 2000.
[SAH+84] Stonebraker, M., Anderson, E., Hanson, E.N., Rubenstein, W.B.,
QVEL as a data type, Proc. ACM SIGMOD Int. Conf. on Man-
agement of Data, 1984, 208-214.
[SAH87] Stonebraker, M., Anton, J., Hanson, E., Extending a database system
with procedures, ACM Transactions on Database Systems 12, 1987,
350-376.
[SB98] Stonebraker, M., Brown, P., Object-relational DBMSs - the next great
wave, Morgan Kaufmann Publishers, San Francisco, CA, 2nd edition,
1998.
[SJ96] Subrahmanian, V.S., Jajodia, S. (eds.), Multimedia database systems
- issues and research directions, Springer-Verlag, Berlin, 1996.
[SK98] Sheth, A., Klas, W. (eds.), Multimedia data management - using
metadata to integrate and apply digital media, McGraw-Hill, Boston,
MA,1998.
[SKS02] Silberschatz, A., Korth, H.F., Sudarshan, S., Database system con-
cepts, 4th edition, McGraw-Hill Higher Education, Boston, MA, 2002.
[SS90] Scholl, M.H., Schek, H.J., A relational object model, S. Abiteboul,
P.C. Kanellakis (eds.) Lecture Notes in Computer Science 470, 3m
Int. Conf. on Database Theory, Springer-Verlag, Berlin, 1990, 89-
105.
[SSB+01] Shanmugasundaram, J., Shekita, E., Barr, R., Carey, M., Lindsay,
B., Pirahesh, H., Reinwald, B. Efficiently publishing relational data
as XML documents, The VLDB Journal 10,2001, 133-154.
[STZ+99] Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., DeWitt, D.,
Naughton, J., Relational databases for querying XML documents:
limitations and opportunities, Proc. 25th Int. Conf. on Very Larye
Data Bases, 1999, 302-314.
[Sub98] Subrahmanian, V.S., Principles of multimedia database systems, Mor-
gan Kaufmann Publishers, San Francisco, CA, 1998.
[SV98] Suciu, D., Vossen, G. (eds.), Special issue on semistructured data,
Information Systems 23(8}, 1998.
[SV01] Suciu, D., Vossen, G. (eds.), The World Wide Web and databases,
Lecture Notes in Computer Science 1997, 3rd International Work-
shop WebDB 2000, Dallas, TX, USA, May 2000, Selected Papers,
Springer-Verlag, Berlin, 200l.
[SWY93] Schek, H.J., Weikum, G., Ye, H., Towards a unified theory of con-
currency control and recovery, Proc. 12th ACM SIGACT-SIGMOD-
SIGART Symp. Principles of Database Systems, 1993, 300-31l.
[TCG+93] Tansel, A.V., Clifford, J., Gadia, S., Jajodia, S., Segev, A., Snod-
grass, R., Temporal databases - theory, design, and implementation,
Benjamin/Cummings, Redwood City, CA, 1993.
[ThaOO] Thalheim, B., Entity-Relationship modeling - foundations of database
technology, Springer-Verlag, Berlin, 2000.
282 G. Vossen

[Ull88) Ullman, J.D., Principles of Database and Knowledge-Base Systems,


vol. I, Computer Science Press, Rockville, MD, 1988.
[Ull89) Ullman, J.D., Principles of database and knowledge-base systems, vol.
II, Computer Science Press, Rockville, MD, 1989.
[UllOO) Ullman, J.D., Information integration using logical views, Theoretical
Computer Science 239, 2000, 189-210.
[Via97) Vianu, V., Rule-based languages, Annals of Mathematics and Artifi-
cial Intelligence 19, 1997, 215-259.
[Vit01) Vitter, J .S., External memory algorithms and data structures: dealing
with massive data, ACM Computing Surveys 33, 2001, 209-271.
[VK76) Van Emden, M.H., Kowalski, R.A., The semantics of predicate logic
as a programming language, Journal of the ACM 23, 1976,733-742.
[Vos95) Vossen, G., Database transaction models, J. van Leeuwen (ed.), Com-
puter Science Today - Recent Trends and Developments, Lecture
Notes in Computer Science 1000, Springer-Verlag, Berlin, 1995, 560-
574.
[Vos96) Vossen, G., Database theory: an introduction, A. Kent, J.G. Williams
(eds.), Encyclopedia of Computer Science and Technology, vol. 34,
Supplement 19, Marcel Dekker, New York, 1996,85-127.
[Vos97) Vossen, G., The CORBA specification for cooperation in heteroge-
neous information systems, P. Kandzia, M. Klusch (eds.), Cooperative
Information Agents, Lecture Notes in Artificial Intelligence 1202, 1st
Int. Workshop on Cooperative Information Agents, Springer-Verlag,
Berlin, 1997, 101-115.
[VV93a) Van den Bussche, J., Vossen, G., An extension of path expressions to
simplify navigation in object-oriented queries, S. Ceri, K. Tanaka, S.
Tsur (eds.), Deductive and Object-Oriented Databases, Lecture Notes
in Computer Science 760, 3rd Int. Conf. on Deductive and Object-
Oriented Databases, Springer-Verlag, Berlin, 1993, 267-282.
[VV93b) Vianu, V., Vossen, G., Static and dynamic aspects of goal-oriented
concurrency control, Annals of Mathematics and Artificial Intelli-
gence 7, 1993, 257-287.
[VVV96) Van den Bussche, J., Van Gucht, D., Vossen, G., Reflective program-
ming in the relational algebra, Journal of Computer and System Sci-
ences 52, 1996, 537-549.
[VVV02) Van den Bussche, J., Vansummeren, S., Vossen, G., Towards practi-
cal meta-querying, Technical Report No. 05/02-1, Schriften zur Ange-
wandten Mathematik und Informatik, University of Munster, Febru-
ary 2002.
[VYB+95) Vingralek, R., Ye, H., Breitbart, Y., Schek, H.-J., Unified transaction
model for semantically rich operations, G. Gottlob, M.Y. Vardi (eds.),
Database Theory, Lecture Notes in Computer Science 893, 5th Int.
ConJ. on Database Theory (ICDT'95), Springer-Verlag, Berlin, 1995,
148-161.
[WC96) Widom, J., Ceri, S., Active database systems - triggers and rules
for advanced database processing, Morgan Kaufmann Publishers, San
Francisco, CA, 1996.
[Wei91) Weikum, G., Principles and realization strategies of multilevel trans-
action management, ACM Trans. Database Systems 16, 1991, 132-
180.
6. Advanced Database Systems 283

[WilOO] Williams, K., Professional XML databases, Wrox Press, Birmingham,


UK, 2000.
[WV02] Weikum, G., Vossen, G., Transactional information systems: theory,
algorithms, and the pmctice of concurrency control and recovery,
Morgan Kaufmann Publishers, San Francisco, CA, 2002.
[YM98] Yu, C.T., Meng, W., Principles of database query processing for ad-
vanced applications, Morgan Kaufmann Publishers, San Francisco,
CA,1998.
[ZCF+97] Zaniolo, C., Ceri, S., Faloutsos, C., Snodgrass, R.T., Subrahmanian,
V.S., Zicari R., Advanced database systems, Morgan Kaufmann Pub-
lishers, San Francisco, CA, 1997.
7. Parallel and Distributed Multimedia
Database Systems

Odej Kao

Department of Computer Science, TU Clausthal, Clausthal-Zellerfeld, Germany

1. Introduction ..................................................... 286


2. Media Fundamentals.... .. .. ........... ............... .... .... ... 288
2.1 Images....................................................... 289
2.2 Video........................................................ 291
3. MPEG as an Example of Media Compression .................... 292
3.1 MPEG I ..................................................... 292
3.2 MPEG II, MPEG IV, and MPEG VII ........................ 297
4. Organisation and Retrieval of Multimedia Data .................. 298
5. Data Models for Multimedia Data ....................... . . . . . . . .. 304
5.1 Data Models for Images......... .... .... .......... . . ... .... .. 306
6. Multimedia Retrieval Sequence Using Images as an Example ..... 308
6.1 Querying Techniques.. ....... .... ........... .... .... ... .... .. 309
6.2 Sample Procedure for Information Extraction ................ 310
6.3 Metrics ...................................................... 314
6.4 Index Structures ............................................. 316
7. Requirements for Multimedia Applications ....................... 318
8. Parallel and Distributed Processing of Multimedia Data .......... 321
8.1 Distribution of Multimedia Data ............................. 323
8.2 Parallel Operations with Multimedia Data ................... 327
8.3 Parallel and Distributed Database Architectures and Systems 330
9. Parallel and Distributed Techniques for Multimedia Databases ... 337
9.1 Partitioning the Data Set .................................... 337
9.2 Applying Static Distribution Strategies on Multimedia Data.. 340
9.3 Content-Independent Distribution of Multimedia Data ....... 340
9.4 Content-Based Partitioning .................................. 341
9.5 Dynamic Distribution Strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 344
10. Case Study: CAIRO - Cluster Architecture for Image Retrieval
and Organisation ................................................ 348
10.1 User Interface ................................................ 349
10.2 Relational Database System and Index Structures............ 351
10.3 Features ..................................................... 351
10.4 CAIRO Architecture .......................................... 353
10.5 Partitioning the Image Set ................................... 355
10.6 Parallel Execution of the Retrieval Operations ................ 356
10.7 Update Manager.. .... ....... ............ ... . ... .... ....... .. 358
11. Conclusions ...................................................... 359
7. Parallel and Distributed Multimedia Database Systems 285

References ....................................................... 359

Abstract. This chapter presents an introduction to the area of parallel and dis-
tributed multimedia database systems. The first part describes the characteristics
of multimedia data and depicts the storage and annotation of such data in con-
ventional and in multimedia databases. The main aim is to explain the process of
multimedia retrieval by using images as an example. The related computational,
storage, and network requirements create an urgent need for the integration of
parallel and distributed computer architectures in modern multimedia information
systems. Different hardware and software aspects have to be examined, for example
the partitioning of multimedia data and the distribution over multiple nodes have
a decisive impact on the performance, efficiency, and the usability of such multime-
dia databases. Other distributed aspects such as streaming techniques, proxy and
client issues, security, etc. are only briefly mentioned and are not in the focus of this
chapter. The last section gives an overview over an existing cluster-based prototype
for image retrieval named CAIRO.
286 O. Koo

1 Introduction

Sensing and processing of multimedia information is one of the basic traits


of human beings: the audio-visual system registers and transports environ-
mental images and noises. This complex recording system, complemented by
the senses of touch, taste, and smell, enables perception and gives us data for
analysing and interpreting our surroundings. Imitating this perception, and
the simulation of the processing, was and still is, the leitmotif of multimedia
technology development. The goal is to find a representation for every type
of knowledge, which makes the reception and processing of information as
easy as possible.
The need to process a given information, to deliver, and to explain it to
a certain audience exists in nearly all areas of day to day life, commerce,
science, education, and entertainment. Nowadays the information is bound
to a computer, a web site, a PDA (Personal Digital Assistant) or to a similar
storage and computing device. Because of the evolution of the Internet, this
information is accessible from any point on the network map of the world.
This accessibility can be further increased by eliminating the need for a fixed
connection point, making it just as mobile as people are. The merging of
PDAs and mobile phones is a step in this direction. Standards such as WAP
(Wireless Application Protocol)1 already allow mobile information access on
the Internet, but fail because of the limited bandwidth possible. This draw-
back will be remedied by the introduction of new technologies, like UMTS
( Universal Mobile Telecommunications System)2.
Multimedia communications are already changing the way in which peo-
ple work, organise, and interact today. More and more platforms for working
together are being introduced in addition to the well-known communica-
tion and discussion forums. This form of cooperation is generally known as
CSCW (Computer-Supported Collaborative Work), combining different as-
pects of collaboration, such as video-conferencing, joint writing and drawing
areas and working simultaneously on all sorts of documents.
The educational system - it does not matter if this means university
classes, an apprenticeship or a sales-training - profits from the possibilities of
multimedia communication, too. The subject matter on hand can be prepared
in such a way, that it is much easier to understand, and above all, that it can
be worked on by the users without direct supervision of the teacher. These
systems are called Computer-based Teaching, Teleteaching, Courseware, etc.
The classical areas of entertainment are a wide field of multimedia usage.
The contents are adapted to fit individual needs, activated only on demand
and are accounted for separately. In the case of TV, often used keywords are
Video-on-Demand and Pay-per-View.

1 www.wapforum.org
2 www.umts-forum.org
7. Parallel and Distributed Multimedia Database Systems 287

The largest operational areas for multimedia applications are still the
mass information systems and marketing communications. The first group
incorporates information systems at heavily frequented, public areas, such
as railway stations, airports, etc. Furthermore, newspapers, magazines, and
books are published in digital form. A part of these is used for advertisements,
while another part co-exists with the traditional printed media. Product-
catalogues are found at many points of sales and enable a fast overview
of the products offered and their prices. Electronic stores, reservation, and
booking sites at terminals as well as on the Internet supplement these sys-
tems. Detailed outlines of multimedia applications are found, for example, in
[Fur99,SteOO,GJM97j.
The development of digital technologies and applications allows the pro-
duction of huge amounts of multimedia data. The scope and spread of doc-
ument management systems, digital libraries, photo archives used by public
authorities, hospitals, corporations, etc., as well as satellite and surveillance
photos, grow day by day. Each year, with an increasing tendency, Petabytes
worth of multimedia data is produced. All this information has to be system-
atically collected, registered, stored, organised, and classified. Furthermore,
search procedures, methods to formulate queries, and ways to visualise the
results, have to be provided. For this purpose a large number of prototypes
and operational multimedia database management systems is available.
This chapter concerns mainly parallel and distributed aspects - hardware
architectures as well as data engineering - for such multimedia databases. It
is organised as follows: after the introduction of the basic properties of mul-
timedia objects together with the accompanying methods for compression,
content analysis and processing (Section 2, 3.1), the storage and management
of such data in traditional and multimedia database systems are discussed in
Section 4 Thereby, existing data models, algorithms, and structures for mul-
timedia retrieval are presented and explained by considering image retrieval
as an example (Section 5, 6).
The analysis of the related storage, computational, and bandwidth re-
quirements in Section 7 shows, that powerful parallel and distributed archi-
tectures and database systems are necessary for the organisation of the huge
archives with multimedia data and the implementation of novel retrieval
approaches, for example an object-based similarity search. Therefore, the
properties and requirements of distributed multimedia applications, such as
Video-on-Demand servers and federated multimedia databases are described
in Section 8.
The parallel and distributed processing of multimedia data is depicted
in greater detail in the last part of the chapter by considering an image da-
tabase as an example. The main attention is given on the partitioning, the
distribution, and the processing of the multimedia data over the available
database nodes, as these methods have a major impact on the speedup and
the efficiency of the parallel and distributed multimedia databases. Section
288 O. Kao

9.1 gives an overview over some existing approaches for partitioning of im-
ages, whereas Section 9.5 explains the functionality of dynamic distribution
strategies. Section 10 closes this chapter with a case study of a cluster-based
prototype for image retrieval named CAIRO.

2 Media Fundamentals
The foundation of the entire construct multimedia are the media contained
therein, which are called multimedia objects. An often used classification
divides these into

Time-invariant (discrete, static) media: these media consist solely of a


sequence of individual elements or of a continuum without a time-based
component. Elements of this class are text, graphics, and images.
Time-variant (continuous, dynamic) media: the values of these media
change with time, so that the validity of an element is dependent on the
point of time in which it is regarded. This condition is satisfied by videos,
audio sequences, and animations.

A multimedia system in the stricter sense is marked by the computer-


controlled, integrated creation, manipulation, presentation, and communica-
tion of independent information, that is encoded in at least one continuous
(time-variant) and one discrete (time-invariant) medium [SteOOj.
Of all media, text is the oldest method to represent information in a
computer. Next to character encoding, for which the ASCII-Code is usually
employed, different format features (colour, font, size, alignment, paragraph
properties, etc.) are stored. Many, thoroughly analysed procedures are known
for processing and retrieval of text-based documents, so that this medium
will not be further regarded in this chapter. Textual information, which is
integrated in images or videos can usually not be recognised as such, and
thus remain unused.
A graphic visualises one or more circumstances and contributes to an
easier understanding. Graphics are different from images, in that they are
composed of a group of well-defined primitives, like points, lines, ellipses,
etc., which are saved in a graphing system. Each primitive has a number of
attributes such as colour, line thickness, etc. assigned to it. The advantage
of this type of information representation is lost when it is converted into an
image. As in pictures, solely a colour matrix remains, in which elements no
longer have immediate semantic meaning.
Video, audio, and animation belong to the time-dependent media. Video
has the highest memory demand of these, but through the development of
compression standards such as MPEG, videos can be used in numerous areas.
An audio sequence can represent speech, sounds, and music. Different algo-
rithms are used to reduce memory requirements of audio files; the best-known
method is MP3. Animations are split in frame and cast animation groups. A
7. Parallel and Distributed Multimedia Database Systems 289

frame animation is a concatenation of individual graphics, while only a single


object is in front of a static background is animated in a cast animation.
In the following, the fundamentals of images and video sequences are
presented. Devices used to record, digitise, and reproduce these media on
the other hand, are not regarded. These are usually vendor-specific and are
changing continuously, so that only a snapshot would be possible here.

2.1 Images

An image is a time-invariant media, representing a visual state within a small


time frame. Digital images are generated by converting the continuous spatial
and value signals, obtained from the optical systems, in discrete signals. Only
certain base points are considered during the reduction of the continuous
spatial signals in their discrete form, for which afterwards one of the possible
colour values is determined. The result of this scanning process is a matrix
with colour values, which is used to represent and process the corresponding
image. An immediately displayable image with the dimension M x N, M, N E
1N, is defined by a function b with

where DM = {1, ... ,M}, DN = {1, ... ,N}, DM,DN c 1N and n =


1,2,3, ... is the number of colour channels used. The symbol G denotes the
number of grey levels possible. Usually an 8-bit resolution is used (corre-
sponding to 256 levels of grey).
The distances between the selected base points, and the number of values
possible per point define the resolution of the image. The number of channels
n determines the image type: a halftone image maps a scalar value onto
each co-ordinate pair, thus n = 1. By using differently tinted filters, three
exposures are made during the recording process for a colour image, a red,
a green, and a blue exposure. Each element of a colour image is defined by
three values, the intensity of each colour channel. Images where n > 3 are
called multi-spectral images, and are used for example in satellite imagery.
The smallest component of an image is called a pixel (Picture Element):
it is defined by the spatial co-ordinates (i,j) and the intensity vector b(i,j),
and is noted by the triple (i,j, b(i,j)). Usually a short notation b(i,j) is used.
This atomic information unit of an image is not descriptive enough for
most applications, like a letter in a word. Only by grouping pixels in coherent
regions, does one obtain interpretable units, such as objects or parts of these,
people, sceneries, etc. This is why most image processing procedures are
based on analysing the relationships of pixels that lie close together, so called
neighbourhoods.
Images are stored in file formats such as BMP, GIF, TIFF, JPEG, PNG,
etc., which represent an image as an array of bits. Some of them include
methods for compression of the pixel matrix and thus for reduction of the
290 O. Kao

storage requirements. The compression technologies are usually divided into


loss less and lossy methods. The later accept loss of information and introduc-
tion of artefacts that can be ignored as unimportant when viewed in direct
comparison with the original. Lossy compression takes advantage of the sub-
tended viewing angle for the intended display, the perceptual characteristics
of human vision, the statistics of image populations, and the objectives of
the display [SM98j. Lossy compression is provided for example by the JPEG
image file format. In contrast, lossless data compression uses a method that
enables the original data to be reconstructed exactly. This kind of image
compression is supplied by file formats such as TIFF or BMP.

Paradigms of image processing. Image processing creates, displays,


transforms, and evaluates images and image contents. During the creation,
the images are brought in a discrete, digital form. In the first level image
manipulation considers a number of algorithms enhancing the image quality.
This includes the correction of exposure and transfer errors (e.g. contrast,
noise suppression). Afterwards certain image features, such as edges, are em-
phasised. Image analysis extracts complex information from the image, which
relates to entire objects or image areas. Possible results are lists of objects
depicted along with their attributes. Image evaluation can produce global
statements about the scene shown, based on the extracted information.
Some standard operators will now be described. A detailed summary of
existing operators can be found, among others, in [Pra91,Jae91,PB99j.

• Histograms show the distribution of grey or colour values in an image,


and are usually displayed in a table, or as a graph.
• Edge detection finds borders between objects, different surfaces, etc. A
subsequent contour tracing returns information about the position and
length of the edges. The output consists of co-ordinate pairs, the so called
contour code.
• Textures are a main attribute of objects in the real world. These are pat-
terns, which are characterised by parameters, such as orientation, colour,
etc. Texture recognition is an important step for image segmentation and
classification.
• Segmentation is a subdivision process of an image in n E :IN regions
R 1 , R2,"" R,., which satisfy the following conditions:
n
1. U Rk =R,
k=l
2. Rk is a contiguous region, k = 1, ... ,n,
3. Rk nRz = 0 for all k,l and k =l-l,
4. P(Rk) = TRUE for k = 1, ... ,n, and
5. P(Rk URI) = FALSE, for all k, land k =I- l,
where P(.) is a homogeneity criterion, for example similar colour or tex-
ture. The individual segments are used as a basis for the recognition,
analysis, and interpretation of objects.
7. Parallel and Distributed Multimedia Database Systems 291

• The classification of images can be approached in two different ways:


bottom-up or top-down. The bottom-up approach divides the image into
areas of similar composition using available segmentation methods. Ap-
plying norms on features such as colour, form or texture derives the sim-
ilarity. These segments are then interpreted and objects are recognised.
The top-down approach uses a model of the expected image content. It
is attempted to interpret the image content and recognise objects with
so called "hypothesis tests" , obtaining a logical picture of the recognised
objects, their attributes and the relationships between them.

2.2 Video

A video consists of a sequence of individual images, which are called frames.


The difference between two successive frames is minimal, not counting
changes in scenery or fast movement, and can thus be employed to visu-
alise a motion sequence. The number of frames displayed per second is the
most important factor for the continuity of motion. Due to the sluggishness
of the human visual system, a frame-rate of 15 frames per second is sufficient
to suppress the recognition of single frames, and the impression of movement
continuity begins. 30 frames per second and above are necessary for smooth
movements.
There are many operators for processing video sequences, which can be
divided into two main classes:

• Sequence-independent operators mainly correspond to image processing


procedures and are applied on the individual frames in a sequence.
• Sequence-dependent operators, on the other hand, consider the chrono-
logical and content-based relationship between two succeeding frames.

The application of sequence-independent operations begins with parti-


tioning the video in individual frames, i.e. the video sequence is transformed
in a set of independent images. The selected operations are then performed
on each element of this set, and the modified frames are re-inserted in the
processed video sequence, closely observing the original chronological order
of the frames in the sequence.
The most important advantage of all image sequences is that now dynamic
systems can be represented, identified, and analysed. This purpose makes it
necessary to complete the information of the current frame with the context
of the preceding and succeeding frames. Conclusions about the state of the
depicted dynamic process can be made by analysing the changing or constant
image elements, as in earth observation, climatic surveys, simulation of crash
tests, etc. Methods for shot detection, motion assessment, object tracking,
scene detection, and video annotation are of central importance.
292 O. Kao

3 MPEG as an Example of Media Compression

Compression algorithms are a key component and an important enabling


technique for the distributed organisation, processing, and presentation of
multimedia data. Without a reduction of the memory requirements of im-
ages, audio, and video sequences the available storage capacities and network
bandwidths are not sufficient for the realisation of distributed multimedia
applications. Therefore, the encoding, decoding, and the processing of com-
pressed data are major demands on multimedia systems.
The palette of available methods for efficient coding of multimedia content
is very large. Well-known compression methods such as JPEG, wavelet or
fractal compression reduce the storage size of images significantly. The best-
known coding method for audio sequences is MP3. Considering the huge
storage requirements is the efficient coding of video sequences one of the main
prerequisites for the suitable integration of videos in multimedia documents.
The principal workflow of data compression is presented in the following
sections by considering the MPEG I standard as an example.
The group of experts MPEG (Moving Pictures Experts Group) was
founded 1988 by the ISO (International Standards Organisation) [Swe97].
The main goal was the development of a standardised coding method for
video and audio sequences for Compact Disc. In the following years addi-
tional experts working in the area of audio, video, and systems expanded the
group. At the end of 1990 a syntax for coding of video with the accompanying
audio existed, providing a nearly VHS quality by a data rate of 1,5 Mbit/s.
This standard was later released as MPEG 1.
The succeeding standard MPEG II enables a significant quality increase
of the compressed video sequences. The later MPEG methods focus on the
user interaction and the representation of meta-information. MPEG IV fol-
lows an object-oriented approach and offers a number of possibilities for the
realisation of different types of user interaction. Furthermore, the audio and
video sequences can be compressed with small bit rates of 64 Kbyte/s, which
are necessary for applications such as video telephones.
The following section offers a brief introduction in the creation and pro-
cessing of video sequences coded according to the MPEG I standard [172].
The main accent is thereby set on the compression of the video layer.

3.1 MPEG I

A MPEG I coded bit stream consists of multiple, separately coded data


streams. The most important streams are related to the audio and video se-
quences. The different data streams are multiplexed packet-wise in the MPEG
stream. Each packet may have a variable length and contains data from ex-
actly one of the possible data types. In addition, supplemental information
about the synchronisation has to be stored in the final MPEG I bit stream.
7. Parallel and Distributed Multimedia Database Systems 293

The MPEG I bit stream comprises six layers arranged in a hierarchical


fashion as shown in Fig. 3.1.

Fig. 3.1. Hierarchy levels of a MPEG coded video stream

The video sequence is located on the top of the hierarchy. It contains


information about the frame dimensions, the aspect ratio, the quantisation
matrices used, buffer sizes, etc. Each sequence consists of a series of different
coded frames, which are usually combined in multiple groups of pictures
(GOP). The most often used frame types are:

Intra-Frames (I-Frames) : these frames correspond to independent refer-


ence images, which can be coded directly. Such frames have following
important functions in the MPEG bit stream:
• Creation of an initial situation, that means, the decoding of an 1-
Frame can be performed without any further knowledge from other
- preceding or succeeding - frames.
• Definition of starting points for random access to the video sequence.
If a certain part of the sequence is skipped, the playing continues
on the following I-Frame. Moreover, parsing the video stream and
solely playing the included I-Frames can simulate a fast-forward or a
fast-backward presentation of a video.
• Re-synchronisation of the decoder in the case of a transmission error.
• Reference images for the other frame types.
Predicted-Frames (P-Frames): frames of this type contain solely the dif-
ferences to the last preceding 1- or P-Frame. Thus, they require knowledge
about the content of these frames to be decoded. They also serve as ref-
erence images for P- and for B-Frames. Due to the reduced information
to be stored, P-Frames have much lower memory requirements than the
I-Frames.
Bi-directional Predicted-Frames (B-Frames): the coding of B-Frames
necessities the content information of preceding or of succeeding I-Frames
or P-Frames, as B-Frames contain the difference information to one of
294 O. Kao

these frame types. The B-Frames cannot be used as reference images for
other frames.
Figure 3.2 depicts the different frame types and the relations between
them. Not used for the prediction of other pictures are the D-Frames, which
allow simple fast forward mode.

I·Frame P·Frame B-Frame

Fig. 3.2. Connection between the described MPEG frame types

The general structure of all frame types is identical, thus no further differ-
entiation after the mentioned three classes is necessary. Each frame consists
of an introductory part - so called header - and a body. The header contains
information about time, coding, and the frame type. The frame body consists
of at least one slice, which can be separated into macro blocks. Each of these
blocks is compounded of 16x 16 pixels and can be further subdivided into
8x8 blocks.

Coding of the video stream. The MPEG I coding method for video
streams is based on six different processing levels, which are graphically de-
picted in Fig. 3.3.
Motion compensation is used in order to eliminate the multiple coding
of the redundant information in succeeding frames. Thus, it is necessary to
identify the spatial redundancy present in each frame of the video sequence.
This static information is subsequently supplemented by the changing parts
of the frame and transmitted.
Two translation vectors, also called motion vectors, describe the estimated
motion. These contain the number of pixels in x- and y-direction, which are
used for the offset calculation of the examined region in the next frame. The
combination of the offset values and of the co-ordinates of the region in the
reference image gives the new position of the region. In the case of MPEG I
coding not objects, but similar 8x8 blocks are searched in the neighbouring
frames. The new position of these blocks can be interpolated with sub-pixel
accuracy. Well-known methods are 2D search, logarithmic search, and tele-
scopic search [172J.
The foundation for the MPEG compression is the two-dimensional Dis-
crete Cosine Transformation (DCT). The DCT is a lossless, reversible trans-
formation converting spatial amplitude data in spatial frequency data. For
7. Parallel and Distributed Multimedia Database Systems 295

Video

l. . . .:o.:..::.:=~=~~
~r---:L-----;~
8,-__..,...____

Coded Video
Fig. 3.3. Levels of the MPEG I coding process

video compression the DCT is applied on 8x8 blocks of luminance samples


and on the corresponding blocks of colour difference samples. The coefficients
of the transformed blocks are described by considering a special notation: the
coefficient in the top left corner is marked as a DC component; all other com-
ponents are called AC components.
The quantisation depicts - next to the sub-sampling of the input data
in the first step - the lossy level of the MPEG compression. For this pur-
pose a 8x8 matrix Q[u, v] with 8 bit values is defined, which contains the
quantisation levels for the calculated 64 DCT coefficients. The MPEG norm
provides standard matrices, which are usually used for the video compression.
It is however allowed to modify the existing matrices or to apply user-defined
quantisation matrices in order to adapt and to improve the quality of the
video compression for certain applications. The modified matrices have to be
transmitted in the MPEG stream, so that a correct decoding of the adapted
stream is guaranteed.
The quantisation matrix considers the connection between the low and
the high frequencies in a frame. In the regions of high frequencies large quan-
tisation coefficients may be used, as this information is not visible and can
thus be eliminated. In contrast, the lower frequencies have a significant im-
pact on the visual impression, so these structures have to be preserved as
296 O . Kao

good as possible. Otherwise, disturbing artefacts such as pixel blocks may


occur.
Usually all values near zero are mapped onto zero at the end of this
processing step. The quantisied coefficients are then re-ordered, so that long
series of successive zero values are created. The schema - so called Zig-Zag
pattern - for the re-organisation of the AC components is illustrated in Fig.
3.4.

I 2 6-.7 15 16 28 29

DC
~..,/
3 5
/' 8 ~ 14 17 27 30 43
Component
~/9 13 18 26 31 42 44

10 12 19 2S 32 41 45 54 AC
Components
II 20 24 33 40 46 53 55

21 23 34 39 47 52 56 61

22 35 38 48 51 57 60 62

36 37 49 SO 58 59 63 64

Fig. 3.4. Zig-Zag pattern for the AC components

This process leads to a suitable starting position for the data reduction,
which is performed using the well-known methods RLC (Run Length Encod-
ing) and VLC (Variable Length Coding) [Huf52]. Thereby only the values
different from zero and the number of zero values between them are consid-
ered, i.e. pairs of the following form are generated:

Number of zero values I Value unequal zero

These pairs serve as input data for the next processing level, where the
VLC is applied. VLC identifies common patterns in the data and uses fewer
bits to represent frequently occurring values. The coding of the DC com-
ponents is realised using a difference building approach: for each 8 x 8 block
the difference between the current DC component and the DC component
of the preceding 8 x 8 block is calculated and coded. The already run length
encoded AC components are subsequently represented by a VLC code: the
MPEG standard provides an exhaustive table with VLC codes for every pos-
sible value combination.
A detailed description of the MPEG I compression process can be found
for example in the MPEG standard [172]. It also contains further information
about the bit representation of the introduced codes and other coding-related
attributes.
7. Parallel and Distributed Multimedia Database Systems 297

3.2 MPEG II, MPEG IV, and MPEG VII


The succeeding standard MPEG II was released in 1994. It mainly focuses the
integration and the support of new multimedia technologies such as HDTV.
This requires a significant improvement of the quality of the compressed
videos, so that a data rate of 100 Mbit/s was realised. A disadvantage is
given by the increased computational effort necessary for the video coding
and presentation.
An important change of MPEG II as compared to the MPEG I standard is
the possibility for coding of interlaced video sequences. Thereby the following
distinction between so called picture structures is made:
• Field pictures, and
• Frame pictures.
Field pictures contain both half images, which are separately stored. These
are first combined and then stored in the case of frame pictures. Improvements
of the audio layer are achieved by supporting supplemental data rates for
mono, stereo, and surround sound.
Furthermore a number of different configurations are available, which en-
able an adaptation of the ratio between the data rate and the quality to
the requirements of the current multimedia application. Each configuration
is thereby defined by a combination of
• Profiles: Simple, Main, Main+, Next, and
• Levels: Low, Main, High 1440, High.
A detailed description of the characteristics of the available profiles and
levels as well as a general introduction to the MPEG II standard can be found
for example in [HPN97,818].
The next MPEG standard - MPEG IV - is available since January 1999.
The central design requirement is the support of various interaction possi-
bilities, which exceed the available functions for the simple presentation of a
video sequence. The foundation for this standard is an efficient representation
of so called audio-visual objects (AVO), which can be seen, heard or oper-
ated during the presentation. Thus, an object-oriented approach is realised,
which offers new possibilities for coding and modification of the objects in
the sequence.
Dependent on the properties of an individual AVO the most efficient cod-
ing method for this object can be used, for example JPEG for images or MP3
for audio sequences. The result is a significant increase of the compression
rate. A scene consisting of multiple AVOs can be interactively modified at any
time: the user can for example add or remove objects, change the properties
such as size, colour, shape, texture, etc.
A complete reference for the MPEG IV standard is provided by [JTC99].
A short introduction of the essential MPEG IV properties can be found for
example in [DePOO].
298 O. Kao

The, at the moment, latest MPEG standard is called MPEG VII. It has
been designed for communication purposes and offers new methods for the
description of the media content by using manually, semi-automatically or
automatically generated meta-information. The existing coding methods are
extended by an additional track, which includes all meta-information. This is
used to improve the retrieval and presentation properties, for the maintenance
of the data consistency, etc. Moreover, besides the descriptive elements, the
MPEG VII system parts focus on compression and filtering issues which are
a key element of the MPEG VII application in distributed systems. The
standard, however, does not provide any details about the methods for the
extraction or creation of the meta-information. This is also true for algorithms
for retrieval of the information.
An overview of the basic principles of the MPEG VII data model is given
in Section 5.

4 Organisation and Retrieval of Multimedia Data


The development of digital technologies and applications allowed the produc-
tion of huge amounts of multimedia data. I/O devices and the corresponding
software are presently used in commerce, science, and at home; the Internet
is an almost limitless source of multimedia documents. All this information
has to be systematically collected, registered, saved, organised, and classified,
similar to text documents. Furthermore, search procedures, methods to for-
mulate queries, and ways to visualise the results, have to be provided. This
task is currently being tended to by existing database management systems
(DBMS) with multimedia extensions. The basis for representing and mod-
elling multimedia data are so called BLOBs (Binary Large Objects), which
store images, video and audio sequences without any formatting and analysis
done by the system, i.e. the compressed or uncompressed media are saved in
the current form in the database. This block of data can be processed with
user-defined functions for content analysis.
In addition to these, a growing number of prototypes and operationable
database management systems is available, which in particular take the re-
quirements and the special cases of managing multimedia data into account.
Some well-known research systems are QBIC3 , MIT PHOTOBOOK 4 , STRUC-
TURED INFORMATION MANAGER 5, SURFIMAGE 6. The raw data is described
by a number of specific, characteristic values, so called meta-information,
which can be classified in the following way:
Technical information refers to the details of the recording, conversion,
and saving process, for example in which format and under what name
3 wwwqbic.almaden.ibm.com/
4 www-white.media.mit.edu/-tpminka/photobook/
5 www.simdb.com/
6 www-rocq.inria.fr/cgi-bin/imedia/surfimage.cgi
7. Parallel and Distributed Multimedia Database Systems 299

a media is stored. Furthermore, the quality of the digitising of an audio


sequence, the compression type, the number of frames per second in a
video sequence, and basic information about the composition of the media
are important. The later refers to for example the resolution, image type,
number of colours used, etc. This information can generally be extracted
from the media header, eventually after a partial decompression.
Extracted attributes are those, which have been deduced by analysing
the content of the media directly. They are usually called features and
emphasise a certain aspect of the media in question. Simple features
describe for example statistical values of the media such as the average
colour of an image. Complex features and their weighted combinations,
attempt to describe the entire media syntax. As an example, entire scenes
of a video sequence can be accurately described by single, representative
frames, so called keyframes.
Knowledge-based information links the objects, people, scenarios, etc.
detected in the media to entities in the real world. For example a photo
shows President Bush and the German chancellor Schroeder in front of
the White House in Washington.
Further information - also called world-oriented information - encom-
passes information on the producer of the media, the date and location of
the action or shot, language used in the video clip, subjective evaluation
of the quality of an audio or video sequence, etc. Especially all manually
added keywords are in this group, which makes a primitive description
and characterisation of the content possible. The number of these key-
words depends on the complexity of the media content as well as the type
of target application, but it is usually - due to time and effort constraints
- limited to a couple of words.

As can be seen by this classification, technical and world-oriented infor-


mation can be modelled and represented in traditional alphanumeric data-
base structures. Organising and searching can be done using existing data-
base functions. Because of this, most database management systems currently
available are supplied with multimedia extensions.
Many different approaches exist for the integration of multimedia objects
in a relational database. A simple method is based on directly including the
objects in a table of the relational database model, by representing them with
bit-fields of variable length (VARCHAR, BLOB, LONG RAW, " .).
This approach has very high memory requirements, making the relational
databases unusually large and thus increasing the time required for the da-
tabase operation. This is the reason why often only the technical and world-
oriented information is stored in database tables, being completed by a refer-
ence to the storage address of the media. The raw media data is then stored
in a separate file system.
A query in such database systems usually refers to the technical and
world-oriented information in a relational table. These are passed over, com-
300 O. Kao

pared, and combined in a final result. The corresponding raw data is then
determined and displayed. Discourses on data modelling, modified query lan-
guages, and further analysis are found, among others, in [KB96].
The advantages of object orientation, opposed to relational database sys-
tems, are the result of supporting complex objects, i.e. the media can be
treated and processed as a unit. This includes modelling, meta-information
management, and storing the complex content. Extended relational and ob-
ject oriented data models add some concepts of object oriented data models
to the classical relational data model, thereby reducing the known drawbacks
of these approaches.
Neglecting to demand content-based search procedures leads to the fact
that a series of databases with multimedia content are falsely designated mul-
timedia databases. Some examples for such databases with media are found
in the following enumeration [KB96]:

• CD-ROMs are called multimedia databases by some authors. The data-


base consists of the saved media, like text, images, video sequences, etc.,
the retrieval is realised by means of full-text or index search.
• Multimedia thumbnail systems are often used in web sites containing a lot
of images. These previews link to an image or other multimedia object.
• Video on demand is a management system for videos with a simple index
for the searching mechanism. Usual keywords are the name of the movie,
genre, actors, etc.
• Document management and imaging systems are used to manage large
stocks of documents.
• CAD/CAM systems: the graphical primitives are stored in hierarchical
fashion and are tagged with different features.

All these examples reflect certain aspects of multimedia database systems


and display the entire bandwidth and areas of operation of these applications.
Yet none of the examples satisfy the conditions for a multimedia database
system, especially when considering the content-based description and search
of the media.

Definition 1 (Multimedia Database). A multimedia database system


consists of a high performance database management system and a database
with a large storage capacity, which supports and manages, in addition to al-
phanumerical data types, multimedia objects in respect to storage, querying,
and searching [KB96].

The structure of a multimedia database system is complex and is under-


stood as a union of the following fundamental technologies:

• Traditional database systems,


• Architectures for efficient and reliable input, output, storage, and pro-
cessing of multimedia data, and
7. Parallel and Distributed Multimedia Database Systems 301

• Information retrieval systems.

The entire functionality of a traditional database has to be introduced to


a multimedia database. Starting from the fundamental tasks such as abstrac-
tion to the details of storage access and memory management, to transactions,
standardised querying and modelling languages: all these characteristics are
demanded from a multimedia database as well. Thus, the database manage-
ment system has to satisfy the following requirements:

Atomicity: transactions are command sequences, which are performed on


all elements of the database. The atomicity characteristic demands that
either all commands of the sequence are executed and the results are
visible to the environment simultaneously, or none of the commands is
executed.
Consistency: the transactions that have been carried out may transform
the database from one consistent state into another. The operations may
be performed on database objects or features. If an error occurs during
the execution - for example a key condition is violated -, the entire
transaction is aborted and the database is brought back into the state it
was before the execution began.
Isolation: the execution of a transaction is isolated and independent of all
other transactions in the database. The transactions can only access "se-
cured" data in the database, which are part of the consistent state.
Durability: if the database signifies the successful execution of a transac-
tion, then all the produced effects have to survive any surfacing faults -
no matter if the faults are hardware or software based. This is especially
important, when the data is not saved on the storage device immediately,
but is temporarily kept in a cache.
Concurrency Control: modern database architectures accelerate the com-
mand execution by employing concurrency and parallelism. This requires
that the order of transactions has to be kept, so that the results are not
tainted, i.e. the results of operations have to correspond to those of the
sequential execution.
Recovery: a database management system has to ensure that error will not
threaten the consistency of the data. The state right before the error
occurrence has to be reconstructed.
Versioning: this component ensures the ability to access earlier versions
(states) of a modified object.
Security: this aspect encompasses securing the data from prohibited access
by the commands executed, as well as securing the content from accesses
by unauthorised users.

The importance of these so called ACID properties falls in the case of


multimedia databases, asthe data access is usually limited to read-only and
no read/write conflicts have to be resolved. Moreover, during runtime the raw
data of the media is solely used for extraction of attributes and presentation,
302 O. Kao

thus an update is not needed. On the other hand the meta-information is


stored in the database in the traditional way and is controlled by the mech-
anisms noted. Therefore, the ACID properties have to be considered when
designing a multimedia database.
The database architecture - in software as in hardware - is decisive for
the efficiency and thus for the usability of a multimedia database. Because
of their high storage and computing requirements, multimedia database be-
long to those applications, which rapidly hit the limits of existing technology.
The widespread client/server architectures, as shown in Fig. 4.1, are - in their
usual form - not fit for multimedia database implementations, since the data-
transfer between the individual architectural elements, for example database
server, hierarchical storage and retrieval server, overloads the network struc-
ture and causes long delays in the processing elements (PEs). This impairs
the entire process, causing the response times to be unacceptable. Because
of this, special - for example parallel and distributed - database architec-
tures, which consider the specific requirements for processing and presenting
multimedia data, are necessary.

Web
Saver

File Rdrieval Ilieran:hical Jukebox


Saver Saver lort&C

Fig.4.1. Example for a traditional client/server architecture for multimedia appli-


cations, that consists of distributed I/O devices, retrieval, processing, and storage
facilities

A primary aspect for supporting modern computer architecture is the us-


age of hierarchical storage management (HSM), which integrates and uses
storage devices with different compromises between access times and capac-
ities. The goal of this organisation is to make the multimedia objects - for
example according to the access frequency - available
• On-line (Cache, RAM, Hard-disks),
7. Parallel and Distributed Multimedia Database Systems 303

• Near-line (CD-ROMs, DVDs, and other optical storage devices) or


• Off-line (tapes, ... ).

Often used data sets are moved upwards in the hierarchy. In the mean-
time HSM is just a small part of the architectural characteristics demanded:
sufficient computing resources are necessary for a proper search in the mul-
timedia stock, high-speed networks are used to transfer the media to the
processing and to the presentation components. Distributing the data among
several nodes in a suiting way cannot only increase processing speed by using
of parallelism, but also makes valuable features possible. Further technical
problems are connected to the introduction of new architectures, such as
efficient backup systems, fault tolerance and thus providing of redundancy,
workload balancing, etc.
The main difference between traditional and multimedia databases is a
result of the complex media content: analysing, describing, and sorting of
the media, as well as deriving similarity statements are orders of magnitudes
more difficult as with the corresponding operations on alphanumerical data.
This requires the following aspects to be considered:

Support for multidimensional data types and queries: objects,


scenes, as well as sequences are the most important information carriers
in the media and present a coherent union of primitive information
units, for example pixels in the case of images. Each of these elements
can be characterised by a series of features, starting with colour, shape,
and texture, to relationships with real entities. Furthermore, topological
descriptions that determine the relationships between elements and
relative to the media can be generated. This information is essential
for the specification of queries, for example Object 1 lies to the right of
Object 2 or Object 1 contains Object 2.
Interactive and iterative queries: conventional database querying inter-
faces are not always well-suited for multimedia databases, as many at-
tributes and traits are abstract and complex, being hard to understand,
interpret, and formulate for users without special expertises. This is why
audio-visual interfaces have to be integrated. The user can load an ex-
emplary image or sequence, choose among pre-defined standard media,
combine from other media, compose or sketch a sample media, and then
search for similar media in the database.
Relevance feedback: ad-hoc queries often require further adjustments. On
the one hand, the results of the previous query can be used as a starting
point for the next, i.e. the query is repeated and fine-tuned iteratively,
until the desired result is obtained. Another problem is that the users
cannot completely evaluate the quality and efficiency of the chosen op-
erators and sample instances. Applying the selection on the given set
of test media and the presentation of the results, gives the user a first
orientation about the results that are to be expected.
304 O. Kao

Automatic computation of the characteristics: a part of the descrip-


tive characteristics, foremost world-oriented information, is input in the
database through a traditional, text-based interface. The largest part is
automatically extracted and stored in the database. This is the reason
why a multimedia database needs to offer a user friendly and easy way to
integrate extraction, analysis, and comparative procedures that are used
to automatically process the inserted media.
Multidimensional indexes and content-based indexing: indexes
serve to accelerate data set accesses and are thus widespread. In contrast
to traditional database systems, multimedia data are characterised by
an entire set of features, so that multidimensional index structures are
necessary. These also have to support range and partial queries, next to
the exact queries.
Query optimisation: the execution order of independent elements have to
be planned so that crucial and time-intensive processes, such as data
transfers between individual system nodes, are minimised, reducing the
response time of the system. However, query optimisation in the context
of multimedia is - due to the large data blocks to be communicated and
processed - much more complicated as in relational DBMS. A detailed
information is provided for example by STONEBAKER [ST96].
Partitioning of the data: this aspect closely relates to the query optimi-
sation and refers to the data distribution among the individual storage
devices. Multiple strategies are possible, but most are a compromise be-
tween the time needed for a data-transfer and minimising latency of the
PEs.
Synchronisation: a multimedia presentation presumes the ability to syn-
chronise the media during replay. These mechanisms have to be supplied
by the multimedia database and the corresponding extensions and need to
be considered during the data processing. The synchronisation can mean
the spatial, chronological or content-based order of the media. The media
can be presented independent of each other, sequentially or in parallel.

The following section regards a selection of data models for multimedia


data in general, and images in particular.

5 Data Models for Multimedia Data


Data models allow the generation of abstractions for problem descriptions.
Generally, they encompass a set of concepts to define the database structure
(data types, relations, conditions) and a set of operations to configure queries,
modifications, and updates.
A definition of a general multimedia data model is proposed for example
by MARCUS and SUBRAHMIAN [MS96], calling it a media instance. This is a
theoretical, formal, and application independent approach with a high level
of abstraction.
7. Parallel and Distributed Multimedia Database Systems 305

Definition 2 (Media instance [MS96]). A media instance is a 7-tuple

mi = (ST, fe, >., n, F, Varl, Var2), (1)

where ST is a set of states, f e is a set of all features possible, and>': ST -+


P(fe) is a function that maps a set of features to each object. The set 'R, and
F contain inter-state relations or feature-state relations. Each relation rEF
holds true, that ref ei x ST, i ;::: 1.

So the data structures are made up of an information trunk (raw data)


that is represented by a storage mechanism, such as R-trees. The functions
and relations belonging to the trunk map concern different aspects, traits
and/or characteristics of a media instance.
Based on the notation of multimedia instances, a multimedia system is
defined as a set of such instances. The concatenation of the available instances
represents the state of the system at a given point in time t, so that the former
static and dynamic media can be linked this way. The resulting set is called
a media event. In a news-broadcast, such an event could be the combination
of the newscaster, who says a certain word, a background image, subtitles,
and different logos. A sequence of media events over a time interval T results
in a media presentation.
A database query can now be defined as a process generating one or more
media events from the stored media instances. The media synchronisation is
reduced to a constraint-solving problem on the individual media events.
These formal definitions of a universal data model for media can be used
as a first guideline for integrating the already existing databases for individual
media, such as images or videos, in a global multimedia database.
An applied example for a multimedia data model is given by archiving
news feeds delivered by news agencies to TV stations. These get video record-
ings of current events, which are used as the basis for composing the news
show [FJK01]. The archiving and management of such newsfeeds has to con-
sider - next to the raw video data - the following information:

• Format specifications: In which format is a given video? This information


is important for the presentation, but also for evaluating its quality. An
MPEG I video, for example, is not broadcastable, and may only be used
for preview purposes.
• Recording data: When and where was a story recorded, in what language,
quality, etc.
• Access rights: Which agency delivered the story, how often may it be
broadcasted, how often was it already broadcasted, etc.
• Classification: What is the category of the story (economy, politics,
sports, etc.), or importance of the story for archival purposes.
• Links to other related information, such as textual contributions and
photos in local and external archives, previous stories, etc.
306 O. Kao

• Keyframes: Single images, which contain the most important elements of


the current scene.

A standardised inclusion of such information in a video stream is supposed


to become possible with the introduction of MPEG VII. The methods for
extracting and retrieving of characteristic features are not considered in this.
The goal is rather to make descriptors and descriptor schemes available for an
efficient representation of the extracted features. The technology belonging
to this is detailed in [MPE98,DePOO] in the following manner:

• Data: audio-visual raw data, which are to be described by the meta-


information.
• Features: extracted characteristics of the media, which emphasise certain
aspects, and form the basis for retrieval. Keywords, statistic colour values,
histograms, objects, topological information, etc. are examples for this.
• Descriptor: defines the syntax and semantics of an entity that repre-
sents the attributes of a certain feature. Such an entity can be an image
(keyframe), a two dimensional array (regions, objects), lists (contours),
or a composite data structure such as (real [3], real [3], real [3] )
for colour moments.
• Descriptor values.
• Description scheme (DS): such a scheme consists of descriptors and other
descriptor schemes, and defines the relation between the elements.

Features and other information described in this manner are added to


the encoded audio-visual media. The type of compression, storage, transfer,
etc. are not relevant, since the meta-information is stored in an additional
track, as shown in Fig. 5.1. When the MPEG VII stream reaches the user, it
is decomposed in raw data and meta-information. The meta-information can
be subsequently retrieved, processed, and managed.

5.1 Data Models for Images

A Content-Based Image Retrieval (CBIR) - first mentioned by KATO in


1992 [Kat92] - is based on extraction and comparison between different image
features. These can be grouped in the already introduced information classes:

Image data (raw data): pixel values of the colour matrix.


Technical information: resolution, number of colours, format, etc.
Information from the image analysis: extracted characteristic proper-
ties, objects, and regions, statistical values, topological data, etc.
Knowledge-based data: relations between image regions and real world
entities.
World-oriented data: acquisition time and date, location, photographer,
manually inserted keywords, etc.
7. Parallel and Distributed Multimedia Database Systems 307

~?J
Feature Extraction
Data models, ...

MPEGVll MPEOVll MPEGVII


Description Encoder Decod.er
User:
- Filter
-Query
- Presentation

Fig. 5.1. Graphical description of the MPEG VII process

The data model of the QBIC (QUERY By IMAGE CONTENT) system


differentiates between the raw image data, technical and world-oriented in-
formation, and image objects. They are determined as areas of similar colour
or texture, or within contours gained from automatic or semi-automatic pro-
cedures [ABF+95,FSN+95,NBE+93] .
More complex data models are based on the MMIS (MANCHESTER MUL-
TIMEDIA INFORMATION SYSTEM) [G0I92] and VIMSYS (VISUAL INFORMA-
TION MANAGEMENT SYSTEM) [GWJ91] . The MMIS data model uses the as-
sumption that a description of the image content is not possible with presently
available tools [Gob97]. Because of this, a so called incremental data model
consisting of four layers [Gr094], was designed so that the results of new ana-
lytical and recognition algorithms can be integrated. Also, the data model is
supposed to support different, and future applications for image databases.
At the time however, only the two bottom layers, the image data layer and the
layer of automatically and semi-automatically extracted primitive features,
are used. From this information, objects (third layer) are to be constructed,
and brought in relation with real world entities (fourth layer).
The VIMSYS data model differentiates between image-only data, world-
only data, and annotation data . The first group consists of image data, techni-
cal information, and all information gained from image processing. The later
can be determined by automatic or semi-automatic procedures. The world-
only data represents a media-independent knowledge base, which models the
relations and depicts the attributes from the real world entities. An example
for this is
Politicians are people:::} Politicians have names.
The description data is a set of all information about the image, that con-
sist of, for example, keywords, descriptors of identified objects, and derived
308 O. Kao

semantic information. Spatial, functional, and semantic relations re-create


the dependencies between the data of different layers.
The AIR (ADAPTIVE IMAGE RETRIEVAL) data model [GRV96] divides
the information in three layers:

Physical level representation is the lowest level in the hierarchy and con-
tains the raw data of the image and its objects.
Logical level representation is above the physical layer and contains the
logical attributes of an image.
Semantic level representation makes it possible to model the different,
user-dependent layers onto the data, as well as synthesising the semantic
query features by derived logical and meta-features.

The main problem when using these models is that methods for extracting
the meta-information from a higher level of abstraction are not available.

6 Multimedia Retrieval Sequence Using Images as an


Example
This section introduces the individual querying and retrieval steps in a mul-
timedia database, which are detailed by considering image retrieval as an
example.
The goal of multimedia retrieval is the selection of one or more media,
whose meta-information meets certain requirements, or are similar to a given
media. Searching the meta-information is usually based on full text search
in the assigned keywords. Furthermore, content references, like colour distri-
butions in an image, or a sequence of notes in a melody can be input. More
complex, content-referring information, like wavelet coefficients, are usually
too abstract. The interpretations and the understanding require specific ex-
pertises, which most users lack.
Because of this, most systems prefer using a query with an example media
item. This can be an image or an audio sequence that is subjectively similar
to the looked-for media. This media is used as a starting point for the search
and is processed in the same manner as the other media, when it was entered
in the database. The content is then analysed with the selected procedures
and the media is mapped to a vector with (semi- ) automatically extracted
features. The raw data is no longer needed at this moment, so that all fur-
ther processing concentrates on analysing and comparing the representative
vectors.
Different types of features are presented in literature, together with ex-
traction methods, similarity metrics and functions. Each feature emphasises
one or more aspects of the media. After the analytical phase, the media is
represented by an attribute vector. This vector can be directly compared to
the representatives of the other media in the database using existing met-
rics or functions. The result of this comparison is a similarity value for the
7. Parallel and Distributed Multimedia Database Systems 309

query and the analysed media. This process is repeated for all n media in
the database, resulting in a similarity ranking. The first k entries, k being
a user-defined constant, represent the k best hits, whose raw data are then
displayed. The comparison process can be accelerated by using index struc-
tures. These contain features extracted a priori, and are organised in such a
way, that the comparisons can be focused to a certain area around the query.
In the following, a selection of querying techniques, extraction methods,
index structures, and metrics are presented, using the example of image re-
trieval.

6.1 Querying Techniques

To specify a query in an image database, the following interface can be used:

Browsing: this user interface enables creation of database queries without


a need to precisely specify the search parameters and without detailed
knowledge of the data structures and mechanisms used. In the case of
multimedia databases, browsers are often used when no suitable start-
ing media is available. Beginning at a pre-defined data set, the user can
navigate in any desired direction, until a similar media sample is found.
Search with keywords: technical and world-oriented data are represented
by alphanumerical fields. These can be searched for a given keyword.
Choosing these keywords is extraordinarily difficult for abstract struc-
tures, such as textures.
Similarity search: the similarity search is based on comparing features
gained via extraction algorithms from the raw data. Most of these features
do not show an immediate reference to the image, making them highly
abstract for users without special knowledge. They cannot be given, but
have to be computed from a query media. The following interfaces are
available for specification of the sample image:
• Query By Pictorial Example: a complete example image is given. An
advantage of this approach is that all query-relevant information can
be gained in the same way from the query image as it was computed
from the images stored in the database. This guarantees comparabil-
ity.
• Query By Painting, Sketch Retrieval: this is a widespread query type
[ABF+95,KTOO,RCOO], in which the user sketches the looked-for im-
age with a few drawing tools. It is not necessary to do this correctly in
all aspects. Priorities can be set by giving the different feature types
certain weights.
• Selection from standards: lists of example instances, so called stan-
dards, can be offered for individual features. This is customary when
choosing a shape or texture that can be selected from an existing list
much faster than being described or constructed.
310 O. Kao

• Image montage: the image is composed of single parts [SSU94]. Fea-


tures can be computed for all elements of an overlayed grid. A key
region can be specified by selecting a set of grid elements .
• Visual Concepts and Visual Dictionaries: these are querying tech-
niques, which are used in the EL NINO [SJOO] system. A visual con-
cept is hereby understood as a set of images that are defined as equiv-
alent, or at least as very similar, by the user. These are then combined
in a concept with a heavier weight. On the other hand, the visual the-
saurus consists of a certain set of manually identified images, from
which starting images can be chosen for retrieval. For this, the user
gives a few keywords that are compared with the textual features of
the images in the visual thesaurus. The hits can then be used to draft
a query.

6.2 Sample Procedure for Information Extraction


The generally accepted similarity search in multimedia data is one of the
Grand Challenge Problems that are investigated by numerous research groups
around the world. Many different methods for feature extraction are devel-
oped and can be classified by various criteria. Based on the point in time in
which the features are extracted, the procedures can be divided into:

A priori feature extraction: in this case, only pre-defined features are al-
lowed in the query, so that the stored images are not processed. These fea-
tures were extracted during insertion in the database, and can be searched
in the form of an index tree.
Dynamic feature extraction: this is a more flexible approach, where the
user marks relevant elements in the sample image as querying parameters
for the similarity search. This could be, for example, a persons face, or
an object. Then, all media in the database are searched for this feature.
Combining the a priori and dynamic feature extraction: some stan-
dard features and the corresponding index trees are computed during
insertion. These are then completed with user-defined features during
the search.

Some well-known image features and extraction methods will now be pre-
sented.

Image features. Histogram-based methods for determining similarity be-


long to the oldest and most widespread procedures for image retrieval. The
core of such a procedure is composed of the following three steps:

Quantising the colour space: calculating a complete histogram is very


computation and storage intensive. This is why the first step is to di-
vide the colour space into a limited number - usually 256 - of partitions
7. Parallel and Distributed Multimedia Database Systems 311

c;. The fundamental algorithms are adapted to the used colour model,
the current application, the characteristics of the given image class, etc.
The result is a set C = {Cl, C2, ••• ,en}, where n E IN and c; n Cj = 0,
Vi,jE[l,nJ,ii-j.
Computing a histogram: after the colour cells are determined, a his-
togram is computed for each image in the database. The colour value
of each pixel is converted in a reference colour with a given metric and
the counter of this colour cell is incremented.
Comparing histograms: this step is performed during the runtime of the
image retrieval and is used to determine the similarity of a query image
and a stored image. The histogram of the query image needs to be com-
puted first. Then, this histogram is compared with the histograms of all
stored images using a given metric. The results are sorted and yield a
ranking of the hits.

The method used by the CORE system [WNM+95) is based on a static


quantisation of the RGB colour space. The colour cells >'1, >'2,"" >'n, n E
IN are explicitly given and are determined by analysing all images in the
database. Each image b is characterised by an n-dimensional vector fb with

(2)
where >.~ is the relative frequency of the reference colour >'i in the image b.
The similarity A(bI, b2) between two images b1 and b2 corresponds to the
Euclidean Distance of the two colour vectors:

n
A(b1 ,b2) = Ilfbl - fb2112 = ~)>'~l _ >,~2)2. (3)
i=l

The metric can be adapted to current demands by introducing weights.


These so called push factors Wi are determined experimentally to ensure that
the distance between two images grows as soon as a colour is found in only
one of the images:

(4)

Similar procedures are used among others by the QBIC [ABF+95) and
the ARBIRS [Gon98) system. Many systems combine the histogram informa-
tion with other features, thus increasing robustness and precision.
Calculation of statistical colour moments [S095) is a further approach
for describing colour distributions. Usually the first moment, as well as the
second and third central moments are used, since these represent the average
312 O. Kao

intensity E i , the variance ai, and the skewness Si of each colour channel.
These are computed for the i-th colour channel of the j-th pixel of the image
b with N pixel as follows:

1 N
Ei = - LPij (5)
N j=1
1

ai = (~ t(Pij _ Ei)2) "2 (6)


3=1
1

Si = (~ t(Pij _ Ei)3) 3" (7)


3=1

ror companng
D •
wo'Images b1 = (Ei
t b1' a b, , Sib, ) and b2 = (Eb2
i ' a ib2 , Sib2 )
i
with r colour channels each, a weighted similarity function L f is introduced:

Lf(b 1 , b2) = L (wilIE~l - E~21 + wi2Ia~' - a~21 + wds~' - s~21). (8)


i=1

The weights Wil, Wi2, Wi3 :::: 0 are user-defined and serve to adapt the
similarity function L f to the current application.
A description of contour or image segments can be assumed to be a mea-
surable and comparable image description, if contours are extracted from the
image. Again, many different methods exist for displaying these segments.
The QBIC System uses 18 different parameters, such as [NBE+93]:
• Area is the number of pixel within an enclosed region.
• Circularity is computed from the quotient of the square of the circumfer-
ences radius and the area.
• Direction of the largest eigenvector of the covariance matrix.
• Eccentricity is the relation of the length of the smallest to the length of
the largest eigenvector.
• Algebraic moments are the eigenvalues of a set of pre-defined matrices.
MEHROTRA and GARY use polygon lines for describing contours [MG95].
The polygon nodes can be the nodes of a line strip, approximating the con-
tour, or nodes computed from the features of the contour, such as the points
of largest curvature. So a contour is displayed as a sequence of so called
interest points.
Another category is made up of the texture-based features. The most
often used texture characteristics are computed from the covariance matrix.
ASENDORF and HERMES offer a survey over the different features [AH96].
The following are well-suited for classifications (Equations are taken from
[HKK+95]):
7. Parallel and Distributed Multimedia Database Systems 313

1. Second slope moment:

h = L LP(i,j)2. (9)
j

2. Contrast:

(10)

3. Correlation:

h = Ei E/ij)p(i,j) - /-Lx/-Ly.
(11)
axay
4. Variance:

/4 = LL(i - /-L)2p(i,j). (12)


j

5. Entropy:

/5 = LLP(i,j)log(p(i,j)). (13)
j

Here, p( i, j) is the entry (i, j) of the normalised covariance matrix, N g


is the number of levels of grey, and /-L, /-Lx, /-Ly, ax, and ay the averages or
standard variations of the probabilities Px and Py, respectively. To obtain
rotation invariance, the covariance matrix of the image or image segment
needs to be pre-calculated for different directions.
The ending of this section lists features, which do not have an immediate,
obvious semantic meaning. These have the drawback, that the result of a
query may not be comprehendible to a human viewer. An obvious form of
feature extraction is using the coefficients of the DCT to describe the image
content. This takes advantage of the fact that a lot of images are stored in the
JPEG format in the database, and the expensive DCT is already performed
during compression [SAM96].
Wavelet coefficients are another possibility to describe the image content.
All images are scaled to the same dimensions, for example 128x128, and are
processed by the wavelet transformation. The results are 16384 wavelet coeffi-
cients. A number n of coefficients, usually n = 64, are then selected, combined
314 O. Kao

in an attribute vector, and stored in the database. The same number of co-
efficients is also used for the query image or sketch, so that the similarity
of two images can be determined by computing an adapted difference of the
corresponding wavelet vectors. An exact description of the criteria used to
select the coefficients, as well as a comparison metric and weights, can be
found in [JFS95,WHH+99]. Figure 6.2 and Figure 6.3 in Section 6.4 show
examples for image retrieval with wavelet coefficients.

Template matching. An example for the dynamic feature extraction is the


application of the template matching operator. This is a fundamental tech-
nique for detecting objects in images and video sequences. The user selects
a region of interest, for example an object. This is represented by a min-
imal bounding rectangle (MBR). The other elements of the image are not
considered, so that a detail search can be executed. The object is compared
to all sections possible, in the target image or video. While doing so, cer-
tain features are combined in a characteristic value, which is a measure for
the similarity of the regions tested. A huge number of different comparison
operations, features, and selection possibilities exist.
The simplest form is based on combining the corresponding pixels of the
object with the area below it directly, for example subtracting the colour
values. The object is moved across the image pixel by pixel. The combination
operation is performed again at each new position, and the results are then
summed up. The resultant sum is interpreted as the measure of similarity for
both regions. This fundamental algorithm is shown graphically in Fig. 6.1.

6.3 Metrics
The similarity of two images in the content-based retrieval process is deter-
mined by comparing the representing feature vectors. A set of metrics and
similarity functions was developed for this purpose. They can be classified as
follows [JMC95]:

• Distance-based measures,
• Set-theoretic-based measures, and
• Signal Detection Theory-based measures.

The distance-based methods work with n-dimensional vectors as image


representation and compute a multidimensional distance between the vectors.
The smaller the distance, the more similar are the analysed images. Metrics
with both, traditional and fuzzy-logic, are here as computational rules. An
example is the MINKOWSKI r-METRIC

[t IXi - Yir]
.1

dr{x, y) = r r ~ 1, (14)
~=1
7. Parallel and Distributed Multimedia Database Systems 315

(b) (c)

Fig.6.1. Example for template matching: (a) Manually selected region of interest
represented by MBR; (b) Search for the object in an unsuitable image; (c) The
object is found in a different environment

where x = (Xl,X2, . . . ,Xn ) and Y = (Yl,Y2, ... ,Yn) are arbitrary points in
an n-dimensional space. The fuzzy logic based MINKOWSKI r-METRIC re-
places the component subtraction by subtracting the corresponding element
functions J,L(Xi) and J,L(Yi).
The set-theoretic measures are based on using the number of same or
different components of the feature vectors. Set operations, such as intersec-
tion, difference, and union, are applied here. A family of such functions was
proposed for example by TVERSKY [Tve77] .
Let a, b be images and A, B the associated feature sets. The measure for
the similarity of both images Sea, b) is computed using the following rule:

S(a,b)=Bf(AnB)-af(A-B)-/3f(B-A), B,a,/3?O. (15)

The function f is usually used to determine the cardinality of the result set.
Not only the quality of the features, but their existence, too, are inspected
with similarity measurements of the third category. Signal Detection Theory
- also called Decision Theory - gives measures for the special case, where
feature components have binary values. Each image is assigned a vector with
binary values, so that comparisons can detect similarities. This makes the
following four cases possible:

• Hits, 1-1 (w): both images contain the feature,


316 O. Kao

• Misses, 0-0 (z): none of the images contains the feature,


• False alarms, 1-0 (x): the first image contains the feature, but not the
second, and
• Correct rejections, 0-1 (y): the opposite of false alarms.
For evaluating the similarity of two images, which are represented by their
binary feature vectors a and b, the so called JACCARD-COEFFICIENT is often
used:

w
S(a, b) = - - - (16)
w+x+y
This classification emphasises the advantages of vector-oriented similarity
measurements: the features can be computed automatically and can be used
to determine the nearest neighbour employing proven algorithms.

6.4 Index Structures


Selected features of an object, a file or other data structures are stored in
an index, offering an accelerated access. The first thing done in a search
operation is that the index for the given feature is computed, if it exists,
to determine the address of the data set. Then the data can be accessed
directly. This implies, that the construction method of an index is of utmost
importance for database and information system efficiency.
The index structures can be characterised using parameters, such as stor-
age requirements and position in memory, operations allowed, composition
of structure elements, dimension of the mapped data space, etc.
Single features and their values are generally not suited for describing and
explicitly identifying complex multimedia data. This is why multiple features
are considered and stored in the database, when such objects are saved. More-
over, not only the objects, which fulfil all required attributes, are important
during a multimedia search, but also those which reside in the immediate
neighbourhood. The selected features span a multidimensional space, so that
the object characteristics can be represented by a point in this space. The
search for similar objects is usually accomplished by placing a figure in this
n-dimensional space and selecting those objects, that are enclosed in the fig-
ure. Hyper-spheres, hyper-cubes and other figures, are specified independent
of the weight, structure of the feature space and query type. The following
query types are differentiated for these multidimensional spaces:
Exact Match Query: seeks all objects that fulfil all attributes required by
the query.
Partial Match Query: only certain, selected attributes of the objects are
considered in this case. A keyword search in a database needs such a
query, since all document features, except for this keyword, are disre-
garded. The partial match query is often called sub-pattern matching.
7. Parallel and Distributed Multimedia Database Systems 317

Partial Range Query: this query type looks for objects, whose feature val-
ues are within given intervals. The space spanned by the intervals defines
the region in which two objects are regarded as similar, so the similarity
term can be introduced with this query.
Next Neighbour Query: this query selects a single object, which has the
smallest distance to the query object, regarding a similarity function. An
extension is realised by looking for the k nearest neighbours. This feature
is for example necessary for ranking pictures by their similarity.
All Pair Query: in a given set of objects, all pairs are selected, which suffice
a certain distance condition.

Data structures that are employed to support such a search, are called
multidimensional index structures. Well-known examples are for example k-
d-trees and their extensions, like k-d-B-trees [Rob81], grid files [Knu73], R-
and R*-trees, SS- and SR-trees [WJ96], TV-trees (telescopic-vector tree), VP-
trees ( vantage point tree) [Chi94a, Chi94 b] or VA files ( Vector Approximation)
[FTA+OO,WSB98].

Image retrieval examples. Examples for the results received by a similar-


ity search, using the wavelet-based feature are shown in Fig. 6.2 and Fig. 6.3.
The query image or sketch is displayed in the top-most row. Query results
are found below it, sorted by the computed similarity.

Fig. 6.2. Result of a wavelet-based image retrieval with a query image

An important advantage of this approach is that it works very well with


query sketches and thus allows intuitive querying techniques.
318 O. Kao

Fig. 6.3. Result of a wavelet-based image retrieval with a query sketch

7 Requirements for Multimedia Applications


From the viewpoint of parallel and distributed systems, the main attributes
of multimedia data and applications are the immense storage and bandwidth
requirements, which are often combined with the demand for real time capa-
bility.
The storage requirements for multimedia data types surpass the storage
demands of conventional, text-based documents by several orders of magni-
tude. This storage requirement is generally described by case examples and
average values. In the case of time-invariant media, such as images, one usu-
ally assumes often used dimensions and colour depths, while a storage volume
per time-unit is given with dynamic media. The data reduction by means of
compression algorithms is considered with a common factor like for example
1 : 10 in these assessments. The results are assembled in tables, such as Table
7.1 (cited from [KA97]).
An ASCII-coded page of text requires about 2 Kbytes of storage space.
The requirement S] of an image depends on the dimensions M x N as well
as on the colour depth and can be estimated as

(17)
where Wi is the word-length of the colour channel i-usually 8 bits. The
constant C « S] represents additional, technical and format-specific details.
According to Equation (17), the storage space required for a page of text is
the same as for an uncompressed RGB image of the dimension 26x26. Images
7. Parallel and Distributed Multimedia Database Systems 319

Media Format Data volume I


Text ASCII 1 MB / 500 Pages
Image (B/W) G3/4-Fax 32 MB / 500 Images
Colour Image GIF, TIFF 1,6 GB / 500 Images
Colour Image JPEG 0,2 GB / 500 Images
Audio (CD-Quality) CD-DA 52.8 MB / 5 min
Video PAL 6,6 GB / 5 min
High Quality Video HDTV 33 GB / 5 min
Speech ADPCM 0,6 MB / 5 min
Speech MPEG Audio 0,2 MB / 5 min
Table 7.1. Data volume examples for static and dynamic media

used in applications, like a PAL-Frame, have a resolution of 768x576 and


require about 1.26 Mbyte storage capacity, nearly the same as approximately
630 pages of text. Recording with a digital camera results in images with
the resolution 1280x960 and needs about 3.51 Mbytes. Medical images, such
as x-ray exposures, are generally represented by a digital 4000 x 4000 grey-
scale image, amounting up to 15.2 Mbytes of data. These storage demands
increase fast with audio and video sequences. Thus, a second of audio in CD
quality needs about 180 Kbytes, and a full-screen, full-motion video sequence
in colour of the same length needs about 112 Mbytes of storage space.
Yet these statements are not significant, as the media are stored in com-
pressed form. Different compression algorithms can be used for this, such as
JPEG or wavelet-compression for images, MPEG for video or MP3 for au-
dio sequences (for example [Cla95]). The storage requirements are lowered to
about 30 Mbyte/s for video and about 8 Kbyte/s for audio [DN91].
The type and parameters of the compression used depend on the appli-
cation. Lossy compression methods, like JPEG, eliminate redundant infor-
mation that is not detectable by the visual system. Medical images, on the
other hand, need to preserve critical details, making this type of compression
useless. Video sequences contain a lot of redundant information in successive
frames, allowing a larger compression rate.
Usually, the extracted features, the knowledge determined from these and
cross-references to objects and people in the real world are stored in the data-
base next to the raw data. While knowledge and the references are represented
by text, the storage volume of the extracted features can easily be multiple
times as large as that of the raw data. Furthermore, index structures for re-
trieval are assembled from these elements, and stored along with the media.
The number, type, and composition of the meta-information can depend -
among many others - on the following factors:

• Type of the retrieval method realised: with dynamic feature extraction,


only raw data is saved, the meta-information is generated on demand. In
320 O. Kao

the case of an a priori extraction, the meta-information is stored along


with the raw data.
• Time complexity of the extraction procedure: expensive operations should
be executed - if possible - a priori.
• Application area.
• Type of the data structures for the feature representation: some features
can be represented by vectors of constant or variable length. Other fea-
tures, such as image segments, may need to reserve matrices of the same
dimensions as the original image.

The demands regarding transmission bandwidth are closely related to


storage demands. The main factors are maximum and average packet delay,
packet delay jitter and probability as well as on-time reliability. Static media,
such as text and images, are robust against variations of the transfer rate.
On the other hand, little to no information is redundant, making packet loss
intolerable. So data integrity needs to be maintained during the transmis-
sion. Continuous media contain a lot of redundancy, allowing a few packet
losses to be compensated by the human audio-visual system. But transfer
rate fluctuations and packet delays result in distracting artefacts.
Applications that process and present a steady, continuous data stream
are called CBR (Continuous Bit Rate) applications. Examples are video-on-
demand applications: as soon as a transmission starts, it has to be carried
out until the sequence ends using the same parameter values. Opposed to
this, VBR (Variable Bit Rate) applications can change the parameters, for
example when modifying the compression rate, or when the data stream is
dynamically generated.
Another important attribute of multimedia systems is the demand for
synchronised presentation. The individual components need to be found and
replayed within given, short intervals. It is important to consider that the
media can be stored on different levels of the storage hierarchy, thus mak-
ing the access latency dependent on the properties of the included storage
devices, such as tape, disk bandwidth, buffer space, etc. Furthermore, the
search is made using different, media-independent methods, so that notice-
able differences in the retrieval time are possible.
These factors are decisive when the media has to be made available on
demand. A user is only willing to wait a limited amount of time for the presen-
tation to begin. Specialised applications, like a video server in a hotel, which
has a limited selection of for example 20 movies, can satisfy these demands
with current technology. Video servers with a large collection of movies, or
a server with different media types require, however, modern architectural
concepts: high computational and storage performance have to be combined
with wide, multiple transfer paths, efficient search algorithms, playback, and
data mining strategies.
7. Parallel and Distributed Multimedia Database Systems 321

8 Parallel and Distributed Processing of Multimedia


Data
The fast development of networks and connectivity in the past few years
makes an online access to complex and storage intensive multimedia data
possible. The basic architecture for such services consists of a server for stor-
ing and managing the data, a number of clients with the querying and manip-
ulation interfaces, which are connected to the server via a high performance
network. Clients can initiate queries on the multimedia data, or display the
contents of the media, individually or synchronised.
Such architecture needs the following performance attributes:
• Playing back continuous media can be controlled interactively, i.e. the
user can stop and continue the presentation; certain audio-visual se-
quences can be skipped or rewound. Furthermore, selected parts can be
played back with increased or decreased speed. Changing the playback
attributes, such as volume, balance, saturation, brightness, etc. is allowed
as well.
• Quality of Service (QoS) is the collective term for all demands on the
recording and the replaying procedures, which refer to generating and
maintaining a continuous data stream. The recording is usually made
by specialised hardware that guarantees a specified rate, but the decod-
ing process is performed by the CPU, which is only partially available.
The data needs to be transferred to the user using available network
resources. Deviations from the presentation requirements result in dis-
tracting artefacts. This is the reason why all involved components have
to be scaled correspondingly - network capacities and load, performance
of clients and server, compression rate of the media, etc. Furthermore,
it needs to be ensured that sufficient resources can be reserved for the
current application.
• Synchronising requirements set the order of contents, time, and space of
several independent media. An example is the news speaker on TV: the
voice replay needs to be synchronised with the lip-movement. Addition-
ally, images concerning the current event, and further specifications such
as date, location, name of the reporter, etc. appear in the background.
• Media retrieval: searching for a certain media is a complex and time-
intensive procedure, that can be performed using different techniques.
• Dynamically adapting to the current resources is necessary, since homo-
geneous user infrastructure cannot be assumed. The network communi-
cation is such an example: if the user is connected via a modem, a higher
compressed version of the media needs to be transferred, due to time and
cost constraints. The compression rate is thus increased, until the service
quality can be assured.
In addition, such a multimedia system needs to supply interfaces and
mechanisms for input, modification, and removal of media and documents.
322 O. Kao

Corresponding I/O devices and software tools have to be supported. These


include next to standard components, such as speakers and microphones,
digital photo and video cameras, scanners, etc., as well as image, sound and
video processing systems, database systems, and other applications. Obtain-
ing the entire product range is expensive, needs maintenance, and is often
not economical, if used at each station. This is why a centralised distribution
of the resources is preferred, forming a single media server [GS98]. These are
part of the entire system and are equipped with all hardware and software
components for input/output, and processing of a certain media type. Due
to the network interconnection, the resources are available for all stations,
which act as media clients in this case. The user queries are segmented and
the individual parts are forwarded to the media server in question. The an-
swers are compiled by the client and then transferred to the user. Standard
examples for media servers are the following:

• Image servers are generally dedicated image database computers. They


contain a large number of images on the local storage devices and are sup-
plied with enough performance capacities, so that a costly image retrieval
can be performed.
• Video servers are the analogue for video data. The storage capacities have
to be huge, so that usually an entire series of hierarchical storage servers
are connected. They must be able to record and present multiple video
sequences.
• Document servers manage conventional documents composed of texts,
graphics, and images. They have at their disposal high performance scan-
ners, tools for text and graphic recognition, and efficient database man-
agement systems.
• Database servers contain information that spans the entire network. They
are used to coordinate and synchronise the queries, as well as to manage
the system.
• Web servers allow network-wide access to the multimedia system. This
component is the only one visible to the user: it offers different pre-
sentation and querying interfaces and abstracts the complexity of the
multimedia system.

Next to this division of media-specific resources, the storage and process-


ing capacities can be placed at the disposal of all participants in the internal
network. The result is a "naturally" distributed system that conforms to
the client/server concept introduced in the 1980s. As already described, a
client can be an active member (media server or media client) or a station
for querying and presentation. In the later case user workstations access the
system over the Internet and present the multimedia documents. They must
have standard input and output components for the media, but do not need
specialised devices or software.
7. Parallel and Distributed Multimedia Database Systems 323

8.1 Distribution of Multimedia Data


The centralised organisation of a media server requires immense storage,
computation, and network resources. With a growing number of user queries
and data to organise, such a centralised system will quickly reach the borders
of its capabilities, so that the quality of service is no longer fully sustainable.
A possibility to solve this problem is offered by distributed or parallel
architectures. The data and the programs are spread over several nodes, so
that the processing is accelerated or the path to the user is shortened. The
type of data partitioning and the structure of the fundamental architecture
depend on the target application. Three different applications, parallel or dis-
tributed Video on Demand (VoD) servers, federated, multimedia databases,
and a parallel retrieval system will now be presented.

Parallel and distributed VoD server. VoD servers offer a number of


services, like movie presentation, video conferencing, distance learning, etc.
The actors in such a system are

• Users requesting the system service.


• Service providers, which take care of communication between users and
the content providers. Interface management, registering, and bookkeep-
ing, forwarding the user queries, resource allocation, etc. are among these.
• Content providers store the media and offer the presentation and retrieval
mechanisms.

The content provider thus has one or more video servers at its disposal.
A centralised solution is linked to high transfer costs and strong signalling
network traffic [DePOOl. A replicated media distribution among several video
servers, independent of one another, significantly reduce the transfer costs,
so that the quality of service demanded can be obtained. The location of the
individual servers can be determined according to different criteria:

• Geographical distribution of the users,


• Access frequency from a certain region,
• Type of VoD services offered, and
• Availability of the video material, etc.

Figure 8.1 contains a graphical representation of the described architec-


ture. A global server archives the entire material and offers mechanisms for
media description and retrieval. The local server stores the replicated data of
the global VoD server. The drawback of this solution is a higher management
and storage effort. A compromise is achieved by the means of a combined
solution, which distributes a selection of movies often demanded, while a
central server stores all other media.
Distributing the video data across a number of servers or disk arrays
supports parallelism and increases data throughput. There are two striping
324 O. Kao

Fig.8.1. A distributed VoD architecture

policies, time striping and space striping. In the first case, a video is striped
in frame units across multiple servers. In contrast thereof, space striping is
a technique to divide a video stream into fixed-size units. These are easy to
manage and simplify the storage requirements.
The re-ordering and the merging of the video fragments in a coherent
video stream are performed by a component called proxy. There are three
main directions for the realisation of a proxy:
• Proxy at Server: a proxy is directly assigned to each storage server. The
proxy analyses the requests, determines the storage position of the other
video fragments, and forwards these from the corresponding proxy. The
computational resources of the storage server are used for this aim and
for the video retrieval.
• Independent Proxy: the proxies and the storage server are connected via
a network, so that the proxy can directly address all servers, and request
the required fragments. This assumes a corresponding control logic for
the proxy and the network, as well as sufficient bandwidth.
7. Parallel and Distributed Multimedia Database Systems 325

• Proxy at Client: a proxy is assigned to each client, which then takes care of
communicating with the storage servers. The communication complexity
is reduced, as the video fragments are transferred directly to the client.
On the other hand, the demands on the client complexity are significantly
increased, as they will need to realise the proxy functionality.

Further information as well as architectural examples such as TIGER are


given among many others in [DePOO,CHC+OO,Lee98,BFD97].

Federated multimedia database systems. A "natural" data distribution


exists whenever multiple independent servers containing different amounts of
data are combined to an entity. An example for this is a multimedia database
for movies, which consists of several local databases. Two possibilities exist:

• The local databases store different media and the corresponding meta-
information, such as an image database with portraits of actors, an im-
age database with keyframes from different movies, video servers with
digitised movies, conventional databases with bibliographical and movie
information.
• Every local database contains a subset of the movie material, for example
sorted by production country: a server contains all information mentioned
for American, another server for French movies, etc.

Yet a central interface exists in both cases, which enables accesses to the
data in all local databases. In the case of a heterogeneous, distributed data-
base system, different database systems are allowed on the same node. The
individual systems can be entirely integrated in the global system by transla-
tion schemes, or merely supply interfaces - so called gateways - to the local
data. The later are comparable to meta-search-engines on the Internet: the
keywords are entered - a syntactical transformation is assumed - in a num-
ber of search engines that will then analyse their databases in parallel. The
syntactical transformation mostly concerns the formulation of logical expres-
sions, for example Wordl AND Word2 is transformed in + Wordl + Word2.
The results are then combined in a final result and presented to the user.
Figure 8.2 shows an example for a heterogeneous, distributed multimedia
database system.
Completely integrated database systems - called multi database manage-
ment systems - are a connection of different local database systems with
already partially existing databases, by the means of conversion subsystems,
in a new, global system. A centralised interface can query all subsystems and
combine the results. Opposed to homogenous database systems, local data ac-
cess is still allowed: the users can continue to use "their" part of the database
as before, without having to re-sort to the global interfaces and mechanisms,
i.e. the membership in the global system is transparent for these users. The
functionality of this architecture is visualised in Fig. 8.3 [BG98]. An example
326 O. Kao

Global User

Fig. 8.2. Architecture of a federated multimedia database system

for a federated multimedia DBMS for medical research is found in [CKT+93]


and a conceptual framework in [BerOO].

Parallel multimedia database systems. Another reason to distribute the


data among several nodes is to speed up the database operations. Accessing
all parts of the database is still done through a central, uniform scheme. Two
additional layers support executing the operations:

• Fragmentation scheme and


• Allocation scheme.

The fragmentation layer divides global operations in segments, which can


be applied on different nodes. This requires certain restrictions to be consid-
ered, so that a useful unification of the partial results is possible.
The allocation layer maps fragments to individual nodes. This operation
is especially important, when no a priori distribution is given: mechanisms
for data replication and scheduling procedures modify the storage position
of selected data sets continuously, or in periodic intervals, and are used for
backing up data, as well as increasing the database systems performance. So
if multiple copies of a data set exist, an optimised allocation strategy can
significantly increase the data throughput. The local execution mechanisms
cannot be differentiated and correspond to those of a conventional database
system. The principal structure of homogenous, distributed database systems
is clarified in Fig. 8.4.
7. Parallel and Distributed Multimedia Database Systems 327

Global Global
User View User View
I n

Local
User View

Local Local
Internal Schema Internal Schema
I n

Local
Multimedia
Database

Fig. 8.3. Functionality of a federated multimedia database

Different strategies for data partitioning, scheduling procedures, and par-


allelising a multimedia database are presented in Section 8.3.

8.2 Parallel Operations with Multimedia Data

Parallel processing is often used in the field of multimedia, as it can give


large performance boosts, thus increasing the spectrum of practically usable
methods. Parallelism is assisted in multimedia databases, in that the data
is usually only read. Changing the raw data is for example necessary when
the media quality is to be improved, like noise-suppression in an image, or
converting the media in a more compact form - such as transformation of an
MPEG II video in the MPEG IV format . But these operations are usually
performed when the data is inserted in the database, so that no modifica-
tion is necessary during runtime. This means that multiple transactions can
access the same media when reading without a time-consuming synchronisa-
tion. Realising the inter-transaction- and intra-transaction-parallelism is thus
simplified and accelerated.
328 O. Kao

Global Global
User View U erView
I n

Global Schema

Fragmentation Schema

Allocation Schema

Local Local
Conceptual Conceptual
Schema I Scheman

Local Local
[ntemal Schema Internal Schema
I n

Local Local
Multimedia Multimedia
Database

Fig. 8.4. Function scheme of a homogenous, distributed multimedia database

The duration for processing multimedia varies between split seconds, as in


histogram computations, and several hours, such as object tracking in a video.
In view of these compute intensive operations, which are necessary for feature
extraction and comparison, it is important to exploit the parallelism of the
individual operators and the data parallelism. This is why these approaches
are now presented in greater detail, and are illustrated with examples from
image processing.

Segmenting the media. Parallelising a multimedia operator is usually done


by using the principle of data parallelism. The media is divided into sections,
the number of which is normally equal to the number of parallel working
nodes available. The media is segmented regardless of its content, Le. the
media is transformed in blocks of equal size. The operator then processes
these blocks. In some cases the sections have to overlap, so that the border
regions can be processed correctly. The partial results are then combined
in the final media. The blocks are often concatenated, but in some cases
7. Parallel and Distributed Multimedia Database Systems 329

this process needs additional computations, when determining corresponding


elements for a histogram, for example. Figure 8.5 visualises this process by
considering an image subdivision as an example.

Fig. 8.5. Parallel processing by subdivision of an image in overlapping sections

The advantage of this approach is that the used operators do not need
to be adapted with a complicated process, i.e. a large part of the sequential
code can be used unchanged. Furthermore, the computation time depends
mostly on the number of elements to be processed, so that all nodes require
nearly the same processing time.
On the other hand, this kind of parallelising cannot be employed for all
operators, as dependencies between the media elements need to be consid-
ered. The partitioning of the data can thus cause a tainted result. These
methods are not well-suited for architectures with distributed memory, since
the large transport costs for the media significantly reduce the performance
gain. The reason for this is the large amount of time the nodes spend wait-
ing for their data. In the worst case the parallel processing may take longer
than the sequential execution [GJK+OOj. Therefore, this kind of parallelism
prefers architectures with shared memory. The protracted transfers between
nodes are not performed, simplifying the segmentation, synchronisation, and
composition.

Parallel execution on multiple nodes. This type of parallel processing


is based on partitioning the data across several nodes. Each node processes
a disjunctive subset of the total data, utilising the natural parallelism of a
computer network. The partial results are then combined to the final result.
This approach is especially well-suited for architectures with distributed
memory, as it partially neutralises the drawbacks of time-intensive communi-
cations. Transferring the operators that are to be executed is not very costly
and is efficiently taken care of by existing broadcasting mechanisms. The I/O
subsystem bottleneck is reduced as well, by splitting the transfer costs among
a series of nodes.
330 O. Kao

Another advantage is that the operators do not have to be modified. The


existing sequential code is executed on all available nodes simultaneously, and
produces partial results for the local, disjunctive set of data. A new, central
component is necessary for unifying the sub results. Analogue to combining
blocks to a media, different methods are possible here as well: if all results
are comparable, it is usually sufficient to sort them. This solution has the
advantage that selected media do not have to be moved, only the results of
the content analysis. Otherwise the extracted features or the raw data need
to be compared to one another, significantly increasing the transfer costs and
the time needed for the final comparison.
The drawbacks of such a parallelisation become clear, when only a subset
of the data needs to be processed. In this case, an uneven data distribution
among the nodes is possible, causing the individual processing times to vary
significantly. Furthermore, the workload is located on few - in the worst case
on one - node, reducing the parallelism. The effective usage of all nodes
requires a dynamic re-distribution of the media, from nodes with a heavy
workload to nodes with little, or no, workload. Yet this is linked to large
transportation costs, and maybe with long idle times, which reduces, or even
eliminates, the performance boost. This disadvantage is especially serious
when large images or videos need to be transferred.
An example for this type of parallel processing of multimedia data is
demonstrated in Section 10, with the parallel image database CAIRO.

Functional parallelism. The foundation for a functional parallelism is the


partitioning of the algorithm in simultaneously executable modules, which
are then assigned to the multiple nodes. Control points are inserted, if a data
exchange, or a synchronisation, have to take place.
Partitioning the algorithm is a complex problem, as the runtimes of the
individual modules, and thus the workload, have to be nearly equal. Other-
wise, long idle times occur, for example during a synchronous communica-
tion: a module has to wait until its counter-piece reaches the control point
as well, before communication can take place. The second condition regards
inter-module communication: a long transfer time, which is generally the case
with multimedia data, means that the modules are blocked during this time.
This is why this kind of parallel processing is seldom used in the case of
multimedia data.
The modules can be alternatively organised in a pipeline, so that the data
is processed in phases.

8.3 Parallel and Distributed Database Architectures


and Systems

Parallel architectures and parallel processing are significant components of


the computer technology in the 1990s, and it is to be expected that they will
7. Parallel and Distributed Multimedia Database Systems 331

have the same impact on the development of the computer technology during
the next 20 years, as the microprocessors had in the past 20 years [CSG99].
ALMASI and GOTTLIEB [AG89] define a parallel architecture as a collection
of communicating and cooperating processing elements (PEs), which are used
to solve large scale problems efficiently.
The way in which individual PEs communicate and cooperate with each
other depends on many factors: the type and the attributes of the connecting
structures, the chosen programming model, the problem class to be solved,
etc. The organisational and the functional connections of these and other
components result in a multitude of different architectures, processing con-
cepts, system software, and applications. From the database community's
point of view the parallel architectures are divided into three categories:
Shared everything architecture: multiple, equally constructed PEs ex-
ist in the system and everyone of them can take care of exactly the
same assignments as all other processors (symmetry). These are, for ex-
ample, memory accesses, controlling and managing the input and output
activities, reacting on interrupts, etc. The other elements are regarded
by the shared operating system and the applications as unique, despite
them being able to be composed of several replicated components, such
as hard disk arrays. The synchronisation and communication is usually
performed by implementation of shared variables and other regions in the
memory. Figure 8.6 shows the principal composition of a shared every-
thing architecture.

Memory

Fig. 8.6. Shared everything architecture

Shared everything systems are main platform for parallel database sys-
tems and the most vendors offer parallel extensions for their relational
and object-oriented databases. Independent query parts are distributed
over a number of PEs.
Shared disk architecture: each processor has a local memory and access
to a shared storage device in this class. The data that to be processed is
transferred from the hard disk to the local memory and it is processed
332 O. Kao

there. The results are then written back to the hard disk, thus being
available for other processors. Special procedures are necessary to retain
data consistency - analogue to the cache coherency problem: the current
data can already be in the cache of a processor, so that accesses per-
formed in the meantime result in false data. The graphical display of this
architecture can be seen in Fig. 8.7.

Fig. 8.7. Shared disk architecture

Shared disk systems host databases distributed in local area networks.


They usually replace shared everything systems, which are not powerful
enough for the performance requirements of new database functions such
as data mining.
Shared nothing architecture: in architectures with distributed memory,
the nodes combine one or more processors, storage devices and are inter-
connect with a powerful network in a parallel computer. Because of the
small distance between the nodes, high bandwidths can be implemented,
making the access times very small. Communication and synchronisation
between processes on different nodes is done by message passing. This
principal can be realised for identical, as well as different types of pro-
cessors. Shared nothing architectures are more and more realised by con-
necting workstations through a network. Examples for this are Beowulf
clusters [Pfi98], which consist of traditional PCs and a high performance
network. The schematic display of a shared nothing architecture is shown
in Fig. 8.8.
Shared nothing systems are usually used for databases distributed over
wide area networks. Each node has a separate copy of the database man-
agement system and its own file system. All operations are performed
with the local data and the inter-node communications is usually based
on the client/server paradigm with conventional network techniques.
An additional analysis of the parallel database architectures, in particu-
lar shared nothing architectures, is provided by NORMAN ET AL. [NZT96].
They found a convergence of parallel database architectures towards a hybrid
7. Parallel and Distributed Multimedia Database Systems 333

Fig. 8.8. Shared nothing architecture

architecture, which mainly consists of cluster of symmetric multiprocessors


(SMPs), which are widespread examples of shared everything architecture.

Distributed database systems. The data is distributed among several


nodes within a computer network and combined in a logical entity, thus mak-
ing it look like a single database system from the outside. The only difference
noticeable to the user is a possible performance improvement of the system,
which results from utilising the parallelism in the computer network. A dis-
tributed database management system is therefore the system component,
which enables the presentation and processing of the distributed data, in a
way that is transparent to the user.
The data distribution orients itself according to numerous, sometimes
even conflicting, requirements and is significant for the system performance
and fault tolerance. The data required by an operation should - if possible -
all be on one node, the so called data locality. On the other hand, as many
operations as possible should be processed in parallel, i.e. the data should be
distributed evenly among all available nodes.
The transparent display of data requires mechanisms to translate the
syntactic and semantic characteristics of distributed data sets. The local data
is usually combined according to the following basic abstractions:

Aggregation: several local data sets are combined in a higher abstracted


object.
Generalising: similar data sets are merged in a generic object.
Restriction: the features shared by different data sets build a subset and
can be described by a new abstraction. Features, which are not contained
in all data sets, are ignored.

Conversion schemes are used to map the local data models onto the new,
global data models. The conversion can take place directly or via an abstract
meta-model. The most important goal is to entirely preserve the data features
that are transferred, and the operators, which can be applied on the data sets.
334 O. Kao

Further essential characteristics of a distributed database system concern


the optimisation of queries, controlling parallelism, reconstruction, integrity,
and security. All these aspects are already well-defined for conventional data-
base systems, and have been analysed in detail. The possibility of a geograph-
ically separated data distribution introduces new constraints and requires
additional communication and synchronisation mechanisms. The problems
grow more complex when heterogeneous computer architectures, long pro-
cessing times per data set - as in multimedia applications -, hardware and
software failures, etc. have to be considered.

Parallel database systems. Parallel processing in the field of database


systems has a long tradition, reaching back to the early 1960s. Similar to the
introduction of parallelism in the operating systems, the database systems
were extended by components enabling a configuration, where multiple users
could simultaneously work with the database. A database system is a shared
resource, so that the parallel execution of operations is linked to a series
of conditions. Their purpose is to synchronise and control the individual
operations, thus preventing the results being falsified.
The parallel concepts of database systems largely refer to optimising the
following features:
• Fault tolerance: the data is stored redundantly, so that a storage device
failure does not have an impact on the entire system.
• Performance: the processing of multiple queries in parallel makes an in-
crease in performance possible, and reduces the system latency.
These two aspects are now inspected more thoroughly.

Fault tolerance. The importance of fault tolerance systems increases more


and more in the age of eBuisness. Numerous OLTP applications (On-Line
Transaction Processing) are used around the clock, sometimes for critical
applications. A non-responding web site is a significant financial loss and,
more importantly, an image loss that can threaten the existence of an entire
business. Thus, central systems such as databases need to be designed so that
defects of individual components can be bridged.
One of these components are hard disks. Manufacturers state the mean
time between failures is longer than 500000 hours, which is more than 60
years. But these numbers are computed from the number of units sold and
the number of units, which have been returned as broken. The effective time
of operation and the access frequency are not considered [GMOOj, so these
numbers are only partially meaningful.
Transferring these statements to large systems with 100 hard disks means
that, under the premise of a constant rate of failure, a fault can be expected
every seven months. These failures are caught by storing the data redun-
dantly. Realising such a redundancy can be done in many ways, and depends
7. Parallel and Distributed Multimedia Database Systems 335

on the type of application, the volume of the data to be stored and its trans-
portation costs, the required access rate, etc.
Often used solutions are RAID systems (Redundant Array of Inexpensive
/ Independent Disks): several hard disks are bundled and configured by a
controller. The data is distributed among all the hard disks. A parity block
takes care of the redundancy and needs to be brought up to date with each
write access.
Another approach is mirroring the hard disks: several independent copies
of the same data set are stored on different hard disks. In the simplest form,
the storage media exists twice, as seen in Fig. 8.9. Next to the required
doubling of the storage resources, this solution uses a higher management
overhead, since each write or delete operation needs to be performed on every
copy. The existing redundancy boosts the workload balancing within the
system, as each read access can be re-directed to a mirrored, non-overloaded
unit.

l}
~-------~",.----.I­
Disk Pair I

Fig. 8.9. Example for a mirrored disk

A more efficient usage of the existing storage resources is achieved through


the principle of chained de-clustering [HD90]: half of each hard disk is repli-
cated on a neighbouring node. Another possibility for partitioning considers
the access statistics of the individual data sets and combines heavily and
weakly frequented data. The access rate is thus evenly distributed among all
storage nodes and offers the highest flexibility regarding workload balancing.
In case of a node failure, the data from the backup copy is distributed among
all other hard disks. In the worst case, the failure of two nodes can result
in information loss, for example when the primary and the backup copy of a
data set are destroyed. An example for this is seen in Fig. 8.10.

Performance. As already stated, redundant storage increases the paral-


lelism within the database system and thus increases the performance and the
data throughput. Parallel processing can be applied in different ways: start-
ing with re-routing queries to nodes with a lighter workload, over segmenting
command sequences in parts independent in respect to the data locality, to
applying an operation onto independent, redundant subsets. Which type of
336 O. Kao

(b)

Fig. 8.10. Example for Chained De-clustering with five independent disks: (a)
Normal case; (b) Re-distribution after disk number 4 failed

parallel processing is used depends on the configuration of the architectural


elements, the current application, as well as the data partitioning across the
individual nodes. The following basic classes of parallel processing can be
identified [Reu99]:

• Inter-transaction parallelism,
• Parallelism of operations within a transaction,
• Parallel execution of individual database operations, and
• Accessing the stored data in parallel.

The inter-transaction parallelism is based on a concurrent or parallel exe-


cution of multiple transactions. Thereby, it has to be assured that each trans-
action works on current, consistent data and does not affect the result of a
competing transaction. This approach is especially efficient when short, small
instruction sequences are executed. The chance of a deadlock in read/write
accesses is high with long computation sequences and durations, so that the
performance boost is reduced by expensive recovery operations. This type of
operations are often found for example in data mining applications.
The parallel execution of individual database operations is usually based
on parallelising the fundamental operations, like file scans, or index genera-
tion. The parallel execution of primitive operators will of course utilise the
natural parallelism of a computer cluster with according data partitioning.
Each PE is assigned the execution of the same operation, which is then -
without further synchronisation efforts - applied to the local partition. Af-
ter the execution sequence, all partial results are unified and combined in
a global result. Some examples for the parallel execution of basic database
operations, such as join or index creation are given, among others, by
[Reu99,GMOO,AW98].
7. Parallel and Distributed Multimedia Database Systems 337

9 Parallel and Distributed Techniques for Multimedia


Databases
This section regards the realisation of the distributed and the parallel tech-
nology for multimedia data. The first part describes the content-independent
and content-dependent data partitioning across individual nodes. Then differ-
ent possibilities for executing database operations in parallel are introduced.
These are general methods, which are clarified with the example of an image
database.

9.1 Partitioning the Data Set


The data distribution across multiple nodes is the most significant factor for
the efficiency of database systems on parallel architectures in general and on
a shared nothing architecture in particular [WZ98j. Existing analysis refer
mainly to parallel, relational database systems, in which the relations to be
processed are divided into segments and are assigned to different PEs. The
demands on such data segmentation are increased in the case of multime-
dia data, as the time consuming data transfer between the individual nodes
affects the performance of the entire system.
The data distribution can be performed statically or dynamically. The
data is analysed and placed in pre-determined categories depending on one
or more features, in the case of a static distribution. A dynamic distribu-
tion takes place during runtime and for example equalises the workload of
individual nodes.
A static data distribution can be generally split in three phases:

• De-clustering: the entire data set is divided into disjunctive subsets, ac-
cording to a given attribute or a combination of attributes.
• Placement: the subsets of the first phase are distributed among the indi-
vidual nodes.
• Re-distribution: the partitioning and assignment process is repeated in
certain intervals, or on demand, to eliminate disturbances in the equilib-
rium, for example after a node was added.

The advantage of these static distributions is that expensive computations


and data transfers are not necessary during runtime, i.e. the performance
of the database system is not reduced by additional management overhead.
Furthermore, the partitioning can be manually optimised with administration
interaction, and adapted to given applications. On the other hand, short-term
reactions to variable workloads among the nodes are not possible.
A dynamic distribution of the data is done during runtime and considers
the current workload of the nodes, as well as the number and the structure
of pending queries. The entire process generally consists of the following sub
processes:
338 O. Kao

• Bookkeeping of the queries performed,


• Monitoring the current workload of the nodes in the system, and
• Re-distribution of data from heavily loaded to idle nodes.

Idle times of individual PEs can be minimised by continuously re-distribu-


ting the data, and increasing the total throughput, as well. These advantages
are offset by an increased management overhead.

Strategies for static data distribution. Different memory models for the
organisation of complex objects in conventional database systems - for exam-
ple relational database systems - are already analysed. The direct memory
model stores the main and sub-object together. This eases object accesses
and reduces the necessary I/O activity. On the other hand, the tables grow
disproportion ably and executing database operations becomes inefficient.
In a normalised memory model, the objects and the corresponding at-
tributes are divided into tupel sets. These are then mapped to one or more
files . Two basic partitioning methods are possible:

• Vertical partitioning: the values of a given attribute are stored together


in a file. It contains for example the wavelet coefficients of all images in
the database. Figure 9.1 shows this technique.

ObjectID Kitogram
Colour
Moments
_. Wavelet
CoemcleDts

BeachOOOI (0.03 •...• 0.05) (273.45 • ... ) ... (54. 17, ... )
BeachAthens (0.0 I •...• 0.08) (125.37, ... ) .. . (98.65 •. .. )
... ... ... ... ...

'--"
File I "-
File 2
Fig.9.1. Vertical partitioning of a relational database
... "
Filen

• Horizontal partitioning: the grouping is done object-oriented, i.e. a file


contains the main objects and the corresponding attributes of a subset of
all data sets. An example for horizontal partitioning can be seen in Fig.
9.2.

It is also possible to use a combination of horizontal and vertical parti-


tioning: the so called mixed partitioning. Vertical partitioned segments are
divided horizontally in the second step, and vice versa.
7. Parallel and Distributed Multimedia Database Systems 339

Colour Wavelet
ObjectJD Hi togJ'am Moments
... Coefficients

BeachOOOI (0.03, ... , 0.05) (273.45, ... ) ... (54. 17, ... )

BeachAthens (0.0 I ..... 0.08) (125.37, ... ) ... (98.65, .. . ) >


... ... ... . .. .. .

>

Fig. 9.2. Horizontal partitioning of a relational database

The following basic strategies are available for the most often used hori-
zontal partitioning:

Range: the data is divided into ranges, based on the value of an attribute
or a combination of attributes. A simple example for this is the mapping
of visitors of an office to one of the available counters based on the first
letter of their surname.
Hashing-Strategy: the attributes are transformed with a given hashing
function and thus mapped on the corresponding partition.
Round-Robin-Strategy: if n nodes are available, the data is sent to node
k, with k < n, the next data set is sent to node k + 1 mod n, and so forth.
After a certain runtime, an even distribution across all nodes is reached.

The initial distribution obtained with these strategies is changed by


adding new elements, so a periodic re-organisation of the partitions is neces-
sary.
Assigning the partitions to the individual nodes takes place in the next
phase, i.e. time consuming data transfers are necessary. The criteria for the
assignment definition are usually based on the workload, which is to be min-
imised for a certain system component, such as the PEs, the I/O system,
or the network. The result of these assignment strategies is the following
classification:

• Network traffic-based strategies,


• Size-based strategies, and
• Access frequency-based strategies.

The goal of the network-based strategies is the minimisation of the pro-


cessing delay, which is generated by extensive network traffic. They are mainly
used in distributed systems running on a shared nothing platform. A heuris-
tic procedure, combined with a greedy algorithm, is suggested by APERS
[Ape88J. The fragments of the relation to be processed form the nodes of
340 O. Kao

the graph. The edges of this graph are given weights that correspond to the
transfer costs to neighbouring nodes. Each node pair is analysed according
to these costs and the pair with the highest costs is merged to one node.
This is repeated until the number of nodes in the graph equals the number
of nodes really existing. Variations of this fundamental algorithm, in respect
to fragment allocation, or the grouping of PEs, are looked at in for example
[IEW92J.
Strategies based on the processing size are developed for systems that
support a fine-grained parallelism. The compute time of all participating PEs
is adjusted by transferring approximately the same volume of data to all nodes
for processing. According to HUAs strategy [HL90J, the data set is divided
into a large number of heuristically determined cells that are combined in
a list. The first element is transferred to the node with the most free space
in its storage device and is then removed from the list. This procedure is
repeated until the list is empty.
The I/O system is a bottleneck, when large amounts of data are processed.
This is the reason why strategies in the third class reduce the frequency with
which the secondary memory is accessed and spread the accesses evenly across
all nodes. The BUBBA system applies such a strategy, which defines the terms
heat and temperature as symbols for the frequency with which a fragment
is accessed, or the quotient of the frequency and the relation size. Heat is
the measure according to which the number of nodes needed to process the
relation is computed. The temperature determines if the relation is to be kept
in the main memory, or if it should be swapped in the secondary memory. The
relations are sorted according to their temperature and distributed among
the nodes with a greedy algorithm, so that every node has nearly the same
temperature.

9.2 Applying Static Distribution Strategies on Multimedia Data


The proper distribution of the data across the nodes of a database system is
the basic requirement for utilising the parallelism. This is especially impor-
tant for multimedia data - due to of the high communication costs - and is
essential for the performance gain possible.
Content-dependent and storage-based partitioning methods are presented
in the following. In the ideal case, the media is divided into categories, which
are assigned semantic meanings. Thus, only relevant categories are regarded
during queries, so that the processing time necessary for a retrieval can be
reduced significantly.

9.3 Content-Independent Distribution of Multimedia Data


In a storage-oriented and content-independent distribution of multimedia
data across the database nodes, the media that is to be inserted is assigned
to the partition, with the least storage space used.
7. Parallel and Distributed Multimedia Database Systems 341

Let DVP1 ,DVP2 , ••• ,DVPn be the sums of the memory usage of all
media on the local hard disks of the database nodes 1,2, ... , n, i. e.

a.
DVPi = Lsizeo!(mij), i = 1,2, ... ,n, (18)
j=1

where the function sizeo!(mij) returns the storage space required for the
media mij. ai represents the number of media mij at the node i. The values
DV Pi are managed in a master list, which is updated whenever a new media
is added.
Let a media m new be given with x = sizeo!(mnew ), that is to be inserted
in the database. The node k, k E [1, n] with the least storage space used, is
determined for this aim:

(19)

The value DVPk = DVPk+sizeo!(mnew) is then updated and the media


m new is transmitted to the node k for storage.
The advantages of storage-based and content-independent partitioning
are the straightforward realisation and management. By evenly distributing
the media across the individual nodes a similar processing time is reached, so
that no complex load-balancing algorithms have to be used. All media must
be searched in a query based on dynamic attributes, making it impossible
to reduce the volume searched and minimise the response time. The total
duration of a retrieval is the sum of the runtime of the node with the most
data, and the time necessary to combine all partial results.

9.4 Content-Based Partitioning

Content-based partitioning is understood as the division of the multimedia


data in disjunctive classes based on a given attribute or a combination of
these. Yet this special attribute has to be chosen in a way that a non-
ambiguous class assignment can be made. It is only then possible to reduce
the data volume that needs to be searched according to an attribute of the
query media, thus minimising computation and communication overhead.
Selected content-based partitioning methods are introduced in the following.

Partitioning according to keywords. A simple method to construct im-


age subsets is by using keywords. The role models for this approach are
Internet search engines that offer the user a set of starting points. These are
characterised by a simple description, such as Entertainment or Computer.
342 O. Kao

After a category is selected, a full-text search can be performed on all media


in this category.
The keywords are bound to a hierarchy that describes the tendencies of
a media. This should be the main focus of an image, or the main plot of a
video, which are determined by the subjective perceptions of a viewer. Each
of these keywords defines a partition, and all media belonging to this cate-
gory are assigned to it. The characterisation can be accomplished manually,
automatically, or semi-automatically.
The manual assignment is done by having so called media managers view
all media on stock and decide which keyword fits the current media best. This
method is reliable and easy to implement, so that it is standard usage in the
commercial area, such as press agencies, television stations, photo agencies,
etc. The demands are higher with this method, as not only one given keyword,
but also a series of keywords has to be entered. Not a grouping, but an entire
characterisation is strived for, which is very time-consuming and linked to
high personnel costs.
Assigning the keywords to the media automatically requires general, work-
ing, and precise procedures for content analysis and for similarity search. The
automatically generated keywords are manually controlled and, if needed,
corrected in semi-automatic assignment.

Partitioning according to additional information. Media classification


according to additional information is closely analysed in numerous works
about digital libraries, as entire documents are stored in such systems. The
idea of this method is to generate a media description from the supplied
titles and subtitles, as well as the rest of the information in the document.
SABLE ET AL. introduce in [SH99] an example for a text-based partitioning
method. Information retrieval metrics, such as the frequency of a term in the
text (term frequency, TF) and the frequency in an inverse document (inverse
document frequency, IDF) are the fundamentals of this method [SaI89]. They
are defined as follows:

IDF(word) = log ( Total number of documents ) (20)


Number of documents containing word
TFIDF(word) = TF(word) x IDF(word)

Each document and the two categories are represented by vectors con-
sisting of T F * I D F. The degree of the match is calculated with the scalar
product in the form

score (doc, cat) = LTFIDFdoc[i] x TFIDFcadi]. (21)


7. Parallel and Distributed Multimedia Database Systems 343

The text and the corresponding image are assigned to the category with
the highest match. Different restrictions and modifications of this general
principle are introduced in [SH99], which makes matches larger than 80%
possible.

Partitioning according to visual similarity. This type of partitioning


is mainly used for images. These are compared to given example images
and are assigned to a class. Colour and texture attributes are fundamental.
Approaches using Self Organising Maps and other soft computing methods
are also known.
SZUMMER and PICARD [SP98] describe an approach for dividing images
into two classes, indoor and outdoor images. VAILAYA, JAIN and ZHANG
suggest a method for dividing images into landscape and city photos [VJZ98]
based on extracting and comparing the following primitive attributes:

• Colour histogram: five partial histograms consider the spatial distribution


of the colours.
• Colour coherence vector: the pixels of a colour class are divided into
coherent and incoherent elements. A pixel is called coherent, when it is
part of a large region of a similar colour.
• DCT coefficients: the nine largest coefficients in the intensity field and the
four largest coefficients of both chrominance fields are used to compute
four central moments of second and third order.
• Edge direction histogram: the pixels are divided into 73 classes. The first
72 classes correspond to partitioning all edge slopes possible in steps of
5°. The last class is composed of all pixels that do not belong to an edge.
The values are normalised by dividing them by the total number of edge
pixels, or by the number of pixels in the image, respectively.
• Coherency vector for edge-direction: all edge pixels of a given direction
are divided into coherent and incoherent pixels.

The choice of these attributes is justified, since the artificial objects in a


city scene contain many horizontal and vertical edges, while large and evenly
coloured regions dominate in landscapes.
Examples, whose attributes represent all elements of the class, are chosen
manually for both categories. Membership of a query image is determined
by an adapted computation of the Euclidean Distance with the extracted
attributes as parameters. The results of this computation di are normalised
for the next k neighbours onto the interval [0,1], so that the probability Pj
of class j, j E {I, 2} is defined as follows:

(22)
344 O. Kao

The query image is assigned to class j, if Pi > 0.5. The hit ratio achieved
is between 66% and 93.9% depending on the number of neighbours considered
and included attributes.
Combined approaches, that use textual as well as content-based at-
tributes, are described for example in [OS95,SC97].

Partition placement. There are two fundamental approaches to assigning


the partitions to the nodes:

• lin of each partition is assigned to each of the n nodes. A query is


therefore - ideally - processed in equal parts on all nodes. Concurrent
queries are executed sequentially.
• Each node is assigned one or more complete partitions. A query requires
only the computing performance of selected nodes; the other nodes can
process competing queries at the same time. This approach is useful for
a large number of queries. With few queries, the computing cost is con-
centrated on a small number of nodes, while the other nodes are idle.

9.5 Dynamic Distribution Strategies


The initial distribution of the media across the nodes of a database serves as
a basis for applying dynamic distribution strategies. These are used during
the system runtime to minimise response times. The current workload of
individual nodes, the structure and number of queries, and other system
parameters are considered here.
It is necessary to integrate dynamic distribution strategies in a shared
nothing architecture, if at least one of the following conditions applies:

• By combining a priori and dynamic feature extraction, the number of


data sets to be searched is reduced, disturbing the even distribution.
• Heterogeneous platforms are used for the retrieval, for example nodes
with different performance capabilities.
• The system is also used for other applications, so that it is possible for
different workloads to exist on the nodes.

Each of these conditions changes the uniform behaviour of a homogenous,


dedicated system and increases the functional complexity. In such a case, the
total processing time corresponds to the processing time of the node, that
has for example the least processing power, or which needs to analyse the
most media.
Let a shared nothing architecture be given with m processing units, on
which a database system for multimedia retrieval is run. A query q is pro-
cessed in two steps s and d, where s considers a static and d a dynamic feature
extraction. The following combinations are then possible for the processing
of a set B:
7. Parallel and Distributed Multimedia Database Systems 345

1. q(B) = s(B),
2. q(B) = d(B),
3. q(B) = so d(B),
4. q(B) = do s(B), and
5. q(B) = d(B){U, n, $, ... }s(B).
The queries 1 and 2 consider only one feature type and therefore there
is no need for task scheduling and unification of partial results. The query
number 5 requires a parallel execution of s and d, thus there is still no need
for dynamic re-distribution, as all - initially distributed - data has to be
processed. The failure of a computing node and the migration of its tasks
to other nodes, as well as execution on heterogeneous architectures are not
considered.
The query types 3 and 4 are compositions of the sand d sequences. Query
number 3 performs a retrieval with dynamically extracted features in the first
stage. The results are then processed with the a priori extracted features in
order to determine the final ranking. From scheduling aspects is this a non-
critical case, as approximately the same processing time for all available nodes
is assumed. This is the consequence of the initial equal size data distribution.
The second processing step considers only a priori extracted features and
corresponding operations, all executed on a single node.
Query number 4 represents a critical case. The retrieval with a priori
extracted features reduces the data set, which has to be considered during
the retrieval with dynamically extracted features. The equal data distribution
over the nodes is distorted; in the worst case all data is located on a single
node, thus no parallel processing can be done. Only this particular node
performs the retrieval operations, while the other nodes idle, resulting in
much longer system response times.
Let td(bil), td(b i2 ), . .. ,td(bin ) be the processing times for bib ... , bin'
Now, the following important time parameter can be approximated:
• System response time tr is the maximal processing time of all nodes:

tr = rr:2r {ttd(b ii )}.


.1=1
(23)

• Minimal processing time t max equals the processing time of the node with
the smallest number of relevant media:

tmin = rEp {ttd(bii )}.


.1=1
(24)

During O•.. tmin, all nodes are fully loaded, thus no media re-distribution
is necessary. After this period at least one node idles and media re-
distribution is necessary in order to avoid unused resources.
346 O . Kao

• Optimal processing time t opt with

(25)

In this case the idle times of all nodes are minimised, and the best possible
system response time is reached. The goal of the scheduling strategy is
to approximate this time as well as possible.

Figure 9.3 depicts the described time parameters and gives an example
for the differences between the processing times of the nodes.

p(bml)Ip(bmJ/ ... I P<hmn..)1

~I) I ~Z> I ... I ~D)I

~I) I ~2) I ~3) I .. 1P<bm,)1


: I
p(q,~p(bd l ... I ~D)1 :
I t [s]
~ tmin t opt It,

Fig. 9.3. Graphic representation of the processing times ti of the individual PEs Pi,
as well as the three points in time vital for using dynamic re-distribution strategies

It follows that data sets of different sizes inevitably lead to varying pro-
cessing times of the individual nodes, and so a significant difference between
the response time tr and the optimal execution time t opt can result.
In regards to the total performance, the point in time of a dynamic re-
distribution is decisive. In the simplest case, such a strategy is activated, as
soon as a node finishes with processing the media assigned to it. A better
utilisation of the resources available is achieved, when the current situation is
analysed during the processing time [0, tmin], as all nodes are busy processing
the media on their local storage devices. The special case tmin = 0 develops,
when no media have to be processed on a certain node(s). Generally, tmin > 0
time units remain to analyse the current situation and to generate a re-
distribution plan.
But creating an a priori execution plan requires data on

• Execution time tp of the current operator p, or the combination of oper-


ators as a function of the elements remaining to be processed.
7. Parallel and Distributed Multimedia Database Systems 347

• Information on the number of elements to be processed per node.

The time tp can be determined beforehand by running a series of trial runs


for all PEs in the system, and storing these in the database. Alternatively,
this time can be approximated from the current processing of the first data
sets. The number of elements to be processed is determined from the technical
information corresponding to the media.
The aim of a distribution strategy is to reduce the system latency tr by
temporarily or permanently re-distributing the data among the nodes, so that
the processing time is as close to the ideal time t opt as possible.
The heuristic Largest Task First (LTF) [KSDOl] is a simple strategy to
dynamically re-distribute the media on a homogenous platform, being char-
acterised by a low time complexity. The basic idea is to sort the media stored
on each node by decreasing processing times, so that the media with the
largest processing time is worked on first. This pre-sort accomplishes that as
little data as possible needs to be transferred through the network during a
re-distribution.
The processing can then be executed up to tmin' When the first node
starts to idle, the re-distribution is initialised and the media that is to be sent
to this node from overloaded nodes is determined. The first media selected
is the smallest media bpq on the node with the maximum processing time
t max . The difference of processing times t max - tmin is then compared to the
processing time of this media. If it is larger than td(b pq ), the media is re-
directed to the node with the minimal processing time tmin' This is repeated
until no media that fulfils the requirement exists on any node.
All media planned for re-distribution are then transferred to their target
nodes from each PE i at time point tiomm with

Processing of the media remaining at the node, and the (temporarily or


permanently) re-distributed media is then resumed at each node.
The advantages of this strategy are, next to the low time complexity, the
simple implementation and management. On the other hand, concentrating
the communication onto the point in time tmin is not efficient, as the network
is overloaded, and the number of transfer collisions is increased. This creates
latencies that reduce the performance gained by parallelising the process. An
alternative approach distributing the communication over longer period of
time is proposed in [DKOl]. The LTF strategy also fails, when the condition
of a linear connection between media size and processing time is no longer
satisfied. This is the case, when the processing time depends on the size and
the content of the current media, for example.
348 O. Kao

Those dynamic distribution strategies are better suited, which perma-


nently analyse the workload distribution and execute the re-distribution in
pairs.

10 Case Study: CAIRO - Cluster Architecture for


Image Retrieval and Organisation
In closing this chapter, a prototype for a parallel image database realised on
a shared nothing architecture is introduced.
Image management systems are one of the most important components of
general multimedia databases and are used to organise, manage, and retrieve
different classes of images. It is differentiated between:

Pattern recognition systems working with a homogeneous and limited


set of images, for example pictures of work pieces on a production line,
finger prints in police files, etc. These are compared to a manually com-
piled set of patterns in order to check the quality of the work piece or
to identify a person. Thus, the goal is to find one particular image. The
nearly constant recording environment and the well-defined set of target
patterns enable the development of specialised image processing opera-
tors resulting in high recognition rates.
Image databases managing large, general sets of images. They allow
searches for a number of images that are similar to a given sample image
or which satisfy user defined conditions. The main focus is to restrict the
large image set in the database to a few suitable images. Subsequently
the obtained results can be used to refine the initial search.

Pattern recognition systems have been used for a long time. Specialised
medical information systems were developed to evaluate images, as well as
manage, organise, and retrieve patient data. A medical database for comput-
ing and comparing the geometry and density of organs was already developed
in 1980 [HSS80J. Similar improvements happened in the field of remote sens-
ing. A generalisation of the procedures used, as well as the extension of the
application areas, required a specification of so called pictorial information
systems in the 1980s. A significant functional requirement was the image
content analysis and content-based image retrieval.
The importance of image database rose enormously in recent years. One of
the reasons is the spreading of digital technology and multimedia applications
producing Petabytes of pictorial material per year.
The application areas are numerous. Document libraries offer their multi-
media stock world-wide. This is also true for art galleries, museums, research
institutions, photo agencies for publisher houses, press agencies, civil services,
etc. managing many current and archived images. Document imaging systems
are tools that digitise and insert paper documents in a computer-based data-
base. Further areas are trademark databases, facial recognition, textile and
7. Parallel and Distributed Multimedia Database Systems 349

fashion design, etc. Systems are created in combination with applied image
processing, in which the image database is only part of a more complex sys-
tem. Medical information systems, for example, manage ultra sound images,
x-ray exposures, and other medical images.
CAIRO, the image database presented here, combines standard methods
for image description and the retrieval with efficient processing On a cluster
architecture. The data is distributed among several nodes, which is then
processed in parallel. The components necessary for this are
• User interfaces.
• Algorithms for feature extraction.
• Relational database system for storing a priori extracted image attributes.
• Index structures to speed up the retrieval.
• Mechanisms for the parallel execution of retrieval operations consisting
of
- Transaction manager: sets the order of the commands to be executed
and balances the workload across the cluster.
- Distribution manager: combines the algorithms to be used with the
identifiers of the sample and the target images and sends these to the
nodes.
- Processing manager: initiates and controls the feature extraction and
the comparison at the individual nodes.
- Result manager: collects the partial results and determines the global
hits.
• Update manager: takes care of inserting new images in the database, the
computation of the a priori defined features, and the updating of the
index structures.
The functionality of the individual nodes is described more closely in the
following.

10.1 User Interface


The graphical user interface offers various tools for formulating database
queries as well as visualising the resulting hits. The integrated query modules
are:
• Browsing: the user can browse the image stock, beginning from a starting
set, until a suitable image for similarity search is found. This interface is
further used to visualise the results.
• SQL Interface: this module is a representation of the SQL interface used
by the relational database system.
• Query by example image / sketch is one of the most often used query
forms in the case of similarity search. The user can load an image similar
to the one looked for, or can create a new one by sketching it. A canvas
that can be worked on with drawing tools (dots, lines, polygons, text,
etc.) is available for this purpose.
350 O. Kao

• Image montage: the query image is composed of several image segments.


At least two areas are necessary: images are loaded from the database
onto the first area, processed and selections determined. These are then
inserted in the second area to compose the query image.
• Feature determination: the user is supplied a survey of the existing fea-
tures and can chose a selection, adjust the parameters, and test the effect
on a standard set of images (relevance feedback).

Figure 10.1 displays the interface7 for query by example image or sketch
with the corresponding browser.

.
...... ~
-,~
,-'

byt,~ II02A • 768 pixels •


f-...=o::......:,--..:...:..-,"--"-'1

• 768 pixel s

Fig. 10.1. Graphical user interface: sketching tools and browser for the retrieval
results

7 Online demo: www.in.tu-clausthal.de/cairo/


7. Parallel and Distributed Multimedia Database Systems 351

10.2 Relational Database System and Index Structures


A relational database system manages the technical, world-oriented and a
part of the information extracted from the images, as well as the corre-
sponding algorithms for feature determination and comparison. Details on
the image size, number of pixels, format, etc. belong to the first group. The
information on the size is vital for the image partitioning and for the reali-
sation of the dynamic distribution strategies, as the approximate processing
time is determined with an operator.
A part of the features extracted a priori - such as histograms and wavelet
coefficients - is modelled with conventional database structures and stored in
the database. Other a priori extracted features are stored as BLOBs, so that
only the final storage position is referred to in the database. This holds true
for the raw data as well. These are also stored in a downscaled dimension as
thumbnails and are used for the visualising of the query results.
Next to the image information, the existing procedures are managed by
the database, too. It is noted, which procedures are available for which image
types, if the features are extracted dynamically or a priori, the designator
for the corresponding programs, and - if a linear dependency exists - the
average processing time for 1000 pixel. Further, each operator is assigned a
comparison metric that can transform the results of the analysis in an image
ranking.
To accelerate the evaluation of a priori extracted features, different index
structures - like VP trees for colour moments - are usable. But these remain
invisible for the user.

10.3 Features
To describe the image content, as well as for conducting an image comparison,
CAIRO offers a set of algorithms for feature extraction and comparison. There
are histograms, colour moments, format attributes, texture characteristics,
wavelet-based approaches, etc. A part of these features is extracted a priori
and stored in the index structures.
One of CAIROs specialties is the support of dynamic feature extraction.
In this case, the user can select a certain region manually, and use it as a
starting point for the search. Other regions of the query image and the object
background are not regarded, so that a detail search can be performed. But
this method requires the analysis of all image sections in the database and
produces an enormous processing overhead. The different approaches and the
results that are to be expected are introduced in the following.

A priori feature extraction. The state-of-the-art approach for the creation


and retrieval of image databases is based on the extraction and comparison
of a priori defined features. These can be combined and weighted in different
ways resulting in advanced features, which represent the image content on a
352 O. Kao

higher abstraction level. The similarity degree of a query image and the target
images is determined by calculation of a distance between the corresponding
features.
An example for this approach is given by the a priori extraction of wavelet
coefficients. Let 1= {h, ... , In} be a set of images to be inserted in a cata-
logue. The main feature for the content description is a vector with the largest
wavelet coefficients. Therefore the wavelet transformation is applied on all im-
ages in I resulting in a set of vectors WIp' .. ,WIn with WIj = (Cjt, ... ,Cj64).
At query time the user creates a sample sketch or loads an image, which
is subsequently processed in the same manner as the images in the database.
The wavelet transformation is applied on this image Q too and the wavelet
coefficients wQ = (CQ1' •.• ,CQ64) are determined. Subsequently the distances
between the vector of the query image and the vectors of all images in the
database are calculated. Each of these results gives an indication about the
similarity of the compared images. The images with the smallest difference
are the most similar images and the corresponding raw data is sent to the
user interface for visualisation. The extraction algorithms for the wavelet
coefficients, which are applied on the sample image as well as the similarity
functions are embedded in a SQL command sequence and executed using the
available mechanisms of the relational database systems. Further algorithms
can also be included and invoked as a user-defined function.
However, with this approach only entire images are compared with each
other, thus global features such as dominant colours, shapes or textures de-
fine the similarity. For example a query with an image showing a person on
a beach results in a set of images with beach scenes or images with large yel-
low and blue coloured objects. Images containing the same person in other
environments - for example canyon or forest - are sorted at the end of the
ranking. Figure 10.2 shows an example of such a query image and the results
obtained with a priori extracted features.
Acceptable system response time are achieved, because no further process-
ing of the image raw data is necessary during the retrieval process resulting
in immense reduction of computing time. The straightforward integration in
existing database systems is a further advantage of this approach.
Extraction of simple features results in disadvantageous reduction of the
image content. Important details like objects, topological information, etc. are
not sufficiently considered in the retrieval process, thus a precise detail search
is not possible. Furthermore, it is not clear, whether the known, relatively
simple features can be correctly combined for the retrieval of all kinds of
images.

Dynamic feature extraction. Image retrieval with dynamically extracted


features - short dynamic retrieval - is the process of analysis, extraction, and
description of any manually selected image elements, which are subsequently
compared to all image sections in the database.
7. Parallel and Distributed Multimedia Database Systems 353

Query/mage

Ranking o/Similar Images

• • •

I. BeachOOO 1 n-1. Fore 11734

2. BeachOOO7 n. Canyon3455

Fig. 10.2. Query image and results retrieved with a priori extracted features

An example for this operation is given by the template matching approach


as described in Section 6.2. The region of interest is represented by a min-
imal bounding rectangle and subsequently correlated with all images in the
database. Distortions caused by rotation and deviations regarding the size,
colours, etc. are thereby considered. In contrast to the previous example in
Fig. 10.2 the person looked for is found in different environments. The im-
ages showing beach scenes without the person are not considered at all. The
retrieved images are depicted in Fig. 10.3.
The dynamic feature extraction increases the computational complexity
for the query processing significantly. CAIRO exploits therefore the natural
parallelism provided by shared nothing architectures using a cluster platform.
These parallel architectures have an advantage that each node has an own
I/O subsystem and the transfer effort is shared by a number of nodes. More-
over, the reasonable price per node enables the creation of large-scale systems.
Open problems concern workload balancing, synchronisation and data distri-
bution as well as general problems like missing single system image and the
large maintenance effort.

10.4 CAIRO Architecture

PFISTER [Pfi98] defines a cluster as a parallel or distributed system consisting


of a collection of interconnected stand-alone computers and used as a single,
unified computing resource. The best-known cluster platform is Beowulf ,
a trivially reproducible multi-computer architecture built using commodity
354 O. Kao

Querylmage

Ranking ojSimiiar Images

• ••

I. Foresl1734 D-l. BeachOOO 1

2. Canyo03455 D. BeachOOO7
Fig. 10.3. Query image and results retrieved with dynamically extracted features

software and hardware components. A master node controls the whole cluster
and serves files to the client nodes. It is also the clusters console and gateway
to the outside world [SS99).
Clusters of symmetric multiprocessors - so called CLUMPs - combine
the advantages and disadvantages of two parallel paradigms: an easily pro-
grammable Symmetric Multiprocessing (SMP) model with the scalability and
data distribution over many nodes of the architectures with distributed mem-
ory. A number of well-constructed parallel image operators, which were de-
veloped and tested for the SMP model, are available. These can be used for
the image analysis on each node. The multiple nodes share the transfer effort
and eliminate the bottleneck between the memory and the I/O subsystem.
Disadvantages result from the time-consuming message passing communica-
tion, which is necessary for workload distribution and synchronisation. The
proposed image partitioning, however, minimises the communication between
the nodes and enables the use of the PEs to nearly full capacity. Based on
their functionality the nodes are subdivided in three classes:

• Query stations host the web-based user interfaces for the access to the
database and visualisation of the retrieval results .
• Master node controls the cluster, receives the query requests and broad-
casts the algorithms, search parameters, the sample image, features, etc.
to the computing nodes. Furthermore, it acts as a redundant storage
server and contains all images in the database. It unifies the intermediate
results of the compute nodes and produces the final list with the k best
hits.
7. Parallel and Distributed Multimedia Database Systems 355

• Compute nodes perform the image processing and comparisons. Each


of these nodes contains a disjunctive subset of the existing images and
executes all operations with the data stored on the local devices. The
computed intermediate results are sent to the master node.

Figure 10.4 shows a schematic of the cluster architecture.

query access

!CPU1 \ CPU2

!
rna ler node

• cluster control
• a-priori/eature extraction--;;I*I=I*I=I"I
• etc. slav node

• dynamic/eature
extraction
Fig. 10.4. Schematic of the CAIRO cluster architecture

10.5 Partitioning the Image Set


The distribution of the image set across the individual cluster nodes is decisive
for the retrieval efficiency. The requirements follow:

• Similar storage sizes of the partitions and thus an even distribution of


the images across the individual nodes,
• Computation reduction for the image retrieval, and
• Minimising the communication between the cluster nodes.

A partition can consist of multiple image classes, the elements of which dif-
fer significantly from other partitions. On the other hand, the images should
be characterisable by a shared feature, like landscape images or portraits.
356 O. Kao

The introduction of existing features for the image classification in the pre-
vious section shows that a reliable, content-based partitioning of the images
in independent subsets is currently not realisable. This is especially the case
when a general image stock is used. An unsuitable assignment can lead to
some images being unfindable, since they are not even considered during the
corresponding queries.
This is the reason why the initial partitioning of the image set Buses
the content-independent, size-based strategy, that leads to a set of partitions
P = {Pl , P2 , ... , Pn } with the following characteristics:

\;fPi ,Pj CB:Pi nPj =0, i,j=l, ... ,n, i=f.j (27)
size(Pi):=;::jsize(Pj ) i,j=l, ... ,n.

The processing of a partition Pi = {bib bi2,.'" binJ with an operator p


is executed per image, i.e.

(28)

The individual operations are independent of one another, so the order


of execution is irrelevant. This initial partitioning makes it possible for all
nodes to have uniform processing times, assuming a homogenous, dedicated
execution platform, if a query needs to analyse all images in the database.
The management overhead depends on the operator used and the structure
of the partial results. This time is usually neglectable, compared to the image
processing times.

10.6 Parallel Execution of the Retrieval Operations


The distribution of the data across a number of nodes makes it possible
to parallelise the retrievals by executing the same operations on all nodes,
and only considering the local image subset. The transaction, distribution,
computation, and result manager components are necessary to implement
this approach. They are based on the well-known parallel libraries PVM and
MPI [PVM,MPI], that are used for distributed and parallel computations in
a network of workstations.

Transaction manager. The functionality of the transaction manager en-


compasses the analysis of the transformations to be executed and determining
the order of the operations. Opposed to a conventional database management
system, the data is usually only read, so that no read and write conflicts need
to be resolved.
The order of operations should be set in a way that the time for the
processing and the presenting of the system response is minimised, and all
7. Parallel and Distributed Multimedia Database Systems 357

suitable images have been considered. The transaction manager is not in-
voked, if only a priori or only dynamically extracted features exist. But the
query usually consists of a combination of a priori and dynamically extracted
features, so that three basic approaches can be made:
1. The a priori extracted features are evaluated in the first phase, and a list
of all potential hits is constructed. This list is forwarded - together with
the algorithms for the dynamic extraction of features - to the distribution
manager, which causes the procedures to be only applied on these images.
2. Inverting the order of operation of the first case (1) leads to the case
where the list of potential hits is determined according to the dynamically
extracted features, which is then further narrowed down by considering
a priori extracted features.
3. Both processing streams can initially be regarded as independent of each
other and be executed in parallel. The resulting intermittent lists are
transformed in a final hit list by a comparison process.
Each of these possibilities has certain advantages and disadvantages re-
garding speed of execution and precision. The combination a priori/dynamic
extracted features limits the set of images that have to be processed dy-
namically and enables the fastest system response time. On the other hand,
suitable images can be removed from the result set by imprecise comparisons
with the a priori extracted features, and are not considered anymore in the
second step. This disadvantage is eliminated in the other two approaches,
but the processing time necessary clearly grows, as every image needs to be
analysed for each query.
The transaction manager also controls the module for dynamic re-distri-
bution of images across the nodes. If only a selection of images need to be
processed, the list is handed to the scheduler, which returns are-distribution
plan. This is the foundation from which the transaction manager creates the
execution lists for each node.

Distribution manager. This component receives a list of the extraction


and comparison algorithms to be executed, as well as a set of images as input.
The MCP module (Master Computing Program) analyses these lists, and
generates the program calls for the image analysis and comparison. They are
composed according to the PVM and MPI syntax, and are sent to all nodes,
which store a part of the images to be analysed on their local storage device
via the communication routines of the active virtual machine.

Computing manager. The computing manager controls the execution of


the extraction algorithms with the local data. This process runs on each
cluster node and supervises the communication with the master node. As
soon as the program and parameter transfer is completed, the computation
is initialised by the SCP module (Slave Computing Program). The end of
358 O. Kao

an SCP process is indicated to the MCP by a signal. The result manager is


initialised when these signals are received for all SCPs launched. A graphic
representation of this schedule can be seen in Fig. 10.5.

Relational
database

Fig. 10.5. Schedule for the parallel execution of the retrieval operations in a cluster
architecture

Result manager. The partitioning of the image data in disjunctive sets


results in each node composing a ranking of hits that need to be unified by
the result manager in the next step. All features have to be visible for this
component. A large communication overhead is generated if the raw data
needs to be compared as well, drastically reducing the advantages of the
parallelisation.

10.7 Update manager

This component realises the insertion of images in the database via a web-
based interface. First, the raw image data is transformed in a uniform format,
and is tagged with a unique identifier. All existing procedures for a priori
feature extraction are then applied to this image. Furthermore, the technical
and , if existent, world-oriented data is determined, and extended by a set of
user defined keywords. All information is composed in a given data structure
and stored in the relational database.
7. Parallel and Distributed Multimedia Database Systems 359

The next phase determines the cluster node, on whose hard disk the raw
image data is to be stored. In the case of an even data distribution, the image
data is sent to the node with the smallest data volume. It may be necessary
to re-distribute the data to achieve a balanced storage load, if larger images
are used. The exact image position is stored in the data structure, and the
image is sent to the corresponding node. The index structures, which may
exist, are updated in the last phase.

11 Conclusions
The development of the Internet technology enables an online access to a
huge set of digital information, which is represented by different multimedia
objects such as images, audio and video sequences, etc. Thus, the Internet can
be considered as a general digital library offering a comprehensive knowledge
collection distributed over millions of independent nodes. Thereby an urgent
need for the organisation, management, and retrieval of multimedia infor-
mation arises. Large memory, bandwidth, and computational requirements
of the multimedia data often surpass the capabilities of traditional database
systems and architectures. The performance bottlenecks can be avoided for
example by partitioning of the data over multiple nodes and by creation of a
configuration supporting parallel storage and processing.
The chapter gives an overview over the different techniques and their in-
teroperability necessary for the realisation of distributed multimedia database
systems. Thereby, existing data models, algorithms, and structures for mul-
timedia retrieval are presented and explained. The parallel and distributed
processing of multimedia data is depicted in greater detail by considering an
image database as an example. The main attention is given on the parti-
tioning and the distribution of the multimedia data over the available nodes,
as these methods have a major impact on the speedup and the efficiency
of the parallel and distributed multimedia databases. Moreover, different ap-
proaches for the parallel execution of retrieval operations for multimedia data
are shown. The chapter is closed by a case study of a cluster-based prototype
for image retrieval.

References
[172] ISO /IEC 11172-1, Information technology - coding of moving pic-
tures und associated audio for digital storage media at up to about
1,5 Mbit/s, part 1-3: Systems, Video, Compliance testing, 1993.
[818] ISO /IEC 13818, Information technology generic coding of moving
pictures and associated audio information, Part 1-3, 1995.
[ABF+95] Ashley, J., Barber, R., Flickner, M., Hafner, J., Lee, D., Niblack,
W., Petkovic, D., Automatic and semi-automatic methods for image
annotation and retrieval in QBIC, Proc. Storage and Retrieval for
Image and Video Databases III, 1995, 24-35.
360 O. Kao

[AG89] Almasi, G.S., Gottlieb, A., Highly parallel computing,


Benjamin/Cummings, Redwood City, CA, 1989.
[AH96] Asendorf, G., Hermes, T., On textures: an approach for a new abstract
description language, Proc. ISf3T/SPIE's Symposium on Electronic
Imaging 96, 1996, 98-106.
[Ape88] Apers, P., Data allocation in distributed database systems, ACM
Transactions on Database Systems, 1988, 263-304.
[AW98] Abdelguerfi, M., Wong, K-F., Parallel database techniques, IEEE
Computer Society Press, 1998.
[BerOO] Berthold, H., A federated multimedia database system, Proc. VII.
Conference on Extending Database Technology (EDBT 2000), PhD
Workshop, 2000, 70-73.
[BFD97] Bolosky, W.J., Fitzgerald, R.P., Douceur, J.R., Distributed schedule
management in the Tiger video fileserver, Proc. 16th ACM Sympo-
sium on Operating Systems Principles, 1997, 212-223.
[BG98] Bell, D., Grimson, J., Distributed database systems, Addison Wesley,
1998.
[Blo95] The Bloor Research Group, Parallel database technology: an evalua-
tion and comparison of scalable systems, Bloor Research, 1995.
[CHC+OO] Choi, S.-Y., Han, J.-H., Choi, H.-H., Yoo, K-J., A striping technique
for extension of parallel VOD-servers, Proc. International Confer-
ence on Parallel and Distributed Processing Technique and Applica-
tion (PDPTA 2000), IEEE Society Press, 2000, 1331-1338.
[Chi94a] Chiueh, T., Content-based image indexing, Proc. 20th VLDB Con-
ference, 1994, 582-593.
[Chi94b] Chiueh, T., Content-based image indexing, Technical report ECSL
TR-7, Computer Science Department, State University of New York,
Stony Brook, 1994.
[CKT+93] Chakravarthy, S., Krishnaprasad, V., Tamizuddin, Z., Lambay, F., A
federated multi-media DBMS for medical research: architecture and
functionality, Technical Report UF-CIS-TR-93-006, Department of
Computer and Information Sciences, University of Florida, 1993.
[Cla95] Clarke, R.J., Digital compression of still images and video, Academic
Press, London, San Diego, 1995.
[CSG99] Culler, D.E., Pal Singh, J., Gupta, A., Parallel computer architecture:
a hardware/software approach, Morgan Kaufmann Publishers, 1999.
[DePOO] DePietro, G., Multimedia applications for parallel and distributed
systems, J. Blazewicz, K Ecker, B. Plateau, D. Trystram (eds.),
Handbook on parallel and distributed processing, Springer-Verlag,
Berlin, 2000, 552-625.
[DK01] Drews, F., Kao, 0., Randomised block size scheduling strategy
for cluster-based image databases, Proc. International Conference
on Parallel and Distributed Processing Techniques and Applications
(PDPTA 2001), 2001, 2116-2122.
[DN91] Davies, N.A., Nicol, J.R., Technological perspective on multimedia
computing, Computer Communications 14, 1991, 260-272.
[FJK01] Falkemeier, G., Joubert, G.R., Kao, 0., Internet supported analy-
sis and presentation of MPEG compressed newsfeeds, International
Journal of Computers and Applications 23, 2001, 129-136.
7. Parallel and Distributed Multimedia Database Systems 361

[FSN+95] Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom,
B., Gorkani, M., Hafner, J., Lee, D., Petkovic, D., Query by image and
video content: the QBIC system, IEEE Computer 28, 1995, 23-32.
[FTA+OO] Ferhatosmanoglu, H., Thncel, E., Agrawal, D., Abbadi, A.E., Vector
approximation based indexing for non-uniform high dimensional data
sets, Proc. 2000 ACM CIKM International Conference on Informa-
tion and Knowledge Management, 2000, 202-209.
[Fur99] Furth, B., Handbook of internet and multimedia systems and applica-
tions, CRC Press, 1999.
[GJK+OO] Gaus, M., Joubert, G.R., Kao, 0., Riedel, S., Stapel, S., Distributed
high-speed computing of multimedia data, E. D'Holiander, G.R. Jou-
bert, F.J. Peters, H.J. Sips (eds.), Parallel computing: fundamentals
and applications, Imperial College Press, 2000, 510-517.
[GJM97] Grosky, W.I., Jain, R., Mehrota, R., The handbook of multimedia
information management, Prentice Hall, 1997.
[GMOO] Golubchik, L., Muntz, R.R., Parallel database servers and multime-
dia object servers, J. Blazewicz, K Ecker, B. Plateau, D. Trystram
(eds.) , Handbook on parallel and distributed processing, Springer-
Verlag, Berlin, 2000, 364-409.
[Gob97] Goble, C., Image database prototypes, W.I. Grosky, R. Jain, R.
Mehrota (eds.), The handbook of multimedia information manage-
ment, Prentice Hall, 1997, 365-404.
[G0I92] Goble, C.A., O'Doherty, M.H., Ireton, P.J., The Manchester multi-
media information system, Proc. 3rd International Conference on Ex-
tending Database Technology, Springer-Verlag, Berlin, 1992,39-55.
[Gon98] Gong, Y., Intelligent image databases: towards advanced image re-
trieval, Kluwer Academic Publishers, 1998.
[Gro94] Grosky, W.I., Multimedia information systems, IEEE Multimedia 1,
1994, 12-24.
[GRV96] Gudivada, V.N., Raghavan, V.V., Vanapipat, K, A unified approach
to data modelling and retrieval for a class of image database appli-
cations, V.S. Subrahmanian, S. Jajodia (eds.), Multimedia database
systems: issues and research directions, Springer-Verlag, Berlin, 1996,
37-78.
[GS98] Griwodz, C., Steinmetz, R., Media servers, Technical Report TR-
KOM-19998-08, TU Darmstadt, 1998.
[GWJ91] Gupta, A., Weymouth, T., Jain, R., Semantic queries with pictures:
the VIMSYS model, Proc. 17th Conference Very Large Databases,
1991, 69-79.
[HD90] Hsiao, H.-I., DeWitt, D.J., A new availability strategy for multipro-
cessor database machines, Proc. International Conference on Data
Engineering (ICDE 1990), 1990,456-465.
[HKK+95] Hermes, T., Klauck, C., Kreyss, J., Zhang, J., Image retrieval
for information systems, Storage and retrieval for image and video
databases III 2420, SPIE, 1995, 394-405.
[HL90] Hua, K, Lee, C., An adaptive data placement scheme for parallel
database computer systems, Proc. 16th Conference on Very Large
Databases, 1990, 493-506.
[HPN97] Haskell, B.G., Puri, B.G., Netravali, A.N., Digital video: an introduc-
tion to MPEG-2, Chapman & Hall, New York, NY, 1997.
362 O. Kao

[HSS80] Huang, H.K, Shiu, M., Suarez, F.R., Anatomical cross-sectional ge-
ometry and density distribution database, S.K Chang, KS. Fu (eds.),
Pictorial information systems, Springer-Verlag, Berlin, 1980, 351-
367.
[Huf52] Huffman, D.A., A method for the construction of minimum redun-
dancy codes, Proc. Institute of Radio Engineers (IRE) 40, 1952,
1098-1101.
[IEW92] Ibiza-Espiga, M.B., Williams, M.H., Data placement strategy for a
parallel database system, Proc. Database and Expert Systems Appli-
cations, Springer-Verlag, Berlin, 1992, 48-54.
(Jae91] Jaehne, B., Digital image processing - concepts, algorithms and sci-
entific applications, Springer-Verlag, Berlin, 1991.
(JFS95] Jacobs, C.-E., Finkelstein, A., Salesin, D.-H., Fast multiresolution
image querying, Proc. ACM Siggraph 95, Springer-Verlag, 1995, 277-
286.
[JMC95] Jain, R., Murthy, S.N.J., Chen, P.L-J., Similarity measures for image
databases, Proc. Storage and Retrieval for Image and Video Databases
III 2420, 1995, 58-65.
(JTC99] ISO/IEC JTC1 / SC29 / WGll / N2725, MPEG-4 overview, 1999,
Web site: www.cselt.stet.it/mpeg/standards/mpeg-4/mpeg-4.htm.
[KA97] Klas, W., Aberer, K, Multimedia and its impact on database system
architectures, P.M.G. Apers, H.M. Blanken, M.A.W. Houtsma (eds.),
Multimedia Databases in Perspective, Springer-Verlag, Berlin, 1997,
31-62.
[Kat92] Kato, T., Database architecture for content-based image retrieval,
Proc. Storage and Retrieval for Image and Video Databases III 1662,
SPIE, 1992, 112-123.
(KB96] Khoshafian, S., Baker, A.B., Multimedia and imaging databases, Mor-
gan Kaufmann Publishers, 1996.
[Knu73] Knuth, D.E., The art of computer programming, Addison Wesley,
Reading, MA, 1973.
[KSD01] Kao, 0., Steinert, G., Drews, F., Scheduling aspects for image re-
trieval in cluster-based image databases, Proc. IEEE/ACM Inter-
national Symposium on Cluster Computing and the Grid (CCGrid
2001), IEEE Society Press, 2001, 329-336.
[KTOO] Kao, 0., La Tendresse, I., CLIMS - a system for image retrieval by
using colour and wavelet features, T. Yakhno (ed.), Advances in infor-
mation systems, Lecture Notes in Computer Science 1909, Springer-
Verlag, Berlin, 2000, 238-248.
[Lee98] Lee, J., Parallel video servers, IEEE Transactions on Multimedia 5,
1998, 20-28.
[LZ95] Liu, H.-C., Zick, G.L., Scene decomposition of mpeg compressed
video, A.A. Rodriguez, R.J. Safranek, E.J. Delp (eds.), Digital Video
Compression: Algorithms and Technologies, vol. 2419, SPIE - The
International Society for Optical Engineering Proceedings, 1995, 26-
37.
[MG95] Mehrotra, R., Gary, J.E., Similar-shape retrieval in shape data man-
agement, IEEE Computer 28, 1995, 57-62.
[MPE98] MPEG Requirement Group, MPEG7 requirements document, ISO /
MPEG N2462, 1998.
7. Parallel and Distributed Multimedia Database Systems 363

[MPI] Message Passing Interface (MPI) project,


http://www.mpi-forum.org/.
[MS96] Marcus, S., Subrahmanian, V.S., Towards a theory of multimedia
database systems, V.S. Subrahmanian, S. Jajodia (eds.) , Multime-
dia database systems: issues and research directions, Springer-Verlag,
Berlin, 1996, 1-36.
[NBE+93] Niblack, W., Barber, R., Equitz, W., Flickner, M., Glasman, E.,
Petkovic, D., Yanker, P., The QBIC project: querying images by con-
tent using color, texture and shape, Proc. Stomge and Retrieval for
Image and Video Databases I, 1993, 1-36.
[NZT96] Norman, M.G., Zurek, T., Thanisch, P., Much ado about shared-
nothing, SIGMOD Record 25, 1996, 16-21.
[OS95] Ogle, V.E., Stonebraker, M., Chabot: Retrieval from a relational da-
tabase of images, IEEE Computer Magazine 28, 1995, 40-48.
[PB99] Petrou, M., Bosdogianni, P., Image processing: the fundamentals,
John Wiley and Sons, 1999.
[Pfi98] Pfister, G.F., In search of clusters, Prentice Hall, 1998.
[Pra91] Pratt, W.K., Digital image processing, John Wiley and Sons, Inc.,
New York, 1991.
[PVM] Parallel Virtual Machine (PVM) project,
http://www.epm.ornl.gov /pvm/.
[RCOO] Rajendran, R.K., Chang, S.-F., Image retrieval with sketches and
compositions, Proc. IEEE International Conference on Multimedia,
IEEE Society Press, 2000, 717-721.
[Reu99] Reuter, A., Methods for parallel execution of complex database
queries, Journal of Pamllel Computing 25, 1999, 2177-2188.
[Rob81] Robinson, J.T., The K-D-B-Trees: a search structure for large multi-
dimensional dynamic indexes, Proc. 1981 ACM SIGMOD Conference
on Management of Data, ACM Press, 1981, 10-18.
[SaI89] Salton, G., Automatic text processing: the tmnsformation, analysis
and retrieval of information by computer, Addison-Wesley, Reading,
1989.
[SAM96] Shneier, M., Abdel-Mottaleb, M., Exploiting the JPEG compression
scheme for image retrieval, IEEE Transactions on Pattern Matching
and Machine Intelligence 18, 1996, 849-853.
[SC97] Smith, J.R., Chang, S.-F., Visually searching the Web for content,
IEEE Multimedia 4, 1997, 12-20.
[SD95] Shen, K., Delp, E., A fast algorithm for video parsing using mpeg
compressed sequences, Proc. International Conference on Image Pro-
cessing, IEEE Computer Society Press, 1995, 252-255.
[SH99] Sable, C.L., Hatzivasseiloglou, V., Text-based approaches for the cat-
egorization of images, S. Abiteboul, A.-M. Vercoustre (eds.), Research
and advanced technology for digital libmries, Lecture Notes in Com-
puter Science 1696 , Springer-Verlag, Berlin, 1999, 19-38.
[SJOO] Santini, S., Jain, R., Integrated browsing and querying for image
databases, IEEE Multimedia 7, 2000, 26-39.
[SM98] Society of Motion Picture and Television Engineers, Anno-
tated glossary of essential terms for electronic production,
http://www.smpte.org/, 1998.
364 O. Kao

[S095] Stricker, M., Orengo, M., Similarity of color images, Storage and re-
trieval for image and video databases III, 1995, 381-392.
[SP98] Szummer, M., Picard, R.W., Indoor-outdoor image classification,
IEEE Workshop on Content Based Access of Image and Video
Databases (CAVID-98), IEEE Society Press, 1998, 42-51.
[SS99] Savarese, D.F., Sterling, T., Beowulf, R. Buyya (ed.), High perfor-
mance cluster computing - architectures and systems, Prentice Hall,
1999, 625-645.
[SSU94] Sakamoto, H., Suzuki, H., Uemori, A., Flexible montage retrieval for
image data, Storage and Retrieval for Image and Video Databases II,
1994, 25-33.
[ST96] Stonebaker, M., Moore, D., Object-relational DBMSs - the next wave,
Morgan Kaufmann, 1996.
[SteOO] Steinmetz, R., Multimedia technology, Springer-Verlag, Berlin, 2000.
[Swe97] Sweet, W., Chiariglione and the birth of MPEG, IEEE Spectrum,
1997, 70-77.
[Tve77] Tversky, A., Features of similarity, Psychological Review 84, 1977,
327-352.
[VJZ98] Vailaya, A., Jain, A., Zhang, H.J., On image classification: city vs.
landscape, Proc. IEEE Workshop on Content-Based Access of Image
and Video Libraries, IEEE Computer Society Press, 1998, 3-8.
[WSB98] Weber, R., Schek, H., Blott, S., A quantitative analysis and per-
formance study for similarity-search methods in high-dimensional
spaces, Proc. International Conference on Very Large Data Bases,
1998, 194-205.
[WHH+99] Wen, X., Huffmire, T.D., Hu, H.H., Finkelstein, A., Wavelet-based
video indexing and querying, Journal of Multimedia Systems 7, 1999,
350-358.
[WJ96] White, D.A., Jain, R., Similarity indexing with the SS-tree, Proc.
12th International Conference on Data Engineering, IEEE Computer
Society Press, 1996, 516-523.
(WNM+95] Wu, J.K., Narasimhalu, A.D., Mehtre, B.M., CORE: a content-based
retrieval engine for multimedia information systems, ACM Multime-
dia Systems 3, 1995, 25-41.
[WZ98] Williams, M.H., Zhou, S., Data placement in parallel database sys-
tems, M. Abdelguerfi, K.-F. Wong (eds.), Parallel Database Tech-
niques, IEEE Computer Society Press, 1998, 203-219.
[YL95] Yeo, B.-L., Liu, B., Rapid scene analysis on compressed video, IEEE
Transactions on circuits and systems for video technology 5, 1995,
533-544.
8. Workflow Technology: the Support
for Collaboration

Dimitrios Ceorgakopoulos, Andrzej Cichocki, and Marek Rusinkiewicz

Telcordia Technologies, Austin, Texas, USA

1. Introduction ..................................................... 367


2. Application Scenario and Collaboration Requirements ............ 368
2.1 Dynamic Team, Workspace, and Process Creation............ 369
2.2 Coordination................................................. 369
2.3 Content and Application Sharing............................. 370
2.4 Awareness ................................................... 370
3. Commercial Technologies Addressing Collaboration Requirements 371
4. Evaluation of Current Workflow Management Technology........ 372
4.1 Workflow Management Technology........................... 372
4.2 Workflow Process Model ..................................... 373
4.3 Control Flow and Dataflow ................................... 376
4.4 Roles ........................................................ 377
4.5 Workflow Process Definition Tools. . . . . . . . . . . . . . . . . . . . . . . . . . .. 377
4.6 Analysis, Simulation, and Animation Tools................... 377
4.7 Workflow Monitoring and Tracking Tools ..................... 378
4.8 Basic WfMS Infrastructure: Architecture, CUIs, and APIs .... 378
4.9 Advanced WfMS Infrastructure: Distribution and Scalability. 379
4.10 Interoperability Among WfMSs and Heterogeneous Applications 380
4.11 Concurrency Control, Recovery, and Advanced Transactions " 380
5. Research Problems, Related Work, and Directions................ 381
5.1 Management of Semi-Structured Processes ................... 381
5.2 Awareness ................................................... 382
5.3 Just in Time Supply Chain Management for Collaboration
Products ..................................................... 383
6. Summary ........................................................ 383

Abstract. Collaboration takes place whenever humans and/or computer appli-


cations work together to accomplish a common or compatible goals. For the last
two decades, many organizations and individuals have considered electronic col-
laboration of distributed teams to be the means of achieving higher productivity
and improving the quality of their products. The various collaboration technologies
introduced over the years managed to improved electronic communication, coor-
dination, and awareness, however, comprehensive solutions that fully realize the
promises of electronic collaboration have remained an elusive goal.
In this chapter we will discuss one of the collaboration technologies: workflow
technology. We will present the main concepts and evaluate the existing commercial
1 (c) 2001 Telcordia Technologies, Inc. All Rights Reserved.
366 D. Georgakopoulos, A. Cichocki, and M. Rusinkiewicz

products in context of an advanced application that tests the strength of the state-
of-the-art solutions. We will conclude this chapter with a discussion of unsolved
problems and research directions.
8. Workflow Technology: the Support for Collaboration 367

1 Introduction

Collaboration takes place whenever humans and/or computer applications


work together to accomplish a common goal or compatible goals. For the last
two decades, many organizations and individuals have considered electronic
collaboration of distributed teams the means of achieving higher productiv-
ity and improving the quality of their products. However, while the various
collaboration technologies introduced over the years managed to improved
electronic communication, coordination, and awareness, comprehensive solu-
tions that fully realize the promises of electronic collaboration have remained
an elusive goal.
Today, the space of collaboration solutions is fragmented, with features
supporting various aspects of collaboration distributed among workflow,
groupware, and content management tools. Unfortunately, as of now none of
these technologies provides a complete solution by itself. Furthermore, since
the majority of the provided tools are general purpose, it may be necessary
to develop application-specific tools and user interfaces.
The support offered by current technologies for scalability, and therefore
the size of distributed electronic teams, varies significantly. In particular,
many groupware tools that support joint activities are only appropriate for
small groups (i.e., teams of less than 10 people). On the other hand, tech-
nologies that scale well, i.e., systems for content and workflow management,
lack essential groupware tools.
Therefore, developing a collaboration solution that scales to an entire
organization and offers the appropriate collaboration tools will very likely
involve significant effort to integrate best-of-the-class components.
In this chapter, we present the workflow technology, and attempt to map
the requirements of advanced applications (such as intelligence gathering)
to the capabilities provided by it. In addition, we identify gaps (i.e., cur-
rently unsupported requirements) and propose areas that would benefit from
additional research.
This chapter is organized as follows: in Section 2 we present a hypothet-
ical scenario from the intelligence gathering domain. In addition we identify
essential requirement for distributed electronic team collaboration. In Sec-
tion 3 we introduce the technologies that address some of our collaboration
requirements. Section 4 presents an evaluation of commercial workflow man-
agement technology. The collaboration problems that require further research
and corresponding related work and research directions are discussed in Sec-
tion 5.
368 D. Georgakopoulos, A. Cichocki, and M. Rusinkiewicz

2 Application Scenario and Collaboration


Requirements

To outline the key requirements for technologies supporting distributed elec-


tronic team collaboration, in this section we discuss a hypothetical collabo-
ration scenario involving such teams.
Consider a team of intelligence analysts responsible for studying various
aspect of a country, e.g., political, economic, or military. Such analysts typ-
ically participate in scheduled projects that have production milestones. In
addition, these analysts may be commissioned to participate in unscheduled
(or ad hoc) projects when a crisis occurs.
Analysts participating in scheduled projects follow the same information
gathering process each day:

• collect information from classified and unclassified information sources to


find something noteworthy to report,
• call one or more meetings (if necessary) to consult with other members
in their team or other analysts and external experts,
• delegate collection for specific or most recent information if needed,
• create a draft product.

The product is either a single report released at one time, or report parts
released as the analysis unfolds. Frequently the product is electronically co-
authored across multiple business areas and different intelligence gathering
organizations that reside in different locations.
Next, the product undergoes multiple levels of electronic reviews before it
is electronically published. The product review process involves the following
electronic review activities:

• internal arid external peer review,


• supervisor review,
• upper management review.

When a disaster occurs (e.g., a terrorist attack involving chemical or bi-


ological weapons) all emergency management, law enforcement, and intelli-
gence sources collaborate to respond to the event. In such an event the right
people would form an integrated crisis mitigation team consisting of members
that are often not in the same location and time zone. Collaboration in these
teams is distributed and electronic.
Crisis mitigation processes will typically also involve practiced responses
that are dynamically tailored for the situation.
We discuss the requirements of this scenario for collaboration technology
in the following sections.
8. Workflow Technology: the Support for Collaboration 369

2.1 Dynamic Team, Workspace, and Process Creation

Dynamic team creation (and change) is required in both scheduled and ad


hoc projects. In particular, in scheduled projects dynamic team creation is
necessary to:

• create subteams, e.g., to perform a draft report review or to consult an


analyst,
• add new team members as needed and release those who expertise are no
longer needed,
• reassign team members to new team roles and activities/task as the
project evolves.

In crisis situations dynamic team creation is a fundamental requirement,


since crisis mitigation supervisors must be able to dynamically create task
force teams and assign individuals to them. For example, in terrorist attack
involving biological weapons a supervisor may need to form teams to gather
field samples, perform lab tests to determine the presence of an agent, and
identify the terrorist group involved. In an oil supply crisis, other teams with
different objectives and membership may be needed.
Teams create and manage information content dynamically. Workspaces
provide the tools for team communication, creating and sharing content,
recording discussion data, and sharing applications. Workspaces must be
created whenever a new project is created, and must change as team needs
change.
Dynamic process extension is required to introduce new activities after
a process has been started. For example, a team supervisor may decide to
extend and refine the product review process by including a group activity
that is performed by multiple team members who discuss and merge draft
reports from different analysts, and create a team draft report.

2.2 Coordination

To illustrate the coordination requirements of the information gathering con-


sider two processes: information gathering process and product review pro-
cess. These processes are semi-structured, since some activities are always
required (i.e., they are prescribed by the process to meet the project mile-
stones), while others are optional (i.e., the need for them depends on decisions
made by the analyst). In addition, prescribed and/or optional activities may
be performed by several members in the team at the time (Le., they are group
activities).
For example, the information gathering process needs to be done by a
group of analysts, dynamically selected by the supervisor (a group activity).
This process does not prescribe how team members conduct their work, be-
cause team members are expert analysts and they know best when to perform
370 D. Georgakopoulos, A. Cichocki, and M. Rusinkiewicz

an activity and how many times to repeat it (repeatable optional activities).


However, there are also prescribed activities since the draft report product is
a scheduled project milestone that must be produced by a specific deadline.
The product review process may contain two group activities that are
performed by two (possibly identical) teams. The review subteam members
use an electronic workspace to access the shared draft report and provide their
comments to the analyst (e.g., by using a chat session or instant messaging).
The team supervisor initiates an activity that produces a team report draft
for the upper management review. This group activity may involve a different
subteam that holds an electronic meeting using videoconferencing to make
decisions while the draft is jointly edited.
This example illustrates the need for supporting optional, group, and pre-
scribed activities. In addition, it demonstrates the need for activity ordering
(control flow) and making the report documents available to each activity at
the right time (dataflow).

2.3 Content and Application Sharing

Distributed electronic teams require tools for finding, accessing, and main-
taining shared content such as documents, images, team calendars, and elec-
tronic bulletin boards. The analysts in our example may participate in several
teams and projects. To facilitate the management of shared content, project
and team information, and providing needed tools for the function of each
project and team, the common information and tools must be organized in
different workspaces. Furthermore, just as teams and projects, analysts and
supervisors may create workspaces as they need to perform their functions.
The main advantage of providing team workspaces is that content and
tools for communication and content analysis and discussion are presented
in one virtual space. This reduces tool and content setup time for each team
activity regardless how the shared content is maintained.

2.4 Awareness

We define awareness as information that is highly relevant to a specific role


and situation of an individual or a team participating in collaboration. Aware-
ness facilitates informed decision making on the part of team members and
supervisors that participate in the collaboration. Awareness facilitates ad hoc
coordination, which is required for creating new teams and workspaces, and
extending and refining the collaboration process. Because human's attention
is a limited resource, and because applications cannot handle information
that is unrelated to their purpose and functionality, awareness information
must be digested into a useful form and delivered to exactly the roles or in-
dividuals that need it. If given too little or improperly targeted information,
8. Workflow Technology: the Support for Collaboration 371

participants will act inappropriately or be less effective. With too much in-
formation, participants must deal with an information overload that adds to
their work and masks important information.
Simple forms of awareness included determining who is present/active in
an electronic team workspace, if there is somebody else editing a document,
or whether a document you are editing have been read or edited since you
created its latest version. In addition to these simple forms of awareness,
electronic teams require awareness provisioning technology that supports the
following complementary types of awareness:

• Focused awareness permits participants to tap directly on activities


and resources (e.g., context, directory of team members present in a
workspace, etc.) that they are interested, including activities that are
performed by others (assuming that they are authorized to do so).
• Customized awareness determines what activity information is needed by
each process participant, and how to filter, combine, digest, and summa-
rize this information to match the participant's information requirements.
• Temporally constrained awareness determines when a process participant
needs specific (focused and/or customized) awareness.
• External awareness extends the above types of awareness to participants
and activities that belong to different organizations, as well as external
information sources that are typically outside of the team workspace(s).

3 Commercial Technologies Addressing Collaboration


Requirements

Several of the requirements we identified in Section 2 are currently addressed


by commercially available technologies that have been specifically developed
to support various aspects of collaboration. These commercial technologies
include the following:
Workflow Management Systems (WfMSs) provide process-based coor-
dination and application integration in many applications domains rang-
ing from insurance claim processing and loan approval to customer care in
telecommunications. WfMSs are either stand-alone systems, or they are em-
bedded in Enterprise Resource Planning (ERP) systems and e-business in-
frastructures providing Enterprise Integration Architectures (EIAs). WfMSs
have become a major industry, and currently WfMSs capture coordina-
tion and resources utilization rules in predefined/static processes definitions
that consist of prescribed activities. WfMSs interpret these process def-
initions to automate procedural coordination of activities including data
and control flow, runtime assignment of activities to participants via or-
ganizational roles, and monitoring the status of instantiated processes.
For examples and for more information on commercial WfMS see e.g.
[MQSOl,FilOl,SAPOl,VitOl,TibOl,EasOl,Hew97j.
372 D. Georgakopoulos, A. Cichocki, and M. Rusinkiewicz

Groupware systems provide tools supporting Computer As-


sisted Cooperative Work (CSCW). Commercial groupware tools
([NetOl,QuiOl,GroOl,LotOl,SamOlj) are currently used to perform video
and audio conferencing; to provide whiteboard, application, and screen
sharing; to keep group calendars, email, and text chat; to share content; and
to organize and perform group presentations over the internet. Groupware
tools assume ad hoc coordination. Therefore, people use groupware tools to
perform optional group activities. Such activities typically involve sharing
of information artifacts (e.g., presentations documents, applications, video
streams, voice steams, etc.). Groupware systems provide tools the allow
people to manage such shared resources (Le., permit specific users to create
such artifacts, view them, manipulate them, check their status, etc.).
Content Management systems are used to provide and control con-
tent. In addition, they support the sharing of content between the
members of a team, multiple teams, or entire organizations and their
partners. Commercial content management systems (see, for example,
[DocOl,VigOl,OpeOl,BroOl,FilOlj) currently provide scalable delivery and
sharing of documents and images, content personalization services, syndi-
cation services, content aggregation services, and collaboration services (e.g.,
group calendar and text chat). Content management services assume optional
activities.
In this chapter we will concentrate on the Workflow Management Tech-
nology only. Groupware and content management are important commercial
markets and research areas, however, they are beyond the scope of this work.

4 Evaluation of Current Workflow Management


Technology

In the following sections we present the features and capabilities currently


supported by Workflow Management Systems (WfMSs). In particular, we
discuss the workflow process models, tools, and infrastructure provided by
WfMSs.

4.1 Workflow Management Technology

A workflow is an automated process, which means that the coordination and


communication between activities is automated, but the activities themselves
can be either automated by applications or performed by people that do
manual work and possibly use software tools. The definition of a workflow
process usually involves the specification of the following:

• activities,
• resources,
• dependencies.
8. Workflow Technology: the Support for Collaboration 373

A manual workflow activity requires the following resources: the role(s)


of the users that are responsible to perform it, the activity tools, and activ-
ity content necessary for the user who is assigned this activity. Automatic
activities specify only the application(s) that execute them.
Workflow technology does not distinguish between shared resources (e.g.,
joint activity tools and shared content), and non-shared resources (e.g., a
user's private calendar tool and appointment data). However, this distinction
is particularly useful in describing tools and content provided by groupware,
and their relationship to workflow technology. Similarly, content management
systems may be viewed as resource management systems that maintain con-
tent resources. In addition, many of the tools that are provided by content
management systems are similar to tools provided by groupware. Since such
tools specialize in content manipulation, we refer to them as content (manip-
ulation) tools. Therefore, just like groupware activity tools, content tools are
activity resources in workflow technology.
In addition to activities and resources, the definition of workflow processes
involves the specification of dependencies between activities and between ac-
tivities and resources. The dependencies between activities are defined by
control flow and dataflow transitions. These define the ordering of activities
and the data communication from one activity to another. The dependen-
cies between activities and resources are resource utilization constraints, e.g.,
assigning a role, a tool, or content to an activity.
Dependencies are implemented by a workflow engine. The engine is con-
trolled by a computerized representation of the workflow processes (including
dataflow and control flow transitions). Each execution of a process is called
a workflow process instance. Users communicate with workflow engines using
worklists. Worklist tools provide an integrated user interface to all workflows
supported by the WfMS. To request work from a user, the engine places
an item in the worklist of this user. Participating users will pick up work
items and indicate the progress and completion status of the work using the
worklist interface.
Figure 4.1 depicts the reference architecture of a WfMS as it is defined
by the Workflow Management Coalition (WfMC) [WfM97].
The external applications in Figure 4.1 may be activity and content tools
provided by groupware and content management systems. The workflow ap-
plication data may be shared content. Some of the content data and status
produced by these external tools may be fed to the workflow engine to con-
trol the workflow process execution. WfMC refers to such data as workflow
relevant data.

4.2 Workflow Process Model


Workflow models typically support the following primitives:
• Activities: These are either basic (elementary) activities or processes
(composite activities).
374 D. Georgakopoulos, A. Cichocki, and M. Rusinkiewicz

References
Organisation!
Role Model ===~,.,.=========~
Data
Workflow
control
may data
refer to

Worldlow
Enactment , Manipulate
Service
1==== = = Worldlow /
'----,_--1 R~:~ant ",Upd
ate 1_ _ _ 1, Workflow
Application

11'
. - . Administration
& Control
Data

(Supervisor)
Invokes

• Software component
D System control data
D Extema,l product/data

Fig. 4.1. WfMS reference architecture as defined by WfMC

• Dependencies: These include control flow and dataflow transitions be-


tween activities, and resources assignments to each activity.
• Resources : These include roles (defined in a organizational structure),
workflow data (data referenced in transition conditions), and activity
content and tools (tools are not always captured by commercial WfMSs'
workflow models).
• Participants : These are WfMS users in a particular workflow that fill
roles and interact with the WfMS while performing activities.
To provide different levels of abstraction workflow models (and the WfMSs
that provide them) typically support the nesting of workflow processes.
Higher levels of abstraction help in capturing the process as it relates to
the teams and the organizations that participate in carrying out the process.
Modeling at these higher levels is typically devoid of implementation details
and tools. The lower levels of abstraction are required to capture a variety of
8. Workflow Technology: the Support for Collaboration 375

details about the actual content, tools, and applications required to support
the implementation of workflow activities.
Workflow processes are specified in any of the workflow specification lan-
guages. In the following paragraphs we describe the primitives in the Work-
flow Process Definition Language (WPDL) defined by the WfMC [WfM97].
Although WPDL is currently incomplete, it is an attempt for defining in-
dustry standard scripting language for representing workflow processes. We
discuss WPDL because it supports fairly common set of primitives:

• Workflow Process Definition: describes the process itself, i.e., name


and/or ID of process, etc. The workflow process definition optionally
contains a reference to an external organizational model.
• Workflow Process Activities: each activity is defined through four dimen-
sions, the who, the what, the how and the when:
- The activity is assigned to a role played by one or more workflow
participants (e.g., "Intelligence Analyst").
- The activity is assigned an application, which will be invoked during
runtime (e.g., "Word Processor" for writing a report).
- Activities are related to one another via transition conditions. Tran-
sition conditions are usually based on workflow data (e.g., an analyst
should write a report, and then send it for a review).
- Optionally an activity depends on a time frame (earliest begin, dead-
line, etc.).
• Workflow Participant Definition: describes the performer of an activity in
terms of a reference to an (external) organizational model. The definition
of such a performer does not necessarily refer to a single person, but
possibly to a function or any other organizational entity. That could be,
for example, a role of "Analyst" or "Manager".
• Transition Information: describes the navigation between different pro-
cess activities, which may involve sequential or parallel execution. Thus,
activities are connected to each other by transition information.
• Workflow Application Definition: defines one to n applications that are
assigned to an activity. These applications will be invoked during run time
by the WfMS. The workflow application definition reflects the interface
between the workflow engine and the application.
• Workflow Process Relevant Data: data used by a WfMS to determine
particular transition conditions that may affect the choice of the next
activity to be executed. Such data is potentially accessible to workflow
applications for operations on the data and thus may need to be trans-
ferred between activities.
• Organizational/Role Model Data: (possibly external) data that may be
referenced by the process definition data to identify relationships of hu-
man and automated resources associated with workflow activities. The
organizational/role model data may contain information about the iden-
tity of human and automated resources, organizational structure of re-
376 D. Georgakopoulos, A. Cichocki, and M. Rusinkiewicz

sources, and role information identifying the function that resources can
perform .
• Workflow Control Data: internal control data maintained by the work-
flow engine. They are used to identify the state of individual process or
activity instances. These data may not be accessible or interchangeable
outside of the workflow engine but some of the information content may
be provided in response to specific commands (e.g., process status, per-
formance metrics, etc.).

4.3 Control Flow and Dataflow


In WfMSs control flow primitives are called transitions. A control flow transi-
tion originates at exactly one source activity and points to exactly one target
activity, and implies that the source have to precede the target. Addition-
ally, a transition condition may be attached to a transition. When multiple
transitions point to the same target activity, this situation is called a JOIN in
[WfM98]. Transitions in a JOIN may be combined by a transition join policy
that is attached to the target activity. A join policy is a boolean condition on
the incoming transitions. Existing WfMSs and standards [WfM98] typically
support only pure AND (AND-SPLIT) or OR (OR-SPLIT) conditions. As
we discussed earlier, more complex constructs are necessary for modern col-
laboration processes. There is a need for optional and group activities, state
dependent control flow, etc. (see Section 2).
Some WfMSs support the flow of only process relevant data. Others sup-
port the flow of any data specified in the workflow process definition (Le.,
independently of whether such data is referenced in a transition condition). In
practice, the ability to pass data among the participants is what determines
the effectiveness of a WfMS. Consider our intelligence gathering example:
it involves a significant number of documents referencing other information
sources and reports. Additional attached information such as articles and
papers are smaller but can be numerous. Hence, the WfMS needs to move a
great deal of information around (often from country to country).
The typical support provided for dataflow is to ensure the existence of all
information objects before an activity is started, and to locate and retrieve
these objects. This typically requires no specific action on the part of the
user, who will experience that all activities on the worklist come with all
documents and information needed to do the work. Current WfMSs achieve
this by allowing the process designer to specify whether the WfMS should
provide dataflow by moving data references rather than the data itself. Some
WfMSs rely on specialized external systems to perform dataflow. They may,
for example, rely on an e-mail system and an imaging system for storing
and routing data involved in dataflow, including scanned or faxed images,
spreadsheets, graphics, voice, email, and multimedia objects. Alternatively,
they may use a CORBA Object Request Broker (ORB) [OMG97] to perform
dataflow by moving object references, or integrate imaging systems with the
8. Workflow Technology: the Support for Collaboration 377

workflow engine to handle the movement of scanned documents. However,


such integration is often poor and the engine has minimal control over the
flow of data, complicating synchronization dataflow with control flow.

4.4 Roles
Commercial WfMSs' roles are global (i.e., organizational) and static (i.e.,
they must be fully defined before the execution of a process begins). Just like
WfMSs, groupware tools provide static activity roles (e.g., "meeting mod-
erator" and "attendee"). Role assignment in WfMSs and process-oriented
systems determines who is doing what activity within a process. The term
role assignment is typically used in WfMSs because process participants are
usually addressed only via roles. Role assignment in existing WfMSs are lim-
ited to a one-out-of-n semantics. This means that an activity in a process
specification corresponds to exactly one activity instance at runtime, and
this activity instance is performed by exactly one participant out of n eligible
participants that play the role(s) assigned to this activity. This traditional
role assignment is well suited in applications where a task must be distributed
among a group of workers. However, in case where a number of people have
to execute the same task, such as participating in the same meeting, or per-
forming concurrent analysis of the same intelligence data, the traditional role
assignment is not sufficient.

4.5 Workflow Process Definition Tools


Most WfMSs provide tools for graphical specification of workflow processes.
The available tools for workflow process design typically support the iconic
representation of activities. Definition of control flow between activities is
accomplished by:
• connecting the activity icons with specialized lines/arrows which specify
the activity precedence order, and
• composing the transition conditions, which must hold before the workflow
execution moves from one activity to another.
Dataflow between activities is typically defined by filling up dialog boxes
that specify the input and output data to and from each activity. In some
WfMSs dataflow definition involves using specialized lines/arrows to draw
dataflow paths between activities.

4.6 Analysis, Simulation, and Animation Tools


Most workflow products provide workflow process animation tools, but de-
pend on external Business Process Management Tools (BPMTs) for sim-
ulation and analysis ([Ids97,HoI97,Met97,IF97j). Such BPMTs provide the
following:
378 D. Georgakopoulos, A. Cichocki, and M. Rusinkiewicz

• business process definition tools to produce visual business process models


by using one or more process modeling methodologies,
• analysis tools to measure performance and to facilitate process reengi-
neering or improvement efforts,
• simulation tools to determine the short term impact of a model and to
address practical concerns such as "bottlenecks",
• integration tools to export, translate, or share process definitions with
WfMSs.
The sophistication of analysis and simulation provided by BPMTs, as
well as the degree of integration and interoperability between BPMTs and
WfMSs have a direct impact on the ability to validate and evaluate workflow
processes.

4.7 Workflow Monitoring and Tracking Tools


Workflow monitoring tools can present different views of workflow process
execution. They illustrate which activity or activities are currently active,
by whom they are performed, the priorities, deadlines, duration, and depen-
dencies. Administrators can use such monitoring tools to compute statistics
such as activity completion times, workloads, and user performance, as well
as to generate reports and provide periodic summary of workflow process ex-
ecutions. Workflow monitoring tools are included in virtually all commercial
workflow execution systems.

4.8 Basic WfMS Infrastructure: Architecture, GUIs, and APIs


Many commercial WfMSs have loosely-coupled client-server architectures
that divide and distribute the WfMS functionality in components similar
to those illustrated in Figure 4.1. In such WfMSs, the engine is typically the
central component, and it is often referred to as the WfMS server. Process
definition data, workflow control data, workflow relevant data, and organi-
zation/role data are usually kept in a centralized database (or a set of such
databases) under the control of the WfMS engine, its (client) tools, and/or
the external applications invoked by the workflow process. Most WfMS en-
gines and tools take advantage of the data manipulation capabilities of a
commercial database management system (DBMS).
WfMSs typically offer proprietary G UIs and (client) tools for graphical
process specification, process monitoring, process invocation, and interac-
tion with human participants. However, the advent of the Web has made
many workflow product designers consider Web browsers and GUIs for WfMS
(client) tools. Using the Web as a front-end platform also allows for work-
flow processes that are geographically spread out. Since many users already
use Web browsers, there is no need to distribute client software, thus en-
abling a wider class of WfMS applications. Many WfMSs currently support
8. Workflow Technology; the Support for Collaboration 379

web-enabled tools for starting and monitoring workflow process instances


([FilOl,MQSOl,Ues97,Act97]). Web-enabled client tools are becoming a de
facto standard in current WfMS.
Many state-of-the-art WfMSs have complete application programming in-
terfaces (APIs). This allows everything that can be done through the user
interface also to be done via an API. In addition, the API can be used to
introduce specialized user interfaces and tools designed to meet specific ap-
plication requirements.

4.9 Advanced WfMS Infrastructure: Distribution and Scalability

State of the art commercial WfMS can currently support several hundred
workflow instances per day. However, older WfMS technology offers limited
(or lack of) engine scalability, distribution and component redundancy for
dealing with load balancing and engines failures.
Workflow vendors have recognized some of these limitations in earlier ver-
sions of their products, and they are currently introducing improvements to
address them. In particular, WfMSs from several vendors allow the use of mul-
tiple WfMS engines for supporting distributed workflow process execution. In
addition, vendors currently provide capacity planning tools that can estimate
the number of WfMS engines required to support the execution requirements
of a given process. However, in many of these WfMSs distributed workflow
process execution requires manual replication of the process definition in all
engines that may be involved in the process execution. This approach suf-
fers form potential configuration problems related to consistency of process
definition residing in different engines.
Another serious limitation in the current approaches for distributed work-
flow execution is the lack of automatic load balancing. Workflow engine scal-
ability and component redundancy issues can be addressed by choosing ap-
propriate system architecture [GE95J:

• a server process per client. Such an architecture does not scale well be-
cause of the large number of connections in the system and the large
number of server processes running on the server machine.
• a process per server. The functionality of the applications is provided by
one multi-threaded server process. In this case the server process becomes
a bottleneck, and the server program packed with several applications
become hard to maintain, as faults cannot be easily isolated.
• the server functionality and data are partitioned, and there is a server
process for each partition. As long as the partitioning of the functionality
balances the load on the server processes, this architecture adequately
addresses the scalability problem. However, each client has to be aware
of the application partition and any change in the partitioning requires
considerable reorganization.
380 D. Georgakopoulos, A. Cichocki, and M. Rusinkiewicz

• a "three-ball" architecture. A router between the client and server pro-


cesses is used to manage a pool of servers. The router automatically bal-
ances the load among the servers for each application, spawns new server
processes to handle heavy load, and restarts failed server processes. This
system can be scaled up further by replicating the router process. In many
modern systems, the router is provided by a TP monitor. However, this
approach requires either that the incoming requests are not related to
each other, or that the router implements session management features.
In many cases, the scalability problem of workflow systems can be ad-
equately addressed by a simple partitioning of the instances (e.g., on geo-
graphical basis), or replacing certain heavyweight components of the product
(such as replacing a general purpose, but inefficient, worklist handler with a
custom lightweight one built using the API provided by the vendor).

4.10 Interoperability Among WfMSs and Heterogeneous


Applications

For workflow processes that access heterogeneous information systems, inter-


operability among heterogeneous systems and WfMSs is an important issue.
Currently, interoperability means that various interface standards on dif-
ferent levels are available. That include protocol standards (e.g., e-mail,
TCP /IP), platform standards (e.g., Windows, Solaris, Linux), and object
interface standards (e.g., OLE, CMO, CORBA, Java RMI, EJBs). However,
interoperability at the workflow level requires additional technology and stan-
dards that exploit and extend current solutions, such as those developed by
the Object Management Group and the World Wide Web Consortium.
Because many types of errors and exceptions could arise in a distributed
heterogeneous computing environment, ensuring consistent error handling is
generally considered a difficult problem. The difficulty is enhanced by the
inherent complexity of business processes. Error prevention and handling is
one problem atea where new breakthroughs are needed in order to deliver
genuinely robust workflow processes.

4.11 Concurrency Control, Recovery, and Advanced Thansactions

Issues of concurrency control are well-understood in databases and transac-


tion processing products. However, state-of-the-art WfMSs take different ap-
proaches to concurrency control, depending on perceived workflow process re-
quirements. Current approaches (check-in/ check-out, pass-by-reference / pass-
by-value, etc.) are rather primitive when compared to DBMS support for con-
currency. Some WfMSs allow multiple users/applications to retrieve the same
data object concurrently. However, if each user decides to update that data
object, new versions of the data item are created to be reconciled (merged) by
human intervention. The rationale for this approach is the assumption that
8. Workflow Technology: the Support for Collaboration 381

data object updates are rare. Thus, consistency can be handled by humans
who review the data object versions and decide which version to keep.
To support forward recovery, contemporary WfMSs utilize transaction
mechanisms provided by the DBMS that maintain the process relevant data.
In particular, such WfMSs issue database transactions to record workflow
process state changes in the DBMS. In the event of a failure and restart,
the WfMS accesses the DBMS(s) to determine the state of each interrupted
workflow instance, and attempts to continue executing workflow processes.
However, such forward recovery is usually limited to the internal components
of the WfMS.
Very few WfMSs currently offer support for automatic undoing of incom-
plete workflow instances. In such systems, the workflow designers may specify
the withdrawal of a specific instance from the system while it is running, pos-
sibly at various locations.
The workflow vendors and the research community are debating whether
it is possible to use database management system technology and transaction
processing technology, or the extended/relaxed transaction models [GHM96]
that have been developed to deal with the limitations of database transactions
in the workflow applications.

5 Research Problems, Related Work, and Directions

In the following sections, we discuss some of the open research problems in


supporting team collaboration. Section 5.1 describes the problems and re-
lated research efforts aimed to support semi-structured processes. Awareness
research issues are introduced in Section 5.2. Supply chain management needs
are discussed in Section 5.3.

5.1 Management of Semi-Structured Processes

To address the requirements of advanced applications we discussed in Sec-


tion 2, technology for electronic team collaboration must support semi-
structured processes that can be dynamically extended and refined, and may
consist of any combination of prescribed, optional and group activities. Also,
such infrastructure technology must allow adding new activities and subpro-
cesses, and dynamic creation of new roles and participants.
Dealing with dynamic aspects of processes is an emerging topic in the
academic workflow and process management research. An overview and tax-
onomy for this problem area are provided in [HSB98]. Existing work can be
separated into approaches that allow for the dynamic modification of run-
ning processes and approaches that support a less rigid and/or descriptive
workflow specification and therefore allow for more flexibility for the process
participants.
382 D. Georgakopoulos, A. Cichocki, and M. Rusinkiewicz

ADEPTflex [RD98], WASA [Wes98], Chautauqua [EM97], and WIDE


[CCP+96] rely on a traditional predefined and static definition of workflow
types and provide explicit operations to dynamically change running work-
flow instances. These change operations allow to add/delete activities and
to change control and data flow within a running workflow instance, while
imposing constraints on the modifications to ensure structural integrity of
the process instance. While ad hoc process change may be a reasonable alter-
native for processes with up to a dozen of activities, ad hoc change of large
process may introduce inefficiencies and/or permit results that are inconsis-
tent with the original process design objectives.
Collaboration Management Infrastructure (CMI) [GSC+OO] supports late
binding of activities and roles in a running process. In particular, CMI pro-
vides placeholder activities. Such placeholders represent abstract activities,
indicating the need or opportunity to perform an activity without prescribing
its specific type, and specify the point in the process where the participants
assigned to them can decide the actual activity to perform. If there is no
existing activity type that can be selected, participants may create a new
activity type.
To allow role creation and role membership changes, CMI provides new
dynamic roles in addition to organizational roles. The dynamic roles are cre-
ated during the execution of a process, and are meaningful only in the scope
of this process.
The workflow research literature describes a few approaches for extend-
ing traditional workflow models to permit optional activities. These include
descriptive workflow specifications, such as the coo operator in [GPS99] and
flexible process types in Mobile [HHJ+99] that cover a range of predefined
process extensions [GPS99]. Coo is a high-level operator that can be used
to capture cooperating activities that work on shared data as proposed in
[GPS99]. Thus, it is limited to a specific application domain. CMI provides
explicit optional dependency that can be attached to any activity.

5.2 Awareness
The term awareness has been used in many collaborative systems (not man-
aged by a process specification) primarily to cover information about one's
fellow collaborators and their actions [BGS+99,PS97,SC94]. However, usually
only raw information is provided. This limited form of awareness is sometimes
called telepresence [GGR96]. One motivation for telepresence is that it allows
users to readily determine who is available at remote locations, so that ad
hoc collaboration may be initiated [DB92].
Commercial WfMSs and WfMC's Reference Model [HoI94] currently pro-
vide standard monitoring APIs. However, unless WfMSs users are willing to
develop specialized awareness applications that analyze process monitoring
logs, their awareness choices are limited to a few built-in options and process-
relevant events, usually delivered via e-mail or simple build-in tools. Elvin
8. Workflow Technology: the Support for Collaboration 383

is a general publish/subscribe framework [BK95j that could be considered


event-based. However, no form of customized event processing other than fil-
tering is performed. None of these systems provides mechanisms to cater the
information for specific roles/classes of users, nor do they address the issue
of combining information from multiple sources.
CEDMOS [CBR99,BGS+99j provides focused, customized, temporally
constrained, and external awareness. It extends a general purpose event pro-
cessing system to allow awareness designers to associate any specific collec-
tion of activity or resource events with an awareness role.
Information Trackers are another technology that uses subscription
queries and data fusion and mining techniques to provide personalized aware-
ness. Information Trackers are currently being used successfully in areas such
as business intelligence, technology tracking, and patent search.

5.3 Just in Time Supply Chain Management for Collaboration


Products

Collaborative applications, such as intelligence gathering projects, often in-


volve several teams that perform activities and produce results (e.g., infor-
mation and reports) in parallel. Current workflow and groupware technology
cannot support efficient delivery of the products of such projects. In partic-
ular, WfMSs support only strict ordering of activities, alternative activities,
and unconstrained parallel activity execution, but not a constrained overlap
of parallel activities. Similarly, groupware relies on explicit peer-to-peer com-
munication to synchronize teamwork and product delivery. To demonstrate
the necessity of coordinating parallel activities, consider a just-in-time sup-
ply chain involving multiple teams from different organizations. This requires
providing information and report parts only when they are needed, in order
to minimize idle time, and to minimize handling of partial products and end
eliminate associated costs. This creates a high responsibility for the partici-
pating teams and their members to finish their product just in the time, as
required by other teams and the project supervisors.
Technology supporting such electronic team collaboration must deal with
the issues of what, when and how to synchronize parallel team activities and
what to do if synchrony fails. The only work in this area that we are aware
of was done in the CMI project [GSC+OOj.

6 Summary

This chapter provides an overview of Workflow Management technology that


is a foundation of electronic collaboration of teams of people as well as auto-
mated systems. Such collaboration is a necessary element of advanced appli-
cation such as crisis management, complex intelligence gathering operations,
supply chain implementations and many others.
384 D. Georgakopoulos, A. Cichocki, and M. Rusinkiewicz

We have explored the requirements posted by these applications and


shown how they could be addressed by the workflow technology. We have
presented the features of the state-of-the-art commercial products support-
ing collaboration technologies, and identified problems that could not be
adequately addressed by them.
Finally, we have presented current research done in the workflow, and in
general, collaboration technology, and pointed out the directions in which
more research effort is needed.

References
[Act97] Action Technologies, Action Workflow and Metro,
http://www.actiontech.com/. 1997.
[BGS+99] Baker, D., Georgakopoulos, D., Schuster, H., Cassandra, A., Cichocki,
A., Providing customized process and situation awareness in the col-
laboration management infrastructure, Proc. 4th IFCIS Conference
on Cooperative Information Systems (CoopIS'99) , Edinburgh, Scot-
land, 1999, 79-91.
[BK95] Bogia, D., Kaplan, S.M., Flexibility and control for dynamic work-
flows in the worlds environment, Proc. ACM Conf. on Organizational
Computing Systems, 1995, 148-159.
[Bro01] BroadVision: one-to-one publishing, http://www.broadvision.com/.
2001.
[CBR99] Cassandra, A.R., Baker, D., Rashid, M., CEDMOS Complex Event
Detection and Monitoring System, MCC Technical Report CEDMOS-
002-99, Microelectronics and Computer Technology Corporation,
1999.
[CCP+96] Casati, F., Ceri, S., Pernici, B., Pozzi, G., Workflow evolution, Proc.
15th Conf. on Conceptual Modeling (ER'96), 1996,438-455.
[DB92] Dourish, P., Bly S., Portholes: supporting awareness in a distributed
work group, Proc. Conference on Computer Human Interaction
(CHl'92), 1992, 541-547.
[DocOl] Documentum: Documentum CME, http://www.documentum.com/.
2001.
[EasOl] Eastman Software, http://www.eastmansoftware.com. 2001.
[EM97] Ellis, C., Maltzahn, C., The Chautauqua workflow system, Proc. 90th
Hawaii Int. Conf. on System Sciences, 1997, 427-437.
[FilOl] FileNet: Panagon and Workflow, http://www.filenet.com/. 2001.
[GE95] Gray, J., Edwards, J., Scale up with TP monitors, Byte, April, 1995,
123-128.
[GGR96] Gutwin, C., Greenberg, S., Roseman, M., Workspace awareness in
real-time distributed groupware: framework, widgets, and evaluation,
R. J. Sasse, A. Cunningham, R. Winder (eds.), People and Comput-
ers XI, Human Computer Interaction Conference (HCl'96), Springer-
Verlag, London, 281-298.
[GHM96] Georgakopoulos, D. , Hornick, M., Manola, F., Customizing transac-
tion models and mechanisms in a programmable environment sup-
porting reliable workflow automation, IEEE 1ransactions on Data
and Knowledge Engineering 8(4), August 1996, 630-649.
8. Workflow Technology: the Support for Collaboration 385

[GPS99] Godart, C., Perrin, 0., Skaf, H., COO: a workflow operator to improve
cooperation modeling in virtual processes, Proc. 9th Int. Workshop
on Research Issues on Data Engineering Information Technology for
Virtual Enterprises (RIDE- VE'99), 1999, 126-131.
[GroOl] Groove, http://www.groove.net/. 2001.
[GSC+OO] Georgakopoulos, D., Schuster, H., Cichocki, A., Baker, D., Manag-
ing escalation of collaboration processes in crisis response Situations,
Proc. 16th Int. Conference on Data Engineering (ICDE'2000), San
Diego, 2000, 45-56.
[Hew97] Hewlett Packard: AdminFlow, http://www.ice.hp.com. 1997.
[HHJ+99] Heinl, P., Horn, S., Jablonski, S., Neeb, J., Stein, K., Teschke, M., A
comprehensive approach to flexibility in workflow management sys-
tems, Proc. Int. Joint Conference on Work Activities Coordination
and Collaboration (WACC'99), San Francisco, 1999, 79-88.
[HoI94] Hollingsworth, D., Workflow reference model, Workflow Management
Coalition, Document Number TCOD-1003, 1994.
[HoI97] Holosofx: workflow analyzer, http://www.holosofx.com. 1997.
[HSB98] Han, Y., Sheth, A., BuBIer, C., A taxonomy of adaptive workflow
management, On-line Proc. Workshop of the 1998 ACM Confer-
ence on Computer Supported Cooperative Work (CSCW'98) "To-
wards Adaptive Workflow Systems", Seattle, 1998.
[Ids97] IDS-Scheer: Aris toolset, http://www.ids-scheer.de/. 1997.
[IF97] ICL/Fujitsu: ProcessWise, http://www.process.icl.net.co.uk/. 1997.
[Lot01] Lotus: Lotus Notes, http://www.lotus.com/home.nsf/welcome/notes.
2001. .
[Met97] MetaSoftware: http://www.metasoftware.com/. 1997.
[MQSOl] IBM: MQSeries workflow,
http://www.ibm.com/software/ts/mqseries/workflow/, 2001.
[NetOl] Microsoft: NetMeeting,
http://www.microsoft.com/windows/NetMeeting/. 2001.
[OMG97] Object Management Group, http://www.omg.org/, 1997.
[Ope01] OpenMarket: Content Server, http://www.openmarket.com/. 2001.
[PS97] Pedersen, E.R., Sokoler, T., AROMA: Abstract Representation of
Presence Supporting Mutual Awareness, Proc. Conf. on Human Fac-
tors in Computing Systems (CHl'97), 1997, 51-58.
[QuiOl] Lotus: QuickPlace,
http://www.lotus.com/home.nsf/welcome/quickplace. 2001.
[RD98] Reichert, M., Dadam, P., ADEPTflex - supporting dynamic changes
of workflows without loosing control, Journal of Intelligent Informa-
tion Systems (JIIS), Special Issue on Workflow Management Systems
10(2), 1998, 93-129.
[SamOl] Lotus: Sametime,
http://www.lotus.com/home.nsf/welcome/sametime. 2001.
[SAP01] SAP, http://www.sap.com/
[SC94] Sohlenkamp, M., Chwelos, G., Integrating communication, coopera-
tion, and awareness the DIVA virtual office environment, Proc. Conf.
on Computer Supported Cooperative Work (CSCW'94), 1994, 331-
343.
[TibOl] InConcert, http://www.tibco.com/products/in_concert/index.html.
2001.
386 D. Georgakopoulos, A. Cichocki, and M. Rusinkiewicz

[Ues97] UES: KI Shell, http://www.ues.com/. 1997.


[Vig01] Vignette, content management server, http://www.vignette.com/.
2001.
[Vit01] Vitria: http://www.vitria.com. 2001.
[Wes98] Weske, M., Flexible modeling and execution of workflow activities,
Proc. 31st Hawaii International Conference on System Sciences, Soft-
ware Technology Track, vol. VII, 1998, 713-722.
[WfM97] Workflow Management Coalition, http://www.wfmc.org, 1997.
[WfM98] Workflow Management Coalition: Interface 1: Process Definition In-
terchange Process Model, Document Number WfMC TC-1016-P, Ver-
sion 7.04, 1998.
9. Data Warehouses

Ulrich Dorndorf1 and Erwin Pesch 2

1 INFORM - Institut fiir Operations Research und Management GmbH, Aachen,


Germany
2 University of Siegen, FB5 - Management Information Systems, Siegen, Germany

1. Introduction ..................................................... 389


2. Basics ........................................................... 389
2.1 Initial Situation and Previous Development .................. 389
2.2 Data and Information........................................ 391
2.3 Specific Features of Decision-Based Systems .................. 391
2.4 The Data Warehouse Idea.................................... 392
2.5 What is a Data Warehouse? .................................. 393
3. The Database of a Data Warehouse .............................. 394
3.1 Data Sources and Data Variety ............................... 394
3.2 Data Modelling.............................................. 397
3.3 Database Design ............................................. 401
4. The Data Warehouse Concept .................................... 404
4.1 Features of a Data Warehouse................................ 404
4.2 Data Warehouse Architecture ................................ 405
4.3 Design Alternatives of Data Warehouses ...................... 408
5. Data Analysis of a Data Warehouse .............................. 411
5.1 Evaluation Tools ............................................. 412
5.2 Data Mining ................................................. 413
5.3 Online Analytical Processing. .. .. .. .. . ...... .. .. ... . .. .. ..... 415
6. Building a Data Warehouse ...................................... 418
6.1 User Groups ................................................. 418
6.2 Data Warehouse Projects and Operations.................... 419
6.3 Products and Services ........................................ 421
7. Future Research Directions ....................................... 422
8. Conclusions ...................................................... 423

Abstract. Data warehouse systems have become a key component of the corpo-
rate information system architecture. Data warehouses are built in the interest of
business decision support and contain historical data obtained from a variety of
enterprise internal and external sources. By collecting and consolidating data that
was previously spread over several heterogeneous systems, data warehouses try to
provide a homogenous information basis for enterprise planning and decision mak-
ing.
After an intuitive introduction to the concept of a data warehouse, the initial
situation starting from operational systems or decision support systems is described
in Section 2. Section 3 discusses the most important aspects of the database of a data
warehouse, including a global view on data sources and the data transformation
388 U. Dorndorf and E. Pesch

process, data classification and the fundamental modelling and design concepts for
a warehouse database. Section 4 deals with the data warehouse architecture and
reviews design alternatives such as local databases, data marts, operational data
stores and virtual data warehouses. Section 5 is devoted to data evaluation tools
with a focus on data mining systems and online analytical processing, a real time
access and analysis tool that allows multiple views into the same detailed data. The
chapter concludes with a discussion of concepts and procedures for building a data
warehouse as well as an outlook on future research directions.
9. Data Warehouses 389

1 Introduction
Enterprises must react appropriately and in time to rapidly changing envi-
ronmental conditions, recognize trends early and implement their own ideas
as quickly as possible in order to survive and strengthen the own position
in an environment of increasing competition. Globalization, fusion, orienta-
tion towards clients' needs in a competitive market, mechanization, and the
growing worldwide importance of the Internet determine this entrepreneurial
environment. In order to plan, decide and act properly, information is of the
utmost importance for an enterprise. It is essential that the right informa-
tion is available in the appropriate form, at the right time and at the right
place. New procedures are necessary to obtain and evaluate this informa-
tion. PC-based databases and spreadsheets for business analysis have the
drawback of leaving the data fragmented and oriented towards very specific
needs, usually limited to one or a few users. Decision support systems and
executive information systems, which can both be considered as predecessors
of data warehouses, are usually also tailored to specific requirements rather
than the overall business structure. Enormous advances in the hardware and
software technologies have enabled the quick analysis of extensive business
information. Business globalization, explosion of Intranets and Internet based
applications, and business process re-engineering have increased the necessity
for a centralized management of data [Tan97,Hac99].
The much discussed and meanwhile well-known concept of the data ware-
house addresses the tasks mentioned above and can solve many of the prob-
lems that arise. A data warehouse is a database built to support information
access by business decision makers in functions such as pricing, purchasing,
human resources, manufacturing, etc. Data warehousing has quickly evolved
into the center of the corporate information system architecture. Typically,
a data warehouse is fed from one or more transaction databases. The data
needs to be extracted and restructured to support queries, summaries, and
analyses. Related technologies like Online Analytical Processing (OLAP) and
Data Mining supplement the concept. Integrating the data warehouse into
the corresponding application and management support system is indispens-
able in order to effectively and efficiently use the information, which is now
available at any time.

2 Basics
2.1 Initial Situation and Previous Development
The amount of internal as well as environmental enterprise related data is
continuously increasing. Despite this downright flood of data, there is a lack of
information relevant for decisions. The data are distributed and concealed in
various branches of the firm, where they are often related to special purposes,
and can also be found in countless sources outside the firm. In addition, the
390 U. Dorndorf and E. Pesch

data evaluations and reports are neither adequately topical nor sufficiently
edited.

An enterprise needs a variety of information as well as information sys-


tems for its various branches and activities: transaction and process oriented
systems for operational planning, and analysis and decision oriented systems
for the tactical and strategic decisions in order to meet the particular de-
mands and necessities at each level.

In the past, manifold concepts have been developed for using data already
at hand in the firm in order to support planning and decision making. Es-
pecially the endeavours regarding Management Information Systems (MIS)
must be mentioned, by means of which it was tried already in the 1960s to
evaluate data effectively. However, most ideas and developments have failed so
far for various reasons. In particular, the requirements and expectations were
often too high and could not be satisfied with the existing technology. Conse-
quently, an upcoming early enthusiasm rapidly changed into disappointment
and started projects were swiftly declared failures and were terminated.

Technical progress - such as improved computer performance, shorter ac-


cess time and larger memory capacity, relational database technology, client-
server architectures, user-friendly software and interfaces in particular - as
well as decreasing prices for hardware and improved software engineering
tools have brought about new possibilities and ideas. The concept mean-
while known as data warehouse offers a solution for many of the prob-
lems mentioned above. First efforts in this direction were made by IBM
in the 1980s, which led to the term information warehouse strategy. How-
ever, William H. Inmon may be considered as the proper father of the data
warehouse. He has coined the term and integrated various different attempts
and ideas that pointed in this direction, and he provides an insightful and
comprehensive overview of the technical aspects of building a data ware-
house [Inm96,Inm99,InmOO].

Several types of information systems that are related to the data ware-
house concept have been described in the literature. They have become known
under different names such as Decision Support System (DSS), Executive In-
formation System (EIS), Management Information System (MIS), or Man-
agement Support System (MSS) [GGC97,Sch96]. A data warehouse consti-
tutes not only a part but the basis of any of these information systems.
Sauter [Sau96], Marakas [Mar99], Mallach [MaI94] or Sprague and Watson
[SW96] present an overview of decision support systems. Turban [Tur98] gives
an overview of all types of decision support systems and shows how neural
networks, fuzzy logic, and expert systems can be used in a DSS. Humphreys
et al. [HBM+96] discuss a variety of issues in DSS implementation. Dhar and
Stein [DS97] describe various types of decision support tools.
9. Data Warehouses 391

2.2 Data and Information

Information has become one of the strategically most relevant success factors
of an enterprise because the quality of any strategic decision directly re-
flects the quality of its underlying information. Mucksch and Behme [MB97]
consider the factor information as the major enterprise bottleneck resource.
Management requires decision related and area specific information on mar-
kets, clients and competitors. The data must be relevant and of high qual-
ity with respect to precision, completeness, connectedness, access, flexibility,
time horizon, portability and reliability. As an immediate consequence, a
large amount of data does not necessarily imply a comprehensive set of infor-
mation [DycOO,IZG97]; Huang et al. [HLW98] discuss how to define, measure,
analyze, and improve information quality.
Heterogeneous data on the operational level are derived from a variety of
different external or internal sources, each of which is bound to its particular
purpose. In order to provide these data as a basis for the enterprise's manage-
ment decisions and for post-decision monitoring of the effects of decisions, an
appropriate adaptation is unavoidable. This, however, is the relevant concept
of a data warehouse: a flexible and quick access to relevant information and
knowledge from any database.

2.3 Specific Features of Decision-Based Systems

The early database discussion was dominated by the concept of a single uni-
fying, integrating system for all relevant enterprise decisions. The inappropri-
ateness of such a system results from different requirements on the operational
and strategic decision levels regarding relevant data, procedural support, user
interfaces, and maintenance.
Systems on the operational level mainly focus on processing the daily
events and are characterized by a huge amount of data that has to be pro-
cessed and updated timely. Hence the system's processing time becomes a
crucial factor. Utmost topicality should be assured whereas time aspects of
any decision relevant data are less important because data sources are up-
dated daily on the basis of short-term decisions. Since the environment is
stable in the short run, many operations become repetitious and can possi-
bly be automated.
On the strategic level, fast data processing and updating is less critical,
while different views corresponding to different levels of data aggregation in
various time horizons become more important. Time is a key element in a
data warehouse: it is important with respect to the data's history in order
to forecast future trends and developments. A continuous updating of the
data is not required, as a specific daily modification will not manipulate any
long-term tendency. Data access should be restricted to reading in order to
ensure that the database remains consistent.
392 U. Dorndorf and E. Pesch

Additionally there are differences in the user groups. Lower management


is usually responsible for decisions on the lower, operational level while the
upper and top management is responsible for the long-term strategic deci-
sions.
System usage on the short-term operational level is typically predictable
and almost equally distributed over time, whereas for systems on the strategic
level any prediction of resource usage is almost impossible [Inm96]. If there
is only one support system for the different levels and if this system is used
to capacity, an access from the strategic level for decision support can easily
lead to an unpredictable slowdown of response times or even to a failure
on the operational level [LL96]. Thus, any system has to assure that data
access for strategic level decision support does not influence the performance
of the systems supporting other decision levels. As an obvious consequence
two independent databases should exist.
In the past the development of decision support systems allowing the
user to perform computer generated analyses of data for the operational
decision level has attracted major attention. In the recent years, decision
support for long-term planning has also become increasingly important. The
latter, however, is the typical application area for data warehouses, because,
as Groffmann mentions [Gro97a], a database separated from the operational
level applications allows an effective administration of the decision related
data as well as a fast and flexible access to them.
Relational databases, see, e.g., Bontempo and Saracco [BS96], are well
known for transactions on the operational level, but decision-based systems
have their own very special requirements that are not immediately satisfied
by relational database technology. A decision support system does not neces-
sarily require the use of a data warehouse as data source and a decision sup-
port system does not always support decisions but, e.g., their consequences.
The most popular decision support tools are spreadsheets which are not at
all connected to any automated data warehouse. Inversely, data warehouses
need not be used as decision support systems. Data warehousing and decision
support systems or tools do not necessarily coincide but they can complement
each other.

2.4 The Data Warehouse Idea


One of the biggest problems in any larger company is the existence of widely
distributed data generated by different programs for certain reasons and in
different ways. Instead of having a homogeneous data set, the generated data
is distributed over multiple systems and media. Altogether, a large number
of obstacles hinder immediate access to the right data at the right time.
Thus the required information, although it might be available, cannot be
retrieved for analysis, planning or decision support. Furthermore, due to the
rapid improvements in information and communication technologies a large
number of external data sources are also available and have to be explored.
9. Data Warehouses 393

Hence, powerful information retrieval systems are needed which are able to
retrieve all relevant, latest and appropriately worked up information at any
time in order to provide this information for the decision making process.
Data warehouses are an important step towards this goal.

2.5 What is a Data Warehouse?


A data warehouse is comparable to a store where the customers can freely
move around and have access to any item on the shelves. A data warehouse
is a center of information and its administration. Inmon [Inm96] considers a
data warehouse as a "subject oriented, integrated, nonvolatile and time vari-
ant collection of data in support of management's decisions". A data ware-
house is a database that provides and centralizes data from different sources.
The warehouse is oriented around the major subjects of the enterprise such
as customers, products, suppliers, etc. The warehouse is isolated from any
operational system. Operational applications are designed around processes
and functions, e.g., loans, savings, etc. Consequently they are concerned both
with database design and process design, and they contain data that satisfies
processing requirements. Kimbal (1996) states that a data warehouse is "a
copy of transaction data specifically structured for query and analysis". The
warehouse generates a database which contains selected and aggregated in-
formation. A large variety of different data is collected, unified and updated.
The information is interactively accessible for management support in de-
cision making. The main output from data warehouse systems are reports,
either as non-formatted tabular listings or as formal reports, which may be
used for further analysis.
Definitions of and introductions to data warehouses are given by Mat-
tison [Mat96,Mat97], Adamson and Venerable [AV98], Garcia-Molina et al.
[GLW+99], Labio et al. [LZW+97], Meyer and Cannon [MC98], Bischoff and
Alexander [BA97], Inmon et al. [IRB+98], Singh [Sin97], Humphries et al.
[HHD99], Hammergren [Ham97a], Watson and Gray [WG97], Agosta [Ag099],
Sperley [Spe99], Goglin [Gog98], Franco [Fra98], Devlin [Dev97], Ponniah
[PonOI]. Barquin and Edelstein [BE96,BE97] discuss a variety of perspectives
on data warehousing. Jarke et al. [JLV+OO] review the state of the art and
cover data integration, query optimization, update propagation, and multi-
dimensional aggregation. They offer a conceptual framework in which the
architecture and quality of a data warehouse project can be assessed. Kim-
ball et al. [KRR+98] discuss all phases of a data warehouse development.
Kimball and Merz [KMOO] describe how to build a warehouse that is acces-
sible via web technology. Anahory and Murray [AM97] describe techniques
for developing a data warehouse; their book is also one of the few sources for
time estimates on data warehouse projects. Inmon and Hackathorn [IH94]
further elaborate the concepts for building a data warehouse. Silverston et
al. [SIG97] present examples for data warehouse models. Debevoise [Deb98]
and Giovinazzo [GioOO] discuss an object oriented approach to building a
394 U. Dorndorf and E. Pesch

data warehouse. Inmon et al. [IIS97] explain how data warehousing fits into
the corporate information system architecture. Kelly [KeI94] discusses how
data warehousing can enable better understanding of customer requirements
and permit organizations to respond to customers more quickly and flexibly.
Morse and Issac [MI97] address the use of parallel systems technology for
data warehouses.
Inmon's understanding of a data warehouse has been generally accepted
although sometimes other concepts like information warehouse have been
introduced in order to focus on specific commercial products. Hackathorn
[Hac95] uses the term data warehousing in order to focus on the dynamic
aspects of data warehouses and to emphasize that the important aspect of
a warehouse is not data collection but data processing. The common aim
behind all concepts is to considerably improve the quality of information.

3 The Database of a Data Warehouse

3.1 Data Sources and Data Variety

In order to provide the right information for decision support, a database has
to be created. The database must be loaded with the relevant data, so that the
required information can be retrieved in the appropriate form. The process of
transforming data that has been collected from a variety of enterprise internal
and external sources and that is to be stored in the warehouse database is
outlined in the example in Figure 3.1.

Internal Sources
Marketing
Finance
Personnel

Data Warehouse

External Sources
Online DB
Media

Fig. 3.1. Data generation for the data warehouse


9. Data Warehouses 395

Data sources. Data from several sources have to be transformed and inte-
grated into the data warehouse database.
According to Poe [Poe97], who gives an overview of data warehousing
that includes project management aspects, the largest amount of relevant
data is produced through enterprise internal operational systems. These data
are distributed over different areas of the enterprise. Acquisition, processing
and storage of these data are difficult because of frequent updates, which
occur not only once per year or month but daily or even several times a day
and usually affect a large amount of data. Enterprise areas with large data
amounts are controlling, distribution, marketing and personnel management.
Other data is collected from enterprise external sources like the media
or other kinds of publications, databases from other companies (possibly
competing ones) as far as available, and information from firms that collect
and sell data. Technological developments, new communication media and
in particular the Internet have led to a rapid increase of these external data
sources. These sources provide additional information for the evaluation of
the enterprise's own data on the competitive markets, for an early recognition
of the market evolution, and for the analysis of own weaknesses, strengths
and opportunities.
Another information source is meta-information, i.e., the information ob-
tained by processing information or data. It is the result of an examination
of data obtained from decision support based systems and takes the form of
tables, figures, reports, etc., which are of importance to different people in
different branches of the company. It can be very costly to extract this kind
of information whenever needed. Although the importance and relevance of
this information for future decisions is hard to predict, it may therefore be
preferable to store the meta-information instead of repeatedly generating it.

Data transformation. The data are available in a variety of formats and


names so that they first have to undergo a transformation process before
they can be introduced into a data warehouse which should serve as a homo-
geneous database. This transformation process is important as the collected
data is optimally adjusted to its original data source in its specific informa-
tion system and business area. Moreover, a large amount of data is redundant
as it is stored in different places, sometimes even under different descriptions.
Brackett [Bra96] uses the terms "data chaos" and "disparate data" in order
to denote non-comparable data, heterogeneous with respect to their kind and
quality, redundant and sometimes even imprecise. Hence, the challenge of the
transformation process is to integrate contents from various data sources into
one homogeneous source, the warehouse.
For example, a typical problem arises from the codes assigned to differ-
ent products for administration and marketing purposes. The codes contain
information about the sales area, the product key, the area size, etc. The
396 U. Dorndorf and E. Pesch

transformation process is responsible for decoding all kinds of available in-


formation in order to make them available for further usage.
Problems of redundancy occur frequently. A special problem is the neces-
sary propagation of redundantly kept data whenever an update is required.
Often the update is limited to a particular data source and its propaga-
tion is missing, so that data inconsistencies are unavoidable. Moreover, there
is the problem of putative and hidden data redundancy. Different items in
the various business branches of the enterprise may have the same descrip-
tion, or there may be different descriptions and names for identical items.
Both cases cannot be recognized immediately; however, they have to be
detected in the process of creating a homogeneous database in order to
avoid serious mistakes for the decision support. Warehouse consistency is-
sues are discussed in detail by Kawaguchi et al. [KLM +97] or Zhuge et al.
[ZGH+95,ZGW96,ZWG97,ZGW98].
The complexity and variety of enterprise external data sources compli-
cates data transformation and integration by far more than that of the en-
terprise internal sources. The number of different formats and sources almost
approaches infinity and the transformation and unification process of the
potential information is critical with respect to time and cost criteria.

Classification of data. Data in a warehouse can be classified according to


various criteria.
A rough classification with respect to the data's origin merely distin-
guishes between external and internal data and meta-information. A more
refined classification considers the exact source and the kind of business data,
Le., whether the information that can be deduced is of operational or strategic
relevance.
An important data class is meta-data, i.e., data about data. Meta-data
describe what kind of data is stored where, how it is encoded, how it is related
to other data, where it comes from, and how it is related to the business, Le.,
it is a precise information about the kind, source, time, storage, etc., of some
data.
Data in the warehouse is either atomic or aggregated. Furthermore, a clas-
sification with respect to the data's age is reasonable, as the data's history is
an important aspect for strategic decision support for forecasting and recog-
nizing trends; a frequent overwriting of past data is not acceptable. Another
question is whether the data structures are normalized or not.
Inmon splits data into primitive and derived data. Primitive data are
detailed, non-redundant, and used daily; they are static according to their
context and definition, and they have a high probability of being retrieved.
Derived data can be characterized by the opposite attributes.
Data can take the form of digits, text, tables, figures, pictures or videos.
A classification with respect to data types knows simple data types stored in
two-dimensional tables as well as multi-dimensional types. Data structures
9. Data Warehouses 397

can be categorized into simple and complex structures and non-structured


data. Especially the latter is difficult to handle for further processing. The
different data structures have their own specific processing or storage require-
ments that must be considered in the design phase of a data warehouse.

3.2 Data Modelling

A data model for strategic decision support differs substantially from models
used on the operational decision level [AM97,Poe97]. The complex data sets
represent a multi-dimensional view of a multi-objective world. Thus the data
structures are multi-dimensional as well. Modelling of data in the data ware-
house means finding a mapping of concepts and terms arising in business
applications onto data structures used in the warehouse. Special attention
must be paid to those structures that are most relevant for the intended
analysis.
A dimensional business model [Poe97] splits the information into facts and
dimensions that can easily be described by key-codes and relations among the
objects. The goal of the model is to provide the information in a user-friendly
way and to simultaneously minimize the number of joins, i.e., the links among
tables in a relational model, required when retrieving and processing the
relevant information.

Facts and dimensions. Schemes on the basis of a relational database in-


corporate two basic data types: facts and dimensions; the data structure of
both is a table. Facts are the main objective of interest. Dimensions are re-
lated to attributes or characteristics of facts. Facts and dimensions should
be memorized separately. It has been observed that about 70% of the data-
base volume of a typical data warehouse are occupied by measures of facts
[AM97]. In order to cope with the explosion of fact data, a normalization of
the fact entries is advisable, while dimensions need not always be normalized.
Figure 3.2 shows facts and dimensions in a simple example database scheme.
Fact tables contain relevant information like key-codes or any other quan-
titative information for retrieval and analysis purposes of business processes,
i.e., facts are numbers or appropriate transaction (operation) data. They de-
scribe physical transactions that have occurred at a certain time. Facts rarely
change and their tables are often extensive. Fact tables are hierarchical, i.e.,
columns of a fact table may be related to facts or to dimensions. Turnover
or sales tables of an enterprise may be considered as fact data.
Facts are described by means of dimension tables. They are smaller than
the fact tables they are associated to, and they provide different foci on the
fact tables in order to extract the right information, i.e., to narrow the search
in fact tables to a selection of only relevant information. Dimension entries
are often hierarchically linked values that can even reveal data set relations
unknown so far. A dimension table consists of columns for projection and
398 U. Dorndorf and E. Pesch

links to fact tables; the columns are used for hierarchies that provide a logical
grouping of dimensions, and for description or references in order to introduce
details.
Dimensional data frequently undergo changes. Some changes are the result
of the warehouse development, as at the very beginning not all kind of queries
can be predicted. Thus dimensional tables are to be established in order to
allow an easy later extension and refinement. Product types, sales regions or
time horizons may be considered as business dimensions.
Holthuis [HoI97] differentiates between several types or groups of dimen-
sions which, once again, may be divided into subtypes or subgroups, etc. Busi-
ness dimension types may be standardized with respect to time or some other
kind of measure. They also may be defined individually and task-specifically.
Structural dimension types are hierarchical due to their vertical relations.
Their data may be aggregated in a hierarchical structure or they may consist
of different internal structures. Moreover, there are also categorical aspects
relevant for dimensions. For example, categorical dimension types are marital
status, salary, etc. Categories result from specific attributes of the informa-
tion objects and can be partitioned into several sub-categories.
As about 70% of a database volume are occupied by measures of facts,
queries are often separated into steps. In the first step, access to the dimension
tables restricts the data volume that has to be searched. In order to select
the desired information, SQL queries are limited to a number of predefined
or user-defined links of fact and dimension tables.
A huge data set may be partitioned with respect to their fact or dimension
data tables into smaller tables. However, a large number of tables is not
desirable and it has been suggested that the number should not exceed 500
[AM97]. Horizontal partitioning of dimension tables should be considered if
their size reaches the size of fact tables in order to reduce the time of a query.
Partitioning is discussed in detail in Section 3.3 below.

Database schemes. Several database schemes have been used for data
warehouse databases. In analogy to the structure and links between elements
of the de-normalized fact and dimension tables, their names are star scheme,
starflake scheme or snowflake scheme. These schemes have their particular
functional properties and can be visualized in different ways. The schemes
consist of facts and dimensions documented by means of tables, Le., the
schemes basically consist of tables and differ in their structural design [GG97].
Kimball [Kim96] gives an excellent explanation of dimensional modelling (star
schemes) induding examples of dimensional models applicable to different
types of business problems.
Dimensions are represented in fact tables by means of foreign key entries.
A detailed description can be found in the inter-linked dimension tables.
According to their key, the columns of the tables are called the primary
or the foreign key columns. The primary key of a dimension table usually
9. Data Warehouses 399

dimension dimension
product
mount' -bike
trekking-bl
race-bike Berlin

sales
dimens'
revenue
color time-horizon
yellow July
red August
green September
blue October

Fig. 3.2. An example of a star scheme: the primary key of the fact table consists
of foreign keys into the dimension tables "product", "region", ''time-horizon'' and
"color" j "sales" and ''revenue'' are data columns of the fact table

consists of a single column, and it is introduced as a foreign key in the fact


tables. The primary key of a fact table consists of a combination of foreign
keys. Columns of fact tables without keys are data columns.
In a star scheme a fact table defines the center of a star while the dimen-
sion tables create the star branches. Each dimension has its own table and
has only links to facts, not to other dimension tables. Figure 3.2 shows an
example of a star scheme.
An alternative design is a star scheme consisting of several fact tables,
called multiple fact scheme [Poe97]. Fact data that are described by means
of different dimensions may be distributed over several fact tables, while fact
data characterized by the same dimensions should be kept together within
one table. A multiple fact scheme can be used whenever the facts have no
common relationship, their update periods differ or a better performance of
the overall system is desired. It is common practice that multiple fact tables
contain data of different aggregation levels, whenever a non-aggregated fact
table would be extremely large. An example of a multiple fact scheme is
shown in Figure 3.3.
An n:n-relation between dimensions can also be introduced in a star
scheme. The resulting table is called associative and the dimension relations
are incorporated into a separate fact table. An outboard table is a dimension
400 U. Dorndorf and E. Pesch

dimension dimension
product region
mounta' -bike avaria
trekking-bl Rhein-Main
race-bike Berlin

dimension
time-horizon
yellow y
red August
green September
blue October
fact
time-horizon
valu
supplier
dimension costs dimension
supplier
value
facility 1
plan
facility 1
actual
facility 3

Fig. 3.3. An example of a multiple fact scheme

table that contains foreign keys as primary keys to other dimension tables.
The latter dimension tables, called outrigger tables or secondary dimension
tables, are used in order to specify a primary dimension through this sec-
ondary dimension. Usually this kind of foreign key only exists in fact tables
in which an appropriate combination of the keys defines a primary key.
In the multiple star scheme, fact tables may, besides their foreign keys to
dimension tables, contain primary keys without any link to a dimension table
but to columns of some fact tables. This happens if the keys linked to the
dimension tables do not sufficiently specify the fact table. The primary key
characterizing the fact table may be any combination of foreign or primary
keys.
The star scheme has a simple structure and well-defined tables and links.
Updates can easily be handled by a user familiar with any kind of database
design. The system response time for a query is short. One of the main
disadvantages is the simplicity of the links among the tables. Dependencies
9. Data Warehouses 401

or any other kind of dimensional relations cannot be introduced easily without


additional overhead. As a result, the data warehouse in action may suffer from
a lack of performance. To overcome these deficiencies additional tools are
required which, ideally, provide a higher flexibility in data modelling through
a higher level of abstraction between the user and the physical database.
Consequently, the snowflake scheme and the starflake scheme, which combines
the star and snowflake schemes, have been suggested.
As the name implies, in the snowflake scheme facts and dimensions are
organized like a snowflake. Very large dimension tables may be normalized
in the third form while the fact tables usually are not normalized. Each di-
mension may have its own sub-dimension, etc. The resulting complex struc-
ture has the shape of a snowflake. The snowflake scheme is attractive as it
usually achieves a performance improvement in case of large dimension ta-
bles with many detailed data. The starflake scheme is a combination of the
star and the snowflake schemes with respect to structure and functionality.
Starflake schemes allow dimension overlapping, i.e., a repetitious occurrence
of snowflake dimension tables or star dimension tables is possible. Overlap-
ping should be carefully designed. The overlapping design of starflake schemes
allows a high retrieval performance without any a-priori knowledge of the fu-
ture access patterns.

3.3 Database Design

The database design strongly influences the performance of data warehouse


retrieval operations. The most important design choices to be made concern
granularity, data aggregation, partitioning, and de-normalization as well as
the aforementioned different kinds of data modelling by means of the star,
starflake and snowflake schemes.
There are ways of modelling data that usually speed up querying and re-
porting and may not be appropriate for transaction processing or even may
slow down transaction processing. An example is bit mapped indexing which
is a family of indexing algorithms that optimize the query performance in rela-
tional database management systems by maximizing the search capability of
the index per unit of memory and per CPU instruction [ONe94]. Bitmapped
indices eliminate all table scans in query processing. There are also techniques
that may speed up transaction processing but slow down query and report
processing. What is needed is an environment for formulating and processing
queries and generating reports that does not require too much knowledge of
the technical aspects of database technologies.

Granularity and aggregation. Inmon [Inm96] or Poe [Poe97] consider


granularity to be the most important means for data structuring in a data
warehouse, leading to an increased data processing efficiency. Granularity di-
rectly reflects the level of aggregation of data in the warehouse, in other words
402 U. Dorndorf and E. Pesch

it can be considered as a measure of the degree of detail. High granularity


corresponds to a low degree of detail while a low data aggregation allows
access to detailed data. With respect to the data relevance, the level of ag-
gregation may be different. Usually granularity, Le., aggregation, increases
with the age of the data. While on the top level the most recent data might
be provided in detailed, weakly aggregated form, data in the past become
higher and higher aggregated. Of course, any specific molding depends on
the required information and is a matter of the decision making process. Ag-
gregation of internal data, which is frequently introduced Or transformed in
the database, is quite simple while data from external sources, due to their
inherent heterogeneity, might cause problems. Aggregation either is a mat-
ter of the data transfer during the integration and transformation process of
data into the warehouse Or it may be shifted to a later time step where the
aggregation is completely integrated in the database. In the latter approach
some particular trigger mechanisms can be used that have to be fired by the
data management functions.
Granularity leads to a mOre efficient processing of data. Obviously, ag-
gregation increases the speed of data retrieval and reduces the data volume
that has to be searched in any retrieval. In addition granularity reduces the
memory requirements which, however, is only relevant in cases of an extensive
usage of quite costly online data accesses. This high speed memory should
be limited to frequently used current data and its aggregation, which is re-
quired for decision support, while data from the past may be moved to the
slower memory where it stays with the operational daily data of no decision
importance.
Granularity defines the data volume as well as the extent of queries that
can be answered. There is a trade-off between the amount of data and the
details of possible queries. A high level of aggregation and a corresponding
low degree of detail reduces the amount of data and its resource usage, but
cannot satisfy the demand for a high flexibility of data analysis. Multilevel
data aggregation can help to overCome this conflict. There is no aggregation
of the current information which, however, will be aggregated at a later time
for direct access in the data warehouse; the detailed data are still available
and can be retrieved whenever necessary. It is common practice to aggregate
daily data by the end of the week, and to aggregate weekly data by the end of
the month, etc. [Inm96,Bis94]. It is even more common that a collective view
of data is taken. Multi-level granularity typically achieves that about 95% of
all queries are served fast and immediately, while only 5% of all queries need
the evacuated data archives [MHR96].

Partitioning. Partitioning means splitting a set of logically related items


that define a unit into smaller pieces. Within a data warehouse this can lead
to a partitioning of the database as well as to a partitioning of the hardware.
The focus of hardware partitioning is an optimized performance of hardware,
9. Data Warehouses 403

input/output and CPU. Partitioning the database means splitting the data
into smaller, independent and non-redundant parts. Partitioning is always
closely connected to some partitioning criteria which can be extracted from
the data. For instance there might be enterprise related data, geographical
data, organizational units or time related criteria, or any combination of
these. A flexible access to decision-relevant information as one of the most
important goals of data-warehousing implies that partitioning is particularly
a tool to structure current detailed data into easily manageable pieces.
Anahory and Murray [AM97] differentiate between horizontal and verti-
cal partitioning. Horizontal partitioning splits data into parts covering equal
time horizon lengths. Non-equal time horizon lengths might be advantageous
whenever the frequency of data access is known in advance. More frequently
accessed data, e.g., the most recent information, should be contained in
smaller parts so that it can easily be kept online-accessible. Horizontal par-
titioning may also split data with respect to some other criteria, e.g., prod-
ucts, regions, or subsidiary enterprises, etc. This kind of partitioning should
be independent of time. Irrespective of the dimension Anahory and Mur-
ray recommend to use the round-robin method for horizontal partitioning,
Le., whenever a certain threshold is reached, the current data partitioning is
memorized in order to free the online memory for current new data partitions.
The vertical partitioning of data is closely related to the table representa-
tion of the data. Hence, columns or a set of columns may define a partition.
Moreover, enterprise functions may also be considered as a kind of vertical
partition. Vertical partitioning avoids an extensive memory usage because
less frequently used columns are separated from the partition.
Partitioning has several advantages; in particular, a smaller data volume
increases the flexibility of data management, as the administration of large
data tables is reduced to smaller and manageable ones. Data can more eas-
ily be restructured, indexed or reorganized; data monitoring and checking
are also easier. In addition, partitioning facilitates a regular data backup
and allows a faster data recovery. Finally, partitioning increases the system's
performance because a small data volume can be searched more quickly.

De-normalization. De-normalization is another option for structuring a


database. The result of a de-normalization is a relational data model in third
normal form. While normalization assures data consistency, de-normalization
also achieves consistency and increases the performance through the combi-
nation of data tables. The main intention is to reduce the number of inter-
nal database retrieval operations in order to reduce the system's response
time. De-normalization increases redundancy and therefore requires addi-
tional memory.
The star scheme is the most popular technique of de-normalization. Data
are always transferred in blocks whenever there is a database access. Closely
related objects are linked together. Data with a higher probability of access
404 U. Dorndorf and E. Pesch

are linked in small tables in order to achieve an increased query efficiency. Any
kind of structured data access, e.g., a certain access probability, data access
sequences, etc., can be reflected by means of linked tables of data blocks in
order to minimize the number of required queries. Data redundancy might be
quite efficient for data whose use is widely spread and rather stable. This is
even more important if costly calculations of data are the only way to avoid
redundancy.

Updates. After loading the data warehouse with the decision relevant infor-
mation, the data have to be updated on a regular basis, Le., current external
or internal data have to be stored in the warehouse. This procedure, called
warehouse loading, is supposed to be executed either in well-defined time
steps or whenever there is a need for new information. The level of topicality
of the warehouse data depends on the enterprise-specific requirements. For
instance, financial data typically need a daily update. Data updates on a
regular basis within a certain time interval can be shifted to the night or to
the weekend in order to avoid unnecessary machine breakdowns or lengthy
query response times. Time marks are used to indicate the changes of data
over time. Monitoring mechanisms register changes of data.

4 The Data Warehouse Concept

4.1 Features of a Data Warehouse


Inmon's definition of a data warehouse as a "subject-oriented, integrated,
non-volatile and time variant collection of data in support of management
decisions" summarizes the most important features of a data warehouse.
Subject orientation means that the data is oriented towards the enterprise
subjects, such as products, customers, or locations. This stands in contrast
to systems on the operational level which are mainly oriented towards the
functions within the enterprise activities [Gr097a].
As a result of integration there should be a unified, homogeneous data
basis. Data collected from different sources and systems usually exists in
different formats under non-unified notation. Integration means fitting these
heterogeneous data together into a unified representation.
Non-volatility ensures that the collection of data in a warehouse is never
changed unless failures require a correction. Hence any access to decision-
based information is limited to data reading and any writing permission
caused by updates of the topical information is only allowed to be an insertion.
Besides insertions, management systems for operational planning frequently
also allow overwriting of data. Data updates are typical of those systems but
the maintenance of a data warehouse through overwriting is generally not ac-
ceptable. Non-volatility also implies that any calculations can be reproduced
at any time [Gr097a].
9. Data Warehouses 405

Time variance is another concept that clearly distinguishes a warehouse


from systems on the operational planning level. While the latter consider a
limited, short period, e.g., 2-3 months, in which the topical data is collected,
memorized and processed, the data warehouse is constructed for decision
support over a long time horizon (up to 10 years). Thus, information remains
unchanged over time, and time is a key element of any extracted information.
Groffmann [Gro97aJ adds a fifth feature: redundancy.

4.2 Data Warehouse Architecture

The description of a data warehouse can be process oriented. The process


oriented view, which is also referred to as data warehousing, obtains the data
from the analysis and description of the functions or procedures arising from
the enterprise activities. Another description can be the one of a fictitious
observer who differentiates between a number of levels, e.g., an input and an
output level and a data administration level. A further possible view of a data
warehouse might be to consider the data as the center of a warehouse. Any
function is defined by its effect on the processed data, e.g., data generation,
data extraction, data recovery, etc. Closely related is the view based on the
data flow in a warehouse. Data processing operations are inflow, upflow,
downflow, outflow or metaflow. A warehouse may be divided into two larger
parts: the data mobilization, which is responsible for data collection and
preparation, and the information discovery, a part responsible for generation
and analysis of information.
In what follows we will emphasize a process oriented view of data ware-
houses. Processes in a warehouse are described with respect to their flow of
data. Basically, we are interested in the extraction and insertion of data, data
updating and converting, recovering and information retrieval management.
The architecture of a warehouse can then be described by its processes, in
other words, it can be considered from a system manager's point of view.

• The insertion manager is responsible for the transfer of external or inter-


nal data into the warehouse. The manager's task is the extraction of data
from the external or internal data source and insertion of the extracted
data into the warehouse.
• The function of the warehouse manager is limited to the administration of
data and information and includes tasks such as the aggregation of data,
consistency checking, de-normalization, operating updates, data projec-
tion, data transformations between different schemes like star, snowflake
or starflake, etc.
• The retrieval manager operates the user interface, i.e., the manager han-
dles the incoming queries and outgoing decision support. The retrieval
manager is responsible for optimal resource planning and efficient re-
sponse times. The manager uses query profiles in order to satisfy the
users' demand [AM97J. Reports and queries can require a variable and
406 U. Dorndorf and E. Pesch

much greater range of limited server resources than transaction process-


ing. Reporting and querying functions that are run on a server which is
also used for transaction processing can create managing problems and in-
crease response times. Managing the limited resources in order to achieve
a high probability for reasonably small transaction processing response
times is a very complex task.

In general these managers are independent and automated.

The database. The enterprise-wide database is the most important compo-


nent of a warehouse; it is separated from operational systems and it contains
information for any kind of assessment or analysis that can be initiated for
decision support.
Transformation programs select, extract, transform and insert data into
the database. They create the basis for an effective decision support through
the selection of sources and their data. Transformation programs are the
only interface to the data sources. Their tasks are the selection of relevant
data and the transformation of these data into subject-oriented, non-volatile
and time-variant structures providing the basis for information generation.
Among these transformations are the data mapping of source data to their
destination, and data scheduling, i.e., the time-based planning of data trans-
fers. The transformation of data does not only consist of the integration of
various data types, the generation of links between data, the balancing of
differences in the data's topicality but it also includes the filtering of incon-
sistencies or fuzziness.
The insertion manager handles the first loading of a warehouse and the
updating of modified or supplemented data on a regular basis. These contin-
uous changes of information are also called monitoring [Sch96j. Monitoring
can be initiated whenever changes have been recognized, in certain time in-
tervals, or whenever some additional information is needed. An immediate
update of a relational database is achieved by means of a trigger mechanism,
which recognizes changes and transfers them to a converter. Thus, a trig-
ger becomes active, for instance, if a table of the database changes. Updates
on a regular basis within predefined time steps may be obtained by using
snapshot-refresh mechanisms or a simple comparison of the data from the
source to the data in the warehouse.
After converting the source data into the format of the warehouse data,
another program, called integrator, integrates the data into the warehouse.
The integration part also provides some standard predefined queries in or-
der to guarantee a better performance. This process is accompanied by a
necessary balancing of data and the removal of inconsistencies and failures.
Information about the data source, format, structure or any specific trans-
formations is put into the meta-database system of the data warehouse.
9. Data Warehouses 407

The meta-database. The meta-database, sometimes also called business


data directory or warehouse repository, is the backbone of a data warehouse.
Meta-data are data needed for the effective use and administration of data.
They provide information about data such as the data's structure, origin, age,
storage location within the warehouse, access conditions and possible evalua-
tions. Meta-data also contain information about the data model, aggregation
level, analyses, and reports. Inmon [Inm96] states that "for a variety of rea-
sons meta-data become even p10re important in the data warehouse than in
the classical operational environment" . Meta-data is information about data,
their structure, and their flow and use, that enables the user to easily find
the required decision relevant information. Meta-data can be considered as
a kind of compass through the data volume and they provide the user with
helpful transparency.
The hierarchical structure of data exists not only on two levels but there
are also data on meta-data, sometimes called meta-meta-data or corpo-
rate meta-data [Inm96,Bra96]. Thus, meta-data are manifold; they contain
subject-relevant data, e.g., economic data, as well as technical administration
data.
Queries on a meta-database are usually not pre-defined but are user spe-
cific. It is important to realise that a meta-database can only provide the
desired user flexibility if the meta-data terminology can be understood by
the user. The different functions of a data warehouse require their individual
meta-data, e.g., the identification of the data source, the transformation, the
data insertion, and the data administration, retrieval and evaluation. Hence,
the administrative function of a meta-database might be considered as a basis
of all functions of a warehouse.
Meta-data can be classified as local and global data. While local data are
only accessible to some users, global data are available for all decision makers.
Poe [Poe97] divides meta-data into those that lead from an operational man-
agement information system to the data warehouse and those which guide the
user to the required decision support tools of the warehouse. In other words,
meta-data can be classified as operational and decision-support meta-data.
He argues that the quality of a warehouse heavily depends on the quality
of its meta-data. Their basic function is the support of the user, who finally
decides on the acceptance of the overall warehouse concept.

Archiving. Another software part of a warehouse puts data into archives


and operates the backups in order to allow necessary re-installation after
data losses or system or program errors. A backup should at least include the
most detailed data level; backups of all aggregation levels might accelerate
a re-installation. Besides serving for backups, archives contain data which
are with a high probability not used any longer. Archives guarantee that the
active part of the warehouse performs quite efficiently even if the amount of
regularly inserted data increases rapidly. Detailed data that are still available
408 U. Dorndorf and E. Pesch

online are put to a cheaper offline memory, such as optical disks or sometimes
magnetic tapes, while the data's aggregated information is still accessible
online. The archive keeps the size of necessary online memory limited. In order
to guarantee that simple standard or ad-hoc queries can be responded to in
a reasonable time, an archive memory also provides the necessary effective
access procedures.

4.3 Design Alternatives of Data Warehouses

One of the probably most desired aspects of a data warehouse is to establish


a system which is specific to the enterprise's individual needs. The different
design alternatives on the basis of different hardware and software speci-
fications range from completely standardized to individually fit solutions.
Flexibility is not only limited to the environment of the database but also to
its different components. The data warehouse structure heavily depends on
the organizational structure of the enterprise and on its current and future
infrastructure. There are centralized as well as scattered solutions. Among
the basic underlying aspects is the influence of the technical infrastructure
and the qualification and experience of the people using the system.
Possible technical data warehouse environments are the classical main-
frame system or the client server architecture. A centralized data warehouse
fits best where the operational data processing systems are centralized. There
is a central creation and update of the data model of the warehouse. All enter-
prise divisions can have easy data access, and the supervision of data access
remains simple. The data warehouse project provides a central solution at
its beginning which can be distributed to different platforms later in order
to increase flexibility, the availability of information, independence and per-
formance [Bis94]. A non-central solution is usually realized as a client server
system. A non-central structure demands larger data administration efforts
and a more complex data model. There are two possibilities of its organi-
zation: the distributed databases may be either individual solutions or they
may be supplemented by a central warehouse. The first possibility allows
all departments of the enterprise to access all locally relevant data. How-
ever, any global data access to another database, for some enterprise-wide
decision support, without a local connection heavily influences the system's
performance. A central data warehouse with common enterprise-wide rele-
vant data may compensate for the performance disadvantages, this, however,
at additional costs for creations, updates and support of the required data.
The local databases contain data at different aggregation levels in order to
answer the queries at all levels.
Among the distributed data warehouse concepts are those particular ones
that have become known as data mart or information factory or the creation
of a virtual data warehouse. An on-line analytical processing database can
be useful for complex, multi-dimensional analyses.
9. Data Warehouses 409

Data marts. Local databases, the so-called data maris, are databases lim-
ited to some enterprise departments such as marketing, controlling, etc. In-
mon [Inm96] considers data marts as departmental level databases. They are
built and adjusted to the specific departmental requirements. Data marts
contain all components and functions of a data warehouse; however, they are
limited to a particular purpose or environment.
The data is usually extracted from the data warehouse and further de-
normalized and indexed to support intense usage by the targeted customers.
Data marts never provide insight into global enterprise information but only
consider the relevant aspects of their particular field of application. Data
marts serve specific user groups. As data marts consider only subsets of the
whole set of data and information, the amount of processed data and the
database are naturally smaller than the corresponding sets of the overall
data warehouse. This advantage is frequently used for local data redundancy,
where data on customers, products, regions or time intervals, etc., are inte-
grated as several copies. In order to provide a reasonable data marting, the
data should be kept separately as long as it reflects the functional or natural
separation of the organization.
Data marts can also be created by decomposing a complete data ware-
house. Inversely, an enterprise-wide database can also be created by com-
posing departmental level data marts. Data marts may be organized as in-
dependently functioning warehouses with data access to their own sources.
Alternatively, the data access may be realized through a central data ware-
house. For consistency purposes the latter is preferable. Semantically there
is no difference between the data model of a data mart or data warehouse.
The data mart design should be in analogy to the design of the database
and should always use the data-inherent structure and clustering if this does
not clash with the access tools. Anahory and Murray [AM97] recommend the
snowflake scheme, integrating possibly different data-types or meta-data on
certain aggregation levels. The data updating of the data marts can be sim-
plified if the technologies are identical and if a data mart only consists of a
subpart of the central data warehouse. Kirchner [Kir97] reports on updating
problems when different data marts are supposed to be updated simultane-
ously.
There are various reasons for using data marts. If there are particular
areas that have to provide frequent access to its data, a local copy of the
data may be useful. Data marts provide the possibility to accelerate the
queries because the amount of data that has to be searched is smaller than in
the global warehouse. The implementation of data marts provides the chance
to structure and partition data, e.g., in a way that the access tools require.
Simultaneously arriving queries in a data warehouse might create problems
which can be avoided through their de-coupling in order to query clusters
that only attack one data mart. Finally, data marts more easily guarantee
necessary data protection against uncontrolled access by a complete physical
410 U. Dorndorf and E. Pesch

separation of all data. Generally speaking, data marts lead to performance


improvements such as shorter response times and an increased clarity. The
realization of data marts is easier and faster than the development of a global
warehouse concept. Organization of data in' the form of data marts is very
useful whenever some data need a very frequent access or whenever user tools
require specific data structures. Mucksch and Behme [MB97] report that data
marts can serve up to 80% of all queries while storing only 20% of the data
of the complete warehouse. In order to achieve a consistent database and an
acceptable performance, any data warehouse should be supplemented by not
more than five data marts [AM97].
Hackney [Hac97] or Simon [Sim98] give a guide to understanding and
implementing data marts.

Operational data store. Inmon and Hackathorn [IH94] consider an opera-


tional data store (ODS) as a hybrid data warehouse because an ODS transfers
the concept and effects of a data warehouse down to the operational deci-
sion area of an enterprise. Although the main goal of operational systems is
the rapid processing and updating of transaction-related data, there is still
a need for decision support which is not appropriately provided within the
current systems. This is the field of an ODS, i.e., to provide the basis for
an operational decision-based enterprise management. Obviously, the data
of an ODS are more accurate and more frequently updated than in a data
warehouse. The evaluation and analysis of the data is more accurate because
current detail data is used. Aggregation of data on different levels is limited
to the warehouse. Thus, an ODS can be used to bridge the data transfers
within a warehouse. The data amount of an ODS is rather small compared
to that of a data warehouse, because evaluations and decisions are related to
short-term periods.

Virtual data warehouse. Whenever the decision-making process needs


recent or detailed data, the time horizon or degree of detail of the data ware-
house may be insufficient and an operational data store is required. If the data
store's topicality of information is also unsatisfactory then another modifi-
cation of a warehouse may be used, the virtual data warehouse [MB97]. As
its name suggests, the virtual warehouse is no warehouse in the conventional
sense but is only a concept describing a physical memory of meta-data. In
a virtual warehouse, data is collected from an operational environment in
order to create a logical data warehouse. The concept enables the user to
combine different database systems and different platforms without creating
a new physical database. A virtual data warehouse may thus pave the way
for a first enterprise-wide database. Unfortunately, the operational systems'
workload complicates the establishment of a virtual data warehouse so that
the implementation and technical requirements are quite high.
9. Data Warehouses 411

Web warehousing. Data warehouse solutions which use the world wide
web are called web warehousing [Mat99,Sin98]. The world wide web provides
a huge source of information for the enterprise as well as an easy and fast
data distribution and communication medium. Information collection and
integration into the data warehouse is also called web farming. The internet
is used for data access to external data while enterprise-internal data and
information distribution and access is supported by intranets. Nets for data
and information exchange between cooperating enterprises, and from and
into their intranets are called extranets.

Database systems. There are different database technologies that are ap-
plicable to a data warehouse. They must have the ability to process huge
amounts of data arising from a large variety of different, detailed, aggregated
or historical enterprise information. Relational database systems have been
successfully applied in operational systems and provide a good solution to
data warehouse concepts as well. Relational databases have the advantage
of parallelism and familiarity. Alternatively, other technologies for decision
support have been applied. For instance there are multi-dimensional data-
base management systems that have been developed for the processing of
multi-dimensional data structures in online analytical processing (OLAP).
These database systems process data with respect to their dimensions. In
order to guarantee efficient OLAP queries, they use multi-dimensional in-
dices [GHR+97]. Moreover, there are hybrid database systems that combine
relational as well as multi-dimensional elements in order to process large
data volumes and to provide possibilities for multi-dimensional data analysis
[Sch96].

5 Data Analysis of a Data Warehouse

The data warehouse concept has proved useful for the support of enterprise
planning and decision making through the generation, evaluation and analysis
of relevant data and information. The variety of applications for analysing
and evaluating data stored in a warehouse is as large as the variety of different
environmental and internal problems and tasks.
Many software tools have been integrated into a data warehouse system.
Middleware and gateways allow to extract data from different systems. Trans-
formation tools are needed for the correction and modification of data. Other
tools have proved useful for the creation of meta-data. Finally, a large num-
ber of tools are available for retrieval, assessment and analysis purposes. The
following sections first discuss general data evaluation tools and then review
two important data analysis technologies: data mining and online analytical
processing.
412 U. Dorndorf and E. Pesch

5.1 Evaluation Tools

A large number of evaluation tools enable the user to use the data warehouse
easily and effectively [Sch96,Poe97]. It is debatable whether the evaluation
tools of the front-end area do necessarily belong to the set of data warehouse
components. However, they are indispensable for a sensible use of the data
warehouse concept, and the effort for selecting and integrating them must
not be underestimated; the selection of the tools should be done in cooper-
ation with the user. Ryan [RyaOO] discusses the evaluation and selection of
evaluation tools.
The manifold user tools for information retrieval in a warehouse can be
classified according to different criteria. The spectrum of tools ranges from
those for simple queries and report functions to the complex tools necessary
for the multi-dimensional analysis of data. One can differentiate between
ad-hoc reporting tools, data analysis tools, EIS-tools and business process
engineering tools as well as navigation elements, which in particular are im-
plemented in all tools.
Query processing techniques are an essential element of many evaluation
tools. There may be ad-hoc as well as standard queries. The knowledge of
frequently required queries can help to prepare and provide a standardized
form in the warehouse in order to accelerate the response time and to increase
the user interface quality. Documents may be memorized by means of some
forms but, additionally, scheduling and retrieval procedures that are neces-
sary for frequent repetitions of assessments should be provided to the user. In
contrast to standard queries, the kind and frequencies of the ad-hoc queries
are difficult to predict and prepare in advance. Data warehouse queries are
sometimes split into three groups: those providing only information, those
that allow a subsequent analysis of information and data, and finally causal
queries. Warehouse query processing aspects have, e.g., been discussed by
Cui and Widom [CWOO], Cui et al. [CWW99], O'Neil and Quass [OQ97],
and Gupta et al. [GHQ95].
An important feature of a useful tool is that it allows a comprehensive
warehouse usage without a deeper knowledge of database systems. This is
achieved through a graphic interface which provides either a direct or an
indirect (via an additional level of abstraction) data access. The interme-
diate level of abstraction enables the user to assign his own specific names
to the data or tables. The graphic tool support allows a simple handling
of queries without a detailed knowledge of the SQL language. The results
are finally transformed into data masks or data tables, which are frequently
connected with report generators or various kinds of graphic presentation
systems [Sch96]. Hence, the system supports the user in generating any kind
of business-related reference numbers without requiring specific knowledge of
the underlying system.
Report generators allow an individual report design. Statistical methods
supplement the data warehouse and provide tools ranging from a simple prob-
9. Data Warehouses 413

ability analysis up to various procedures for trend, correlation or regression


analysis or hypothesis tests. Executive information systems (EIS) provide a
structured access to predefined reports containing highly aggregated busi-
ness information. They support the executives' decision making through the
generation of online information based on prepared analyses. An EIS can be
considered an extended decision support system [WHR97]. Spreadsheet sys-
tems integrate various result presentation methods, among others there are
functions, diagrams, charts, and different kinds of three-dimensional forms.

5.2 Data Mining


Information in a data warehouse is frequently hidden because of the huge
amount of data, which, moreover, is continuously increasing as historical
data have to be kept available for a long time. In order to effectively use the
available information automatic tools are required which enable the user to
detect interesting and unknown relations between data. A system collecting
such tools is called a data mining system.
Since data mining provides the potential for an analysis of large, complex
and diffuse data, it perfectly supplements the data warehouse concept. It
is advantageous to mine data from multiple sources to discover as many
interrelationships as possible; data warehouses contain clean and consistent
data from various sources as a prerequisite for mining. The results of data
mining are only useful if there is some way to further investigate the uncovered
patterns; data warehouses provide the capability to go back to the data source
in order to ask new, specific questions.
A data mining system offers efficient methods for data clustering or fil-
tering with respect to significant patterns. In contrast to various procedures
which automatically discover interrelations among data, data mining provides
the user with a tool that allows an individual, interactive analysis. The user
formulates hypotheses and queries which are processed while inspecting data.
Data mining methods are used to detect trends, structures, dependencies,
etc., in large data volumes in order to obtain new information and generate
new knowledge. The major difference to traditional methods of data analy-
sis is that, instead of verifying or rejecting existing hypotheses, data mining
reveals implicitly existing and still hidden information and makes this infor-
mation explicit. Highly automated systems are special-purpose oriented and
their functionality is limited quite narrowly, while user-oriented systems with
a low automatism have an increased flexibility and greater range of use.
Data mining procedures and implementations heavily depend on appli-
cation specific requirements. However, there is a basic hereditary common
structure of all data mining systems. Through the database interface the se-
lected data become available in the system, the knowledge basis consists of
the problem specific rules, and the focussing area contains the data for the
analysis. The processing, analysis and assessment of data are the next steps.
The important final step is the presentation of results to the user; a graphical
414 U. Dorndorf and E. Pesch

presentation might be integrated into data mining tools or be left for addi-
tional presentation programs. The reliability of the derived results might be
questionable and must therefore be verified by means of statistical evalua-
tions. Data mining systems incorporate mathematical, statistical, empirical
and knowledge-based approaches.
Incomplete databases as well as databases containing a minimal amount
of relevant data limit a successful application of data mining tools. Moreover,
they can lead to false evaluations. To a certain degree, defective or false data
can be detected, filtered and continued to be processed by some data mining
tools. This kind of data cleaning, called scrubbing, is, of course, only possible
to a certain level of destruction and heavily depends on the data quality
and data redundancy. The importance of scrubbing is due to the fact that
data warehouse systems prove most successful when the user can focus on
using the data that are in the warehouse, without having to wonder about
its credibility or consistency.
Data mining has successfully been applied in various business areas, such
as banking, insurance, finance, telecommunication, medicine, or public health
services. As a typical example, the shopping behaviour of customers in a su-
permarket has been examined in order to draw conclusions for the market's
presentation of its products. Type and number of all products in the cus-
tomer's shopping basket have been recorded in order to draw conclusions
with respect to the customer's behaviour. For instance, it might be the case
that customers buying coffee frequently also buy milk, or customers buying
wine frequently also buy cheese. A typical correlation between diaper and
beer has been detected in US supermarkets: men buying diapers tend to buy
beer for themselves too. Conclusions of this kind could lead to an appropriate
location and presentation of the market's products and could even influence
the product mix. In addition, the information is important for estimating the
influence of withdrawing a product from the mix onto the sales figures of
other products.
Comprehensive overviews on data mining and related tools are provided
by Han and Kamber [HK02j, Groth [Gro97b,Gro99j, Fayyad et al. [FPS+95j,
Cabena [Cab97j, Berry and Linoff [BLOOj, Bigus [Big96j, Weiss and Indurkhya
[WI97j, Adriaans and D. Zantiage [AZ96j, Westphal and Blaxton [WB98j,
Anand [AnaOOj, Mena [Men99j, and Lusti [Lus02j. Data preparation for data
mining is discussed by Pyle [PyI98j.
The following subsections review some commonly used methods for data
mining.

Descriptive statistical methods. Descriptive statistical methods using


probability distributions, cprrelation analysis, variance analysis, etc., are
helpful in testing and verifying hypotheses which can be generated using
the data mining system. The idea is to define a rule that allows to include
new objects into the appropriate classes.
9. Data Warehouses 415

Knowledge based methods. There are further methods which are applied
for pattern recognition, e.g., inductive learning, genetic algorithms or neural
networks. Additionally, "if-then" analysis has been found to be useful.

Cluster analysis. Cluster analysis groups data with respect to their at-
tributes so that data in a group are as homogenous as possible. Basically
there are two ways of clustering: hierarchical clustering and partitioning.
A way of hierarchical clustering is to start off with the two most homoge-
nous elements in order to create the first group of more than one element.
The process continues until a sufficiently small number of groups has been
reached. Other methods of clustering pursue the opposite direction. Groups
are continuously split until a certain level of homogeneity is reached. Hierar-
chical clustering always creates hierarchy trees.
Partitioning groups the data without going through a hierarchical cluster-
ing process. One can think of the objects represented by the data as vertices
of an edge-weighted graph; each positive or negative weight represents some
measure of similarity or dissimilarity, respectively, of the object pair defining
an edge. A clustering of the objects into groups is a partition of the graph's
vertex set into non-overlapping subsets. The set of edges connecting vertices
of different subsets is called a cut. In order to find groups as homogeneous as
possible, positive edges should appear within groups and negative edges in
the cut. Hence, a best clustering is one with a minimal cut weight. Cut min-
imization subject to some additional constraints arises in many applications,
and the literature covers a large number of disciplines, as demonstrated by
the remarkable variety in the reference section of [DJ80j.
In general there are two steps to be performed during the clustering pro-
cess. Firstly, some measure of similarity between distinct objects must be
derived and secondly the objects must be clustered into groups according to
these similarities (clique partitioning) [DP94,GW89j.

5.3 Online Analytical Processing

Online analytical processing (OLAP) [Thi97,BS97j is a basic warehouse el-


ement that involves real time access in order to analyze and process multi-
dimensional data such as order information. OLAP tools allow a data pro-
jection from all different perspectives. The term OLAP has been created by
E.F. Codd in 1993. OLAP supplements online transaction processing (OLTP)
which is used in operational systems for processing huge data amounts accord-
ing to predefined transactions. Codd has established 12 rules or quality crite-
ria for an OLAP system: (1) a multi-dimensional conceptual perspective, in-
cluding full support for hierarchies and multiple hierarchies, (2) transparency,
(3) easy access, (4) reasonable response times for reporting, (5) client-server
architecture, (6) parity of dimensions, (7) dynamical administration of low
density matrices, (8) multi-user capabilities, (9) unlimited operations across
416 U. Dorndorf and E. Pesch

dimensions, (10) intuitive data manipulation, (11) flexible reporting, and,


finally, (12) an unlimited number of dimensions and aggregation levels.
Various discussions have led to a modification and extension of Codd's
rules. It is generally agreed that the main purpose of an OLAP system is
the "fast analysis of shared multi-dimensional information" (FASMI). The
word "fast" suggests that the access time of any query using OLAP tools is
constrained to a few seconds. The range of OLAP tools should encompass
statistical analysis and business-related logical aspects which are desired by
the user, such as times series or case studies. All of these are reflected in the
word "analysis". The word "shared" indicates multi-user capabilities. A read-
only access creates no difficulties while a read-write access requires a careful
and limited assignment of the access rights. "Multi-dimensional information"
denotes the ability to process and provide multi-dimensional data irrespective
of the data volume and data sources.
Thomsen [Th097] gives a guide to implementing systems with OLAP tech-
nology.

Multi-dimensional analysis. Business data usually have multiple dimen-


sions, and the data model must thus be multi-dimensional as well. A simple
example of three dimensions is the time (day, week, year), the product and the
enterprise department. Every dimension corresponds to an axis in the multi-
dimensional space. This leads to a hypercube or multi-dimensional matrix;
its efficient implementation is discussed in Harinarayan et al. [HRU96]. Di-
mensions are often hierarchically structured, e.g., the dimension time horizon
can be structured as year, month, week, day, etc. The interior of the hyper-
cube describes the position of the data or information with respect to their
dimensions shape. Obviously, the low density of the matrix of information
requires efficient access methods in order to guarantee a high performance
of the access tools. Therefore, one of the important aspects is the possibility
of projecting the hypercube to a lower number of dimensions which allow
alternative sights onto the described data.
Different operations can be applied, making use of the multi-dimensional
matrix representation with a hierarchical dimension structure. The opera-
tions allow an easy access to the data of the hypercube. Among the most
common operations are slicing, dicing, rotating, rolling up, drilling down,
and pivoting. Slicing means considering one particular slice of the cube by
fixing values of some of the data dimensions. Dicing is the reduction of the
hypercube to a smaller hypercube as a sub-cube of the original one. It lim-
its the consideration of data only within certain dimension areas. Rotating
means considering the data in the matrix from different perspectives. Rolling
up describes moving to upper aggregation levels of the data. It provides a
more general view onto the data. Drilling down means the opposite operation
of splitting aggregated data into more detailed data. Pivoting is a special case
of rotating two dimensions. It exchanges two dimensions and therefore allows
9. Data Warehouses 417

to consider the data from the opposite perspective. Other dimensions are cut
out after a pivoting operation.

OLAP database and architecture. A few years ago the aforementioned


technology and operations required their own OLAP databases. However, the
situation has changed. Relational OLAP databases, so-called ROLAP, allow
to represent an arbitrary number of dimensions by means of two-dimensional
tables [GGC97,CG98]. In order to improve the system's response time al-
ternative solutions have been created, called the multi-dimensional OLAP
or MOLAP. Multiple dimensional databases are useful for representation
and processing of data from a multi-dimensional hypercube. A MOLAP sys-
tem represents a database alternative to a relational database whenever the
amount of data is reasonable and data analysis procedures are deduced from
the hypercube operations. A hybrid OLAP system combines relational and
multi-dimensional aspects. The multi-dimensional representation is limited
to highly aggregated data while the relationally stored detail data are still
accessible. Hybrid systems allow a flexible handling of large data sets. Thus,
with respect to the underlying database management system, three variants
exist of the OLAP concept, i.e., the multi-dimensional, the relational and the
hybrid one.
The OLAP architecture consists of a data and a user level which leads to
a differentiation between OLAP-servers and OLAP-clients. The servers are
used for the multi-dimensional and appropriate data view, they define the
basis for the user tools. The multiple dimensions can be realized in two ways:
either in a physical or in a virtual multi-dimensional database. In the virtual
variant, the relational technology is still kept, but for different projections a
level of various transformations is necessary to create the multi-dimensional
structures from relational tables. If the level of transformations is arranged
on the users' machines, it is called a fat-client. However, because of the data
balancing and adjustment problems, a fat-server (thin-client) architecture is
frequently implemented. Its advantage is the possible application of specific
OLAP solutions with a standardized interface to relational technologies. The
OLAP engine on the server has access to the relational system through the
standard interface in order to perform the required data transformations. The
clients only have a presentation task. Modularization and parallelism allow an
easy modification of the system. For the ROLAP applications the modelling
techniques described above such as star, snowflake or starflake schemes are
used in order to keep the transformations simple and the response times low.
The alternative to the virtual solutions is the physical multi-dimensional da-
tabase management system, where the user's view and the realized structures
coincide, but where the low density matrices might lead to non-acceptable
speed slow-downs.
418 U. Dorndorf and E. Pesch

Front end tools. Equally important for a successful application of the


OLAP concept are appropriate front-end tools which allow an easy naviga-
tion through the data. Some of these tools are OLAP-server specific, others
are generally useful, e.g., spreadsheets. The representation and evaluation of
data are limited to two dimensions while all other dimensions are fixed. Cha-
moni and Gluchowski [CG98] classify the front-end tools in an OLAP concept.
Some tools are standardized but inflexible; other tools allow the modification
of the data model or an extension of the standard software in order to avoid
a modification of the user interface and the user's system environment. Spe-
cific administration tools for multi-dimensional structures can help to avoid
changes in the software engineering and programming user interface. For in-
stance, the HTML-Ianguage for WWW-applications has been extended by
some multi-dimensional commands which are particularly useful for naviga-
tion and less for analysis purposes. Another example is the integration of
business products into enterprise-wide intranets.
Whatever kinds of tools are in use, they need some navigation and visual-
ization features in order to guarantee a clear presentation of the information.
For navigation purposes slicing, dicing, rolling up and drilling down as well
as sorting and selection functions belong to the standard repertoire of oper-
ations. The presentation of information is supposed to be achieved by means
of tables and figures.

6 Building a Data Warehouse

6.1 User Groups

As the first step of a data warehouse project a precise definition of the goals
is indispensable. In general a survey of the needs of the various user groups
is necessary in order to generate the knowledge about the required informa-
tion and data; one of the most difficult problems is to specify the manage-
ment's information needs for the future. When the warehouse is developed
this knowledge is very incomplete and undergoes continuous modifications.
The user of a data warehouse may be characterized with respect to the
management hierarchy within the enterprise. Another classification might be
the users' experience with the data warehouse concept. Poe [Poe97] differen-
tiates the novice or casual user without any or with a very limited computer
science experience. This kind of user needs frequent support and a simple
user interface. The business analyst is a regular user group having a basic
knowledge of the daily requests of information. They are able to use the
system on the basis of the predefined navigation and reporting tools with-
out further special support. The power users are able to specify their own
individual environment by parameters and macro definitions. They are suf-
ficiently qualified to generate individual reports and analysis independently
of the provided support tools. The application developer is the most skillful
9. Data Warehouses 419

user who is responsible for the warehouse environment and the availability
of tools.
Another differentiation of user groups can be achieved if the users' demand
on the warehouse is considered. A frequent and regular use of the warehouse
requires a completely different quality of decision relevant information from
the warehouse than an occasional usage. The design of the user interface has
to observe, however, the needs of the weakest group of occasional users, in
order to avoid their total exclusion from the possible use of the warehouse.
Quality, contents and current or future demands on a warehouse have to
reflect the aspect of usage frequency.
A further user group differentiation arises from the functional differentia-
tion of an enterprise into, e.g., product management, marketing, distribution,
accounting and controlling, finance, etc. For any of these business functions
a standard warehouse can be supplemented with additional, specific tools
and applications or a specific warehouse can be designed. Dyer and Forman
[DF95] discuss how to build systems for marketing analysis. Mentzl and Lud-
wig [ML97] report of the use of a warehouse as a marketing database in order
to improve the client care or to quickly recognize trends. The marketing de-
partment might also need an access to geographic information systems for
the generation of client relevant data.
Many users have on their own developed databases that meet their needs.
These users may be skeptical whether the new data warehouse can do as
good a job in supporting their reporting needs as their own solutions. The
users possibly feel threatened by the prospect of automation. Users may pre-
fer their own data marts for a variety of reasons. They may want to put
their data on different hardware platforms or they desire to not have to work
with other groups on resolving data definition issues. One functional area of
the enterprise may not want another functional area to see or to have access
to their data, e.g., because of concerns about misinterpretations or misun-
derstandings. Besides, disagreements about the correctness of data added or
processed might arise.

6.2 Data Warehouse Projects and Operations


Building a data warehouse is very time-consuming and therefore very expen-
sive. The high costs of a warehouse project are caused by planning and design,
hardware, software, implementation, and the training; in addition there are
the subsequent costs for the continuous use. There is the risk that the desired
goals cannot be achieved and that the warehouse usage remains limited. A
cost estimation for building a data warehouse is extremely difficult because
any solution is highly dependent on the enterprise specifics. Hence, substan-
tial time and effort is being devoted to evaluating data warehousing software
and hardware, but standard solutions are not available.
Organizations undertaking warehousing efforts almost continuously dis-
cover data problems. The process of extracting, cleaning, transforming, and
420 U. Dorndorf and E. Pesch

loading data takes the majority of the time in initial data warehouse devel-
opment; estimates of the average effort for these steps are as high as 80% of
the total time spent for building a warehouse. A very common problem is
that data must be stored which are not kept in any transaction processing
system, and that the data warehouse developer faces the problem to build a
system dedicated to generating the missing information.

On the one hand, many strategic applications of data warehousing have


a short life time and force to develop an inelegant system quickly. On the
other hand, it takes time for an organization to detect how it can change its
business practices to get a substantial return on its warehouse investment.
Thus, the learning curve may be too long for some companies because it
takes a long time to gain experience with the usual problems which arise at
different phases of the data warehousing process.

Prototyping may help to keep the time and costs of a warehouse develop-
ment under control. The warehouse is first constructed for a small, limited
and well-defined business area and later extended to the whole enterprise.
A prototype development allows to present results and the quality of the
warehouse characteristics quickly, which is quite important in order to re-
ceive user acceptance as early as possible. Additionally, modifications and
corrections of the concepts and goals can be recognized early enough to allow
an appropriate restructuring. Prototyping is a central part of rapid applica-
tion development (RAD) and joint application design (JAD) methodologies.
Consultants are assigned to work directly with the clients and a continuous
collaboration, mentoring, and supervision ensures the desired outcome. The
traditional software development cycle follows a rigid sequence of steps with
a formal sign-off at the completion of each. A complete detailed requirements
analysis is done to capture the system requirements at the very beginning. A
specification step has to be signed-off before the development phase starts.
But the design steps frequently reveal technical infeasibilities or extremely ex-
pensive implementations unknown at the requirements' definition step. RAD
is a methodology for compressing the analysis, design, implementation and
test phases into a series of short, iterative development cycles. The advantage
gained is that iterations allow a self-correction of the complex efforts by small
refinements and improvements. Small teams working in short development it-
erations increase the speed, communication, management, etc. An important
aspect of the iterative improvement steps is that each iteration cycle delivers
a fully functional sub-version of the final system. JAD [WS95,Wet91] centers
around structured workshop sessions. JAD meetings bring together the users
and the builders of the system in order to avoid any delay between ques-
tions and answers. The key people involved are present, and the situation
does not arise that, when everyone is finally in agreement one discovers that
even more people should have been consulted because their needs require
something entirely different.
9. Data Warehouses 421

Besides the costs for the warehouse installation one should not underes-
timate maintenance and support costs as well as the personnel costs for the
system's useful application. Large and complex warehouses may take their
own life. Maintaining the warehouse can quickly become a very expensive
task. The more successful the warehouse is with the users, the more main-
tenance it may require. Possibly the enterprise has to introduce new tech-
nologies for the hard- or software. When a data warehouse has been built
questions arise such as: Who should administer the database? Who has re-
sponsibilities for data quality monitoring? Who makes the final decision over
the correctness of data? Who has access to what data? Inmon et al. [IWG97j,
Yang and Widom [YWOOj, Labio et al. [LYG99j, Huyn [Huy96,Huy97j, Quass
and Widom [QW97j, Quass et al. [QGM+96j, and Mumick et al. [MQM97j
discuss maintenance issues in data warehouses.

6.3 Products and Services

The Data Warehousing Institute estimates that over 3000 companies offer
data warehouse products and services.

Consulting services. The spectrum of consulting activities ranges from


general data warehousing services, data acquisition, tool selection and im-
plementation to project management [MAOOj. There are so many options
that finding the right consultant for the right project at the right time can
be a project itself. The Data Warehousing Institute has collected the most
common mistakes that clients are making when selecting a data warehousing
consultant and has derived the following rules: (1) hire a consultant with the
skills and courage to challenge you; (2) blend analytical and intuitive deci-
sion making into the selection process; (3) use small, trial service packages as
a means to overcome reluctance to use "outsiders"; (4) create a process for
early and frequent feedback; don't bail out too quickly; (5) blend resources;
(6) don't expect miracles; take responsibilities and set realistic expectations;
(7) involve employees from the start to avoid losing commitment; (8) a good
consultant is no substitute for a good leader; bad management leads to bad
consulting; (9) make sure who you see is who you get; (10) personal integrity
on behalf of both parties is ultimately the only way to ensure that promises
are fulfilled.

Products. Numerous data warehouse products and companies are now on


the market, and many companies offer products that fit into multiple cat-
egories. The following list, which is only a collection of some products and
companies, places each company and product into one or two major cate-
gories. One may search the company web sites for information about related
products and products in additional categories.
422 U. Dorndorf and E. Pesch

Among the relational database vendors we name: IBM, Informix, Mi-


crosoft, Oracle, Sybase, SAP. For specialized data warehouses and OLAP
the reader should consult: Hyperion, Oracle Express, Red Brick Systems,
Sagent Technology Inc., The SAS Data Warehouse [We198j and the CRM
Methodology for Customer Relationship Management, WhiteLight Systems,
WebOLAP and ShowCase STRATEGY as a solution for data warehouse and
data mart construction on the AS/400 [KeI97aj. Query and data analysis
tools are, e.g., the multi-dimensional data visualization, analysis, and report-
ing software Dimensional Insight, Esperant from the Software AG, Forest &
Trees from Platinum Technology Inc., GQL from Andyne. Moreover, there
is S-Plus, the tool for statistical data analysis and data visualization, and
StatServer for pattern and trend analysis in corporate databases. Among
the data warehouse middleware products we mention Torrent Systems' Or-
chestrate which is a highly scalable framework for business intelligence and
client relationship management applications. MetaSuite from Minerva Soft-
Care provides tools with integrated meta-data management. Constellar is a
data transformation and movement software. IDS Inde and IDS Integration
Server create "yellow pages" for a company's own corporate data. Finally,
Applied Data Resource Management allows to generate industry specific busi-
ness models, Syncsort can speed up your data warehouse, and Verity can be
applied in data cleaning and mining.
Hashmi [HasOOj and Kaiser [Kai98j discuss SAP information warehousing.
Venerable and Adamson [VA98j discuss data models appropriate for different
types of business problems. For further reading see Whitehorn and Whitehorn
[WW99aj, and Sanchez [San98j. For explanation of fundamentals and use of
the Microsoft products see [Mic99j, Peterson at al. [PPD99], Thomsen et
a1. [TSC99], Brosius [Bro99j, Craig et a1. [CVB99j, Corey at a1. [CAA+99J,
Youness [YouOOj, Ramalho [RamOOj. Data warehousing with Oracle has been
discussed by Yazdani and Wong [YW97J, Dodge and Gorman [DGOOj, Reed
[ReeOOj, Corey et a1. [CAA+98j, Hillson and Hobbs [HH99j, and Burleson
[Bur97j. The latter discusses implementation, troubleshooting, performance
enhancement techniques. Corey and Abbey [CA96j review how to use Oracle
technology for data warehousing. Hammergren [Ham97bj gives an overview
on Sybase data warehousing on the Internet.

7 Future Research Directions

Data quality is an important issue in data warehouses. Extensions of the


existing data quality framework are desirable that are able to capture the
specific features of the warehouse. It is necessary to define metrics for data
quality measurement and data reliability. Better methods for making data
consistent, for identifying values that represent the same object differently
and detecting implausible values are needed. Facilities are needed to maintain
information how the data was obtained.
9. Data Warehouses 423

The complexity of data warehouses is continuously increasing. There are


no sufficient investigations how far the complexity influences the warehouse
life cycle. The first step in order to achieve this goal is precise definition and
characterization of what is a data warehouse's complexity. This immediately
leads to the question of how to efficiently interconnect different warehouses
without increasing their complexity.
There will be new or extended data warehouse applications in areas such
as earth observation systems, electronic commerce, health care information
systems, digital publishing, etc. In earth observation systems, information
gathered by a collection of satellites has to be integrated with existing data
in order to serve the information needs so that even children will be able
to access, e.g., simulations of the weather. Electronic commerce involves a
very large number of participants interacting over a network, browsing in
catalogs, purchasing goods, or supplying product information. Health care
information systems must provide many different kinds of information, e.g.,
medical information about a patient, that is widely spread over several med-
ical offices. Information about the diagnosis and therapy as well as drugs
and medicine has to be provided; access control and confidentiality becomes
increasingly important. Digital publishing requires the organization of and
access to overwhelming amounts of data and information.

8 Conclusions

Designing, implementing and running a data warehouse involves opportuni-


ties and risks. The use of an enterprise-wide, consistent database is important
for a better and more sensible decision making, a better closeness to the cus-
tomers and an improved business intelligence. However, as Inmon mentions,
the real importance and benefits of a warehouse with their information and
data handling possibilities are difficult to predict a priori [Inm96]. The diffi-
culty lies in how to assess the importance and benefits of a specific information
over time: What is the information's contribution to a certain decision and
what is the information's importance for the enterprise when this particular
decision has been made? An information may be useful and of interest, but it
is difficult to measure its importance for the future of the enterprise. There-
fore there is a need for measurable assessment criteria of the warehouse with
respect to the enterprise-wide development.
One of the most important qualitative aspects of an enterprise-wide, con-
sistent database is that it provides a homogeneous information basis for plan-
ning and decision making. Data, which might be useless if spread over sev-
eral systems and heterogeneous sources in manifold, non-compatible varieties
and structures, are to be collected and processed in order to provide deci-
sion relevant information. Only the creation of this information basis allows
a comprehensive department and enterprise-wide data analysis. Another im-
portant feature of a data warehouse is the data presentation over a long term
424 U. Dorndorf and E. Pesch

in which for instance products, production, processes and the environment


such as customers, markets can be observed. This leads to a faster, more
efficient and more effective decision making process in the enterprise.
Despite the obvious advantages of a data warehouse solution one should
not oversee the risks. Data warehousing systems store historical data gen-
erated from transaction processing systems. For enterprises competing in a
rapidly extending and dynamical market these data may be only a small part
of the data available to manage an enterprise efficiently. Furthermore, care
must be taken in order to avoid that data warehousing leads to an adminis-
trational and organizational overhead for generating even simple reports and
thus complicates business processes. Data warehousing imposes new respon-
sibilities and tasks and it requires changes that a firm must be comfortable
with.

References
[Ago99] Agosta, L., The essential guide to data warehousing: aligning technol-
ogy with business imperatives, Prentice-Hall, 1999.
[AM97] Anahory, S., Murray, D., Data warehousing in the real world,
Addison-Wesley, 1997.
[AnaOO] Anand, S., Foundations of data mining, Addison-Wesley, 2000.
[AV98] Adamson, C., Venerable, M., Data warehouse design solutions, John
Wiley & Sons, 1998.
[AZ96] Adriaans, P., Zantiage, D., Data mining, Addison-Wesley, 1996.
[BA97] Bischoff, J., Alexander, T. (eds.), Data warehouse: practical advice
from the experts, Prentice-Hall, 1997.
[BE96] Barquin, R., Edelstein, H. (eds.), Planning and designing the data
warehouse, Prentice-Hall, 1996.
[BE97] Barquin, R., Edelstein, H. (eds.) , Building, using and managing the
data warehouse, Prentice-Hall, 1997.
[Big96] Bigus, J.P., Data mining with neural networks, McGraw-Hill, 1996.
[Bis94] Bischoff, J., Achieving warehouse success, Database Programming &
Design 7, 1994, 27-33.
[BLOO] Berry, M., Linoff, G., Mastering data mining, John Wiley & Sons,
2000.
[Bra96] Brackett, M.H., The data warehouse challenge - taming data chaos,
John Wiley & Sons, 1996.
[Bro99] Brosius, G., Microsoft OLAP services, Addison-Wesley, 1999.
[BS96] Bontempo, C.J., Saracco, C., Database management: principles and
products, Prentice-Hall, 1996.
[BS97] Berson, A., Smith, S.J., Data warehousing, data mining, and OLAP,
McGraw-Hill, 1997.
[Bur97] Burleson, D., High performance Oracle data warehousing, Coriolis
Group, 1997.
[CA96] Corey, M., Abbey, M., Oracle data warehousing, McGraw-Hill, 1996.
[CAA+98] Corey, M., Abbey, M., Abramson, I., Taub, B., OracleB data ware-
housing, McGraw-Hill, 1998.
9. Data Warehouses 425

[CAA+99] Corey, M., Abbey, M., Abramson, I., Venkitachalam, R., Barnes, L.,
Taub, B., SQL Server 7 data warehousing, McGraw-Hill, 1999.
[Cab97] Cabena, P., Discovering datamining: from concept to implementation,
Prentice-Hall, 1997.
[CG98] Chamoni, P., Gluchowski, P. (OOs.), Analytische Informationssys-
teme, Springer, Berlin, 1998.
[CVB99] Craig, R.S., Vivona, J.A., Bercovitch, D., Microsoft data warehousing:
building distributed decision support systems, John Wiley & Sons,
1999.
[CWOO] Cui, Y., Widom, J., Lineage tracing in a data warehousing system,
Proc. 16th International Conference on Data Engineering, 2000, 683-
684.
[CWW99] Cui, Y., Widom, J., Wiener, J.L., Tracing the lineage of view data
in a data warehousing environment, Technical Report, Stanford Uni-
versity, 1999.
[Deb98] Debevoise, T., The data warehouse method, Prentice-Hall, 1998.
[Dev97] Devlin, B., Data warehouse: from architecture to implementation,
Addison-Wesley, 1997.
[DF95] Dyer, R., Forman, E., An analytic approach to marketing decisions,
Prentice-Hall, 1995.
[DGOO] Dodge, G., Gorman, T., Essential Omcle8i data warehousing, John
Wiley & Sons, 2000.
[DJ80] Dubes, R., Jain, A.K., Clustering methodologies in exploratory data
analysis, Advances in Computers 19, 1980, 113-228.
[DP94] Dorndorf, U., Pesch, E., Fast clustering algorithms, ORSA Journal
on Computing 6, 1994, 141-153.
[DS97] Dhar, V., Stein, R., Intelligent decision support methods: the science
of knowledge work, Prentice-Hall, 1997.
[DycOO] Dyche, J., e-Data: turning data into information with data warehous-
ing, Addison-Wesley, 2000.
[FPS+95] Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R.,
Advances in knowledge discovery and data mining, MIT Press, 1995.
[Fra98] Franco, J.M., Le datawarehouse, Eyrolles, 1998.
[GG97] Gabriel, R., Gluchowski, P., Semantische Modellierungstechniken fUr
multidimensionale Datenstrukturen, HMD, Theorie und Praxis der
Wirtschaftsinformatik 34, 1997, 18-37.
[GGC97] Gluchowski, P., Gabriel, R., Chamoni, P., Management Support
Systeme, ComputergestUtzte Informationssysteme fUr Fuhrungskriifte
und Entscheidungstriiger, Springer-Verlag, Berlin, 1997.
[GHQ95] Gupta, A., Harinarayan, V., Quass, D., Aggregate-query processing
in data warehousing environments, Proc. 21st Con/. on Very Large
Data Bases (VLDB), 1995, 358-369.
[GHR+97] Gupta, H., Harinarayan, V., Rajaraman, A., Ullman, J., Index selec-
tion for OLAP, Proc.International Conference on Data Engineering,
1997, 208-219.
[GioOO] Giovinazzo, W., Object-oriented data warehouse design, Prentice-
Hall, 2000.
[GLW+99] Garcia-Molina, H., Labio, W.J., Wiener, J.L., Zhuge, Y., Distributed
and parallel computing issues in data warehousing, Proc. ACM Prin-
ciples of Distributed Computing Conference, 1999, 7-10.
426 U. Dorndorf and E. Pesch

[Gog98) Goglin, J.-F., La construction du datawarehouse, Editions Hermes,


1998.
[Gro97a) Groffmann, H.-D., Das Data Warehouse Konzept, HMD, Theorie und
Praxis der Wirtschaftsinformatik 34, 1997, 8-17.
[Gro97b) Groth, R., Data mining: a hands on approach for business profession-
als, Prentice-Hall, 1997.
[Gro99) Groth, R., Data mining: building competitive advantage, Prentice-
Hall, 1999.
[GW89) Grotschel, M., Wakabayashi, Y., A cutting-plane algorithm for a clus-
tering problem, Mathematical Programming 45, 1989, 59-96.
[Hac95) Hackathorn, R.D., Data warehousing energizes your enterprise, Data-
mation 41, 1995, 38-45.
[Hac97) Hackney, D., Understanding and implementing successful data marts,
Addison-Wesley, 1997.
[Hac99) Hackathorn, R.D., Web farming for the data warehouse, Morgan
Kaufmann, 1999.
[Ham97a) Hammergren, T.C., Data warehousing: building the corporate knowl-
edgebase, The Coriolis Group, 1997.
[Ham97b) Hammergren, T.C., Official sybase data warehousing on the internet,
The Coriolis Group, 1997.
[HasOO) Hashmi, N., Business information warehouse for SAP, Prima Pub-
lishing, 2000.
[HBM+96) Humphreys, P., Bannon. L., Migliarese, P., Pomerol, J.-C., Mc-
Cosh, A., Implementing systems for supporting management deci-
sions, Chapman & Hall, 1996.
[HH99) Hillson, S., Hobbs, L., Oracle8i data warehousing, Digital Press, 1999.
[HHD99) Humphries, M.W., Hawkins, M.W., Dy, M.C., Data warehousing: ar-
chitecture and implementation, Prentice-Hall, 1999.
[HK02) Han, J., Kamber, M., Data mining - concepts and techniques, Morgan
Kaufmann, 2001.
[HLW98) Huang, K.-T., Lee, Y.W., Wang, R.Y., Quality information and
knowledge, Prentice-Hall, 1998.
[Ho197) Holthuis, J., Multidimensionale Datenstrukturen - Modellierung,
Strukturkomponenten, Implementierungsaspekte, H. Mucksch, W.
Behme (eds.), Das Data Warehouse-Konzept, Gabler, 1997, 137-186.
[HRU96) Harinarayan, V., Rajaraman, A., Ullman, J., Implementing data
cubes efficiently, Proc. ACM SIGMOD Conference, 1996, 205-216.
[Huy96) Huyn, N., Efficient view self-maintenance, Proc. ACM Workshop on
Materialized Views: Techniques and Applications, 1996, 17-25.
[Huy97) Huyn, N., Multiple-view self-maintenance in data warehousing envi-
ronments, Proc. 23rd Conf. on Very Large Data Bases (VLDB), 1997,
26-35.
[IH94) Inmon, W.H., Hackathorn, R.D., Using the data warehouse, John Wi-
ley & Sons, 1994.
[IIS97) Inmon, W.H., Imhoff, C., Sousa, R., Corporate information factory,
John Wiley & Sons, 1997.
[Inm96) Inmon, W.H., Building the data warehouse, 3rd edition, John Wiley
& Sons, 2002.
[Inm99) Inmon, W.H., Building the operational data store, John Wiley & Sons,
1999.
9. Data Warehouses 427

[InmOO] Inmon, W.H., Exploration warehousing, John Wiley & Sons, 2000.
[IRB+98] Inmon, W.H., Rudin, K., Buss, C.K., Sousa, R., Data warehouse per-
formance, John Wiley & Sons, 1998.
[IWG97] Inmon, W.H., Welch, J.D., Glassey, K., Managing the data warehouse,
John Wiley & Sons, 1997.
[IZG97] Inmon, W.H., Zachman, J., Geiger, J., Data stores, data warehousing,
and the Zachman framework, McGraw-Hill, 1997.
[JLV+OO] Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P., Fundamentals
of data warehouses, 2nd edition, Springer-Verlag, 2000.
[Kai98] Kaiser, B.-D., Corporate information with SAP-EIS, Morgan Kauf-
mann, 1998.
[Kel94] Kelly, S., Data warehousing: the route to mass customization, John
Wiley & Sons, 1994.
[Kel97a] Kelly, B.W., AS/400 data warehousing: the complete implementation
guide, Midrange Computing, 1997.
[Kel97b] Kelly, S., Data warehousing in action, John Wiley & Sons, 1997.
[Kim96] Kimball, R., The data warehouse toolkit, John Wiley & Sons, 1996.
[Kir97] Kirchner, J., Transformationsprogramme und Extraktionsprozesse
entscheidungsrelevanter Basisdaten, H. Mucksch, W. Behme (eds.),
Das Data Warehouse-Konzept, Gabler, 1997, 237-266.
[KLM+97] Kawaguchi, A., Lieuwen, D., Mumick, I., Quass, D., Ross, K., Con-
currency control theory for deferred materialized views, Proc. Inter-
national Conference on Database Theory, 1997, 306-320.
[KMOO] Kimball, R., Merz, R., The data webhouse toolkit: building the web-
enabled data warehouse, John Wiley & Sons, 2000.
[KRR+98] Kimball, R., Reeves, L., Ross, M., Thornwaite, W., The data ware-
house lifecycle toolkit: tools and techniques for designing, developing
and deploying data marts and data warehouses, John Wiley & Sons,
1998.
[LL96] Laudon, K.C., Laudon, J.P., Management information systems, orga-
nization and technology, 4th edition, Prentice-Hall, New Jersey 1996.
[Lus02] Lusti, M., Data warehousing und Data Mining, 2nd edition, Springer-
Verlag, 2002.
[LYG99] Labio, W.J., Yerneni, R., Garcia-Molina, H., Shrinking the warehouse
update window, Proc. ACM SIGMOD Conference, 1999, 383-394.
[LZW+97] Labio, W.J., Zhuge, Y., Wiener, J.L., Gupta, H., Garcia-Molina, H.,
Widom, J., The WHIPS prototype for data warehouse creation and
maintenance, Proc. ACM SIGMOD Conference, 1997, 557-559.
[MAOO] Moss, L., Adelman, S., Data warehouse project management,
Addison-Wesley, 2000.
[Mal94] Mallach, E., Understanding decision support systems and expert sys-
tems, McGraw-Hill, 1994.
[Mar99] Marakas, G., Decision support systems in the 21st century, Prentice-
Hall, 1999.
[Mat96] Mattison, R., Data warehousing: strategies, tools and techniques,
McGraw-Hill, 1996.
[Mat97] Mattison, R., Data warehousing and data mining for telecommunica-
tions, Artech House, 1997.
[Mat99] Mattison, R., Web warehousing and knowledge management,
McGraw-Hill, 1999.
428 U. Dorndorf and E. Pesch

[MB97] Mucksch, H., Behme, W. (eds.), Das Data Warehouse-Konzept, 2nd


edition, Gabler, 1997.
[MC98] Meyer, D., Cannon, C., Building a better data warehouse, Prentice-
Hall, 1998.
[Men99] Mena, J., Data mining your website, Digital Press, 1999.
[MHR96] Mucksch, H., Holthuis, J., Reiser, M., Das Data Warehouse-Konzept
- ein Uberblick, Wirlschaftsinformatik 38, 1996, 421-433.
[MI97] Morse, S., Issac, D., Parallel systems in the data warehouse, Prentice-
Hall, 1997.
[Mic99] Microsoft Press, Microsoft SQL Server 7.0 data warehousing training
kit, 1999.
[ML97] Mentzl, R., Ludwig, C., Das Data Warehouse als Bestandteil eines
Database Marketing-Systems, H. Mucksch, W. Behme (eds.), Das
Data Warehouse-Konzept, Gabler, 1997, 469-484.
[MQM97] Mumick, I., Quass, D., Mumick, B., Maintenance of data cubes and
summary tables in a warehouse, Proc. ACM SIGMOD Conference,
1997, 100-11l.
[ONe94] O'Neil, P., Database: principles, programming, performance, Morgan
Kaufmann, 1994.
[OQ97] O'Neil, P., Quass, D., Improved query performance with variant in-
dexes, Proc. ACM SIGMOD Conference, 1997, 38-49.
[Poe97] Poe, V., Building a data warehouse for decision support, Prentice-
Hall, 1997.
[PonOl] Ponniah, P., Data warehousing fundamentals, John Wiley & Sons,
200l.
[PPD99] Peterson, T., Pinkelman, J., Darroch, R., Microsoft OLAP unleashed,
SAMS, 1999.
[Pyl98] Pyle, D., Data preparation for data mining, Morgan Kaufmann, 1998.
[QGM+96] Quass, D., Gupta, A., Mumick, I., Widom, J., Making views self-
maintainable for data warehousing, Proc. Conference on Parallel and
Distributed Information Systems, 1996, 158-169.
[QW97] Quass, D., Widom, J., On-line warehouse view maintenance for batch
updates, Proc. ACM SIGMOD Conference, 1997, 393-404.
[RamOO] Ramalho, J., Data warehousing with MS SQL 7.0, Wordware, 2000.
[ReeOO] Reed, D., Managing the Oracle data warehouse, Prentice-Hall, 2000.
[RyaOO] Ryan, C., Evaluating and selecting data warehousing tools, Prentice-
Hall,2000.
[San98] Sanchez, A., Data warehousing with Informix: best practices,
Prentice-Hall, 1998.
[Sau96] Sauter, V.L., Decision support systems, John Wiley & Sons, 1996.
[Sch96] Schreier, U., Verarbeitungsprinzipien in Data-Warehousing-Sy-
stemen, HMD, Theorie und Praxis der Wirtschaftsinformatik 33,
1996, 78-93.
[SIG97] Silverston, L., Inmon, W.H., Graziano, K., The data model resource
book: a library 0/ logical data models and data warehouse designs,
John Wiley & Sons, 1997.
[Sim98] Simon, A.R., 90 days to the data mart, John Wiley & Sons, 1998.
[Sin97] Singh, H.S., Data warehousing: concepts, technology, and applica-
tions, Prentice-Hall, 1997.
9. Data Warehouses 429

[Sin98] Singh, H.S., Interactive data warehousing via the web, Prentice-Hall,
1998.
[Spe99] Sperley, E., The enterprise data warehouse, vol. 1, Planning, building
and implementation, Prentice-Hall, 1999.
[SW96] Sprague, RH., Watson, H., Decision support for management,
Prentice-Hall, 1996.
[Tan97] Tanler, R, The intranet data warehouse: tools and techniques for con-
necting data warehouses to intranets, John Wiley & Sons, 1997.
[Thi97] Thierauf, RJ., On-line analytical processing systems for business,
Quorum Books, 1997.
[Tho97] Thomsen, E., OLAP solutions: building multidimensional information
systems, John Wiley & Sons, 1997.
[TSC99] Thomsen, E., Spofford, G., Chase, D., Microsoft OLAP solutions,
John Wiley & Sons, 1999.
[Thr98] Thrban, E., Decision support systems and expert systems, Prentice-
Hall, 1998.
[VA98] Venerable, M., Adamson, C., Data warehouse design solutions, John
Wiley & Sons, 1998.
[WB98] Westphal, C., Blaxton, T., Data mining solutions: methods and tools
for solving real-world problems, John Wiley & Sons, 1998.
[Wel98] Welbrock, P.R., Strategic data warehousing principles using SAS soft-
ware, SAS Institute, 1998.
[Wet91] Wetherbe, J.C., Executive information requirements: getting it right,
MIS Quarterly, 1991.
[WG97] Watson, H., Gray, P., Decision support in the data warehouse,
Prentice-Hall, 1997.
[WHR97] Watson, H.J., Houdeshel, G., Rainer, RK., Building executive infor-
mation systems and other decision support applications, John Wiley
& Sons, 1997.
[WI97] Weiss, S.M., Indurkhya, N., Predictive data mining: a practical guide,
Morgan Kaufmann, 1997.
[WS95] Wood, J., Silver, D., Joint application development, 2nd edition, John
Wiley & Sons, 1995.
[WW99a] Whitehorn, M., Whitehorn, M., Business intelligence: the IBM solu-
tion, Springer, 1999.
[WW99b] Whitehorn, M., Whitehorn, M., SQL server: data warehousing and
OLAP, Springer-Verlag, 1999.
[YouOO] Youness, S., Professional data warehousing with SQL Server 7.0 and
OLAP services, Wrox, 2000.
[YW97] Yazdani, S., Wong, S., Data warehousing with Oracle: an administra-
tor's handbook, Prentice Hall, 1997.
[YWOO] Yang, J., Widom, J., Making temporal views self-maintainable for
data warehousing, Proc. 7th International Conference on Extending
Database Technology, 2000, 395-412.
[ZGH+95] Zhuge, Y., Garcia-Molina, H., Hammer, J., Widom, J., View mainte-
nance in a warehousing environment, Proc. ACM SIGMOD Confer-
ence, 1995, 316-327.
[ZGW96] Zhuge, Y., Garcia-Molina, H., Wiener, J.L., The strobe algorithms
for multi-source warehouse consistency, Proc. Conference on Parallel
and Distributed Information Systems, 1996, 146-157.
430 U. Dorndorf and E. Pesch

[ZGW98] Zhuge, Y., Garcia-Molina, H., Wiener, J.L., Consistency algorithms


for multi-source warehouse view maintenance, Journal of Distributed
and Pamllel Databases 6, 1998, 7-40.
[ZWG97] Zhuge, Y., Wiener, J.L., Garcia-Molina, H., Multiple view consis-
tency for data warehousing, Pmc. International Conference on Data
Engineering, 1997, 289-300.
10. Mobile Computing

Omran Bukhres 1 , Evaggelia Pitoura 2 , and Arkady Zaslavsky 3

1 Computer Science Department, Purdue University, Indianapolis, U.S.A.


2 Computer Science Department, University of Ioannina, Ioannina, Greece
3 School of Computer Science and Software Engineering, Monash University,
Melbourne, Australia

1. Introduction ..................................................... 433


1.1 The Challenges of Mobile Computing ........................ 433
1.2 Chapter Outline ............................................. 437
2. Mobile Computing Infrastructure ................................ 437
2.1 Mobile Computing Architecture..... .... ... ...... . . .... ... ... 437
2.2 Taxonomy of Wireless Technologies .......................... 439
2.3 Existing Wireless Technologies ............................... 441
3. Mobile Computing Software Architectures and Models ........... 444
3.1 Adaptivity and Application Awareness... ........ .... ... ..... 445
3.2 Multi-Tier Client/Server Models ............................. 447
3.3 Mobile Agents ............................................... 452
3.4 Taxonomy ................................................... 453
4. Disconnected Operation .......................................... 454
4.1 Overview .................................................... 455
4.2 File Systems ................................................. 458
4.3 Database Management Systems .............................. 460
5. Weak Connectivity ............................................... 462
5.1 File Systems ................................................. 462
5.2 Database Systems ............................................ 465
6. Data Delivery by Broadcast ...................................... 468
6.1 Hybrid Delivery... ............... ........... .... .... .... ..... 469
6.2 Organization of Broadcast Data...... ........... .. .. .... ..... 470
6.3 Client Caching in Broadcast Delivery .............. '" ........ 473
6.4 Cache Invalidation by Broadcast ............................. 474
6.5 Consistency Control in Broadcast Systems ................... 475
7. Mobile Computing Resources and Pointers....... .... ....... ..... 476
8. Conclusions...................................................... 479

Abstract. Mobile computing has emerged as a convergence of wireless commu-


nications and computer technologies. Mobile computing systems can be viewed as
a specialized class of distributed systems where some nodes may disengage from
joint distributed operations, move freely in the physical space and re-connect to a
possibly different segment of a computer network at a later stage in order to re-
sume suspended activities. Migrating applications, mobile distributed objects and
agents are also frequently associated with mobile computing. Mobile computing
platforms offer new opportunities at the system software and application levels
432 O. Bukhres, E. Pitoura, and A. Zaslavsky

and pose many research challenges. This chapter addresses data management is-
sues in mobile computing environments. It analyzes the past and present of mobile
computing, wireless networks, mobile computing devices, architectures for mobile
computing, and advanced applications for mobile computing platforms. It covers
extensively weak connectivity and disconnections in distributed systems as well as
broadcast delivery. The chapter also lists available (at the time of writing) online
mobile computing resources.
10. Mobile Computing 433

1 Introduction

Mobile computing is associated with mobility of hardware, data and soft-


ware in computer applications. Mobile computing has become possible with
the convergence of mobile communications and computer technologies, which
include mobile phones, personal digital assistants (PDAs), handheld and
portable computers, wireless local area networks (WLAN), wireless wide area
networks and wireless ATMs. The increasing miniaturization of virtually all
system components is making mobile computing a reality [AK93,FZ94].
Wireless networking has greatly enhanced the use of portable comput-
ers. It allows users versatile communication with other people, immediate
notification about important events and convenient access to up-to-date in-
formation, yet with much more flexibility than with cellular phones or pagers.
It also enables continuous access to the services and resources of stationary
computer networks. Wireless networking promises to do for portable com-
puters what traditional networks have done for desktop personal computers.
Networks enable stand-alone personal computers to participate in distributed
systems that allow users anywhere on the network to access shared resources.
With access to a wireless network, mobile users can download news or elec-
tronic documents, query a remote database, send or receive electronic mail,
or even be involved in a real-time video-conference with other users. How-
ever, we have to distinguish between wireless networks and mobile distributed
computing systems. For instance, point-to-point wireless connection between
workstations does not make this system mobile. Using a portable computer
in-flight makes it mobile, but neither wireless nor part of a distributed system.
Mobile computing has emerged and evolved along with wireless commu-
nications, be it radio or infrared communications. Digital wireless communi-
cations were tried as early as 1901 by the Italian physicist Guglielmo Mar-
coni and around the same time by the Russian scientist Professor Alexander
Popov. First attempts to use radio in computer networks were undertaken in
the early 70s during the ALOHA project at the University of Hawaii. Wire-
less computer networks received a powerful boost with the development of
cellular and mobile communication systems in the 80s and then in the early
90s.

1.1 The Challenges of Mobile Computing

The technical challenges that mobile computing must resolve are hardly triv-
ial. Many challenges in developing software and hardware for mobile comput-
ing systems are quite different from those involved in the design of today's
stationary, or fixed network systems [FZ94j. Also the implications of host
mobility on distributed computations are quite significant. Mobility brings
about a new style of computing. It affects both fixed and wireless networks.
On the fixed network, mobile users can establish a connection from different
434 O. Bukhres, E. Pitoura, and A. Zaslavsky

locations. Wireless connection enables virtually unrestricted mobility and


connectivity from any location within radio coverage.
Mobile computing can be viewed from a number of perspectives, as il-
lustrated in Figure 1.1. These views can be mapped onto respective require-
ments, expectations and terminology.

r
Sys18ms support User applications

Teleconwnunlcatlons Networking
engl...ring

Fig.1.1. Multiple views on mobile computing

Mobile user location becomes a dynamically changing piece of data. In


this case, the user updates this information, while many others may access
it to find out where the mobile user resides. In the mobile environment, the
location of a user can be regarded as a data item whose value changes with
every move. Establishing a connection requires knowledge of the location of
the party we want to establish a connection with. This implies that locating
a person is the same as reading the location data of that person. Such read
operations may involve an extensive search across the whole network as well
as a database look up. Writing the location may involve updating the location
of the user in the local database as well as in other replicas of this data item
[IB94,PF98].
One important characteristic about mobile computers is that they have
severe power restrictions. A battery represents the largest single source of
weight in a portable computer. While reducing battery weight is important,
a small battery can undermine the value of portability by causing users to
recharge frequently, carry spare batteries, or use their mobile computers to
a minimum. Minimizing power consumption can improve portability by re-
10. Mobile Computing 435

ducing battery weight and lengthening the life of a charge. Power can be
conserved not only by the design of energy-efficient software, but also by effi-
cient operation [DKL+94,ZZR+98j. Power management software can power
down individual components when they are idle, for example, spinning down
the internal disk or turning off screen lighting. Applications may have to con-
serve power by reducing the amount of computations, communication, and
memory, and by performing their periodic operations infrequently to mini-
mize the start-up overhead. Database applications may use energy efficient
query processing algorithms. Another characteristic of mobile computing is
that the cost of communication is asymmetric between the mobile host and
the stationary host. Since radio modem transmission normally requires about
10 times as much power as the reception operation, power can be saved by
substituting a transmission operation for a reception one. For example, a
mobile support station (MSS) might periodically broadcast information that
otherwise would have to be explicitly requested by the mobile host. This
way, mobile computers can obtain this information without wasting power to
transmit a request.
Mobile computing is also characterized by frequent disconnections and
the possible dozing of mobile computers. The main distinction between a
disconnection and a failure is its elective nature. In traditional distributed
systems, the loss of connectivity is considered to be a failure and leads to
network partitioning and other emergency procedures. Disconnections in mo-
bile computing, on the other hand, should be treated as planned activities,
which can be anticipated and prepared for. There may be various degrees
of disconnection ranging from a complete disconnection to a partial or weak
disconnection, e.g., a terminal is weakly connected to the rest of the network
via a low bandwidth radio channel. The reasons for disconnections may be
due to costs involved, as it is expensive to maintain an idle wireless commu-
nication link. Also, it could happen that there are no networking capabilities
at the current location. In addition, for some technologies, such as cellu-
lar modems, there is a high start-up charge for each communication session
[BBI+93,SKM+93j. Moreover, the increasing scale of distributed systems
will result in more frequent disconnections. Disconnections are undesirable
because they may impede computation.
Security and privacy is another major concern in mobile computing. Since
mobile computers appear and disappear on various networks, prevention of
impersonation of one machine by another is problematic. When a mobile com-
puter is taken away from its local environment, the data it sends and receives
are subject to possible theft and unauthorized copying. A network that al-
lows visiting mobile computers to connect cannot perform the type of packet
filtering now used as a security mechanism, since certain foreign packets will
be legitimate packets destined for the visiting mobile host. The administrator
of the foreign environment has security concerns as well. These concerns are
much greater than the current mode of mobile computing in which a user in a
436 O. Bukhres, E. Pitoura, and A. Zaslavsky

foreign environment is logged into a local guest account from which the user
may have a communication session (e.g., telnet protocol) to his/her home en-
vironment. In the nomadic computing paradigm, a guest machine may harm
its host/server - either accidentally or maliciously [As094]. The possibility of
such harm is much greater than that likely caused by the typical user of a
guest account on a fixed network.
Another major issue is establishing a connection when a mobile host has
no prior knowledge about the targeted network [NSZ97]. The point of entry
in a network is through the physical medium or interface to the access point.
The choices of physical medium include radio, infrared, wire/coaxial cable
and optical means. Furthermore, a mobile host needs to communicate using
one of the host network's protocols for meaningful exchange of information
to occur. In addition, networks may have established security schemes. In
order to join the targeted network, information about the "code of behavior"
is normally provided to the incoming member of the community. This ar-
rangement, characteristic of legacy computing systems, works well in a static
environment. This approach does not apply to mobile hosts, which migrate
within and across networks. It is important to note that the complexity of
connectivity depends on the variety of choices presented to the node. For
example at the signal level, there are several choices regarding the medium,
access method and encoding. Also, once a protocol is known, there are several
ways it can be used by the upper layers. A mobile host to start communi-
cating with a network needs to "speak the same language" as the targeted
network. The situation can be likened to visiting an unknown country where
one has no prior knowledge of the language, customs, or behavior but some-
how hopes to communicate and ask for directions, food or any other services.
Such a paradigm can be called "the ET (extraterrestrial) effect" [NSZ97]. A
mobile computer that intends to establish a connection in a foreign computer
network is viewed as an outsider and may have no prior knowledge of how
to instigate communications. This is a situation that will arise over and over
again as people demand computing anywhere without geographic barriers
such as those partially achieved in GSM technology.
Wireless data networks are a natural extension and enhancement to exist-
ing wireline computer networks and services. Wireless data networks support
mobile users who may require remote access to their base computer networks.
Wireless data services and systems represent a rapidly growing and increas-
ingly important segment of the telecommunications industry. It is easy to
notice that current computer applications follow the rapid advancements in
the telecommunications industry. Eventually, information systems will be in-
fluenced by the rapid evolution of the wireless segment of this industry. Since
mobility affects many assumptions upon which today's distributed systems
are based, such systems will have to move to where tomorrow's technology
can support them. Wireless data technology is foreseen to be a main infras-
tructure platform for future applications, which are naturally distributed,
10. Mobile Computing 437

dynamic and require much flexibility and mobility. In mobile computing sys-
tems, the underlying network infrastructure is somewhat different from tra-
ditional distributed systems. Designers of mobile information systems have
much less control over wireless networks since not only the communication
media is provided by telecommunications providers, but also base stations
and servers are part of a proprietary wireless network. For example, location
of base stations is considered commercial information and is unavailable to
application developers.

1.2 Chapter Outline

The remainder of this chapter is structured as follows. Section 2 focuses on


mobile computing infrastructures and enabling wireless technologies. In Sec-
tion 3, software models for building distributed systems for mobile comput-
ing are presented including appropriate multi-tier architectures and mobile
agents. Sections 4 and 5 cover disconnections and weak connectivity, respec-
tively, with an emphasis on their treatment in file and database management
systems. Broadcast-based delivery is presented in Section 6. Section 7 lists
a number of mobile-computing resources most of which are available online,
while Section 8 concludes the chapter.

2 Mobile Computing Infrastructure

The overall architecture of the wireless system represents a massively dis-


tributed system that uses the concept of physical or virtual cells. Such cells
can be characterized by physical boundaries or the underlying communication
technology (e.g., frequency). While on the move, the mobile host crosses many
cells and connects to different segments of the communication/computer net-
work. If properly configured, the mobile host may use different technologies
and dynamically adapt to changing bandwidth, quality of service (QoS), user
application demands and physical environments.

2.1 Mobile Computing Architecture


The components of the mobile computing enabling infrastructure are illus-
trated in Figure 2.1. The architectural model consists of two distinct sets of
entities: mobile hosts and fixed hosts. Some of the fixed hosts, called mo-
bile support stations (MSS) [IB94] or home base nodes (HBN) [YZ94] have
a wireless interface to communicate with mobile hosts. The mobile host can
connect to any other fixed host where it can register as a visitor. This fixed
node is called the visitor base node (VBN). The VBN routes all transactions,
messages and communication calls to and from the mobile host to its appro-
priate HBN. The segment of a larger computer network or a geographical
area controlled by a corresponding HBN is called its zone of influence. Fixed
438 O. Bukhres, E. Pitoura, and A. Zaslavsky

hosts and communication links between them constitute the static or fixed
network, and can be considered to be the reliable part of the infrastructure.
Thus, the general architecture for the network with mobile hosts is a two
tier structure consisting of a potentially high-performance and reliable fixed
network with mobile support stations and a large number of mobile hosts,
which are roaming within and across multiple heterogeneous networks and
are connected by slow and often unreliable wireless links.

Fixedlwireline network
Mobile GSM connection

U
G i HomeBase
NodeNI

Wireless LAN
Cellular Data (Aironet, Wavelan, Xircom)
DPD, DataTac, Motorola)

Fig.2.1. Heterogeneous mobile computing environment

Wireless networks use radio waves or pulsing infrared light to commu-


nicate. Stationary transceivers link the wireless hosts to the wired network
infrastructure. Wireless communications can be affected by the surrounding
physical environment which interferes with the wireless signal, blocking sig-
nal paths and introducing noise and echoes. Wireless communications are
characterized by limited bandwidth, high error rates, and frequent spurious
connections/disconnections. These factors increase communication latency.
This is a result of re-transmissions, re-transmission time-out delays, error
control processing, and short disconnections. Quality of service (QoS) may
be hard to maintain while the mobile host moves across multiple heteroge-
neous wireless networks. Mobility can also cause wireless connections to be
lost or degraded. Users may travel beyond the coverage of a wireless network
10. Mobile Computing 439

Table 2.1. Wireless communication groups


Group Characterized by:
Cordless Telephones low mobility, low-power, two-way wireless voice commu-
nications, with low mobility applying both to the range
and the user's speed
Cellular Mobile Ra- providing high mobility, wide-ranging, two-way wireless
dio systems communications, with high mobility applying to vehicu-
lar speeds and to widespread regional to national cover-
age
Wide-Area Wireless high mobility, wide-ranging, low-data-rate digital com-
Data Systems (Mo- munications to both vehicles and pedestrians
bile Data Systems)
High-Speed Wireless low-mobility high-data-rate data communications within
Local-Area Networks a confined region, e.g., a campus or a large building; IEEE
(WLANs) standard 802.11 is an attempt to put some order into this
area
Paging/Messaging one-way messaging over wide areas
systems
Satellite-Based Mo- provides two-way (or one-way) limited quality voice
bile Systems and/or very limited data or messaging to very wide-
ranging vehicles (or fixed locations)

or enter areas of high interference. Unlike typical wired networks, the num-
ber of devices in a wireless cell varies dynamically and large concentrations
of mobile users, say, at conventions, hotels and public events, may overload
network capacity.

2.2 Taxonomy of Wireless Technologies

Wireless communication services can be grouped into relatively distinct


groups [Inc95,EZ97J. The grouping is done with respect to the scale of mo-
bility and communication modes and is summarized in Table 2.1.
Wireless communication technologies can be broadly classified as either
connection-oriented or circuit-switched (e.g., GSM) and as connectionless or
packet-oriented (e.g., radio packet-based). We outline the differences between
these two types of networks below.
To be able to use data and facsimile capabilities over a circuit-switched
handset, a mobile host needs to be connected to a data terminal equipment
(DTE) by a circuit-switched data card. These cards or interfaces differ ac-
cording to the circuit-switched network. They are usually manufactured by
the circuit-switched network providers and third party developers.
Circuit-switched networks are best suited for off-line remote data applica-
tions. A major advantage of using circuit-switched wireless data is that most
of the wireline products would work on wireless circuit-switched data service
cards without major modifications. Circuit-switched type of connection can
440 O. Bukhres, E. Pitoura, and A. Zaslavsky

be used for transfer of large volumes of data (> 20 Kbytes) and for short
connect time batch transactions. A major disadvantage of circuit switched
networks is the high cost when connecting for a long time or on a regular
basis.
In terms of cost-efficiency, packet switched networks offer an alternative to
circuit switching for data transmission. The burst nature of data traffic leads
to an inefficient utilization of the pre-allocated bandwidth under the wireless
circuit switching technology. Wireless packet switching on the other hand,
allocates transmission bandwidth dynamically, hence allowing an efficient
sharing of the transmission bandwidth among many active users. Although
it is possible to send wireless data streams over dedicated channels by using
circuit-switched cellular networks, such methods are too expensive for most
types of data communications [Inc95]. Packet data networks are well suited
for short data transmissions where the overhead of setting up a circuit is not
warranted for the transmission of data bursts lasting only seconds or less.
In packet switching, packets are data sent using limited size blocks. The
information at the sending end is divided into a number of packets and trans-
mitted over the network to the destination where it is assembled to its in-
tended representation. The data is broken into packets of a certain size, for
example, 240 or 512 bytes. Each packet includes the origin and destination
address, allowing multiple users to share a single channel or transmission
path. Packet switches will use this information to send the packet to the next
appropriate transmission link. The actual route is not specified and does
not matter. It can change in the middle of the process to accommodate a
varying network load. Main advantages of packet-switched networks include
[Inc95,Inc96] :

• connectionless networks are designed for data,


• generally not charged on connect time, but rather on the data volume
sent,
• there is no call set up time which makes application faster.

Disadvantages for packet-switched network include:

• existing systems need integration to be done,


• wireless modems are dedicated and can only be used for specific purposes,
• shared bandwidth can give a slow appearance and execution time if the
application is poorly designed.

Wireless network technologies will continue to offer more services and


greater flexibility at lower costs. Users will be able to choose from a wide range
of technologies and mix and match wireless with wireline communications in
an effort to meet their needs in the most cost-effective manner.
10. Mobile Computing 441

2.3 Existing Wireless Technologies

Public wireless data networks are provided to the public by service providers
that offer telecommunications services in general. Private networks, used by
fleet operators and support services such as emergency services, also use these
types of networks. These networks use the existing infrastructure of base
stations, network control centers, and switches to transmit data. Enterprise
systems and third-party service providers can connect host data systems to
the wireless networks via wire line communications.
Public packet-switched wireless data networks are more economical to
operate than similar circuit-switched networks. They allow many devices to
share a small number of communication channels. Charges are based on the
amount of data transmitted, not on the connection time. Transmission speeds
vary from 4800 bps to 19.2 Kbps. However, the actual transmission time and
throughput is determined by the network load and overhead and cannot be
precisely specified. Two widely used packet data networks worldwide are the
Motorola's DataTac [Inc95] and Ericsson's Mobitex [Inc96].
Cellular digital packet data (CDPD) is another packet-based technology
that transmits data packets over existing analogue cellular networks. It is
ideally suited for established voice cellular analogue network operators who
wish to add wireless data to their existing services. CDPD has the same
in-building coverage as the current voice cellular analogue networks. CDPD
transmits over channels not in use for voice calls, making efficient use of
capacity that would otherwise be wasted. It always relinquishes a channel
when needed for voice. Packet-switched communication is optimized for the
burst like transmission of data. The fact that many CDPD users have the
same channel optimizes the use of scarce radio frequencies. Packet-switched
network resources are only used when data is actually being sent or received.
Depending on the application, CDPD allows for as many as up to 1,000 users
per channel with bit rate of 19,200 bps [Inc95].
Among circuit-switched networks there are two standards for digital net-
works: Code Division Multiple Access (CDMA) and Time Division Mul-
tiple Access (TDMA) which includes GSM [Br095]. CDMA, International
Standard-95 (IS-95), was adopted as a standard in 1992. In a CDMA sys-
tem, being a spread spectrum system, the total occupied RF bandwidth is
much larger than the information signal. All users share the same range
of radio spectrum and different digital code sequences are used to differenti-
ate between subscriber conversations. Ericsson, the leading TDMA producer,
claims that the CDMA technology is too complex and still years from being
ready for commercial use [Bro95]. PDC is the Japanese digital standard based
on TDMA. PDC is mainly used in Japan. In a TDMA system, a portion of
the frequency spectrum is designed as a carrier and then divided into time
slots. One conversation at a time is assigned a time slot (channel). The chan-
nel is occupied until the call is finished or handed by the system to another
channel (roaming).
442 O. Bukhres, E. Pitoura, and A. Zaslavsky

Table 2.2. Comparison of digital networks


PDC D-AMPS CDMA (IS-95) GSM
Frequency Band 800 MHz 800 MHz 800 MHz 900 MHz
Upbanded 1.5 GHz 1.9 GHz 1.8/1.9 GHz
Access method TDMA TDMA Spread TDMA
spectrum
Channel spacing 25 kHz 30 kHz 1.25 MHz 200 kHz
Frequency 4 7 1 2
Reuse Factor
Handoff Hard Hard Hard(no muting) Hard
(audio Soft (site-site)
muting) Softer (sector-sector)
International
Roaming No Future Future Yes

The Global System for Mobile communication (GSM) is a digital cellu-


lar telephone system used widely in Europe, South East Asia, Middle East
and Australia. Voice communication is the most basic and important service
provided by GSM. GSM supports circuit-switched data services. Connections
are made to fixed hosts through the telephone network by converting the dig-
ital data to analogue modem tones. GSM allows for encryption, which is not
available with analogue systems. GSM uses the spectrum efficiently because
it can form smaller cells than analogue systems can. Automatic international
roaming between GSM networks is one of the major attractions of the sys-
tem. GSM services are available in two forms: Short Message Service and
Bearer Services. Bearer services provide circuit switched data or facsimile
connections with speeds up to 9600 bps. The short message service (SMS)
is a value-added service that allows users to send short alphanumeric mes-
sages of up to 160 7-bit characters or 140 Octets. SMS allows two-way paging
between GSM handsets or a message center.
Analogue Advanced Mobile Phone Service (AMPS) is a technology for
communicating data over the analogue cellular voice network that is cur-
rently offered by many cellular service providers. Users can send and receive
data at transmission rates up to 14.4 Kbps. Using a laptop computer con-
nected to a wireless modem through the communication port, the user dials
up a connection, much like using a wireline modem. Analogue cellular data
connections are session-based similar to wireline modem connections. Once a
session is established, users pay for the connection time, even when no data is
being transmitted. For example, when the user is browsing a directory or file
or reading an email message. D-AMPS is the digital version of the AMPS.
Table 2.2 summaries the available digital networks worldwide.
Paging is a one-way wireless data service that, in addition to beeping
or providing numeric-only information, can also deliver short alphanumeric
10. Mobile Computing 443

messages to pagers. One-way data, simple paging allows broadcast of unac-


knowledged (unconfirmed) data to one or more recipients.
Wireless local area networks (WLANs) allow roaming in limited areas,
typically a department, building, or campus, while maintaining a wireless
connection to the area's wired network, usually an Ethernet backbone. Wire-
less LANs provide the fastest data rates of the wireless networks, generally
between 1 and 12 Mbps. Wireless LANs might be preferable to their wired
counterparts for situations in which wiring is difficult or impractical, or where
some degree of mobility is needed.
Mobile satellite network services fill the gap in situations in which pro-
viding radio coverage with cellular like terrestrial wireless networks is either
not economically viable, such as in remote sparsely-populated areas, or is
physically impractical such as over large bodies of water. Satellite networks
provide global coverage with some trade-offs compared to land-based systems.
More transmitter power is required, the mobile devices are somewhat bulkier,
there is less total channel capacity, and the cost for comparable services is
typically greater. There are three types of satellite based systems available
worldwide [Bro95] :

• LEO satellite system (Low Earth Orbit): orbits the earth at 500-
1000 kms. They require 66 satellites served by 200 ground stations. They
orbit the earth in 90-100 minutes. It has been proposed by Motorola's
Iridium project which was recently discontinued due to economic effi-
ciency considerations.
• GEO satellites (Geostationary): orbit the earth at 36,000 km. They take
23 hours and 56 minutes to complete an orbit. Three GEO are required for
global coverage. This technology uses car-mounted or pluggable handsets
rather than being genuine lightweight portables.
• MEO satellite system (Medium Earth Orbit): orbits the earth at 10400
kms and consists of 12 satellites. It has been proposed by ICO Global
Communications (formerly Inmarsat-P). MEO will be capable of 45000
simultaneous calls and in full operation by 2000. A 12 satellite system is
planned to be launched by TWR in the US.

There are three main standards in the cordless technology: Digital Euro-
pean Cordless Telephony (DECT), Telepoint (or CT-2), and Personal Handy
Phone System (PHS) [Rap96]. Cordless telecommunications systems are suit-
able mostly in high density business environments. Cordless telecommunica-
tions are of central importance to suppliers of the following two markets:
PABX suppliers will have their lucrative next generation market, and LAN
suppliers would open a new market for wireless LANs.
DECT is suitable for large installations and CT-2 for smaller operations.
Telepoint (or CT-2) was pioneered in the UK back in 1989-1990. However,
CT-2 was further developed by Hong Kong, France and Canada Telecoms
and produced a successful CT-2 services system.
444 O. Bukhres, E. Pitoura, and A. Zaslavsky

Digital European (Enhanced) Cordless Telephony (DECT) is the pan Eu-


ropean specification for short range cordless telephones, wireless PABX, pub-
lic access services, local loop replacement. Equipment based on DECT can be
used in offices, homes, factories and in public places. DECT is a TDMAjTDD
standard which has all the interference and echo problems of GSM, but with
DECT the frequency band divided into 12 full-duplex channels with each
channel carrying 32 Kps voice. Ericsson is the main promoter of DECT.
Personal Handy Phone System (PHS) is a Japanese cordless standard. It
can be described as a cross between the cordless phone technology, CT-2 for
example, and a mobile cellular phone. PHS offers two-way connection at a
fraction of the costs of cellular phones. PHS is a single channel TDMAjTDD
system which provides 77 carriers in the band between 1895-1918 MHz, with
four two way channels per carrier. There are 148 working channels as 6 chan-
nels are used for control purposes. PHS is designed from the start to be
Telepoint(CT-2)jPCS as well as wireless PABX.

3 Mobile Computing Software Architectures


and Models

The mobile computing environment is constrained in many ways. These se-


vere restrictions have a great impact on the design and structure of mobile
computing applications [BGZ+96,MBZ+97,MBM96] and motivate the devel-
opment of new computing models [PS98]. The restrictions induced by mobile
computing that affect software architectures can be categorized
[Sat96a,FZ94,IB94,AK93] in those induced by (a) mobility, (b) wireless com-
munications and (c) portable devices.
The consequences of mobility are numerous. First it results in systems
whose configuration is no longer static. The center of activity, the topology,
the system load, and the notion of locality, all change dynamically. Then, mo-
bility introduces the need for specialized techniques for managing the location
of moving objects. Finally, it causes various forms of heterogeneity. Wireless
networks are more expensive, offer less bandwidth, and are less reliable than
wireline networks. Consequently, connectivity is weak and often intermittent.
Mobile elements must be light and small to be easily carried around. Such
considerations, in conjunction with a given cost and level of technology, will
keep mobile elements having less resources than static elements, including
memory, screen size and disk capacity. Mobile elements must rely for their
operation on the finite energy provided by batteries. Even with advances in
battery technology, this concern will not cease to exist. Furthermore, mobile
elements are easier to be accidentally damaged, stolen, or lost. Thus, they
are less secure and reliable than static elements.
The first issue to be addressed in deriving mobile computing models is
what type of functionality should be assigned to mobile hosts. Mobile units
are still characterized as unreliable and prone to hard failures, i.e., theft,
10. Mobile Computing 445

loss or accidental damage, and resource-poor relative to static hosts. These


reasons justify treating the mobile units as dumb terminals running just
a user-interface. The InfoPad [NSA+96j and ParcTab [SAG+93j projects
employ such a dump terminal approach and off-load all functionality from
the mobile unit to the fixed network. On the other hand, slow and unreliable
networks argue for putting additional functionality at the mobile hosts to
lessen their dependency to remote servers. Although, there is no consensus
yet on the specific role mobile hosts will play in distributed computation, the
above contradictory considerations lead to models that provide for a flexible
adjustment of the functionality assigned to mobile hosts.
The remainder of this section is organized as follows. In Section 3.1, we
present two specific characteristics that mobile computing models must pos-
sess, namely support for adaptivity and application awareness; in Section 3.2
and 3.3, we discuss in detail several emerging client-server and mobile-agent
based mobile computing models, and in Section 3.4, we summarize.

3.1 Adaptivity and Application Awareness

The mobile environment is a dynamically changing one. Connectivity con-


ditions vary from total disconnections to full connectivity. In addition, the
resources available to mobile computers are not static either, for instance
a "docked" mobile computer may have access to a larger display or mem-
ory. Furthermore, the location of mobile elements changes and so does the
network configuration and the center of computational activity. Thus, a mo-
bile system is presented with resources of varying number and quality. Con-
sequently, a desired property of software systems for mobile computing is
their ability to adapt to the constantly changing environmental conditions
[Kat94,Sat96b,JTK97,FZ94j.
But how can adaptivity be captured and realized? A possible answer is
by varying the partition of duties between the mobile and static elements of
a distributed computation. For instance, during disconnection, a mobile host
may work autonomously, while during periods of strong connectivity, the host
will depend heavily on the fixed network sparing its scarce local resources.
Another way to realize adaptivity is by varying the quality of data available
at the mobile host based on the current connectivity. One way to quantify
quality is the notion of fidelity introduced in [NPS95]. Fidelity is defined as
the degree to which a copy of data presented for use at a site matches the
reference copy at the server. Fidelity has many dimensions. One universal
dimension is consistency. Other dimensions depend on the type of data in
question. For example, video data has at least the additional dimensions of
frame rate and image quality. The form of adaptation depends not only on the
type of data but also on the application. Take for example colored images. In
cases of weak connectivity, a web browser may sacrifice both color and pixel
resolution of an image when used for web surfing. However, a viewer used in
446 O. Bukhres, E. Pitoura, and A. Zaslavsky

a medical application cannot tolerate losing the detail of an image used for
medical diagnosis.
An issue germane to system design is where should support for mobility
and adaptivity be placed. Should applications be aware of their environ-
ment? Strategies range between two extremes [Sat96a,NPS95]. At one ex-
treme, adaptivity is solely the responsibility of the underlying system and is
performed transparently from applications. In this case, existing applications
continue to work unchanged. However, since there is no single best way to
serve applications with diverse needs, this approach may be inadequate or
even make performance worse than providing no support for adaptivity at
all. For example, consider the following application-transparent way to pro-
vide operation during disconnection. Before disconnection, the last recently
used files are preloaded in the mobile host's cache. Upon re-connection, file
and directory updates are automatically integrated in the server and any
conflicting operations are aborted. This method performs poorly if the appli-
cation does not exploit any time locality in file accesses or if most conflicts
are semantically acceptable and can be effectively resolved, for example, in a
calendar application by reconciling conflicting entries. Often, completely hid-
ing mobility from applications is not attainable. For instance, during periods
of long disconnections, applications may be unable to access critical data.
At the other extreme, adaptation is left entirely to individual applica-
tions. No support is provided by the operating system. This approach lacks
a focal point to resolve the potentially incompatible resource demands of dif-
ferent applications or to enforce limits on the usage of resources. In addition,
applications must be written anew. Writing such applications becomes very
complicated.
Application-aware [SNK+95] support for mobility lies in between han-
dling adaptivity solely by applications and solely by the operating system. In
this approach, the operating system co-operates with the application in vari-
ous ways. Support for application awareness places additional requirements to
mobile systems [ARS97]. First, a mechanism is required to monitor the level
and quality of resources and inform applications about any relevant changes
in their environment. Then, applications must be agile [NSN+97,ARS97],
that is able to receive events in an asynchronous manner and react appro-
priately. Finally, there is a need for a central point for managing resources
and authorizing any application-initiated request for their use. Environmental
changes include changes of the location of the mobile unit and the availability
of resources such as bandwidth, memory or battery power.
Informing the application of a location update or a change in the availabil-
ity of a resource involves addressing a number of issues. To name just a few:
how does the system monitor the environment, which environmental changes
are detectable by the system and which by the application, how and when are
any changes detected by the system conveyed to the application. In [WB97],
changes in the environment are modeled as asynchronous events which are
10. Mobile Computing 447

delivered to the application. Events may be detected either within the kernel
or at the user-level. The detection of an event is decoupled from its delivery so
that only relevant events are delivered. In Odyssey [NS95,SNK+95,NPS95],
the application negotiates and registers a window of tolerance with the system
for a particular resource. If availability of that resource rises above or falls
below the limits set in the tolerance window, Odyssey notifies the applica-
tion. Once notified, it is the application's responsibility to adapt its behavior
accordingly.
Nevertheless, handling mobility spans multiple levels. Take for example
a mobile application that is built on top of a database management sys-
tem that is in turn built on top of an operating system that uses a specific
communication mechanism. At what level should mobility be handled?

3.2 Multi-Tier Client/Server Models

With client/server computing in its simplest form, an application component


executing in one computing system, called the client, requests a service from
an application component executing in another computing system, called
the server. In a wireless setting, the mobile host acts as the client requesting
services from servers located at the fixed network (Figure 3.1{a)). Some times,
functionality and data are distributed across multiple servers at different fixed
hosts that may have to communicate with each other to satisfy the client's
request. Frequently, a server is replicated at different sites of the fixed network
to increase availability, performance, and scalability. The traditional client-
server model is based on assumptions that are no longer justifiable in mobile
computing, such as: static clients, reliable and low-latency communications,
and relatively resource-rich and reliable clients.
When there are multiple interconnected servers each covering a different
geographical area, a more flexible treatment of client mobility is possible. In
this case, the client can be attached to the server located closest to it. Mul-
tiple server architectures may form the basis for delivery of a wide range of
personal information services and applications to mobile users including per-
sonalized financial and stock market information, electronic magazines, news
clipping services, travel information, as well as mobile shopping, banking,
sales inventory, and file access [JK94]. The process of transferring the service
of a client from one server to another is termed service handoff [JK94]. The
mapping of a client to a server can be done either transparently from the ap-
plication or the application may be aware of the mapping. In [TD91,KS92],
the mapping of clients to servers is completely transparent to the applica-
tion and is taken care of by an underlying coherence control scheme running
among the servers. In contrast, the approach taken in Bayou [TTP+95] in-
volves the application. In this case, during the same session, the client may
access any copy located at any server.
448 O . Bukhres, E. Pitoura, and A. Zaslavsky

Next, we describe extensions [PS98] of the traditional client/server model


with components responsible for implementing appropriate optimizations for
disconnected operation and weak connectivity as well as for mobility.

The client / agent / server model. A popular extension to the traditional


client-server model is a three-tier or client/agent/server model [BBI+93],
[FGB+96,Ora97,ZD97,TSS+96], that uses messaging and queuing infrastruc-
ture for communications from the mobile client to the agent and from the
agent to the server (Figure 3.1(b)). Agents are used in a variety of forms
and roles in this model. In one of these forms, the agent acts as a complete
surrogate or proxy of the mobile host on the fixed network. In this case, any
communication to and from the mobile host goes through its agent. The sur-
rogate role of an agent can be generalized by having an agent acting as a
surrogate of multiple mobile hosts [FGB+96]. Another role for the agent is
to provide mobile-aware access to specific services or applications, e.g., web
browsing [HSL98] or database access [Ora97]. In this case, any client's re-
quest and server's reply associated with the specific service is communicated
through this service-specific agent. A mobile host may be associated with as
many agents as the services it needs access to.

Wireless Link
1 ... - - - - Fixed Networ1< - - - •

1
1 Application Client~ - - -- -- ..I
I
-------11 Application Server

Mobile Host I
I
I (a)

I
1
Application Client --H
1
1- Agent Application Server

Mobile Host I
I
I (b)

Fig.3.1. Client-Server based models: (a) traditional client/server model, (b)


client / agent / server model

This three-tier architecture somewhat alleviates the impact of the limited


bandwidth and the poor reliability of the wireless link by continuously main-
10. Mobile Computing 449

taining the client's presence on the fixed network via the agent. Furthermore,
agents split the interaction between mobile clients and fixed servers in two
parts, one between the client and the agent, and one between the agent and
the server. Different protocols can be used for each part of the interaction
and each part of the interaction may be executed independently of the other.
Between its surrogate and service-specific roles, various functions may be
undertaken by an agent. Agent functionality includes support for messaging
and queuing for communication between the mobile client and the server. The
agent can use various optimizations for weak connectivity. It can manipulate
the data prior to their transmissions to the client [ZD97,FGB+96,TSS+96],
by changing their transmission order so that the most important information
is transferred first, by performing data specific lossy compression that tailors
content to the specific constraints of the client, or by batching together mul-
tiple replies. The agent can also assume a more active role [Ora97,TSS+96],
for instance, it can notify the client appropriately, when application-specific
predefined events occur. To reduce the computation burden on the mobile
client, the agent might be made responsible for starting/stopping specific
functions at the mobile unit or for executing client specific services. For ex-
ample, a complex client request can be managed by the agent with only the
final result transmitted to the client. Located on the fixed network, the agent
has access to high bandwidth links and large computational resources, that
it can use for its client's benefit.
To deal with disconnections, a mobile client can submit its requests to
the agent and wait to retrieve the results when connection is re-established.
In the meantime, any requests to the disconnected client can be queued at
the agent to be transferred upon re-connection. The agent can be used in a
similar way to preserve battery life.
The exact position of the agent at the fixed network depends on its role.
Placing the agent at the fringe of the fixed network, i.e., at the base station,
has some advantages especially when the agent acts as the surrogate of the
mobile hosts under its coverage [ZD97,BBI+93]: it is easier to gather infor-
mation for the wireless link characteristics; a special link level protocol can
be used between the mobile host and the agent; and personalized information
about the mobile hosts is available locally. On the other hand, the agent may
need to move along with its mobile host, or the current base station may not
be trustworthy. In the case of service-specific agents, it makes sense to place
them either closer to the majority of their clients or closer to the server.
To accommodate the change in the system configuration induced by client
mobility, there may be a need to move the agents at the fixed network. Again,
relocating the agent depends on the role of the agent. If the agent is service-
specific, a client's request for this service and the associated server's reply is
transmitted through the agent. Moving the agent closer to the client does not
necessarily reduce communication since it may increase the cost of the agent's
interaction with the server especially when the agent serves multiple clients.
450 O. Bukhres, E. Pitoura, and A. Zaslavsky

When the agent acts as a surrogate of a client, any message to and from the
client passes through the client's agent. In this case, moving the agent along
with the client seems justifiable. Additional support is now needed to manage
information regarding the location of the mobile surrogate. A mobile motion
prediction algorithm to predict the future location of a mobile user according
to the user's movement history is proposed in [LMJ96]. A new proxy is then
pre-assigned at the new location before the mobile user moves in.
While the client/agent/server model offers a number of advantages, it
fails to sustain the current computation at the mobile client during periods
of disconnection. Furthermore, although the server notices no changes, the
model still requires changes to the client code for the development of the
client/ agent interaction rendering the execution and maintenance of legacy
applications problematic. Finally, the agent can directly optimize only data
transmission over the wireless link from the fixed network to the mobile client
and not vice versa.

The pair of agents model. To address the shortcomings of the


client/agent/server model, [SP97,HSL98] propose the deployment of a client-
side agent that will run at the end-user mobile device along with the agent of
the client/agent/server model that runs within the wireline network (Figure
3.2(a)). The client-side agent intercepts client's requests and together with
the server-side agent performs optimizations to reduce data transmission over
the wireless link, improve data availability and sustain uninterrupted the mo-
bile computation. From the point of view of the client, the client-side agent
appears as the local server proxy that is co-resident with the client.
The model provides a clear distinction and separation of responsibili-
ties between the client and the server agents. The communication protocol
between the two agents can facilitate highly effective data reduction and
protocol optimization without limiting the functionality or interoperability
of the client. The model offers flexibility in handling disconnections. For in-
stance, a local cache may be maintained at the client-side agent. The cache
can be used to satisfy the client's requirements for data during disconnec-
tions. Cache misses may be queued by the client-side agent to be served
upon re-connection. Similarly, requests to the client can be queued at the
server-side agent and transferred to the client upon re-connection. Weak con-
nectivity can also be handled in a variety of ways. For example, relocating
computation from the client-side agent to the server-side agent or vice versa
can minimize the effect of weak connectivity. Background prefetching to the
client-side agent can reduce communication during weak connectivity.
The model is more appropriate for heavy-weight clients with sufficient
computational power and secondary storage. The main weakness of the model
is that every application requires development work both at the server and at
the client site. However, there is no need to develop a pair of agents for every
instance of an application. Instead, since the functionality and optimizations
10. Mobile Computing 451

Wireless Link
... - - - - Fixed Network - - - ..

Application Server

Mobile Host
(a)

I
I
IApplication I --H
I
1-1Agent Application Server 1
Mobile Host I
I
I (b)

Fig. 3.2. Client-Server based models: (a) pair of agents model, (b) peer-to-peer
model

performed by the agent pair is generic enough, it is only required to develop


a different pair of agents per application type e.g., file, database, or web
application. For example, prefetching documents at the cache of the client-
side agent follows the same principles independently of the specific type of
web-application.
The idea of employing proxy pairs has been gaining lately some atten-
tion [ZD97,FGB+96,MB96,MB97j. Extensions to RPC [JTK97,BB97,AD93]
can be viewed in the context of this model. In asynchronous queued RPC
[JTK97], when an application issues an RPC, the RPC is stored in a local
stable log at a client-side agent and control is immediately returned to the
application. When the mobile client is connected, the log is drained in the
background and any queued RPCs is forwarded to the server by the client-
side agent. Queuing RPCs leaves space for performing various optimizations
on the log. For instance, the Rover toolkit [JTK97j reorders logged requests
based on consistency requirements and application-specified operation prior-
ities. Delivering any replies from the server to the client at the mobile host
may require multiple retries [JTK97,BB97j. Specifically, if a mobile host is
disconnected between issuing the request and receiving the reply, a server-side
agent may periodically attempt to contact the mobile host and deliver the
reply. A pair-of-agents approach is also employed by WebExpress [HSL98],
an IBM system for optimizing web browsing in a wireless environment. In
this system the client-side agent is called client-side intercept (CSI), while
the server-side agent is called server-site intercept (SSI).
452 O. Bukhres, E. Pitoura, and A. Zaslavsky

Peer-to-peer models. Considering the mobile host as a client is inade-


quate for certain applications. For example, consider the case of two partners
performing co-operative work on some data using their portable computers
[RPG+96]. If applications running at mobile hosts are considered clients, the
partners cannot trade their updates directly. Instead, each mobile host has
to connect with the server machine to be informed of each other's actions.
This may incur excessive communication costs, when the server is located far
away from the clients. Even worst, in cases of physical disconnection of the
clients from the server, there is no way that the two clients can interact with
each other, even when a communication path connecting them is available.
Ideally, in such applications, each site must have the full functionality of both
a client and a server.
In this case, mobile hosts are equal partners in distributed computations.
This kind of model is only appropriate for heavy-weight mobile hosts. Discon-
nections have the additional negative effect of making the server unavailable
to clients requesting its services. Thus, to deal with disconnections and weak
connectivity, a server-side intercept agent must be placed on the mobile host
as well (Figure 3.2(b)). The server-side agent at the mobile host may possess
special features to take into account the fact that the server is running on a
mobile host. For instance, a server at the mobile host cannot be continuously
active because, to conserve power, the mobile host may be switched-off or op-
erating in the doze mode. A mechanism to automatically start applications
on demand [AD93] is useful in such cases.

3.3 Mobile Agents


Besides the functional components of a mobile application, the organization
of data is also important. Data may be organized as a collection of objects.
Objects become the unit of information exchange among mobile and static
hosts. Objects encapsulate not only pure data but also information necessary
for their manipulation, such as operations for accessing them. This feature
makes object-based models very flexible. For instance, objects can encapsu-
late procedures for conflict resolution. Such an object organization can be
built on top of an existing database or file organization, by defining, for
example, an object that consists of a set of files and file operations. The cen-
tral structure in the Rover toolkit [JTK97], is a relocatable dynamic object
(RDO). Mobile clients and fixed servers interchange RDOs. Similarly, in the
Pro-motion infrastructure [WC97], the unit of caching and replication is a
special object called compact.
Incorporating active computations with objects and making them mo-
bile leads to mobile agents models. Mobile agents are processes dispatched
from a source computer to accomplish a specified task [CGH+95,Whi96].
Each mobile agent is a computation along with its own data and execution
state. After its submission, the mobile agent proceeds autonomously and in-
dependently of the sending client. When the agent reaches a server, it is
10. Mobile Computing 453

delivered to an agent execution environment. Then, if the agent possesses


necessary authentication credentials, its executable parts are started. To ac-
complish its task, the mobile agent can transport itself to another server,
spawn new agents, or interact with other agents. Upon completion, the mo-
bile agent delivers the results to the sending client or to another server. The
support that mobile agents provide for intermittent connectivity, slow net-
works, and light-weight devices makes their use in mobile computing very
attractive [CG H +95 ,PB95a,PSP99].
To support disconnected operation, during a brief connection service, a
mobile client submits an agent to the fixed network. The agent proceeds in-
dependently to accomplish the delegated task. When the task is completed,
the agent waits till re-connection to submit the result to the mobile client.
Conversely, a mobile agent may be loaded from the fixed network onto a lap-
top before disconnection. The agent acts as a surrogate for the application
allowing interaction with the user even during disconnections. Weak connec-
tivity is also supported by the model since the overall communication traffic
through the wireless link is reduced from a possibly large number of messages
to the submission of a single agent and then of its result. In addition, by let-
ting mobile hosts submit agents, the burden of computation is shifted from
the resource-poor mobile hosts to the fixed network. Mobility is inherent in
the model. Mobile agents migrate not only to find the required resources but
also to follow their mobile clients. Finally, mobile agents provide the flexibil-
ity to load functionality to and from a mobile host depending on bandwidth
and other available resources.
The mobile agent computational paradigm is not orthogonal to the client/
server model and its extensions. The agent of the client/agent/server model
may be implemented as a mobile agent that moves on the fixed network
following its associated client. Mobile agents can be used in conjunction with
agents located at the fixed network. Let's call the agents at the fixed network
proxies for clarity. In this scenario, a client submits a general mobile agent to
its proxy. The proxy refines and extends the mobile agent before launching
it to servers on the network. When the mobile agent finishes its task, it first
communicates its results to the proxy. The proxy filters out any unnecessary
information and transmits to the mobile client only the relevant data. Such
an approach entails enhancing the proxies with capabilities to process mobile
agents. Building on this approach, a proxy may be programmable, that is
extended with the ability to execute mobile agents submitted to it by clients
or servers. Such an approach is in accordance with current research on active
networks [TSS+96].

3.4 Taxonomy
The agents, placed between the mobile client and the fixed server, alleviate
both the constraints of the wireless link, by performing various communica-
tion optimizations, and of any resource constraints, by undertaking part of
454 O. Bukhres, E. Pitoura, and A. Zaslavsky

Table 3.1. Types of agents

Surrogate of the All functionality of the host is off-loaded


to the agent
Mobile Host

Implements the mobile-specific part of a


protocol:
Filter E.g., an MPEG-I filter that discards MPEG
frames, or an RPC filter that implements
asynchronous queueing RPC

Implements the mobile-specific part of a


service/application
Service-specific E~., an Exmh-agent, a web-agent, or
a de-system agent

Programmable Understands and can process code or


mobile agents sent by clients or servers

the functionality of resource-poor mobile clients. But, at what level do agents


function? Multiple agents handling mobility at different levels may be inserted
on the path between the mobile client and the fixed server. Such agents can
co-operate in various ways. Agents at lower layers may convey information to
agents at higher layers and vice versa. For instance, a transport-layer agent
that queues RPC replies can co-operate with the application-layer agent to
delete unwanted messages from the queue of pending requests or to reorder
the queue based on user-defined priorities.
Another approach is to provide agents, called filters, that operate on
protocols [ZD97] rather than at the application or operating system level.
Such agents may include for example an MPEG-agent that discard MPEG
frames or a TCP-agent that optimizes TCP. Since there are fewer protocols
than applications less development work is required. Applications may control
the agents by turning them on and off. Table 3.1 [PS98] summarizes some of
the roles an agent may play in providing support for mobile computing.

4 Disconnected Operation

Since network disconnections are common in mobile wireless computing,


methods for sustaining the computation at the mobile host uninterrupted
when such a disconnection occurs are central. We discuss how these meth-
ods can be realized in file systems and database management systems. Similar
techniques are also applicable to web-based and workflow applications [PS98].
10. Mobile Computing 455

Fig.4.1. States of operation

4.1 Overview
Disconnections can be categorized in various ways. First, disconnections may
be voluntary, e.g., when the user deliberately avoids network access to re-
duce cost, power consumption, or bandwidth use, or forced e.g., when the
portable enters a region where there is no network coverage. Then, discon-
nections may be predictable or sudden. For example, voluntary disconnection
are predictable. Other predictable disconnections include those that can be
detected by changes in the signal strength, by predicting the battery lifetime,
or by utilizing knowledge of the bandwidth distribution. Finally, disconnec-
tions can be categorized based on their duration. Very short disconnections,
such as those resulting from handoffs, can be masked by the hardware or
low-level software. Other disconnections may either be handled at various
levels, e.g., by the file system or an application, or may be made visible to
the user. Since disconnections are very common, supporting disconnected op-
eration, that is allowing the mobile unit to operate even when disconnected,
is a central design goal in mobile computing.
The idea underlying the support for disconnected operation is simple.
When a network disconnection is anticipated, data items and computation
are moved to the mobile client to allow its autonomous operation during dis-
connection. Preloading data to survive a forthcoming disconnection is called
hoarding. Disconnected operation can be described as a transition between
three states [KS92] (Figure 4.1).

Data hoarding. Prior to disconnection, the mobile host is in the data hoard-
ing state. In this state, data items are preloaded into the mobile unit. The
items may be simply relocated from the fixed host to the mobile unit. How-
ever, by doing so, these data items become inaccessible to other sites. Alter-
natively, data items may be replicated or cached at the mobile unit. The type
of data objects transfered to a mobile host depends on the application and
the underlying data model. For instance, in cases of file systems, the data
456 O. Bukhres, E. Pitoura, and A. Zaslavsky

may be files, directories or volumes; in cases of database management sys-


tems, the data may be relations or views; in cases of workflow management
systems, the data may be workflow tasks, and in cases of web-based systems
html documents. In the case of object models, data objects (e.g., files) may
carry with them additional information such as a set of allowable operations
or a characterization of their fidelity. In cases of mobile agent-based models,
objects may carry along active parts to be executed at the mobile client. For
foreseeable disconnections, data hoarding may be performed just before the
disconnection. To sustain less predictable disconnections, hoarding needs to
be deployed on a regular basis, e.g., periodically.
A critical issue in this state is how to anticipate the future needs of the
mobile unit for data. One approach is to allow users to explicitly specify
which data items to hoard. Another approach is to use the past history of
data accesses to predict future needs for data. Which data to hoard also
depends on the application for which the system will be used. For instance,
depending on the intended use of a file system, system files of a text processor
or a compiler may be preloaded. An issue that adds to the complexity of
hoarding is that some sites may need to operate on data items concurrently
with the disconnected site. Taking the probability of conflicting operations
into consideration when deciding which items to hoard may improve the
effectiveness of disconnected operation.

Disconnected operation. Upon disconnection, the mobile unit enters the


disconnected state. While disconnected, the mobile unit can only use local
data. Requests for data that are not locally available cannot be serviced.
Such pending request may be inserted in an appropriate queue to be serviced
upon re-connection. Applications with unsatisfied requests for data can ei-
ther suspend their execution or continue working on some other job. There
are two approaches regarding updates of shared data during disconnection.
In the pessimistic approach, updates are performed only at one site using
locking or some form of check-in/check-out. In the optimistic approach, up-
dates are allowed at more than one site with the possible danger of conflicting
operations.
Updates at the mobile unit are logged in client's stable storage. An im-
portant issue is what information to keep in the log. The type of information
affects the effectiveness of reintegration of updates upon re-connection as well
as the effectiveness of log optimizations. Optimizing the log by keeping its
size small is important for at least two reasons: (a) to save local memory at
the mobile client, and (b) reduce the time for update propagation and rein-
tegration during re-connection. Optimization operations on the log can be
performed either (a) during disconnected operation, incrementally each time
a new operation is inserted in the log, or (b) as a preprocessing step before
propagating or applying the log upon re-connection.
10. Mobile Computing 457

Table 4.1. Issues in disconnected operation

Issue Approach

Depends on the system


Unit of hoarding (e.g,. a file or a database fragment)

Specified explicitly by the user


Induced im'plicitly from the history of
Which items to hoard past operations
Data Depends on the application for which
Hoarding the system is used

When to perform hoarding Prior to disconnection


On a regular basis

Request for data Raise an exception/error


not locally available Requests are queued for future service

Data Values
What to log Timestamps
Operations

Disconnection When to optimize the log Incrementally


Prior to integration

How to optimize the log Depends on the system

How to integrate Re-execute an operational log

Reintegration Use application-semantics


How to resolve conflicts Automatic resolution
Provide tools to assist the user

Reintegration. Upon re-connection, the mobile host enters the reintegration


state. In this state, updates performed at the mobile host are reintegrated
with updates performed at other sites. Update reintegration is usually per-
formed by re-executing the log at the fixed host. Whether the operations
performed at the disconnected sites are accepted depends on the concurrency
semantics adopted by the particular system. Such correctness semantics vary
for enforcing transaction serializability to just resolving concurrent updates
of the same object.

Table 4.1 [PS98] summarizes some of the issues regarding each of the three
states. The complexity of operation in each state depends on the type of the
distributed system and the dependencies among the data operated on. In the
following, we will discuss disconnected operation in distributed file systems
and database management systems.
458 O. Bukhres, E. Pitoura, and A. Zaslavsky

4.2 File Systems

Most proposals for file system support for disconnected operation are based
on extending cache management to take into account disconnections. Files
are preloaded at the mobile client's cache to be used during disconnection.
Caching to support disconnected operation is different from caching during
normal operation in many respects. First, cache misses cannot be served.
Then, updates at a disconnected client cannot be immediately propagated to
its server. Similarly, a server cannot notify a disconnected client for updates
at other clients. Thus, any updates must be integrated upon re-connection.

Data hoarding. Hoarding is the process of preloading data into the cache
in anticipation of a disconnection, so that the client can continue its op-
eration while disconnected. Hoarding is similar to prefetching used in file
and database systems to improve performance. However, there are impor-
tant differences between hoarding and prefetching. Prefetching is an ongoing
process that transfers to the cache soon-to-be-needed files during periods of
low network traffic. Since prefetching is continuously performed, in contrast
to hoarding, keeping its overhead low is important. Furthermore, hoarding
is more critical than prefetching, since during disconnections, a cache miss
cannot be serviced. Thus, hoarding tends to overestimate the client's need
for data. On the other hand, since the cache at the mobile client is a scarce
resource, excessive estimations cannot be satisfied. An important parameter
is the unit of hoarding, ranging from a disk block, to a file, to groups of files or
directories. Another issue is when to initiate hoarding. The Coda file system
[KS92] runs a process called hoard walk periodically to ensure that critical
files are in the mobile user's cache.
The decision on which files to cache can be either (a) assisted by instruc-
tions explicitly given by the user, or (b) taken automatically by the system by
utilizing implicit information, which is most often based on the past history
of file references. Coda [KS92] combines both approaches in deciding which
data to hoard. Data are prefetched using priorities based on a combination
of recent reference history and user-defined hoard files. A tree-based method
is suggested in [TLA+95] that processes the history of file references to build
an execution tree. The nodes of the tree represent the programs and data files
referenced. An edge exists from parent node A to child node B, when either
program A calls program B, or program A uses file B. A GUI is used to assist
the user in deploying this tracing facility to determine which files to hoard.
Besides clarity of presentation to users, the advantage of this approach is that
it helps differentiate between the files accessed during multiple executions of
the same program. Seer [Kue94] is a predictive caching scheme based on the
user's past behavior. Files are automatically prefetched based on a measure
called semantic distance that quantifies how closely related they are. The
measure chosen is the local reference distance from a file A to a file B. This
10. Mobile Computing 459

distance can be informally defined as the number of file references separating


two adjacent references to A and B in the history of past file references.

Disconnected operation. While disconnected, the mobile unit uses solely


data available at its cache. Cache misses are treated as errors and raise ex-
ceptions. Applications can either block on a cache miss or continue working.
If the client is allowed to perform updates, these updates are logged locally.
Issues include determining what type of information to keep in the log and
deriving techniques for optimizing the log.
In Coda, a replay log is kept that records all corresponding system call
arguments as well as the version state of all objects referenced by the call
[KS92j. Two optimizations are performed before a new record is appended to
the log. First, any operation which overwrites the effect of earlier operations
may cancel the corresponding log records. Second, an inverse operation (e.g.,
rmdir) cancels both the inverting and inverted (e.g., mkdir) log records.
The approach taken in the Little Work project [HH94j suggests applying
rule-based techniques used in compiler peephole optimizers. Such an off-the-
shelf optimizer is used as the basis for performing log optimization. In contrast
to log optimization in Coda, optimization is carried out at a preprocessing
step before reintegrating the log at re-connection. There are two types of
rules: replacement and ordering rules. Replacement rules remove adjacent re-
dundant operations, e.g., a create followed by a move. Ordering rules reorder
adjacent operations so that further replacement rules can be applied. Up-
dates performed at a disconnected site may conflict with operations at other
sites. Thus, updates of data in the cache are considered tentative.

Reintegration. Upon re-connection, any cache updates are incorporated


in the server using the log. In Coda, the replay log is executed as a single
transaction. All objects referenced in the log are locked. Different strategies
are used for handling concurrent updates on files and on directories. This is
because, in contrast to files, there is enough semantic knowledge for directo-
ries to attempt transparent resolution of conflicts [KS93,HH93j. For instance,
directory resolution fails only if a newly created name collides with an exist-
ing name, if an object updated at the client or the server has been deleted by
the other, or if directory attributes have been modified at the server and the
client [KS93j. In Coda, file resolution is based on application-specific resolvers
(ASRs) per file [KS95j. An ASR is a program that encapsulates the knowl-
edge needed for file resolution and is invoked at the client when divergence
among copies is detected. Rules are used to select the appropriate ASR. The
ASR's mutations are performed locally on the client's cache and written back
to the server atomically after the ASR completes. The execution of an ASR is
guaranteed transaction semantics. If no ASR is found or the ASR execution
fails, an error code indicating a conflict is returned. For cases of unresolved
conflicts, a manual repair tool is run on the client.
460 O. Bukhres, E. Pitoura, and A. Zaslavsky

In the case of files systems, the only conflicts detected are write/write
conflicts because they produce divergent copies. Read/write conflicts are not
considered. Such conflicts occur, for instance, when the value of a file read
by a disconnected client is not the most recent one, because the file has been
updated at the server after the client's disconnection. Extensions to provide
such semantics are discussed in the following section.

4.3 Database Management Systems

As in file systems, to support disconnected operation, data items are


preloaded in mobile clients, prior to a forthcoming disconnection.

Data hoarding. There are many problems that remain open regarding
hoarding in databases. First, what is the granularity of hoarding. The gran-
ularity of hoarding in relational database systems can range from tuples, to
set of tuples, to whole relations. Analogously, in object-oriented database
systems, the granularity can be at the object, set of objects or class (ex-
tension) level. A logical approach would be to hoard by issuing queries; i.e.,
by prefetching the data objects that constitute the answer to a given query.
This, in a sense, corresponds to loading on the mobile unit materialized views.
Then, operation during disconnection is supported by posing queries against
these views.
Another issue is how to decide which data to hoard. In terms of views,
this translates to: how to identify the views to materialize, or how to specify
the hoarding queries that define the views. Then, users may explicitly specify
their preferences by issuing hoarding queries. Alternatively, the users' past
behavior may be used by the system as an indication of the users' future
needs. In such a case, the system automatically hoards the set of most com-
monly used or last referenced items along with items related to the set. Using
the history of past references to deduce dependencies among database items is
harder than identifying dependencies among files. Furthermore, issues related
to integrity and completeness must also be taken into account.
To decide which data to hoard, [GKL+94] proposes (a) allowing users to
assist hoarding by specifying their preferences using an object-oriented query
to describe hoarding profiles, and (b) maintaining a history of references by
using a tracing tool that records queries as well as objects. To efficiently
handle hoarding queries from mobile clients, [BP97] proposes an extended
database organization. Under the proposed organization, the database de-
signer can specify a set of hoard keys along with the primary and secondary
key for each relation. Hoard keys are supposed to capture typical access pat-
terns of mobile clients. Each hoard key partitions the relation into a set of
disjoint logical horizontal fragments. Hoard fragments constitute the hoard
granularity, i.e., clients can hoard and reintegrate within the scope of these
fragments.
10. Mobile Computing 461

Disconnected operation. Consistent operation during disconnected opera-


tion has been extensively addressed in the context of network partitioning. In
this context, a network failure partitions the sites of a distributed database
system into disconnected clusters. Various approaches have been proposed
and are excellently reviewed in [DGS85]. In general, such approaches can be
classified along two orthogonal dimensions. The first concerns the trade-off
between consistency and availability. The other dimension concerns the level
of semantic knowledge used in determining correctness. Most of these ap-
proaches should be readily applicable to disconnected operation in mobile
computing and a study to evaluate their effectiveness would be most appre-
ciable.
Revisiting the network partition problem for mobile computing requires
taking into consideration a number of new issues. Network partition is usu-
ally considered in conjunction with peer-to-peer models where transactions
executed in any partition are of equal importance. In mobile computing, how-
ever, transactions at the mobile host are most often consider second-class. A
common trend in mobile computing is to tentatively commit transactions
executed at the disconnected mobile unit and make their results visible to
subsequent transactions in the same unit. Another issue is the frequency of
disconnections. Network partitions correspond to failure behavior, whereas
disconnections in mobile computing are common. The fact that disconnec-
tions are frequent justifies building systems around them. Lastly, most dis-
connections in mobile computing can be considered foreseeable.

Reintegration. Upon re-connection, a certification process takes place, dur-


ing which the execution of any tentatively committed transaction is validated
against an application or system defined correctness criterion. If the criterion
is met, the transaction is committed. Otherwise, the execution of the trans-
action must be aborted, reconciled or compensated. Such actions may have
cascaded effects on other tentatively committed transactions that have seen
the transaction's results.

Case studies. Isolation-only transactions (lOTs) are proposed in [LS94],


[LS95] to provide support for transactions in file systems. An lOT is a se-
quence of file access operations. Transaction execution is performed entirely
on the client and no partial result is visible on the servers. A transaction
T is called a first-class transaction if it does not have any partitioned file
access, i.e., the client machine maintains a connection for every file it has
accessed. Otherwise, T is called a second-class transaction. Whereas the re-
sult of a first-class transaction is immediately committed to the servers, a
second-class transaction remains in the pending state till connectivity is re-
stored. The result of a second-class transaction is held within the client's local
cache and is visible only to subsequent accesses on the same client. Second-
class transactions are guaranteed to be locally serializable among themselves.
462 O. Bukhres, E. Pitoura, and A. Zaslavsky

A first-class transaction is guaranteed to be serializable with all transactions


that were previously resolved or committed at the server. Upon re-connection,
a second-class transaction T is validated against predefined serialization cri-
teria.
In the two-tier replication schema proposed in [GHN+96]' replicated data
have two versions at mobile nodes: master and tentative versions. A master
version records the most recent value received while the site was connected.
A tentative version records local updates. There are two types of transactions
analogous to second- and first-class lOTs: tentative and base transactions.
A tentative transaction works on local tentative data and produces tentative
data. A base transaction works only on master data and produce master data.
Base transactions involve only connected sites. Upon re-connection, tentative
transactions are reprocessed as base transactions. If they fail to meet some
application-specific acceptance criteria, they are aborted and a message is
returned to the mobile node.

5 Weak Connectivity
Weak connectivity is the connectivity provided by slow or expensive networks.
In addition, in such networks connectivity is often lost for short periods of
time. Weak connectivity sets various limitations that are not present when
connectivity is normal and thus instigates revisions of various system proto-
cols. An additional characteristic of weak connectivity in mobile computing
is its variation in strength. Connectivity in mobile computing varies in cost,
provided bandwidth and reliability. Many proposals for handling weak con-
nectivity take this characteristic into consideration and provide support for
operation that adapts to the current degree of connectivity. In such systems,
disconnected operation is just the form of operation in the extreme case of
total lack of connectivity. The aim of most proposals for weak connectivity
is prudent use of bandwidth. Often, fidelity is traded off for a reduction in
communication cost.

5.1 File Systems


In file systems, weak connectivity is dealt by appropriately revising those
operations whose deployment involves the network.

Overview. In terms of caching, approaches to weak connectivity are cen-


tered around the following three topics that affect bandwidth consumption:

• the handling of cache misses,


• the frequency of propagation to the server of updates performed at the
client's cache, and
• cache updates.
10. Mobile Computing 463

Similar considerations are applicable in the case in which the replicated


files at the weakly connected site are not cached copies but peers, that is they
are treated equivalently to the copies at the fixed network. Analogously to
caching, issues include: (a) the handling of requests for data items for which
no local replicas are available, (b) the propagation of updates from the local
site to the fixed network, and (c) the currency of the value of local replicas.
There are a number of design choices for handling these issues. We will discuss
them in the context of caching, but the discussion is directly applicable to
replication in general.
Servicing a cache miss may incur very long delays in slow networks or
excessive costs in expensive ones. Thus, cache misses should be serviced se-
lectively based on how critical the required item is and on the current con-
nectivity.
Determining when to propagate cache updates and integrate them at the
server is also an interplay among various factors. Aggressive reintegration re-
duces the effectiveness of log optimizations, because records are propagated
to the server early. Thus, they have less opportunity to be eliminated at
the client. For instance, short-lived temporary files are usually eliminated if
they stay in the log long enough. Early reintegration can also affect the re-
sponse times of other traffic especially in slow networks. On the other hand,
it achieves consistent cache management, timely propagation of updates and
reduces the probability of conflicting operations. Furthermore, early rein-
tegration keeps the size of the log in the client's memory short, thus saving
precious space. In addition, lazy reintegration may overflow the client's cache,
since a cached data that has been updated cannot be discarded before being
committed at the server.
Regarding the currency of cached items, notifying the client each time an
item is changed at the server may be too expensive in terms of bandwidth.
Postponing the notification results in cache items having obsolete values and
affects the value returned by read operations. Another possibility is to update
cache items on demand, that is, each time a client issues a read operation on
an item. Alternatively, a read operation may explicitly contact the server to
attain the most recent value.
Besides normal, disconnected, and weak connectivity operation, [HH95a]
suggests having one more mode of operation called fetch-onlyv operation.
While, weak connectivity requires continuous network availability such as
that provided by pes systems or low-speed wireless networks, fetch-only op-
eration does not impose the requirement of continuous network connectivity.
Fetch-only operation is attractive when the network has an associated charge
for connect time, e.g., over a cellular phone or ISDN. In the fetch-only mode,
cache updates are deferred and no cache consistency protocol is used. The
network is only used to satisfy cache misses.
One last issue is how to notify the client that maintains a cache copy of
a data item, that this data item has been updated at the server, when con-
464 O. Bukhres, E. Pitoura, and A. Zaslavsky

nectivity is intermittent. In such cases, the client cannot rely on the server
sending such notifications. Thus, upon re-connection, the client must validate
its cache against the server's data. Cache invalidation may impose substan-
tial overheads on slow networks. To remedy this problem, [MS94] suggests
increasing the granularity at which cache coherence is maintained. In par-
ticular, each server maintains version stamps for volumes, i.e., sets of files,
in addition to stamps on individual objects. When an object is updated, the
server increments the version stamp of the object and that of its containing
volume. Upon reintegration, the client presents volume stamps for validation.
If a volume stamp is still valid, so is every object cached from that volume.
So, in this case there is no need to check the validity of each file individually.

Case studies. In the Coda [MES95] file system, cache misses are serviced
selectively. In particular, a file is fetched only if the service time for the
cache miss which depends among others on bandwidth is below the user's
patience threshold for this file, e.g., the time the user is willing to wait for
getting the file. Reintegration of updates to the servers is done through trickle
reintegration. Trickle reintegration is an ongoing background process that
propagates updates to servers asynchronously. To maintain the benefits of
log optimization while ensuring a reasonably prompt update propagation,
a technique called aging is used. A record is not eligible for reintegration
until it spends a minimal amount of time, called aging window, in the log.
Transferring the replay log in one chunk may saturate a slow network for an
extended period. To avoid this problem, the reintegration chunk size is made
adaptive, thus bounding the duration of communication degradation. If a file
is very large, it is transferred as a series of fragments, each smaller than the
currently acceptable chunk size.
In the Little Work project [HH95b], update propagation is performed in
the background. To avoid interference of the replay traffic with other network
traffic, the priority queuing in the network driver is augmented. Three levels of
queuing are used: interactive traffic, other network traffic, and replay traffic.
A number of tickets are assigned to each queue according to the level of
service deemed appropriate. When it is time to transmit a packet, a drawing
is held. The packet in the queue holding the winning ticket is transmitted.
File updates at the servers are propagated to the client immediately through
callbacks. Thus a client opening a file is guaranteed to see the most recently
stored data. Directory updates are tricky to handle, thus only the locally
updated directory is used by mobile clients. Cache misses are always serviced.
In the variable-consistency approach [TD91,TD92], a client/server archi-
tecture with replicated servers that follow a primary-secondary schema is
used mainly to avoid global communication, but also works well with weak
connectivity. The client communicates with the primary server only. The pri-
mary makes periodic pickups from the clients it is servicing and propagates
updates back to the secondaries asynchronously. Once some number N of
10. Mobile Computing 465

secondaries have acknowledged receipt of an update, the primary informs the


client that the associated cached update has been successfully propagated
and can be discarded. The traditional read interface is split into strict and
loose reads. Loose read returns the value of the cache copy, if such a copy
exists. Otherwise, loose read returns the value of the copy at the primary,
or any secondary, whichever it finds. In contrast, the strict read call returns
the most consistent value by contacting the necessary number of servers and
clients to guarantee retrieving the most up-to-date copy. If strict read and
write are used exclusively, the system provides one-copy Unix semantics, e.g.,
reads return the value stored by the last write.
Ficus and its descendant Rumor are examples of file systems following a
peer-to-peer architecture [GHM+90,HPG+92,RPG+96]. There is nO distinc-
tion between copies at the mobile and copies at the fixed host; all sites store
peer copies of the files they replicate. Updates are applied to any single copy.
The Ficus file system is organized as a directed acyclic graph of volumes. A
volume is a logical collection of files that are managed collectively. Files within
a volume typically share replication characteristics such as replica location
and the number of replicas. A pair-wise reconciliation algorithm is executed
periodically and concurrently with respect to normal file activity. The state
of the local replicated volume is compared to that of a single remote replica
of the volume to determine which files must have updates propagated. The
procedure continues till updates are propagated to all sites storing replicas
of the volume.

5.2 Database Systems


Approaches to handling weak connectivity in database management systems
aim at minimizing communication and surviving short disconnections, simi-
larly to file systems. However, due to the complicated dependencies among
database items, the problem is a complex one.

Overview. The mobile host can play many roles in a distributed database
setting. For example, it may simply submit operations to be executed on a
server or an agent at the fixed network [JBE95,YZ94,DHB97,M097,BMM98].
In this case, it may either submit to the fixed server operations of a transac-
tion One at a time sequentially or the whole transaction as one atomic unit
[JBE95]. In [YZ94], the second approach is taken. Each mobile client submits
a transaction to a coordinating agent. Once the transaction has been submit-
ted, the coordinating agent schedules and coordinates its execution on behalf
of the mobile client. A different approach to the role of the mobile host is to
allow local database processing at the mobile host. Such an approach is nec-
essary to allow autonomous operation during disconnection but complicates
data management and may cause unacceptable communication overheads.
Concurrency control in the case of distributed transactions that involve
both mobile and fixed hosts is complicated. For transactions that access data
466 O. Bukhres, E. Pitoura, and A. Zaslavsky

at both mobile and stationary hosts accessing the wireless link impose large
overheads. Take for instance, the case of a pessimistic concurrency control
protocol that requires transactions to acquire locks at multiple sites. In this
case, transactions may block if they request locks at sites that get discon-
nected or if they request locks held by transactions at disconnected sites. On
the other hand, techniques such as timestamps may lead to a large number
of transactions being aborted because operations may be overly delayed in
slow networks.
To avoid delays imposed by the deployment of slow wireless links open-
nested transaction models are more appropriate [Chr93]. According to these
models, a mobile transaction that involves both stationary and mobile hosts
is not treated as one atomic unit but rather as a set of relatively independent
component transactions some of which run solely at the mobile host. Compo-
nent transactions can commit without waiting for the commitment of other
component transactions. In particular, as in the disconnected case, transac-
tions that run solely at the mobile host are only tentatively committed at
the mobile host and their results are visible by subsequent local transactions.
These transactions are certified at the fixed hosts, Le., checked for correctness,
at a later time. Fixed hosts can broadcast to mobile hosts information about
other committed transactions prior to the certification event, as suggested
in [Bar97]. This information can be used to reduce the number of aborted
transactions.

Case studies. Transactions that run solely at the mobile host are called
weak in [PB95b,Pit96,PB99], while the rest are called strict. A distinction
is drawn between weak copies and strict copies. In contrast to strict copies,
weak copies are only tentatively committed and hold possibly obsolete values.
Weak transactions update weak copies, while strict transactions access strict
copies. Weak copies are integrated with strict copies either when connectiv-
ity improves or when an application-defined limit to the allowable deviation
among weak and strict copies is passed. Before reconciliation, the result of
a weak transaction is visible only to weak transactions at the same site.
Applications at weakly connected sites may chose to issue strict transactions
when they require strict consistency. Strict transactions are slower than weak
transactions since they involve the wireless link but guarantee permanence
of updates and currency of reads. During disconnection, applications can use
only weak transactions. In this case, weak transactions have similar semantics
with second-class lOTs [LS95j and tentative transactions [GHN+96]. Adapt-
ability is achieved by adjusting the number of strict transactions and the
degree of divergence among copies based on the current connectivity.
The approach taken in Bayou [TDP+94,TTP+95,DPS+94] does not sup-
port transactions. Bayou is built on a peer-to-peer architecture with a number
of replicated servers weakly connected to each other. In this schema, a user
application can read-any and write-any available copy. Writes are propagated
10. Mobile Computing 467

to other servers during pair-wise contracts called anti-entropy sessions. When


a write is accepted by a Bayou server, it is initially deemed tentative. As in
two-tier replication [GHN+96], each server maintains two views of the data-
base: a copy that only reflects committed data and another full copy that also
reflects the tentative writes currently known to the server. Eventually, each
write is committed using a primary-commit schema. That is, one server des-
ignated as the primary takes responsibility for committing updates. Because
servers may receive writes from users and other servers in different orders,
servers may need to undo the effects of some previous tentative execution of
a write operation and re-apply it. The Bayou system provides dependency
checks for automatic conflict detection and merge procedures for resolution.
Instead of transactions, Bayou supports sessions. A session is an abstraction
for a sequence of read and write operations performed during the execution of
an application. Session guarantees are enforced to avoid inconsistencies when
accessing copies at different servers; for example, a session guarantee may be
that read operations reflect previous writes or that writes are propagated
after writes that logically precede them. Different degrees of connectivity are
supported by individually selectable session guarantees, choices of commit-
ted or tentative data, and by placing an age parameters on reads. Arbitrary
disconnections among Bayou's servers are also supported since Bayou relies
only on pair-wise communication. Thus, groups of servers may be discon-
nected from the rest of the system yet remain connected to each other.
So far, we have made no assumptions about the type of data or appli-
cations. The basic idea is to exploit the semantics of data in order to split
large or complex objects into smaller fragments, so that operation at each of
the fragments can proceed relatively independently from operations at other
fragments [KJ95,WC95]. For instance, in site escrow methods [Nei86,KB92],
the total number of instances of a given item is partitioned across a number
of sites. A transaction runs at only one site and successfully completes, if
the number of instances it requires do not exceed the number of instances
available in escrow at that site. When more instances are required at a site,
a redistribution protocol can be executed to reassign escrows. Thus, transac-
tions at a mobile unit can run independently without employing the wireless
link. Escrow methods are appropriate for sales and inventory applications
[KJ95].
In the fragmentation approach [WC95], a master copy of a large object
residing at the fixed network is split into smaller physical fragments which are
logically removed from the copy and loaded at the mobile host. The physical
fragments transferred at the mobile host are only accessible by transactions on
the mobile host, while the remaining part of the master copy remains readily
accessible. A type-specific merge procedure is executed to re-assemble the
fragments into a single copy. Examples of fragmentable objects include stacks
and sets. More flexibility is attained, if objects encapsulate not only pure data
but also information necessary for their manipulation, such as procedures for
468 O. Bukhres, E. Pitoura, and A. Zaslavsky

conflict resolution. Such an object organization can be built on top of an


existing database or file organization, by defining, for example, an object that
consists of a set of files and file operations. Such object-based approaches are
followed in the Rover toolkit and the Pro-motion infrastructure [WC97].
The basic unit in Rover [JTK97] is a relocatable dynamic object (RDO).
Clients import copies of RDOs in their local caches. To decide which RDOs
to hoard, Rover allows each application to provide a prioritized list of ob-
jects to be prefetched. Rover provides flexibility in the choice of mechanism
for concurrency control. However, it directly supports a primary-copy tenta-
tive update optimistic consistency control, similarly to most of the systems we
have studied so far. Each RDO has a home server that maintains the primary
canonical copy. Clients import secondary copies of RDOs in their local caches.
When a client modifies a locally cached copy, the cached copy is marked ten-
tatively committed. Clients log method invocations rather than only new
data values. The client log is lazily propagated to the server, where the op-
erations are applied to the canonical copies. In the meantime, clients may
choose to use tentatively committed RDOs. The server detects any update
conflicts and uses type-specific information in resolving them. The results of
reconciliation override the tentative data stored at the clients.
In the Pro-motion infrastructure [WC97], the unit of caching and repli-
cation is a compact. When a wireless client needs data, it sends a request
to the database server. The server sends a compact as a reply. A compact
is an object that encapsulates the cached data, operations for accessing the
cached data, state information (such as the number of accesses to the object),
consistency rules that must be followed to guarantee consistency, and obliga-
tions (such as deadlines). Compacts provide flexibility in choosing consistency
methods from simple check-in/check-out pessimistic schemes to complex opti-
mistic criteria. If the database server lacks compact management capabilities,
a compact manager acts as a front-end to the database server.

6 Data Delivery by Broadcast

In traditional client/server systems, data are delivered on demand. A client


explicitly requests data items from the server. When a data request is received
at a server, the server locates the information of interest and returns it to the
client. This form of data delivery is called pull-based. In wireless computing,
the stationary server machines are provided with a relative high-bandwidth
channel which supports broadcast delivery to mobile clients in their cell. This
facility provides the infrastructure for a new form of data delivery called push-
based delivery. In push-based data delivery, the server repetitively broadcasts
data to a client population without a specific request. Clients monitor the
broadcast and retrieve the data items they need as they arrive.
Besides wireless communications, push-based delivery is important for
a wide range of applications that involve dissemination of information to a
10. Mobile Computing 469

large number of clients. Dissemination-based applications include information


feeds such as stock quotes and sport tickets, electronic newsletters, mailing
lists, traffic and weather information systems, and cable TV. An important
application of dissemination-based systems is information dissemination on
the Internet that has gained considerable attention (e.g., [BC96,YG95]). Re-
cently, many commercial products have been developed that provide wireless
dissemination of Internet-available information. For instance, the AirMedia's
Live Internet broadcast network [Air] wirelessly broadcasts customized news
and information to subscribers equipped with a receiver antenna connected
to their personal computer. Similarly, Hughes Network Systems' DirectPC
[Sys97] network downloads content from the Internet directly from the web
servers to a satellite network and then to the subscribers' personal computer.
The idea of broadcast data delivery is not new. Early work has been
contacted in the area of Teletext and Videotex systems [AW85,Won88]. Pre-
vious work also includes the Datacycle project [BGH+92] at Bellcore and
the Boston Community Information System (BCIS) [Gif90]. In Datacycle, a
database circulates on a high bandwidth network (140 Mbps). Users query
the database by filtering relevant information via a special massively paral-
lel transceiver. BCIS broadcasts news and information over an FM channel
to clients with personal computers equipped with radio receivers. Recently,
broadcast has received attention in wireless systems because of the physical
support for broadcast in both satellite and cellular networks.

6.1 Hybrid Delivery


Push-based data delivery is suitable in cases in which information is trans-
mitted to a large number of clients with overlapping interests. In this case,
the server saves several messages that in pull-based systems would have to be
sent individually. In addition, the server is prevented from being overwhelmed
by multiple client requests. Push-based delivery is scalable since performance
does not depend on the number of clients listening to the broadcast. Pull-
based delivery, on the other hand, cannot scale beyond the capacity of the
server or the network. One of the limitations of broadcast delivery is that
access is only sequential; clients need to wait till the required data appear
on the channel. Thus, access latency degrades with the volume of data being
broadcast that is with the size of the database. In pull-based data delivery,
clients playa more active role and can explicitly request data from the server.
Push and pull based delivery can be combined by providing clients with an
uplink channel, also called backchannel, to send messages to the server.
An important issue in such a hybrid delivery mechanism is whether the
same channel from the server to the clients is used for both broadcast delivery
and for the transmission of replies to on demand requests. In this case, poli-
cies are needed for efficiently sharing the channel among the various delivery
mechanisms. Clients can use the backchannel in various ways. The backchan-
nel can be utilized by the clients to provide feedback and profile information
470 O. Bukhres, E. Pitoura, and A. Zaslavsky

to the server. Clients can also use the backchannel to directly request time-
critical data. The backchannel is used in [AFZ97J along with caching at the
clients to allow clients to pull pages that are not available in their local cache
and are expected to appear in the broadcast after a threshold number of
items.
One approach in hybrid delivery is, instead of broadcasting all data items
in the database, to broadcast an appropriately selected subset of the items
and provide the rest on demand. Determining which subset of the database
to broadcast is a complicated task since the decision depends on many factors
including the clients' access patterns and the server's capacity to service re-
quests. Broadcasting the most popular data is the approach taken in [SRB97J,
where the broadcast medium is used as an air-cache for storing frequently
requested data. A technique is presented that continuously adjusts the broad-
cast content to match the hot-spot of the database. The hot-spot is calculated
by observing the broadcast misses indicated by explicit requests for data not
on the broadcast. These requests provide the server with tangible statistics on
the actual data demand. Partitioning the database into two groups: a "publi-
cation group" that is broadcast and an "on demand" group is also suggested
in [IV94J. The same medium is used for both the broadcast channel and the
backchannel. In this approach, the criterion for partitioning the database is
minimizing the backchannel requests while keeping the response time below
a predefined upper limit.
Another approach is to broadcast pages on demand. In this approach, the
server chooses the next item to be broadcast on every broadcast tick based
on the requests for data it has received. Various strategies have been stud-
ied [Won88J such as broadcasting the pages in the order they are requested
(FCFS), or broadcasting the page with the maximum number of pending
requests. A parameterized algorithm for large-scale data broadcast that is
based only on the current queue of pending requests is proposed in [AF98J.
Mobility of users is also critical in determining the set of broadcast items.
Cells may differ in their type of communication infrastructure and thus in
their capacity to service requests. Furthermore, as users move between cells,
the distribution of requests for specific data at each cell changes. Two vari-
ations of an adaptive algorithm that takes into account mobility of users
between cells of a cellular architecture are proposed in [DCK +97J. The algo-
rithms statistically selects data to be broadcast based both on user profiles
and registration in each cell.

6.2 Organization of Broadcast Data


Clients are interested in accessing specific items from the broadcast. The
access time is the average time elapsed from the moment a client expresses
its interest to an item by submitting a query to the receipt of the item on the
broadcast channel. The tuning time is the amount of time spent listening to
the broadcast channel. Listening to the broadcast channel requires the client
10. Mobile Computing 471

to be in the active mode and consumes power. The broadcast data should be
organized so that the access and tuning time are minimized.
The simplest way to organize the transmission of broadcast data is a flat
organization. In a flat organization, given an indication of the data items
desired by each client listening to the broadcast, the server simply takes the
union of the required items and broadcasts the resulting set cyclically. More
sophisticated organizations include broadcast disks and indexing.
In many applications, the broadcast must accommodate changes. At least
three different types of changes are possible [AFZ95]. First, the content of the
broadcast can change in terms of including new items and removing existing
ones. Second, the organization of the broadcast data can be modified, for
instance by changing the order by which the items are broadcast or the
frequency of transmission of a specific item. Finally, if the broadcast data are
allowed to be updated, the values of data on the broadcast change.

Broadcast disks. The basic idea of broadcast disks is to broadcast data


items that are most likely to be of interest to a larger part of the client com-
munity more frequently than others. Let us consider data being broadcast
with the same frequency as belonging to the same disk. Then, in a sense,
multiple disks of different sizes and speeds are superimposed on the broad-
cast medium [AFZ95,AAF+95]. An example demonstrating these points is
shown in Figure 6.1 [AFZ95,AAF+95]. The figure shows three different orga-
nizations of broadcast items of equal length. Figure 6.1(a) is a flat broadcast,
while in Figure 6.1(b) and (c) the data item A is broadcast twice as often
as items Band C. Specifically, (b) is a skewed (random) broadcast, in which
subsequent broadcasts of A are potentially clustered together, whereas (c) is
regular since there is no variance in the interarrival time of each item. The
performance characteristics of (c) are the same as if A was stored on a disk
that is spinning twice as fast as the disk containing Band C. Thus, (c) can
be seen as a multidisk broadcast. It can be shown [AAF+95] that, in terms of
the expected delay, the multidisk broadcast (c) always performs better that
the skewed one (b). The parameters that shape the multidisk broadcast are:
first, the number of disks, that determines the number of different frequencies
with which items will be broadcast; and then, for each disk, the number of
items and the relative frequency of broadcast.

Indexing. Clients may be interested in fetching from the broadcast individ-


ual data items identified by some key. If a form of a directory indicating when
a specific data item appears in the broadcast is provided along with the data
items, then each client needs only selectively tune in the channel to download
the required data [IVB94a,IVB94b). Thus, most of the time clients will re-
main in doze mode and thus save energy. The objective is to develop methods
for allocating catalog data together with data items on the broadcast channel
so that both access and tuning time are optimized. As an example, consider
472 O. Bukhres, E. Pitoura, and A. Zaslavsky

(a)

(c)

Fig. 6.1. Broadcast disks

the case of a flat broadcast where no catalog information is provided. This


method provides the best access time with a very large tuning time. For a
broadcast of size Data data items, the average access time is Data/2. On the
other hand, the average tuning time equals Data/2, which is the worst case
value.
The catalog may have the form of an index to the broadcast data items.
In (1, m) indexing [IVB94a], the whole index is broadcast following every
fraction (l/m) of the broadcast data items. All items have an offset to the
beginning of the next index item. To access a record, a client tunes into the
current item on the channel, and using the offset determines the next nearest
index item. Then, it goes to doze mode and tunes in when the index is broad-
cast. From the index, the client determines the required data item, and tunes
in again when the data item is broadcast. In the (1, m) allocation method, the
index is broadcast m times during each period of the broadcast. Distributed
indexing [IVB94a] improves over this method by only partially replicating
the index. In particular, instead of replicating the whole index, each index
segment describes only data items which immediately follow. Finally, in flex-
ible indexing [IVB94b], the broadcast is divided into p data segments. The
items of the broadcast are assumed to be sorted. The first item in each data
segment is preceded by a control index. The control index consists of a binary
control index and a local index. The binary control index is used to determine
the data segment where the key is located by performing a binary search. The
local index is then used to locate the specific item inside the segment.
Instead of broadcasting a separate directory, if hashing-based techniques
[IVB94b] are used, only the hashing parameters need to be broadcast along
with each data item. The hashing parameters may include the hashing func-
tion h and in case of collisions an indication of where in the broadcast the
overflow items are located. If h is a perfect hashing function, then a client
requesting item K tunes in, reads h, computes h(K), and goes to doze mode
waiting for bucket h(K) .
10. Mobile Computing 473

6.3 Client Caching in Broadcast Delivery

Caching can be deployed along with broadcast in dissemination-based sys-


tems. Clients may cache data items to lessen their dependency on the server's
choice of broadcast priority. Since this choice is often based on an average over
a large client population with diverse needs, it may not be optimal for a spe-
cific client. Furthermore, the specific client's access distribution may change
over time. In any case, caching data items from the broadcast reduces the
expected delay for accessing them. Employing caching in broadcast-based
systems requires revising traditional cache management protocols such as
those for cache replacement and prefetching. To be in accordance with cache
related terminology, we use the term page and data item interchangeably,
assuming that the granularity of a cache is a (broadcast) item.

Replacement policies. In traditional cache management systems, clients


cache their hottest data mostly to improve the cache hit ratio. In general,
in such systems, the cost of obtaining a page on a cache miss is considered
constant and thus is not accounted for during page replacement. However, in
broadcast systems, the cost of servicing a miss on a page depends on when the
requested page will appear next on the broadcast. This creates the need for
cost-based page replacement [AFZ95,AAF +95], where the cost of obtaining a
page on a cache miss must be taken into consideration in page replacement
decisions. In particular in dissemination systems with a broadcast disk or-
ganization, clients should store those pages for which the local probability
of access is significantly greater than the page's frequency of broadcast. A
simple cost-based replacement strategy is the P Inverse X method (PIX)
[AAF+95], that replaces the cache-resident page having the lowest ratio be-
tween its probability of access (P) and its frequency of broadcast (X).

Prefetching. Clients prefetch pages into their cache in anticipation of fu-


ture accesses. In traditional distributed systems, prefetching puts additional
load on the server and the network since the pages to be prefetched need to
be transmitted to the client. However, in dissemination-based systems only
the client's local resources are impacted, since the items to be prefetched are
already on the broadcast. Using prefetching instead of page replacement can
reduce the cost of a miss as illustrated in [AFZ96b]. This possibility is ex-
ploited by the PT [AFZ96b,AAF+95] prefetching heuristic. PT is a dynamic
policy that performs a calculation for each page that arrives on the broad-
cast to decide whether that page is more valuable than some other page that
is currently in the cache. If so, it replaces it with the page currently in the
broadcast.
Another approach to prefetching is proposed in [Amm87] in the context
of Teletext broadcast delivery systems. In this approach, control information
is stored along with each broadcast page. The control information for a page
474 O. Bukhres, E. Pitoura, and A. Zaslavsky

is a linked list of pages that are most likely to be requested next by the client.
When a request for a page p is satisfied, the user enters a phase during which
it prefetches the D most likely referenced item associated with p, where D
is the cache size in pages. This phase terminates either when D pages are
prefetched or when the client submits a new request.

6.4 Cache Invalidation by Broadcast


The server in a client/server system can use broadcast to inform its clients
of updates of items in their cache. A server can use broadcast to invalidate
the cache of its clients either asynchronously or synchronously [BI94]. In
asynchronous methods, the server broadcasts an invalidation report for a
given item as soon as its value is changed. In synchronous methods, the
server periodically broadcasts an invalidation report. A client has to listen
the report first to decide whether its cache is valid or not. Thus, each client
is confident for the validity of its cache only as of the last invalidation report.
That adds some latency to query processing, since to answer a query, a client
has to wait for the next invalidation report. Cache invalidation protocols
are also distinguished based on whether the server maintains or not any
information about their clients, the contents of their cache, and when it was
last validated. Servers that hold such information are called stateful, while
servers that do not are called stateless [BI94].
Invalidation reports vary in the type of information they convey to the
clients. For instance, the reports may contain the values of the items that
have been updated, or just their identity and the timestamp of their last
update. The reports can provide information for individual items or aggregate
information for sets of items. The aggregate information must be such that
if a client concludes that its cache is valid, this is in fact the case. However,
false alarms, where a client mistakenly considers its cache as invalid, may be
tolerated.
Three synchronous strategies for stateless servers are proposed in [BI94].
In the broadcasting timestamps strategy (T S), the invalidation report con-
tains the timestamp of the latest change for items that have had updates in
the last w seconds. In the amnestic terminals strategy (AT), the server only
broadcasts the identifiers of the items that changed since the last invalidation
report. In the signatures strategy, signatures are broadcast. A signature is a
checksum computed over the value of a number of items by applying data
compression techniques similar to those used for file comparison. Each of
these strategies is shown to be effective for different types of clients depend-
ing on the time the clients spend in doze mode. An asynchronous method
based on bit sequences is proposed in [JBE+95,BJ95]. In this method, the
invalidation report is organized as a set of bit sequences with an associated
set of timestamps. Each bit in the sequence represents a data item in the
database. A bit "I" indicates that the corresponding item has been updated
since the time specified by the associated timestamp, while a "0" indicates
10. Mobile Computing 475

that the item has not changed. The set of bit sequences is organized in a
hierarchical structure.
A client may miss cache invalidation reports, because of disconnections
or doze mode operation. Synchronous methods surpass asynchronous in that
clients need only periodically tune in to read the invalidation report instead of
continuously listening to the channel. However, if the client remains inactive
longer than the period of the broadcast, the entire cache must be discarded,
unless special checking is deployed. In simple checking, the client sends the
identities of all cached objects along with their timestamps to the server for
validation. This requires a lot of uplink bandwidth as well as battery energy.
Alternatively, the client can send group identifiers and timestamps, and the
validity can be checked at the group level. This is similar to volume checking
in the Coda file system. Checking at the group level reduces the uplink re-
quirements. On the other hand, a single object update invalidates the whole
group. As a result the amount of cached items retained may significantly
reduce by discarding possibly valid items of the group. To remedy this situa-
tion, in GCORE [WYC96j, the server identifies for each group a hot update
set and excludes it from the group when checking the group's validity.

6.5 Consistency Control in Broadcast Systems


When the values of broadcast items are updated, there is a need for consis-
tency control protocols. Such protocols vary depending on various parame-
ters. First, protocols depend on the assumptions made about data delivery,
for example, on whether there is a backchannel for on demand data delivery,
as well as on whether data items are cached at clients, and if so, on whether
clients can perform updates. Consistency control protocols also depend on
the data consistency model in use. In traditional database systems, consis-
tency is based on serializability that ensures that operations are performed in
the context of atomic, consistent, isolated, and durable transactions. Because
dissemination-based information systems are only now beginning to emerge,
appropriate data consistency models in such applications have not yet been
extensively studied.
Preserving the consistency of client's read-only transactions in the pres-
ence of updates is discussed in [Pit98b,Pit98aj. To this end, control infor-
mation is broadcast along with data that enables the validation of read-only
transactions at the clients. Various methods are presented that vary in the
complexity and volume of control information, including transmitting inval-
idation reports, multiple versions per item, and serializability information.
Caching at the client is also supported to decrease query latency. The perfor-
mance of the methods is evaluated and compared through both qualitative
arguments and simulation results. In all the methods proposed, consistency
is preserved without contacting the server and thus the methods are scalable;
i.e., their performance is independent of the number of clients. This property
makes the methods appropriate for highly populated service areas.
476 O. Bukhres, E. Pitoura, and A. Zaslavsky

A number of cache consistency models are reasonable for broadcast-


based delivery. For example, when clients do not cache data, the server
always broadcasts the most recent values, and there is no backchannel for
on-demand data delivery, the model that arise naturally is the latest value
model [AFZ96a]. In this model, clients read the most recent value of a data
item. This model is weaker than serializability because there is no notion of
transactions, i.e., operations are not grouped into atomic units. When clients
cache data but are not allowed to perform any updates, an appropriate con-
sistency model is quasi caching [ABG90]. In this model, although the value of
the cached data may not be the most recent one, this value is guaranteed to be
within an allowable deviation as specified through per-client coherency condi-
tions. Quasi caching is a reasonable choice in the case of long disconnections
and/ or weak connectivity. A weaker alternative to serializability that sup-
ports transactions in dissemination-based systems is proposed in [SNP+97j.
In this model, read only transactions read consistent data without having
to contact the server. However, to ensure correctness, control information is
required to be broadcast along with the data.
The broadcast facility can be exploited in various traditional algorithms
for concurrency control. Using the broadcast facility in optimistic concurrency
control protocols to invalidate some of the client's transactions is suggested
in [Bar97]. In optimistic concurrency control, the transaction scheduler at
the server checks at commit time whether the execution that includes the
client's transaction to be committed is serializable or not. If it is, it accepts
the transaction; otherwise it aborts it. In the proposed enhancement of the
protocol, the server periodically broadcasts to its clients a certification report
(C R) that includes the readset and writeset of active transactions that have
declared their intention to commit to the server during the previous period
and have successfully been certified. The mobile client uses this information
to abort from its transactions those whose readsets and writesets intersect
with the current CR. Thus, part of the verification is performed at the mobile
client.

7 Mobile Computing Resources and Pointers

Books that focus on data management for mobile and wireless computing
include [PS98] and [IK95j, which is an edited collection of papers covering a
variety of aspects in mobile computing.
There are various extensive on-line bibliographies on mobile computing
that include links to numerous research projects, reports, commercial prod-
ucts and other mobile-related resources [Ali,Mob].
There is a major annual conference solely devoted to mobile computing,
the ACM/IEEE International Conference on Mobile Computing and Net-
working. Many database, operating systems, networking and theory confer-
ences have included mobile computing in their topics of interest, and several
10. Mobile Computing 477

related papers now appear in the proceedings of these conferences. Recent


conferences that have addressed mobile computing include:

• 16th, 17th, 18th International Conferences on Distributed Computing


Systems;
• 30th Hawaii International Conference on System Sciences
• 24 International Conference on Very Large Databases (VLDB'98)
• 1st (1995, USA), 2nd (1996, USA), 3rd (1997, Hungary) ACM/IEEE
Annual International Conference on Mobile Computing and Networking
• IEEE International Conference on Computer Communications and Net-
works
• International Conferences on Data Engineering
• International workshop "Mobility in Databases and Distributed Systems"
at DEXA'98 and many others.

Table 7.1 gives a condensed summary ofresearch groups world-wide (cur-


rent at the time of writing), which are looking into various aspects of mobile
computing.

Table 7.1. Mobile computing research groups

Who What
DATAMAN Data management in mobile computing
http://www .cs.rutgers.edu/datarnan/ distributed algorithms and services
T.Imielinski, B.Badrinath data broadcasting, indirect protocols
Rutgers University, NJ, U.S.A. data replication, wireless networks
location management, software architectures

INFOPAD InfoPad terminal design, low-power units


http://infopad.eecs. berkeley.edu/ mobile multimedia networks, design tools
EECS Dept., R.Katz, B.Brodersen applications & user interfaces
University of California, Berkeley, U.S.A.

LITTLE WORK Investigating as requirements for mobile


http://www.citLumich.edu/mobile.html computers, log optimisation, communications
CITI( Centre for Information Technology & consistency in mobile file systems
Integration) Univ. of Michigan, Ann Arbor, U.S.A. disconnected operation for AFS

Mobile CODlmunication. a ••• arch Group Satellite and personal communications


http://www.ee.surrey.ac.uk/EE/CSER/ Universal mobile telecom systems (UMTS)
EE Dept., CSER (Centre for Satellite Engineering)
University of Surrey, U.K.

MobU. Com.putlnC Lab disconnected operation, dynamic load


http://www.mcl.c8.columbia.edu/ balancing, efficient use of limited bandwidth,
D. Duchamp, Columbia University, U.S.A. dynamic service location,
mobility-aware applications

Telecommunication Sy.tem. Laboratory mobile communications, handover,


http://www.it.kth.se/labs/ts multicast routing, mobile applications
KTH, Teleinformatik Stockholm, Sweden

Mobile Computlns Group Mobisaic-WWW browser for a mobile


http://www.cs.washington.edu/research/ and wireless computing environment
mobicomp/mobile.html Wit - system infrastructure for mobile
aSE Dept., Univ. of Washington, Seattle, U.S.A. handheld computing,
B.Bershad, J .Zahorjan coping with resource variability

continued on next page ...


478 O. Bukhres, E. Pitoura, and A. Zaslavsky

continued from previous page ...

Who What
Dldrlbut..d Multhnedla a. •••rch Group multimedia .upport for mobile computing
http://www.comp.lanc8.ac.uk/computing/rsssarch middleware for mobile computing
Impgl Lancaster Univ., U.K., G.Blair, N.Davies mobility-aware applications
Actlv. Bad .._ location management, mobUity-aware
http://www.cam.orl.co.uk/ab.html applications
Olivetti U.K.

Solari. MobU... IP Mobile IP for Solaris OS


http://playground.8un.com/pub/mobUe-ip/

IETF Mobile IP Work'p• •roup mobUe networks, protocols, OS for mobile


http://www . ietf.org/htmt.charters/mobileip-charter. html computing
Internet Engineering Task Force, U.S.A.

Mobil. Cornputlna: - Sho.hln Lab quality of service and mobility management,


http://ccnga.uwaterloo.ca/mobile/ traffic modellinc, aecurity, signal quality
Dept of Camp Science Univ. of Waterloo, Canad~

Mobil. Cornputln&, a ••••rch CrossPoint" Sciencepad projects


http://www.cB.purdue.edu/reaearch/cs8/mobile data manasement in mobile computing
Dept of Camp Sciences & School of Elee. Engineering high speed ATM/broadband integrated
Purdue University, U.S.A. networks, mobile environments in te1emedicine
B.Bhargava, O.Bukhrea, A.Joahi, A.Elmagarmid mobile IP, performance,
caching in mobile computing, mobile WWW

MobUe datab_e. &l: c::onununlcatlon. mobile databases,


http://ca.anu.edu.au/research.html mobile communications,
J .Yu, Camp. Sci., Australian National University mobile IP, TCP lIP performance

Wlrel_. Networb high-performance wireless LANe


http://www.mpce.mq.edu.au/elec/networks/wireless/ antennas,
D.Skellern, N.Weate, mobile IP, protocols, wireless networks
School of MPCE, Macquarie University, Sydney

DPMC (DI.trlb., ParaUel &£: Mobil. Cornputlns) data management for mobile computing
http://www.ct.monash.edu.au/DPMC/ wireless networks, interoperability,
School of Camp. Sci. and Software Eng. mobile agents'" objects,
Monash Unlv., Australia caching, adaptive protocols

Coda flle ay.tern disconnected operations, mobile file syetem,


http://www.cs.cmu.edu/afs/cs.cmu.edu/project/ caching, replication, mobility management
coda/Web/coda.html
M.Satyanarayanan, CS, Carnegie Mellon Univ., U.S.A.

CMU Monarch project mobile networks, mobile IP


http://www.monarch.cs.cmu.edu protocols for wireless networks
D.Johnson, School of Computer Science mobile computing architecture
Carnegie Mellon University, U.S.A.

Th. FU. Mobility Group mobile file systems, replication,


http://flcus-www.cs.ucla.edu/ficus/ mobile computing environments
Comp.Science, Unlv. of California, Los Angeles, U.S.A.

Multhnedla Wire Ie •• LAN Group quality of service, wirelesa LANs,


http://www.ecs.umass.edu/ece/wireless/ multimedia applications, protocols
Unlv. of Massachusetts, Amherst, U.S.A

Rover Mobile Application Toolkit dynamic mobile objects, mobile applications


http://www.pdos.lcs.mit.edu/rover/ queued remote procedure calls
F.Kaashoek, Camp. Science Lab, MIT, U.S.A. mobile computing environments

MoaqultoNet project mobile computing


http://mosquitonet.Stanford.EDU/IIlosquitonet.html wireless networks
M.Baker, Compo Science, Stanford Univ., U.S.A. connectivity management

Pleiad •• projec::t personal communication systems


http://www-db.stanford.edu/-jan/HiDB.html replication, location management
J.Widom, Compo Science, Stanford Univ., U.S.A. mobile databases

continued on next page ...


10. Mobile Computing 479

continued from previous page ...

Who What
Wireless LAN Alliance (WLANA) wireless LANe, protocols, handoff
http://www.wlana.comIEEE802.11
Major LAN vendors

8 Conclusions

Wireless communications permit users carrying portable computers to re-


tain their network connection even when mobile. The resulting computing
paradigm is often called mobile computing. In conjunction with the exist-
ing computing infrastructure, mobile computing adds a new dimension to
distributed computation that of universal access to information anytime and
anyplace. This dimension enables a whole new class of applications. However,
the realization of these applications presupposes that a number of challenges
regarding data management are met. In this chapter, we have surveyed these
challenges along with various proposals for addressing them. Many technical
problems still remain to be resolved regarding mobility support in large scale
complex heterogeneous distributed systems.

References

[AAF+95] Acharya, S., Alonso, R., Franklin, M.J., Zdonik, S., Broadcast
disks: data management for asymmetric communications environ-
ments, Pmc. ACM SIGMOD Intl. Conference on Management of
Data (SIGMOD 95), 1995, 199-210. Reprinted in T. Imielinski, H.
Korth (eds.), Mobile Computing, Kluwer Academic Publishers, 1996,
331-36l.
[ABG90] Alonso, R., Barbara, D., Garcia-Molina, H., Data caching issues in
an information retrieval system, ACM Transactions on Database Sys-
tems 15(3), 1990, 359-384.
[AD93] Athan, A., Duchamp, D., Agent-mediated message passing for con-
strained environments, Pmc. USENIX Symposium on Mobile and
Location-Independent Computing, Cambridge, Massachusetts, 1993,
103-107.
[AF98] Aksoy, D., Franklin, M.J., Scheduling for large-scale on-demand
data broadcasting, Proc. Conference on Computer Communications
(IEEE INFO COM '98), 1998, 651--659.
[AFZ95] Acharya, S., Franklin, M.J., Zdonik, S., Dissemination-based data
delivery using broadcast disks, IEEE Personal Communications 2(6),
1995, 50--60.
[AFZ96a] Acharya, S., Franklin, M.J., Zdonik, S., Disseminating updates on
broadcast disks, Proc. 22nd International Conference on Very Large
Data Bases (VLDB 96), 1996,354-365.
480 O. Bukhres, E. Pitoura, and A. Zaslavsky

[AFZ96bJ Acharya, S., Franklin, M.J., Zdonik, S., Prefetching from a broad-
cast disk, Proc. 12th International Conference on Data Engineering
(ICDE 96), 1996, 276-285.
[AFZ97J Acharya, S., Franklin, M., Zdonik, S., Balancing push and pull for
data broadcast, Proc. ACM Sigmod Conference, 1997, 183-194.
[AirJ Air Media, AirMedia Live, www.airmedia.com.
[AK93J Alonso, R., Korth, H.F., Database system issues in nomadic com-
puting, Proc. 1999 SIGMOD Conference, Washington, D.C., 1993,
388-392.
[AliJ Aline Baggio's bookmarks on mobile computing,
http://www-sor.inria.frraline/mobile/mobile.html.
[Amm87J Ammar, M.H., Response time in a Teletext system: an individual
.user's perspective, IEEE 1Tansactions on Communications 35(11),
1987, 1159-1170.
[ARS97J Acharya, A., Ranganathan, M., Saltz, J., Sumatra: a language for
resource-aware mobile programs, J. Vitek, C. Tschudin (eds.), Mobile
Object Systems, Lecture Notes in Computer Science 1222, Springer-
Verlag, Berlin, 1997, 111-130.
[As094J Asokan, N., Anonymity in mobile computing environment, IEEE
Workshop on Mobile Computing Systems and Applications, 1994,
200-204,
http://snapple.cs.washington.edu:600/library/mcsa94/asokan. ps.
[AW85J Ammar, M.H., Wong, J.W., The design of Teletext broadcast cycles,
Performance Evaluation 5(4), 1985, 235-242.
[Bar97J Barbara, D., Certification reports: supporting transactions in wire-
less systems, Proc. IEEE International Conference on Distributed
Computing Systems, 1997,466-473.
[BB97J Bakre, A., Badrinath, B., Implementation and performance evalua-
tion of indirect TCP, IEEE 1Tansactions on Computers 46(3), 1997,
260-278.
[BBH93J Badrinath, B.R., Bakre, A., Imielinski, T., Marantz, R., Handling
mobile clients: a case for indirect interaction, Proc. 4th Workshop on
Workstation Operating Systems, Aigen, Austria, 1993, 91-97.
[BC96J Bestavros, A., Cunha, C., Server-initiated document dissemination
for the WWW, IEEE Data Engineering Bulletin 19(3), 1996, 3-11.
[BGH+92J Bowen, T., Gopal, G., Herman, G., Hickey, T., Lee, K., Mansfield, W.,
Raitz, J., Weinrib, A., The Datacycle architecture, Communications
of the ACM 35(12), 1992, 71-81.
[BGZ+96J Bukhres, 0., Goh, H., Zhang, P., Elkhammas, E., Mobile computing
architecture for heterogeneous medical databases, Proc. 9th Inter-
national Conference on Parallel and Distributed Computing Systems,
1996.
[B194J Barbara, D., Imielinski, T., Sleepers and workaholics: caching strate-
gies in mobile environments, Proc. ACM SIGMOD Intl. Conference
on Management of Data (SIGMOD 94), 1994, 1-12.
[BJ95J Bukhres, O.A., Jing, J., Performance analysis of adaptive caching
algorithms in mobile environments, Information Sciences, An Inter-
national Journal 95(2), 1995, 1-29.
10. Mobile Computing 481

[BMM98] Bukhres, 0., Mossman, M., Morton, S., Mobile medical database
approach for battlefield environments, The Australian Journal on
Computer Science 30(2), 1998, 87-95
[BP97] Badrinath, B.R., Phatak, S., Database server organization for han-
dling mobile clients, Technical Report DCS-342, Department of Com-
puter Science, Rutgers University, 1997.
[Bro95] Brodsky, I., The revolution in personal telecommunications, Artech
House Publishers, Boston, London, 1995.
[CGH+95] Chess, D., Grosof, B., Harrison, C., Levine, D., Parris, C., Tsudik,
G., Itinerant agents for mobile computing, IEEE Personal Commu-
nications 2(5), 1995, 34-49.
[Chr93] Chrysanthis, P.K., Transaction processing in mobile computing en-
vironment, Proc. IEEE Workshop on Advances in Parallel and Dis-
tributed Systems, Princeton, New Jersey, 1993, 77-83.
[DCK+97] Datta, A., Celik, A., Kim, J., VanderMeer, D., Kumar, V., Adaptive
broadcast protocols to support efficient and energy conserving re-
trieval from databases in mobile computing environments, Proc. 19th
IEEE International Conference on Data Engineering, 1997, 124-133.
[DGS85] Davidson, S.B., Garcia-Molina, H., Skeen, D., Consistency in parti-
tioned networks, ACM Computing Surveys 17(3), 1985, 341-370.
[DHB97] Dunham, M., Helal, A., Balakrishnan, S., A mobile transaction model
that captures both the data and movement behavior, ACM/Baltzer
Journal on Special Topics on Mobile Networks, 1997, 149-162.
[DKL+94] Douglis, F., Kaashoek, F., Li, K., Caceres, R., Marsh, B., Tauber,
J.A., Storage alternatives for mobile computers, Proc. 1st Symp. on
Operating Systems Design and Implementation, Monterey, California,
USA, 1994, 25-37.
[DPS+94] Demers, A., Petersen, K., Spreitzer, M., Terry, D., Theimer, M.,
Welch, B., The Bayou architecture: support for data sharing among
mobile users, Proc. IEEE Workshop on Mobile Computing Systems
and Applications, Santa Cruz, CA, 1994, 2-7.
[EZ97] Elwazer, M., Zaslavsky, A., Infrastructure support for mobile in-
formation systems in Australia, Proc. Pacific-Asia Conference on
Information Systems (PACIS'97), Brisbane, QLD, Australia, 1997,
33-43.
[FGB+96] Fox, A., Gribble, S.D., Brewer, E.A., Amir, E., Adapting to net-
work and client variability via on-demand dynamic distillation, Proc.
International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS- VII), Cambridge, MA,
1996, 160-170.
[FZ94] Forman, G.H., Zahorjan, J., The challenges of mobile computing,
IEEE Computer 27(6), 1994, 38-47.
[GHM+90] Guy, R.G., Heidemann, J.S., Mak, W., Page, T.W.J., Popek, G.J.,
Rothmeier, D., Implementation of the Ficus replicated file system,
Proc. Summer 1990 USENIX Conference, 1990,63-71.
[GHN+96] Gray, J., Helland, P., Neil, P.O., Shasha, D., The dangers of repli-
cation and a solution, Proc. ACM SIGMOD Conference, Montreal,
Canada, 1996, 173-182.
[Gif90] Gifford, D., Polychannel systems for mass digital communication,
Communications of the ACM 33(2), 1990, 141-151.
482 O. Bukhres, E. Pitoura, and A. Zaslavsky

[GKL+94] Gruber, R, Kaashoek, F., Liskov, N., Shrira, L., Disconnected opera-
tion in the Thor object-oriented database system, Pmc. IEEE Work-
shop on Mobile Computing Systems and Applications, Santa Cruz,
CA, 1994,51-56.
[HH93] Huston, L.B., Honeyman, P., Disconnected operation for AFS, Proc.
USENIX Symposium on Mobile and Location-Independent Comput-
ing, Cambridge, Massachusetts, 1993, 1-10.
[HH94] Huston, L., Honeyman, P, Peephole log optimization, Pmc. IEEE
Workshop on Mobile Computing Systems and Applications, Santa
Cruz, CA, 1994, http://citeseer.nj.nec.com/huston94peephole.html.
[HH95a] Honeyman, P., Huston, L.B., Communication and consistency in mo-
bile file systems, IEEE Personal Communications 2(6), 1995, 44-48.
[HH95b] Huston, L.B., Honeyman, P., Partially connected operation, Com-
puting Systems 4(8), 1995, 365-379.
[HPG+92] Heidemann, J., Page, T.W., Guy, RG., Popek, G.J., Primarily dis-
connected operation: experience with Ficus, Pmc. 2nd Workshop on
the Management of Replicated Data, 1992, 2-5.
[HSL98] Housel, B.C., Samaras, G., Lindquist, D.B., WebExpress: a
client/intercept based system for optimizing Web browsing in a wire-
less environment, ACM/Baltzer Mobile Networking and Applications
(MONET) 3(4), Special Issue on Mobile Networking on the Internet,
1998, 419-431. Also, University of Cyprus, CS-TR 96-18, 1996.
[IB94] Imielinski, T., Badrinath, B.R, Wireless mobile computing: chal-
lenges in data management, Communications of the ACM 37(10),
1994, 18-28.
[IK95] Imielinski, T., Korth, H. (eds.), Mobile computing, Kluwer Academic
Publishers, 1995.
[Inc95] Inc, M., Wireless data communications: an overview,
http://www.mot.com/wdg/.
[Inc96) Inc, E., GSM: the future,
http://www.ericsson.se/systems/gsm/future.htm.
[IV94] Imielinski, T., Viswanathan, S., Adaptive wireless information sys-
tems, Proc. SIG Data Base Systems Conference, Japan, 1994, 19-41.
[IVB94a] Imielinski, T., Viswanathan, S., Badrinanth, B.R, Energy efficient
indexing on air, Proc. ACM SIGMOD Intl. Conference on Manage-
ment of Data (SIGMOD 94), 1994, 25-36.
[IVB94b] Imielinski, T., Viswanathan, S., Badrinanth, B.R, Power efficient fil-
tering of data on air, Pmc. 4th International Conference on Extending
Database Technology, 1994, 245-258.
[JBE95] Jing, J., Bukhres, 0., Elmagarmid, A., Distributed lock management
for mobile transactions, Pmc. 15th IEEE International Conference
on Distributed Computing Systems, 1995, 118-125.
[JBE+95] Jing, J., Bukhres, 0., Elmargarmid, A.K., Alonso, R, Bit-sequences:
a new cache invalidation method in mobile environments, Technical
Report CSD-TR-94-074, Revised May 95, Department of Computer
Sciences, Purdue University, 1995.
[JK94] Jain, R., Krishnakumar, N., Network support for personal informa-
tion services for PCS users, Pmc. IEEE Conference on Networks for
Personal Communications, 1994, 1-7.
10. Mobile Computing 483

[JTK97] Joseph, A.D., Tauber, J.A., Kaashoek, M.F., Mobile computing with
the Rover toolkit, IEEE Transactions on Computers 46(3), 1997,
337-352.
[Kat94] Katz, R.H., Adaptation and mobility in wireless information systems,
IEEE Personal Communications 1, 1994, 6-17.
[KB92] Krishnakumar, N., Bernstein, A., High throughput escrow algorithms
for replicated databases, Proc. 18th VLDB Conference, 1992, 175-
186.
[KJ95] Krishnakumar, N., Jain, R., Mobility support for sales and inven-
tory applications, T. Imielinski, H. Korth (eds.), Mobile Computing,
Kluwer Academic Publishers, 1995, 571-594.
[KS92] Kistler, J.J., Satyanarayanan, M., Disconnected operation in the
Coda file system, ACM Transactions on Computer Systems 10(1),
1992, 213-225.
[KS93] Kumar, P., Satyanarayanan, M., Log-based directory resolution in
the coda file system, Proc. 2nd International Conference on Parallel
and Distributed Information Systems, San Diego, CA, 1993, 202-213.
[KS95] Kumar, P., Satyanarayanan, M., Flexible and safe resolution of file
conflicts, Proc. Winter 1995 USENIX Conference, New Orleans, LA,
1995, 95-106.
[Kue94] Kuenning, G.H., The design of the Seer predictive caching system,
Proc. IEEE Workshop on Mobile Computing Systems and Applica-
tions, Santa Cruz, CA, 1994, 37-43,
ftp://ftp.cs.ucla.edu/pub/ficus/mcsa94.ps.gz.
[LMJ96] Liu, G.Y., Marlevi, A., Maguire Jr., G.Q., A mobile virtual-
distributed system architecture for supporting wireless mobile com-
puting and communications, ACM Journal on Wireless Networks 2,
1996, 77-86.
[LS94] Lu, Q., Satyanarayanan, M., Isolation-only transactions for mobile
computing, Operating Systems Review, 1994, 81-87.
[LS95] Lu, Q., Satyanarayanan, M., Improving data consistency in mobile
computing using isolation-only transactions, Proc. 5th Workshop on
Hot Topics in Operating Systems, Orcas Island, Washington, 1995,
124-128, http://citeseer .nj .nec.com/lu95improving.html.
[MB96] Morton, S., Bukhres, 0., Mobile transaction recovery in distributed
medical databases, Proc. 8th International Conference on Parallel
and Distributed Computing and Systems, 1996.
[MB97] Morton, S., Bukhres, 0., Utilizing mobile computing in the Wishard
Memorial Hospital Ambulatory Service, Proc. 12th ACM Symposium
on Applied Computing (ACM SAC'97), 1997,287-294.
[MBM96] Morton, S., Bukhres, 0., Mossman, M., Mobile computing architec-
ture for a battlefield environment, Proc. International Symposium
on Cooperative Database Systems for Advanced Applications, 1996,
130-139.
[MBZ+97] Morton, S., Bukhres, 0., Zhang, P., Vanderdijs, E., Platt, J., Moss-
man, M., A proposed architecture for a mobile computing environ-
ment, Proc. 5th Euromicro Workshop on Parallel and Distributed
Processing, 1997.
484 O. Bukhres, E. Pitoura, and A. Zaslavsky

[MES95] Mummert, L.B., Ebling, M.R., Satyanarayanan, M., Exploiting weak


connectivity for mobile file access, Proc. 15th ACM Symposium on
Opemting Systems Principles, 1995, 143-155.
[M097] Morton, S., Bukhres, 0., Mobile computing in military ambulatory
care, Proc. 10th IEEE Symposium on Computer-Based Medical Sys-
tems (CBMS'97), 1997, 58-65.
[Mob] Mobile and wireless computing site,
http://mosquitonet.Stanford.EDU /mobile/.
[MS94] Mummert, L., Satyanarayanan, M., Large granularity cache coher-
ence for intermittent connectivity, Proc. Summer 1994 USENIX Con-
ference, Boston, MA, 1994, 279-289.
[Nei86] Neil, P.O., The escrow transactional method, ACM Transactions on
Database Systems 11(4), 1986,405-430.
[NPS95] Noble, B.D., Price, M., Satyanarayanan, M., A programming interface
for application-aware adaptation in mobile computing, Computing
Systems 8(4), 1995, 345-363.
[NS95] Noble, B., Satyanarayanan, M., A research status report on adapta-
tion for mobile data access, Sigmod Record 24(4), 1995, 10-15.
[NSA+96] Narayanaswamy, S., Seshan, S., Amir, E., Brewer, E., Brodersen,
R.W., Burghardt, F., Burstein, A., Yuan-Chi Chang, Fox, A., Gilbert,
J.M., Han, R., Katz, R.H., Long, A.C., Messerschmitt, D.G., Rabaey,
J.M., A low-power, lightweight unit to provide ubiquitous information
access application and network support for InfoPad, IEEE Personal
Communications Magazine, 1996, 4-17.
[NSN+97] Noble, B.D., Satyanarayanan, M., Narayanan, D., Tilton, J.E., Flinn,
J., Walker, K.R., Agile application-aware adaptation for mobility,
Proc. 16th ACM Symposium on Opemting System Principles, 1997,
276-287.
[NSZ97] Nzama, M., Stanski, P., Zaslavsky, A., Philosophy of mobile comput-
ing in heterogeneous distributed environment: ET effect in computing
world, A. Zaslavsky, B. Srinivasan (eds.), Proc. 2nd Austmlian W/S
on Mobile Computing, Databases and Applications, Monash Univer-
sity, Melbourne, Australia, 1997, 37-45.
[Ora97] Oracle, Oracle mobile agents technical product summary,
www.oracle.com/prod ucts / networking/ mobile_agents /html/.
[PB95a] Pitoura, E., Bhargava, B., A framework for providing consistent and
recoverable agent-based access to heterogeneous mobile databases,
ACM SIGMOD Record 24(3), 1995, 44-49.
[PB95b] Pitoura, E., Bhargava, B., Maintaining consistency of data in mobile
distributed environments, Proc. 15th IEEE International Conference
on Distributed Computing Systems, 1995, 404-413. '
[PB99] Pitoura, E., Bhargava, B., Data Consistency in intermittently con-
nected distributed systems, IEEE Transaction on Knowledge and
Data Engineering 11(6), 1999, 896-915.
[PF98] Pitoura, E., Fudos, I., An efficient hierarchical scheme for locating
highly mobile users, Proc. 6th ACM International Conference on
Information and Knowledge Management (CIKM98), 1998, 218-225.
[Pit96] Pitoura, E., A replication schema to support weak connectivity in
mobile information systems, Proc. 7th International Conference on
10. Mobile Computing 485

Database and Expert Systems Applications (DEXA96), Lecture Notes


in Computer Science 1194, Springer Verlag, September 1996, 510-
520.
[Pit98aj Pitoura, E., Scalable invalidation-only processing of queries in broad-
cast push-based delivery, Proc. Mobile Data Access Workshop, in co-
operation with the 17th International Conference on Conceptual Mod-
eling (ER'98), Lecture Notes in Computer Science, Springer Verlag,
1998, 230-241.
[Pit98bj Pitoura, E., Supporting read-only transactions in wireless broadcast-
ing, Proc. DEXA98 International Workshop on Mobility in Databases
and Distributed Systems, IEEE Computer Society, 1998, 428-433.
[PS98j Pitoura, E., Samaras, G., Data management for mobile computing,
Kluwer Academic Publishers, ISBN 0-7923-8053-3, 1998.
[PSP99j Papastavrou, S., Samaras, G., Pitoura, E., Mobile Agents for WWW
Distributed Database Access, Proc. 15th International Conference on
Data Engineering (ICDE99) , 1999, 228-237.
[Rap96j Rappaport, T.S., Wireless communications: principles and practice,
IEEE Press - Prentice Hall, 1996.
[RPG+96j Reiher, P., Popek, J., Gunter, M., Salomone, J., Ratner, D.,
Peer-to-peer reconciliation based replication for mobile comput-
ers, Proc. European Conference on Object Oriented Program-
ming, 2nd Workshop on Mobility and Replication, http://ficus-
www.cs.ucla.edu/ficus-members/reiher/papers/ecoop.ps. 1996.
[SAG+93j Schilit, B.N., Adams, N., Gold, R., Tso, M., Want, R., The ParcTab
mobile computing system, Proc. 4th IEEE Workshop on Workstation
Operating Systems (WWOS-IV), 1993, 34-39.
[Sat96aj Satyanarayanan, M., Fundamental challenges in mobile computing,
Proc. 15th ACM Symposium on Principles of Distributed Computing,
Philadelphia, PA, 1996, 1-7.
[Sat96bj Satyanarayanan, M., Accessing information on demand at any lo-
cation. Mobile information access, IEEE Personal Communications
3(1), 1996, 26-33.
[SKM+93j Satyanarayanan, M., Kistler, J.J., Mummert, L.B., Ebling, M.R., Ku-
mar, P., Lu, Q., Experience with disconnected operation in a mobile
computing environment, Proc. 1993 USENIX Symposium on Mobile
and Location-Independent Computing, Cambridge, MA, 1993, 11-28.
[SNK+95j Satyanarayanan, M., Noble, B., Kumar, P., Price, M., Application-
aware adaptation for mobile computing, Operating System Review
29(1), 1995, 52-55.
[SNP+97j Shanmugasundaram, J., Nithrakasyap, A., Padhye, J., Sivasankaran,
R., Xiong, M., Ramamritham, K., Transaction processing in broad-
cast disk environments, S. Jajodia, L. Kerschberg (eds.) , Advanced
Transaction Models and Architectures, Kluwer, 1997.
[SP97j Samaras, G., Pitsillides, A., Client/Intercept: a computational model
for wireless environments, Proc. 4th International Conference on
Telecommunications (ICT'97), Melbourne, Australia, 1997, 1205-
1210.
[SRB97j Stathatos, K., Roussopoulos, N., Baras, J.S., Adaptive data broadcast
in hybrid networks, Proc. 29rd VLDB Conference, 1997, 326--335.
486 O. Bukhres, E. Pitoura, and A. Zaslavsky

[Sys97) Systems, H.N., DirectPC homepage, www.direcpc.com.


[TD91) Tait, C.D., Duchamp, D., Service interface and replica management
algorithm for mobile file system clients, Proc. 1st International Con-
ference on Parallel and Distributed Information Systems, 1991, 19{}-
197.
[TD92) Tait, C.D., Duchamp, D., An efficient variable-consistency replicated
file service, Proc. USENIX File Systems Workshop, 1992, 111-126.
[TDP+94) Terry, D., Demers, A., Petersen, K., Spreitzer, M., Theimer, M.,
Welch, B., Session guarantees for weakly consistent replicated data,
Proc. International Conference on Parallel and Distributed Informa-
tion Systems, 1994, 14{}-149.
[TLA+95) Tait, C., Lei, H., Acharya, S., Chang, H., Intelligent file hoarding
for mobile computers, Proc. 1st ACM International Conference on
Mobile Cmputing and Networking (Mobicom'g5), Berkeley, 1995, 119-
125.
[TSS+96) Tennenhouse, D.L., Smith, J.M., Sincoskie, W.D., Minden, G.J., A
survey of active network research, IEEE Communication Magazine
35(1), 1996, 8o-B6.
[TTP+95) Terry, D.B., Theimer, M.M., Petersen, K., Demers, A.J., Spreitzer,
M.J., Hauser, C.H., Managing update conflicts in Bayou, a weakly
connected replicated storage system, Pmc. 15th ACM Symposium on
Operating Systems Principles, 1995, 172-183.
(WB97) Welling, G., Badrinath, B.R., A framework for environment aware
mobile applications, Proc. IEEE International Conference on Dis-
tributed Computing Systems, 1997, 384-391.
(WC95) Walborn, G., Chrysanthis, P.K., Supporting semantics-based trans-
action processing in mobile database applications, Proc. 14th Sym-
posium on Reliable Distributed Systems, 1995.
(WC97) Walborn, G., Chrysanthis, P.K., PRO-MOTION: support for mo-
bile database access, Personal Technologies Journal 1(3), Springer-
Verlag, 1997, 171-181.
(Whi96) White, J.E., Mobile agents, General Magic White Paper,
www.genmagic.com/agents.
(Won88) Wong, J. Broadcast delivery, Proc. IEEE 76(12), 1988, 1566-1577.
[WYC96) Wu, K.-L., Yu, P.S., Chen, M.-S., Energy-efficient caching for wire-
less mobile computing, Pmc. 12th International Conference on Data
Engineering (ICDE 96), 1996,336-343.
(YG95) Van, T., Garcia-Molina, H., SIFT - a tool for wide-area information
dissemination, Pmc. 1995 USENIX Technical Conference, 1995, 177-
186.
(YZ94) Yeo, L.H., Zaslavsky, A., Submission of transactions from mobile
workstations in a cooperative multidatabase processing environment,
Proc. 14th International Conference on Distributed Computing Sys-
tems, Poznan, Poland, 1994,372-279.
[ZD97) Zenel, B., Duchamp, D., General purpose proxies: solved and unsolved
problems, Proc. Hot Topics in Operating Systems (Hot-OS VI), 1997,
87-92.
[ZZR+98) Zhou, X.D., Zaslavsky, A., Rasheed, A., Price, R., Efficient object-
oriented query optimisation in mobile computing environment, Aus-
tralian Computer Journal 30, 1998,65-76.
11. Data Mining

Tadeusz Morzy and Maciej Zakrzewicz

Institute of Computing Science, Poznan University of Technology, Poznan, Poland

1. Introduction ..................................................... 488


2. Mining Associations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 490
2.1 Mining Association Rules: Formal Problem Definition ........ 491
2.2 Basic Algorithm for Association Rules Discovery ............. 493
2.3 Quantitative Association Rules ............................... 506
2.4 Generalized and Multiple-Level Association Rules ............ 506
2.5 Other Algorithms for Mining Frequent Itemsets .............. 508
2.6 Mining Sequential Patterns: Formal Problem Description..... 511
3. Classification and Prediction ..................................... 517
3.1 Classification ................................................. 518
3.2 Classification by Decision Tree Construction .................. 520
3.3 The Overfitting Problem ..................................... 528
3.4 Other Classification Methods ................................. 533
3.5 Classifier Accuracy. . . . . . . .. . . . . . . . . . . . ... . . . . . . . . . . . . . . . . . . .. 536
3.6 Prediction ................................................... 538
4. Clustering ....................................................... 540
4.1 Partitioning Methods ........................................ 541
4.2 Hierarchical Methods ......................................... 544
4.3 Other Clustering Methods .................................... 550
4.4 Clustering Categorical Attributes ............................ 552
4.5 Outlier Detection ............................................ 555
5. Conclusions...................................................... 558

Abstract. Data mining, also referred to as database mining or knowledge discovery


in databases (KDD), is a new research area that aims at the discovery of useful
information from large datasets. Data mining uses statistical analysis and inference
to extract interesting trends and events, create useful reports, support decision
making ,etc. It exploits the massive amounts of data to achieve business, operational
or scientific goals.
In this chapter we give an overview of the data mining process and we describe
the fundamental data mining problems: mining association rules and sequential
patterns, classification and prediction, and clustering. Basic algorithms developed
to efficiently process data mining tasks are discussed and illustrated with examples
of their operation on real data sets.
488 T. Morzy and M. Zakrzewicz

1 Introduction
Recent advances in data capture, data transmission and data storage tech-
nologies have resulted in a growing gap between more powerful database sys-
tems and users' ability to understand and effectively analyze the information
collected. Many companies and organizations gather gigabytes or terabytes
of business transactions, scientific data, web logs, satellite pictures, text re-
ports, which are simply too large and too complex to support a decision
making process. Traditional database and data warehouse querying models
are not sufficient to extract trends, similarities and correlations hidden in
very large databases.
The value of the existing databases and data warehouses can be signif-
icantly enhanced with help of data mining. Data mining is a new research
area which aims at nontrivial extraction of implicit, previously unknown and
potentially useful information from large databases and data warehouses.
Data mining, sometimes referred to as data dredging, knowledge extraction
or pattern discovery, can help answer business questions that were too time
consuming to resolve with traditional data processing techniques. The pro-
cess of mining the data can be perceived as a new way of querying - with
questions such as "which clients are likely to respond to our next promotional
mailing, and why?" .
Data mining aims at the discovery of knowledge that can be potentially
useful and unknown. It is subjective whether the discovered knowledge is
new, useful or interesting, since it depends on the application. Data mining
algorithms can discover large numbers of patterns and rules. To reduce the
number, users may have to put additional measurements and constraints on
patterns.
Two main types of data mining tasks are description and prediction. The
description consists in automated discovery of previously unknown patterns
which describe the general properties of the existing data. Example applica-
tions include the analysis of retail sales data to identify groups of products
that are often purchased together by customers, fraudulent credit card trans-
action detection, telecommunication network failure detection. The predic-
tion tasks typically attempt to do predictions of trends and behaviors based
on inference on available data. A typical application of a predictive problem
is targeted marketing, where the goal is to identify the targets most likely to
respond to the future mailings. Other predictive problems include customer
retention, promotion design, bankruptcy forecasting. Such applications may
help companies make proactive, knowledge-driven decisions.
Data mining is also popularly known as knowledge discovery in databases
(KDD), however, data mining is actually a part of the knowledge discovery
process. The knowledge discovery process is composed of seven steps that
lead from raw data collection to the new knowledge:
1. Data cleaning (data cleansing), which consists in removal of noise and
irrelevant data from the raw data collection.
11. Data Mining 489

2. Data integration, which consists in heterogeneous data source combina-


tion into a common database.
3. Data selection, which consists in retrieving the data relevant to the anal-
ysis.
4. Data transformation (data consolidation), which consists in transform-
ing the selected data into the form which is appropriate for the mining
algorithm.
5. Data mining, which consists in extracting potentially useful patterns from
the data.
6. Pattern evaluation, which consists in identification of interesting patterns.
7. Knowledge representation, which consists in visual presentation of the
discovered patterns to the user to help the user understand and interpret
the data mining results.

Typically, some of the above steps are combined together, for example, data
cleaning and data integration represent a preprocessing phase of data ware-
house generation, data selection and data transformation can be expressed
by means of a database query.
Depending on the type of patterns extracted, data mining methods are
divided into many categories, where the most important ones are:

• Association analysis: discovery of subsets of items which are most fre-


quently co-occurring in the database of item sets. The discovered pat-
terns are represented by means of association rules, characterized by two
types of importance measures, support and confidence.
• Classification: a classification model is built from labeled data collection.
The classification model is then used to classify new objects.
• Clustering: similar to classification, clustering is the organization of data
in classes, however, class labels are unknown and it is the task of the
clustering algorithm to discover acceptable classes.
• Characterization: summarization of general features of objects in a target
class. The general features are represented by means of characteristic
rules.
• Discrimination: comparison of the general features of objects between
two classes referred to as the target class and the contrasting class, The
differences are represented by means of discriminant rules.
• Outliers analysis: detection of data elements that cannot be grouped in
a given class or cluster.
• Evolution and deviation analysis: analysis of time related data that
changes in time.

Knowledge discovery is an iterative and interactive process. Once the discov-


ered patterns and rules are presented, the users can enhance their evaluation
measures, refine mining, select new data, in order to get different, more appro-
priate results. To support this form of inter activity, several query languages
have been proposed which enable users to declaratively formulate their data
490 T. Morzy and M. Zakrzewicz

mining problems. The languages employ the concept of a data mining query,
which can be optimized and evaluated by a data mining-enabled database
management system (KDDMS - knowledge discovery management system).

2 Mining Associations

Association rules are interesting class of database regularities, introduced by


Agrawal, Imielinski, and Swami in [AIS93J. Association rules were originally
formulated in the context of a market basket analysis. Classic market basket
analysis treats the purchase of a number of items (the contents of a shop-
ping basket) as a single transaction. Basket data usually consists of products
bought by a customer along with the date of transaction, quantity, price, etc.
Such data may be collected, for example, at supermarket checkout counters.
The goal is to find trends across large number of purchase transactions that
can be used to understand and exploit natural buying patterns, and repre-
sent the trends in the form of association rules. Association rules identify
the sets of items that are most often purchased together with another sets
of items. For example, an association rule may state that "80% of customers
who bought items A, Band C also bought D and E" . This information may
be used for cross-selling, optimal use of shelf and floor space, effective sales
strategies, target marketing, catalogue design ,etc.
An association rule is usually expressed as X --t Y, where X and Y
are sets of items. Given a set of products, association rules can predict the
presence of other products in the same transaction to certain degree of prob-
ability, called confidence. Since confidence does not necessarily describe the
importance of a rule, the actual coverage of a rule is also considered. This
measure is called the support of a rule. For example, consider the following
association rule:
{A, B} --t C with 10% support and 60% confidence
This rule states that (1) out of all customers who buy A and B, 60% of them
also buy C, and (2) 10% percent of the transactions involve the purchase of
A,B, and C. Both support and confidence should be taken into consideration
when assessing the significance of an association rule.
Interesting patterns can also be discovered when time information is
stored in the database. The problem of sequential patterns discovery was
introduced by Agrawal and Srikant in [AS95J. Sequential patterns discovery
consists in analyzing collections of records over a period of time to identify
trends. An example of a sequential pattern that hold in a video rental da-
tabase is that customers typically rent "Star Wars", then "Empire Strikes
Back", and then "Return of the Jedi". Note that these rentals need not be
consecutive. Sequential patterns discovery can be also used to detect the set
of customers associated with some frequent buying patterns. Use of sequen-
tial patterns discovery on, for example, a set of insurance claims can lead
11. Data Mining 491

to the identification of frequently occurring sequences of medical procedures


applied to patients which can help identify good medical practices as well as
to potentially detect some medical insurance fraud.

2.1 Mining Association Rules: Formal Problem Definition

Let L = {h, l2, .. " lm} be a set of literals, called items. Let a non-empty set
of items T be called an itemset. Let D be a set of variable length itemsets,
where each itemset T ~ L. We say that an itemset T supports an item x E L
if x is in T. We say that an itemset T supports an itemset X ~ LifT supports
every item in the set X.
An association rule is an implication of the form X -> Y, where X c L,
Y c L, X n Y = 0. Each rule has associated measures of its statistical
significance and strength, called support and confidence. The support of the
rule X - t Y in the set Dis:

Y D) _ I {T E D I T supports Xu Y} I
support (X
-t
,- ID I
In other words, the rule X -> Y holds in the set D with support s if
s· 100% of itemsets in D support X U Y. Support is an important measure
since it is an indication of the number of itemsets covered by the rule. Rules
with very small support are often unreliable, since they do not represent a
significant portion of the database.
The confidence of the rule X -> Y in the set Dis:

f "d (X -t Y D) _ I {T E D I T supports Xu Y} I
con z ence ,- I {T ED I T supports X} I
In other words, the rule X - t Y has confidence c if c . 100% of itemsets in
D that support X also support Y. Confidence indicates the strength of the
rule. Unlike support, confidence is an asymmetric (confidence(X -> Y) i-
confidence(Y -> X)) and non-transitive (the presence of highly confident
rules X - t Y and Y -> Z does not mean that X - t Z will have the minimum
confidence) .
The goal of mining association rules is to discover all association rules
having support greater than or equal to some minimum support threshold,
minsup, and confidence greater than or equal to some minimum confidence
threshold, minconf"

Illustrative example of association rules. Consider a supermarket with


a large collection of products. When a customer buys a set of products, the
whole purchase is stored in a database and referred to as a transaction having
a unique identifier, date, and a customer code. Each transaction contains the
set of purchased products together with their quantity and price. An example
492 T. Morzy and M. Zakrzewicz

of the database of customer transactions is depicted below. The attribute


trans_id represents the transaction identifier, cusUd - the customer code,
product - the purchased product, qty - the quantity and price - the price.

trans.J.d cust.J.d product date Qty Price


1 908723 soda_03 02/22/98 6 0.20
1 908723 potato_chips_12 02/22/98 3 0.99
2 032112 beer_10 02/22/98 4 0.49
2 032112 potato_chips_12 02/22/98 1 0.99
2 032112 diapers_b01 02/22/98 1 1.49
3 504725 soda_03 02/23/98 10 0.20
4 002671 soda_03 02/24/98 6 0.20
4 002671 beer-10 02/24/98 2 0.49
4 002671 potato_chips_12 02/24/98 4 0.99
5 078938 beer-lO 02/24/98 2 0.49
5 078938 potato_chips_12 02/24/98 4 0.99
5 078938 diapers_b01 02/24/98 10 1.49

The strongest association rules (minsup = 0.4, minconf = 0.5) that can be
found in the example database are listed below:
beer _10 ---> potato...chips_12 support = 0.60 confidence = 1.00
potato...chips_12 ---> beer_lO support = 0.60 confidence = 0.75
beer _10 /\ diapers..bOl ---> potato...chips_12 support = 0.40 confidence = 1.00
diapers..bOl /\ potato...chips_12 ---> beer_lO support = 0.40 confidence = 1.00
diaper s...bOl ---> beer _10 /\ potato...chips_12 support = 0.40 confidence = 1.00
diapers..bOl ---> beer _10 support = 0.40 confidence = 1.00
diapers...bOl ---> potato...chips_12 support = 0.40 confidence = 1.00
beer _10 /\ potato...chips_12 ---> diapers...bOl support = 0.40 confidence = 0.67
beer _10 ---> diapers..bOl /\ potato...chips_12 support = 0.40 confidence = 0.67
beer _10 ---> diapers..bOl support = 0.40 confidence = 0.67
soda_03 ---> potato...chips_12 support = 0.40 confidence = 0.67
potato...chips_12 ---> beer _10 /\ diapers...bOl support = 0.40 confidence = 0.50
potato...chips_12 ---> diapers...bOl support = 0.40 confidence = 0.50
potato...chips_12 ---> soda_03 support = 0.40 confidence = 0.50

For example, the association rule "beer _10 ---> potato...chips_12 (support =
0.60, confidence = 1.00)" states that every time the product beer _10 is pur-
chased, the product potato_chips_12 is purchased too and that this pattern
occurs in 60 percent of all transactions. Knowing that 60 percent of customers
who buy a certain brand of beer also buy Ii certain brand of potato chips can
help the retailer determine appropriate promotional displays, optimal use of
shelf space, and effective sales strategies. As a result of doing this type of
association rules discovery, the retailer might decide not to discount potato
chips whenever the beer is on sale, as doing so would needlessly reduce profits.
11. Data Mining 493

2.2 Basic Algorithm for Association Rules Discovery


The first algorithm for association rules discovery was presented in the pa-
per of Agrawal, Imielinski and Swami [AIS93j. The algorithm discovered all
association rules whose support and confidence were greater than some user
specified minimum values. In [HS93j, an algorithm called SETM was pro-
posed to solve this problem using relational operators. In [AS94j, two new
algorithms called Apriori and AprioriT I D were proposed. These algorithms
achieved significant improvements over the previous algorithms and became
the core of many new ones [SA95,HF95,SON95,SA96a,Toiv96,CHN+96j. A
fundamentally new approach, called FP-growth was introduced in [HPYOOj.
All the algorithms decompose the problem of mining association rules into
two subproblems:
1. Find all itemsets that have support greater or equal to the minsup thresh-
old. These are called frequent itemsets. Frequent itemset discovery is the
most time consuming operation.
2. Generate highly confident rules from the frequent itemsets. For each fre-
quent itemset l, find all non-empty subsets a of l. For each subset a,
output a rule of the form a ---+ (l- a) if support(l)jsupport(a) is greater
or equal to the minconj threshold. Notice, that if a rule a ---+ (l - a) has
the confidence value less than minconj, then any rule b ---+ (l - b), where
b c a, also has the confidence values less than minconj. Thus, the rule
generation begins with the empty head that is being expanded unless the
confidence value falls below minconf.

Frequent itemset discovery. The goal of frequent itemset discovery is


to find all itemsets in D that satisfy the minsup threshold. For practical
applications, looking at all subsets of L is infeasible because of the huge search
space (there are 21LI - 1 possible subsets of L, while I L I tends to be 1.000-
100.000). The search space forms a lattice, visualized in Figure 2.1 for the
special case of L = {A, B, C, D}. The lattice is a systematic enumeration of
all the subsets of L, starting with the empty itemset, followed by all singleton
itemsets at the first level, all two-item itemsets at the second level, etc. In
the lattice structure, a k-item itemset appears at the kth level of the lattice
and is linked to all its (k - I)-item subsets appearing at level k - 1. An
interesting property of an itemset is anti-monotonicity. This property says
that the support of an itemset can never be larger than the minimum support
of its subsets. In other words, all subsets of a frequent itemset must also be
frequent. The anti-monotonicity property can be used to prune the search
space [AS94j. For example, in Figure 2.1, if support of the itemset {A, C} is
below minsup, then any ofits supersets ({A, B, C}, {A, c, D}, {A, B, C, D})
will be infrequent. Thus, the entire subgraph containing the supersets of
{A, C} can be pruned immediately, reducing the size of the search space.
For dense databases, the number of all frequent itemsets can be extremely
large. Therefore, instead of discovering all frequent itemsets, sometimes we
494 T. Morzy and M. Zakrzewicz

{A} {B) {C} {D)

{A,B} {A,C} {A,D} {B,C} {B,D} {C,D}

{A,B,C} {A,B,D} {A,C,D} {B,C,D}

~
{A,B,C,D}

Fig.2.1. Lattice for L = {A, B, C, D}

are interested in finding maximal or closed frequent itemsets only. A frequent


itemset is called maximal if it is not a subset of any other frequent itemset.
The set of all maximal frequent itemsets is called the positive border. A
frequent itemset X is called closed if there exists no proper superset Y J X
with support(X) = support(Y). Generally, the number of closed frequent
itemsets can be orders of magnitude smaller than the number of all frequent
itemsets, while the number of maximal frequent itemsets can be orders of
magnitude smaller than the number of closed frequent itemsets. Thus, the
maximal and closed frequent itemsets allow us to compress the set of all
frequent itemsets. However, the closed sets are lossless in the sense that the
exact support of all frequent itemsets can be determined, while the maximal
sets lead to a loss of information.
The itemset lattice can be decomposed into smaller, independent pieces,
which can fit in memory. Assuming a lexicographical ordering of items inside
itemsets, we say that an itemset Y is the k-length prefix for the itemset
X = {h,b, ... ,lm},k < m, if Y = {h,l2, ... ,ld. For example, given the
itemset X = {beer _10, soda_03}, its 1-length prefix is {beer _10}. Using these
definitions, we can divide the itemset lattice into prefix-based sublattices.
Figure 2.2 shows the lattice decomposition using 1-length prefixes.

Frequent itemset discovery algorithms. A number of algorithms have


been proposed for association mining [AIS93,AS94,AS96,BMU+97,HF95],
[HPYOO,HS93,MTV94,SA95,SA96a,SON95,Toiv96,Zak98j. The algorithms
can be divided into two broad groups, according to the database layout used:

1. Horizontal mining algorithms, also called row-wise algorithms.


2. Vertical mining algorithms, also called column-wise algorithms.
11. Data Mining 495

(AI
prefix-based
class

Fig. 2.2. I-length prefix sublattices for L = {A, B, C, D}

Horizontal mining algorithms assume that the database rows represent trans-
actions and each transaction consists of a set of items. Vertical mining algo-
rithms assume that the database rows represent items and with each item we
associate a set of transaction identifiers for the transactions that contain this
item. The two layouts of the database from the previous example are shown
in Figure 2.3.

Horizontal database layout Vertical database layout


tid items item tid list
1 soda_03, potato_chips_I2 beer_lO 2,4,5
2 beeLIO, potato_chips_I2, diapers_bOI diapers_bOI 2, 5
3 soda_03 potato_chips_I2 1,2,4,5
4 soda_03, beer _10, potato_chips_I2 soda_03 1,3,4
5 beeLlO, potato_chips_I2, diapers_bOI
Fig. 2.3. Horizontal vs. vertical database layout

The two groups of algorithms also differ in support counting methods.


In order to determine itemsets' support values, horizontal mining algorithms
must directly count their occurrences in the database. For that purpose a
counter is set up and initialized to zero for each itemset that is currently un-
der investigation. Then all transactions are scanned and whenever one of the
investigated itemsets is recognized as a subset of a transaction, its counter
is incremented. Vertical mining algorithms can use "tid list" set intersections
to find the identifiers of all transactions that contain the itemset, and then
evaluate the size of the resulting "tidlist" set to find the support. Generally,
496 T. Morzy and M. Zakrzewicz

horizontal mining algorithms perform better for shorter frequent itemsets,


while vertical mining algorithms are especially suited to discover long pat-
terns.

Horizontal mmmg: Apriori algorithm. The algorithm called Apriori


employs the anti-monotonicity property to discover all frequent itemsets. We
assume that items in each itemset are kept sorted in their lexicographic order.
The Apriori algorithm iteratively finds all possible itemsets that have support
greater or equal to a given minimum support value (minsup). The first pass of
the algorithm counts item occurrences to determine the frequent 1-itemsets
(each 1-itemset contains exactly one item). In each of the next passes, the
frequent itemsets Lk-l found in the (k - l)th pass are used to generate the
candidate itemsets Ck, using apriori-gen function described below. Then,
the database is scanned and the support of candidates in Ck is counted.
The output of the first phase of the Apriori algorithm consists of a set of k-
itemsets (k = 1,2, ... ), that have support greater or equal to a given minimum
support value. Figure 2.4 presents a formal description of this part of the
algorithm.

L1 = frequent l-itemsets;
for (k = 2; Lk-1 =1= 0; k++) do begin
Ck = apriorLgen(Lk_1);
forall transactions tED do begin
Ct = subset(Ck, t);
forall candidates c E C t do
c.count ++;
end
Lk = {c E C k I c.count 2: minsup};
end
Answer = Uk Lk;
Fig. 2.4. Frequent itemset generation phase of the Apriori algorithm

In the algorithm Apriori, candidate itemsets Ck are generated from pre-


viously found frequent itemsets Lk-l, using the apriori-gen function. The
apriori-gen function works in two steps: (1) join step and (2) prune step.
First, in the join step, frequent itemsets from Lk-l are joined with other
frequent itemsets from Lk-l in the following SQL-like manner:

insert into Ck
select p.iteml,p.item2, ... ,p.itemk_l, q.itemk-l
from L k - 1 p, Lk-l q
where p.iteml = q.iteml
and p.item2 = q.item2
11. Data Mining 497

and p.itemk-2 = q.itemk-2


and p.itemk-l < q.itemk-l;

Next, in the prune step, each itemset c E Ok such that some (k - I)-subset
of c is not in Lk-l is deleted:

forall itemsets c E Ck do
forall (k - I)-subsets 8 of c do
if (8 t/:- Lk-l) then delete c from Ok;

The set of candidate k-itemsets Ok is then returned as a result of the function


apriori-gen.
After the candidate itemsets have been generated, we need to compute
their support in the database. This is the most time-consuming part of the
algorithm. The Apriori algorithm uses a hash tree structure to store the
candidates and to efficiently match transactions against the candidates. The
8ub8et() function returns current candidate itemsets that are contained in
the given transaction. To illustrate the idea of using a hash tree for can-
didate matching, consider the example in Figure 2.5. Given the set of can-
didates C3 = {{I, 4, 5}, {I, 2, 3}, {2, 3,4}, {5, 6, 7}, {I, 6, 7}, {3, 6, 7}, {6, 7, 8}}
and the transaction t = {l, 2,4,5}, we want to find which candidates are
contained in the transaction. Starting from the root node, the 8ub8et() func-
tion finds all candidates contained in t as follows. If we are at a leaf, we find
which of the candidate itemsets in the leaf are contained in t. If we are at
an interior node and we have reached it by hashing the item i, we hash on
each item that comes after i in t and recursively apply this procedure. In our
example we begin from the root node with "I", so we follow the arc: 1 mod
3 = 1. Then, from the node n2, we follow the arcs: 2 mod 3 = 2,4 mod 3 =
1, 5 mod 3 = 2. In this way, the leaves l5 and l6 are visited. Next, we start
again from the root node with "2", so we follow the arc: 2 mod 3 = 2. This
takes us to the leaf l7. In the next step we start from the root node with "4"
and we follow the arc 4 mod 3 = 1 again. Then, from the node n2, we follow
the arc 5 mod 3 = 2, which takes us to the leaf l6. Finally, we start from the
root node with "5", following the arc 5 mod 3 = 2, and we visit the leaf h.

Simple example of the Apriori algorithm execution. Consider the


following illustrative example of work of the algorithm Apriori. Assume that
498 T. Morzy and M. Zakrzewicz

n,

Fig. 2.5. The candidate hash tree structure

minimum support is 0.30 and minimum confidence is 0.70. The database D


is presented in the figure below.

trans-id products
I soda_03, potato_chips_12
2 beeLIO, potato_chips_12, diapers_bOI
3 soda_03
4 soda_03, beer_lO, potato_chips_12
5 beeLIO, potato_chips_12, diapers_bOI

The first pass of the algorithm Apriori counts product occurrences to deter-
mine the frequent l-itemsets L 1 • Each product that is contained in at least
2 purchase transactions (30% of all five transactions) becomes a frequent
l-itemset. All l-itemsets together with their support values are listed below:

itemset support
beeLlO 0.60
diapers_bOI 0.40
potato_chips_12 0.80
soda_03 0.60

Next, apriori-gen function is used to generate candidate 2-itemsets. Each


candidate 2-itemset consists of two products from L 1 • The set of candidate
2-itemsets together with their support values is presented below:
11. Data Mining 499

itemset support
beer _10, diapers_bOI 0.40
beeLlO, potato_chips_12 0.60
beeLIO, soda_03 0.20
diapers_bOI, potato_chips_12 0.40
diapers_b01, soda_03 0.00
potato_chips_12, soda_03 0.40

The set L2 of frequent 2-itemsets consists of those candidate 2-itemsets from


C2 , whose support is at least 0.30:

itemset support
beer _10, diapers_bOI 0.40
beeLlO, potato_chips_12 0.60
diapers_bOI, potato_chips_12 0.40
potato_chips_12, soda_03 0.40

In the next step, apriori-gen function is used again - this time to generate
candidate 3-itemsets C3 • Each candidate 3-itemset is a superset of frequent
2-itemsets and its every subset is contained in L 2 . The set of candidate 3-
itemsets contains only one itemset and is presented below:

The set L3 of frequent 3-itemsets consists of this only itemset, because its
support is at least 0.30:

When we use apriori-gen function to generate the set of candidate 4-itemsets


C4 from C3 , it turns out to be empty, and the first phase of the algorithm
terminates. The output of the first phase of the algorithm consists of a set of
frequent l-itemsets L 1 , 2-itemsets L2 and 3-itemsets L 3 •

Now, the frequent itemsets will be used to generate desired association


rules. The frequent l-itemsets from L1 will actually not be used for asso-
ciation rules generation directly - each association rule must consist of at
least 2 items. However, those frequent l-itemsets may be needed to compute
500 T. Morzy and M. Zakrzewicz

association rule confidence values. From the 2-itemsets of L2 the following


2-item association rules will be generated:
source 2-itemset supp generated rule conf
beer _10, diapers_bOl 0.40 beer _10 ---+ diapers_bOl 1.00
beer _10, diapers_bOl 0.40 diapers_bOl ---+ beef-1O 0.67
beer _10, potato_chips_12 0.60 beer_1O ---+ potato--<:hips_12 1.00
beer_1O, potato_chips_12 0.60 potato_chips_12 ---+ beer_1O 0.75
diapers_bOl, potato_chips_12 0.40 diapers_bOl ---+ potato_chips_12 1.00
diapers_bOl, potato--<:hips_12 0.40 potato_chips_12 ---+ diapers_bOl 0.50
potato--<:hips_12, soda_03 0.40 potato--<:hips_12 ---+ soda..03 0.50
potato--<:hips_12, soda_03 0.40 soda_03 ---+ potato--<:hips_12 0.67

From the only frequent 3-itemset of L 3 , the following 3-item association rules
will be generated:
source 3-itemset supp generated rule conf
beer_1O, diapers_bOl, 0.40 beer _10 1\ diapers_bOl ---+ potato_chips_12 0.67
potato--<:hips_12
beer_lO, diapers_bOl, 0.40 beer_1O 1\ potato_chips_12 ---+ diapers_bOl 0.67
potato--<:hips_12
beer_1O, diapers_bOl, 0.40 diapers_bOl 1\ potato_chips_12 ---+ beef-IO 1.00
potato--<:hips_12
beef-lO, diapers_bOl, 0.40 diapers_bOl ---+ beer _10 1\ potato_chips_12 1.00
potato--<:hips_12
beer_1O, diapers_bOl, 0.40 potato--<:hips_12 ---+ beer_1O 1\ diapers_bOl 0.50
potato--<:hips_12
Notice, that an association rule "beer _10 ~ diapers_bOII\ potato_chips_12"
has not been even generated because both the rules: "beer _lOl\diapers..bOI ~
potato_chips_12" and "beer _10I\potato..chips_12 ~ diapers_bOI" did not have
the minimum confidence.
Finally, all the generated association rules are filtered and only the rules
with minimum confidence (~ 70%) form the result of the algorithm Apriori:

association rule support confidence


beeLIO ~ diapers_bOI 0.40 1.00
beeLIO ~ potato_chips_12 0.60 1.00
diapers_bOI ~ beeLlO 1\ potato_chips_12 0.40 1.00
diapers_bOI ~ potato_chips_12 0.40 1.00
diapers_bOI 1\ potato_chips_12 ~ beer_lO 0.40 1.00
potato_chips_12 ~ beer_lO 0.60 0.75

Horizontal mining: FP-Growth algorithm. The algorithm called F P-


Growth (Frequent Pattern Growth) divides the problem of frequent-pattern
discovery into two steps:
11. Data Mining 501

1. Compress the database to the form of so called F P-tree (frequent pattern


tree), where non-frequent items are skipped and similar transactions are
merged.
2. Mine the FP-tree using a pattern fragment growth method which avoids
the generation of a large number of candidate itemsets.

In the first step, the database is initially scanned to discover all frequent
items. Then, all non-frequent items are removed from each transaction and
the frequent items are sorted in frequency descending order. Next, all the
transactions are mapped to paths in an F P-tree.
F P- tree consists of one root node labeled as "null" and of regular nodes,
each containing a frequent item and an integer. Each path in the F P-tree
originating from the root represents a set of transactions containing identical
frequent items. The last node of the path contains the number of supporting
transactions. To facilitate tree traversal, an item header table is built, in
which each item points, via links, to its first occurrence in the tree. Nodes
with the same item name are linked in sequence via links.
In the second step, the F P-tree is mined in order to find all frequent
itemsets. The mining algorithm is based on the property that for any frequent
item a, all possible frequent itemsets that contain a can be obtained by
following a's node links, starting from a's head in the FP-tree header. The
algorithm uses the concepts of a transformed prefix path, conditional pattern
base and conditional FP-tree. A transformed prefix path is the prefix subpath
of node a, with frequency count of its nodes adjusted to the same value as
the count of node a. A conditional pattern base of a is a small database
of transformed prefix paths of a. A conditional F P-tree of a is the F P-tree
created over the conditional pattern base of a. The complete algorithm is
given in Figure 2.6. It starts with Tree = F P-tree and a = null.

procedure FP-growth (Tree,a) {


if Tree contains a single path P
then for each combination (3 of the nodes in P do
generate itemset (3 U a with support = minimum support of nodes in (3
else for each ai in the header of Tree do {
generate itemset (3 = ai U a with support = support(ai);
construct (3's conditional pattern base;
construct (3's conditional F P-tree Tree(3;
if Tree(3 :F 0 then F P-growth(Tree(3, (3);
}
}

Fig. 2.6. FP-growth algorithm


502 T. Morzy and M. Zakrzewicz

Simple example of the FP-Growth algorithm execution. Consider the


following illustrative example of work of the algorithm F P- Growth. Assume
that minimum support is 0.30 and minimum confidence is 0.70. The database
D is presented in the figure below.
trans~d products
1 soda_03, potato_chips_12, film_klOO
2 beer _10, potato_chips_12, diapers_bOl
3 soda_03, beer _25
4 soda_03, beeLlO, potato_chips_12
5 beer _10, potato_chips_12, diapers_hOI

First, a scan of the database derives a list of frequent items (ordered in


frequency descending order):

item count
potato_chips_12 4
beeLlO 3
soda_03 3
diapers_bOl 2

Next, all non-frequent items are removed from the original database and the
frequent items inside each transaction are sorted in frequency descending
order. Thus, the database looks like the following:

trans~d products
1 potato_chips_12, soda_03
2 potato_chips_12, beeLlO, diapers_bOl
3 soda_03
4 potato_chips_12, beeLlO, soda_03
5 potato_chips_12, beeLlO, diapers_hOI

The root of the F P - tree is created and labeled with "null". We scan the
database again. The scan of the first transaction leads to the construction
of the first branch of the tree: ((potato_chips_12 : 1) (soda_03 : 1)). The
number after ":" is the support counter. For the second transaction, since
its frequent items list shares a common prefix (potato_chips_12) with the
existing path, the count of the first node is incremented by 1, and two new
nodes are created and linked as a child path of (potato_chips_12). For the
third transaction, a new branch of the tree is constructed: (( soda_03 : 1)).
For the fourth transaction, since its frequent items list shares a common
prefix (potato_chips_12, beer _10) with an existing path, the count of each
node along the prefix is incremented by 1, and one new node (soda_03) is
created and linked as a child of (beeLlO). For the last transaction, since it
can be completely mapped to an existing path ((potato_chips_12 : 3)(beer _10 :
2)(diapers_bOl : 1)), the count of each node along the path is incremented by
11. Data Mining 503

1. The resulting F P- tree with the associated item links is shown in Figure
2.7.

null

Fig. 2.7. Example FP-tree

After having created the F P - tree, we mine the tree to discover all fre-
quent itemsets. First, we collect all the transactions that diapers_bOi partic-
ipates. This item derives a single path in the F P-tree. The path indicates
that the itemsets {potato_chips_I2, beer _10, diapers_bOI} , {potato_chips_I2,
diapers_bOl}, and {beer _10, diapers..bOI} appear twice in the database (sup-
port = 0.4). Next, we collect all the transactions that soda_03 participates.
This item derives three paths in the FP-tree: ((potato_chips_I2 : 4) (soda_03 :
1)), ((potato_chips_I2 : 4)(beer_1O : 3)(soda_03 : 1)), ((soda_03 : 1)). To find
which items appear together with soda_03, we build a conditional database:
{{potato_chips_I2} : 1, {potato_chips_I2, beer _IO} : I} and mine it recur-
sively. We find that potato_chips_I2 appears twice, so the discovered frequent
itemset is {potato_chips_I2, soda_03} (support = 0.4). Next, we collect all the
transactions that beer _10 participates. This item derives a single path in the
F P-tree. The path indicates that the itemsets {potato_chips_I2, beer _1O} ap-
pears three times in the database (support = 0.6). Finally, we skip the path
from the item potato_chips_I2 since it is a single-item path.
The result of the F P- Growth algorithm is the following:

itemset support
potato_chips_I2 0.80
beeLlO 0.60
soda_03 0.60
diapers_bOI 0.40
potato_chips_I2, beeLlO, diapers_bOI 0.40
potato_chips_I2, diapers_bOI 0.40
beeLIO, diapers_bOI 0.40
potato_chips_I2, soda_03 0.40
potato_chips_I2, beeLIO 0.60
504 T. Morzy and M. Zakrzewicz

Vertical mining: Eclat algorithm. The algorithm called Eclat (Equiva-


lence Class Transformation) employs prefix-based classes to reduce the search
space. We assume that items in each itemset are kept sorted in their lexi-
cographic order. The Eclat algorithm recursively merges discovered frequent
itemsets and uses "tidlists" to evaluate support. When two frequent (k - 1)-
itemsets are merged to form a candidate k-itemset, their "tidlists" are inter-
sected to form the "tidlist" of the new candidate. Due to significant compu-
tational complexity of l-itemset and 2-itemset discovery, the Eclat algorithm
usually starts with known frequent 2-itemsets (discovered by e.g. Apriori).
The Eclat algorithm is shown in Figure 2.8.

Eclat (Sk-l):
forall itemsets Ia,h E Sk-l,a < b do begin
C = Ia.tidlist n h.tidlistj
if (I C 12 minsup)
add C to Lk
end
Partition Lk into prefix-based (k - I)-length prefix classes
foreach class Sk in Lk
Eclat(Sk);
end
Answer = Uk Lkj

Fig. 2.S. Frequent itemset generation phase of the Eclat algorithm

Simple example of the Eclat algorithm execution. Consider the fol-


lowing illustrative example of work of the algorithm Eclat. Assume that
minimum support is 0.30. The database D is presented in the figure below.

item tidlist
beer-lO 2,4,5
diapers_bOl 2, 5
potato_chips_12 1,2,4,5
soda_03 1,2,3,4,5

First, the Eclat algorithm uses an auxiliary horizontal mining algorithm to


discover all frequent l-itemsets and 2-itemsets.

itemset support
beer-lO 0.60
diapers_hOI 0.40
potato_chips_12 0.80
soda_03 1.00
11. Data Mining 505

itemset tid list support


beeL10 n diapers_bOI 2,5 0.40
beer _10 n potato_chips_12 2, 4, 5 0.60
beeL10 n soda_03 2,4,5 0.60
diapers_bOI n potato_chips_12 2, 5 0.40
diapers_bOI n soda_03 2, 5 0.40
potato_chips_12 n soda_03 1, 4 0.40

Next, the frequent 2-itemsets L2 are divided into I-length prefix-based classes.
All the frequent 2-itemsets that begin with {beer_1O} form the first class, all
the frequent 2-itemsets that begin with {diapers_bOI} form the second class,
and all the frequent 2-itemsets that begin with {potato_chips_12} form the
third class.

L2 divided into I-length prefix classes


class prefix frequent itemsets
{beeLIO} {beeLIO, diapers_bOI}
{beer _10, potato_chips_12}
{beeLIO, soda_03}
{diapers_bOI} {diapers_bOl, potato_chips_12}
{diapers_bOI, soda_03}
{potato_chips_12 } {potato_chips_12, soda_03}

Next, for each of the classes we recursively call the Eclat algorithm. The
algorithm merges pairs of frequent itemsets inside each class to generate
new potentially large itemsets and then evaluates their supports by counting
"tidlist" items. Here, we have found four frequent 3-itemsets.

itemset tidlist support


beeL10 n diapers_bOI n potato_chips_12 2, 5 0.40
beeL10 n diapers_bOI n soda_03 2, 5 0.40
beeLIO n potato_chips_12 n soda_03 2,4,5 0.60
diapers_bOI n potato_chips_12 n soda_03 2,5 0.40

Next, the frequent 3-itemsets from each I-length prefix-based classes


are recursively divided into 2-length prefix-based classes. For the I-length
prefix-based class {beer _10} we generate two new 2-length prefix-based
classes: {beer_1O,diapers_bOI} and {beer_1O,potato_chips_12}. For the 1-
length prefix-based class {diapers_bOI} we generate a new 2-length prefix-
506 T. Morzy and M. Zakrzewicz

based class: {diapers.1JOI,potato-chips_12}. The I-length prefix-based class


{potato_chips_12} is not decomposed further.

L3 divided into 2-length prefix classes


class prefix frequent itemsets
{beer _10, diapers_bO I} {beer_1O, diapers_bOI, potato_chips_12}
{beeL1O, diapers_bOI, soda_03}
{beeL1O, potato_chips_12} {beeLIO, potato_chips_12, soda_03}
{diapers_bOI, potato_chips_12} {diapers_bOI, potato_chips_12, soda_03}

2.3 Quantitative Association Rules


Traditional association rules describe associations between occurrences of sin-
gle items. However, relational tables in most business and scientific domains
have richer structure: each item can be described by many categorical (e.g.
zip code, product code, martial status) and quantitative attributes (e.g. age,
income, price). The association rules that refer to many different attributes
are often called quantitative association rules or multi-attribute association
rules. The problem of mining quantitative association rules and an algorithm
to discover such rules were introduced by Srikant and Agraval in [SA96aj. For
illustration of quantitative association rules, Figure 2.9 shows a supermar-
ket database table Customers with three non-key attributes. The attributes
Income and Num_trans are quantitative, whereas Married is a categorical
attribute. Let us assume that we want to discover quantitative association
rules between attributes of individual customers. Example quantitative asso-
ciation rules found in this table are: "income = 27 K .. 34K - t married = No
(support = 0.60,confidence = 0.67)" and "married = Yes 1\ income =
OK.. 5IK - t num_trans = 0. .46 (support = 0.40, confidence = 1.00)". The
first quantitative association rule states that customers with income between
27K and 34K are usually not married, the second rule states that married
customers with income below 5lK do not visit the supermarket too often
(less than 46 times).
In order to discover quantitative association rules, the quantitative at-
tribute ranges must be partitioned into discrete intervals, and then any of
the presented algorithms can be used. The problems of discretizing were ad-
dressed in [SA96aj.

2.4 Generalized and Multiple-Level Association Rules


In many applications, interesting associations between items often occur at a
relatively high concept level. For example, besides discovering that "40% of
11. Data Mining 507

cust_id married income num_trans


002671 No 28K 115
032112 Yes 50K 45
078938 Yes 60K 436
504725 No 33K 230
908723 Yes 32K 21
Fig. 2.9. The Customers table with quantitative and categorical attributes

customers who purchase soda_03, also purchase potato_chips_12" , it could be


informative to also show that "100% of customers who purchase any of bever-
ages also purchase potato_chips_12". Such rules utilize conceptual hierarchy
information and are called generalized association rules or multiple-level asso-
ciation rules. It is worth mentioning that generalized association rules cannot
be inferred from simple association rules that hold in a database. The rea-
son is that the support value for an item in a conceptual hierarchy is not
equal to the sum of the support values of its children, since several of the
children could be present in a single itemset. First algorithms for discovering
generalized association rules were introduced by Han and Fu in [HF95] and
by Srikant and Agrawal in [SA95]. The general idea behind the algorithms
is to replace each itemset T in the database with an "extended itemset" T',
where T' contains all the items from T as well as the conceptual ancestors
of each item in T. Then, a slightly modified and optimized association rules
discovery algorithm (e.g. Apriori) can be used to find generalized association
rules. Some trivial association rules discovered this way should be pruned -
e.g. rules of the form "item -> ancestor(item}" are always true with 100%
confidence, hence redundant.

products

beverages candy children

I \ I \
Fig. 2.10. An example of conceptual hierarchy for finding generalized association
rules

Consider the database from Section 2.1 and the conceptual hierarchy in
Figure 2.10 for a simple example. Assume that minimum support for rules
to discover is 0.75 and minimum confidence is 0.90. First, every itemset in
508 T. Morzy and M. Zakrzewicz

the database is replaced with an extended itemset and any duplicates are
removed. It results in the following contents of the database:

trans-.id product
1 beverages
1 candy
1 potato_chips_12
1 soda_03
2 beef-I 0
2 beverages
2 candy
2 children
2 diapers_bOI
2 potato_chips_12
3 beverages
3 soda_03
4 beef-10
4 beverages
4 candy
4 potato_chips_12
4 soda_03
5 beef-lO
5 beverages
5 candy
5 children
5 diapers_bOl
5 potato_chips_12

The derivation of large itemsets is shown in Figure 2.11. The first pass of
the algorithm simply counts item occurrences to determine large 1-itemsets.
Notice that the itemsets can contain items from the leaves of the conceptual
hierarchy or from interior nodes. In a subsequent pass, candidate itemsets
C 2 are generated and the database is scanned to count their support. We
can also prune every itemset that consists of an item and its ancestor (not
to generate redundant rules). Then the algorithm ends, since no candidate
3-itemsets can be generated. Finally, the association rules are generated from
the large itemsets. The discovered generalized association rules are presented
in Figure 2.12.

2.5 Other Algorithms for Mining Frequent Itemsets


Database scan reduction. A disadvantage of the presented Apriori algo-
rithm is that it requires K or K + 1 passes over the database to generate
all large itemsets, where K is the size of the greatest large itemset found.
To solve this problem, a number of Apriori extensions have been proposed
11. Data Mining 509

itemset support itemset support


beverages 1.00 beverages, candy 0.80
candy 0.80 beverages, potato_chips_12 0.80
potato_chips_12 0.80 candy, potato_chips_12 0.80

itemset support
beverages, candy 0.80
beverages, potato_chips_12 0.80

Fig.2.11. Generation of large itemsets for finding generalized association rules

association rule support confidence


candy -> beverages 0.80 1.00
potato_chips_12 -> beverages 0.60 1.00
Fig. 2.12. Example generalized association rules

in the literature. A common way of reducing the database activity is to use


a random sample of the database and to find approximate association rules
in this sample. This can be very useful, because samples small enough to be
handled totally in main memory can give reasonably accurate results. More-
over, approximate association rules discovered from the sample can be used
to adjust parameters for a more complete discovery phase. However, although
sampling can be very effective, it is often the case to know the support and
confidence values of association rules exactly. When relying on results from
sampling alone, there is a risk of loosing valid association rules because their
support in the sample is below the user specified minimum value. In [Toiv96],
Toivonen presents an algorithm that employs sampling, yet discovers exact
association rules. The key idea is to pick a random sample first, use it to dis-
cover approximate association rules that probably hold in the whole database,
and then to verify the results with the rest of the database. If all association
rules happen to be discovered from the sample, then the algorithm requires
only one full pass over the database. In case, where the sample does not
discover all association rules, the algorithm can find the missing rules in a
second pass.
Another way of reducing the number of database passes was proposed by
Savasere, Omiecinski and Navathe in the algorithm called Partition [SON95].
Partition algorithm is similar to Apriori but it discovers all association rules
in only two scans over the database. The algorithm executes in two phases.
In the first phase, the database is logically divided into a number of non-
overlapping partitions. Then, for each partition, all large itemsets are gener-
ated. At the end of the first phase, the large itemsets from all the partitions
are merged into a set of all potentially large itemsets. In the second phase
of the algorithm, the actual support values for these itemsets are computed.
510 T. Morzy and M. Zakrzewicz

The partition sizes are chosen such that each partition can be handled totally
in main memory so that the partitions are read from the database only once
in each phase.
DIC (Dynamic Itemset Counting) algorithm [BMU+97] tries to generate
and count the itemsets earlier, thus reducing the number of database scans.
The database is treated as a set of intervals of transactions, and the intervals
are scanned sequentially. During the first interval scan, l-itemsets are gener-
ated and counted. At the end of the first scan, potentially frequent 2-itemsets
are generated. During the second interval scan, all generated l-itemsets and
2-itemsets are counted. At the end of the second scan, potentially frequent
3-itemsets are generated, etc. When the end of the database is reached, the
database is rewound to the beginning and the itemsets that were not fully
counted are processed. The actual number of database scans depends on the
interval size, however, the minimal number of database scans is two.
Data mining research also focused on online algorithms for mining as-
sociation rules. G ARM A (Continuous Association Rule Mining Algorithm)
shows current association rules to the user and allows the user to change the
minsup and minconf parameters online, at any transaction during the first
scan of the database. G ARM A generates the itemsets in the first scan and
finishes counting all of them during the second scan, similarly to DIG. After
having read each transaction, G ARM A first increments the counts of the
itemsets that are subsets of the transaction. Then, if all immediate subsets
of the itemsets are currently potentially frequent with respect to the current
minsup and the part of the database read, it generates new itemsets from the
transaction. For more accurate prediction of whether an itemset is potentially
large, C ARM A calculates an upper bound for the count of the itemset, which
is the sum of its current count and an estimate number of occurrences before
the itemset is generated. The estimate (called maximum misses) is computer
when the itemset is first generated. G ARM A needs at most 2 database scans
to discover the requested association rules.

Incremental updating of discovered association rules. Since it is


costly to discover association rules in large databases, there is often a need
for techniques that incrementally update the discovered association rules ev-
ery time the database changes. In general, a database updates may not only
invalidate some existing strong association rules but also turn some weak
rules into strong ones. Thus it is nontrivial to maintain such discovered asso-
ciation rules in large databases. In [CHN+96J, Cheung, Han, Ng and Wong
presented an algorithm called FUP (Fast Update Algorithm) for computing
the large itemsets in the expanded database from the old large itemsets. The
algorithm FU P is an extension of Apriori algorithm, but it is much faster in
comparison with re-running of Apriori on the updated database. The major
idea of FU P algorithm is to reuse the information of the old large itemsets
11. Data Mining 511

and to integrate the support information of the new large itemsets in order
to reduce the pool of candidate itemsets to be re-examined.

Parallel and distributed algorithms. A number of parallel algorithms


for mining association rules have been proposed. The Count Distribution
algorithms [AS96] are Apriori extensions, where candidate itemsets are du-
plicated On all processors and the database is distributed across the proces-
sors. Each processor is responsible for computing local support counts of all
the candidate itemsets, which are the support counts in its database parti-
tion. All processors then compute the global support counts of the candidate
itemsets, which are the total support counts of the candidates in the whole
database, by exchanging the local support counts. Subsequently, frequent
itemsets are computed by each processor independently.
The Data Distribution algorithms [AS96] are other Apriori extensions,
where candidate itemsets as well as the database are partitioned and dis-
tributed across the processors. Each processor is responsible for keeping the
global support counts of only a subset of candidate itemsets. This approach
requires two rounds of communication at each iteration. In the first round,
every processor sends its database partition to all the other processors. In
the second round, every processor broadcasts the frequent itemsets that it
has found.

2.6 Mining Sequential Patterns: Formal Problem Description

Let L = {h, l2, ... , lm} be a set of literals called items. Let D be a set
of variable length sequences, where each sequence S = (X 1 X 2 ... Xn) is an
ordered list of sets of items such that each set of items Xi ~ L.
We say that a sequence (X 1 X 2 .. . Xn) is contained in another sequence
(Y1 Y2 ..• Ym ) if there exist integers i1 < i2 < ... < in such that Xl ~
Yi 1 , X 2 ~ Yi 2 , ••• ,Xn ~ Yin. We say that in a set of sequences, a sequence is
maximal if it is not contained in any other sequence. We say that a sequence
S from D supports a sequence Q, if Q is contained in S.
A sequential pattern is a maximal sequence in a set of sequences. Each
sequential pattern has an associated measure of its statistical significance,
called support. The support for the sequential pattern (X 1 X 2 ... Xn) in the
set of sequences Dis:

((X X X) D) _ I {S E DIS supports (X 1 X2 ... Xn)} I


support 1 2··· n, - ID I
In other words, the support for a sequential pattern is the fraction of total
sequences in D that support the sequential pattern.
Given a sequence S = (Sl S2 ... sn) and a subsequence C, C is called a
contiguous subsequence of S if any of the following conditions hold:
512 T. Morzy and M. Zakrzewicz

• C is derived from S by dropping an item from either S1 or Sn.


• C is derived from S by dropping an item from an element Si which has
at least 2 items.
• C is a contiguous subsequence of C', and C' is a contiguous subsequence
of S.
For example, the sequences ({2}{3, 4}{5}), ({I, 2}{3}{5}{6}) and ({3}{5})
are contiguous subsequences of ({1,2}{3,4}{5}{6}), while ({1,2}{3,4}{6})
and ({1}{5}{6}) are not.
A transaction d is an itemset with a timestamp assigned. The timestamp
is called transaction time. The transaction time of a transaction d is denoted
as transaction-time (d). The size of a transaction d is the size of the itemset
forming the transaction.
A data-sequence D is an ordered list of transactions and is denoted as
D = (d 1 d2 ••. dn ), where di is a transaction, di is called an element of the
data-sequence D. Each data-sequence has a unique identifier assigned. An
item can occur only once in an element of a data-sequence but can occur
multiple times in different elements. The size of a data-sequence is the sum
of sizes of all elements of the data-sequence. The length of a data-sequence
is the number of elements in the data-sequence.
We say that an itemset Si is contained in a transaction dk if Si is a subset
of the itemset forming dk. We say that an itemset Si is contained in a set of
transactions ~l , di2 , ••• , dik if Si is a subset of the union of itemsets forming
dip di2 , ... , dik •
Sequence containment in a transaction can be further restricted by us-
ing max-gap, min-gap, and window-size constraints, that restrict time gaps
between sets of transactions that contain consecutive elements of the se-
quence. Given a window-size, max-gap and min-gap, the data-sequence C =
(C1 C2 .•• em) contains the sequence S = (S1 S2 ••. sn) if there exist integers
h :$ U1 < l2 :$ U2 < ... < ln :$ Un such that:
• Si is contained in U~~li Ck, 1 :$ i :$ n, and
• transaction-time(euJ - transaction-time(czJ :$ window-size, 1:$ i :$ n
• transaction-time(cIJ - transaction-time(eu i _ 1 ) > min-gap, 2 :$ i :$ n
• transaction-time(cUi ) - transaction-time(cl i _ 1 ) :$ max-gap, 2 :$ i :$ n
The goal of mining sequential patterns is to discover all sequential patterns
having support greater than or equal to some minimum support threshold,
minsup.

Illustrative example of sequential patterns. Consider a database of


purchase histories of two customers (Figure 2.13). For each customer a se-
quence of transactions ordered according to transaction time is stored. Each
transaction contains the customer identifier, the transaction time, and the
list of items purchased in the transaction.
11. Data Mining 513

cusLid trans_time products


1 June 1, 2000 heer_lO
1 June 2,2000 diapers_hOI
1 June 15, 2000 potato_chips_12, soda_03
2 June 1, 2000 diapers_hOI, beer_lO
2 June 20, 2000 cookies-30
2 July 20, 2000 potato_chips_12
Fig. 2.13. Example database of data sequences

Assuming the minimum support set to 75% (Le. in our case a


sequence is frequent if it is supported by two customers), and no
time constraints are given, the only 2-element sequential patterns are
({beer_IOHpotato-ehips_12}) and ({diapers_bOIHpotato-ehips_12}). If the
window-size is to 7 days, we get the third 2-element sequential pattern
({ diapers_b01, beer-IOHpotato-ehips_12}) since diaperslJOl and beer _10 are
present in the data sequence of the first customer within a period of 7 days.
If we set a max-gap constraint of 30 days, the three patterns listed above will
not be frequent, because they will not be supported by the second customer.

Basic algorithm for sequential patterns discovery. An efficient min-


ing algorithm called GSP (Generalized Sequential Patterns) was introduced
in [SA96a]. GSP is an Apriori-like algorithm, exploiting a variation of the
Apriori anti-monotonicity property: any super-pattern of a non-frequent pat-
tern cannot be frequent.
The GSP algorithm makes multiple passes over the data. The first pass
determines the support of each item. At the end of the first pass, the algo-
rithm knows which items are frequent (their support exceeds minsup). Each
such item yields a I-element frequent sequence consisting of that item. Each
subsequent pass starts with a seed set: the frequent sequences found in the
previous pass. The seed set is used to generate new potentially frequent se-
quences, called candidate sequences. Each candidate sequence has one more
item than a seed sequence; so all the candidate sequences in a pass will have
the same number of items. The support for these candidate sequences is found
during the pass over the data. At the end of the pass, the algorithm deter-
mines which of the candidate sequences are actually frequent. These frequent
candidates become the seed for the next pass. The algorithm terminates when
there are no frequent sequences at the end of a pass, or when there are no
candidate sequences generated. The two crucial steps of GSP are candidate
generation and counting.
It can be shown [SA96b] that any data-sequence that contains a sequence
s will also contain any contiguous subsequence of s. If there is no max-gap
constraint, the data-sequence will contain all subsequences of s (including
non-contiguous subsequences). Candidate generation is based on the above
properties. Let Lk denote the set of all frequent k-sequences, and Ck the set of
514 T. Morzy and M. Zakrzewicz

candidate k-sequences. Given Lk-1, the set of all frequent (k -I)-sequences,


the goal is to generate a superset of the set of all frequent k-sequences. Can-
didate generation procedure can be decomposed into two phases: Join Phase
and Merge Phase. In the Join Phase, candidate sequences are generated
by joining Lk-1 with Lk-1. A sequence S1 joins with S2 if the subsequence
obtained by dropping the first item of S1 is the same as the subsequence
obtained by dropping the last item of S2. The candidate sequence generated
by joining S1 with S2 is the sequence 81 extended with the last item in 82·
The added item becomes a separate element if it was a separate element in
82, and part of the last element of 81 otherwise. When joining L1 with L 1,
we need to add the item in S2 both as part of an itemset and as a separate
element, since both ({x}{y}) and ({x, y}) give the same sequence ({y}) upon
deleting the first item. In the Prune Phase, candidate sequences that have
a contiguous (k - I)-subsequence that is not frequent are deleted. If there is
no max-gap constraint, candidate sequences that have any subsequence that
is not frequent are also deleted.

While making a pass, one data-sequence at a time is read from the data-
base and the support count of candidates contained in the data-sequence is
incremented. Thus, given a set of candidate sequences C and a data-sequence
d, the goal is to find all sequences in C that are contained in d. The algorithm
for checking if the data-sequence d contains a candidate sequence 8 alternates
between forward and backward phases. The algorithm starts in the forward
phase from the first element. In the forward phase, the algorithm finds suc-
cessive elements of 8 in d as long as the difference between the end-time of
the element just found and the start-time of the previous element is less than
max-gap (for an element 8i, start-time(8i) and end-time(8i} correspond to the
first and last transaction-times of the set of transactions that contain 8i). If
the difference is more than max-gap, the algorithm switches to the backward
phase. If an element is not found, the data-sequence does not contain 8.

In the backward phase, the algorithm backtracks and "pulls up" previous
elements. If 8i is the current element and end-time(8i} = t, the algorithm
finds the first set of transactions containing 8i-1 whose transaction-times are
after t - max-gap. Pulling up 8i-1 may necessitate pulling up 8i-2 because
the max-gap constraint between 8i-1 and 8i-2 may no longer be satisfied.
The algorithm moves backwards until either the max-gap constraint between
the element just pulled up and the previous element is satisfied, or the first el-
ement has been pulled up. The algorithm then switches to the forward phase,
finding elements of 8 in d starting from the element after the last element
pulled up. If any element cannot be pulled up (that is, there is no subsequent
set of transactions which contain the element), the data-sequence does not
contain 8. This procedure is repeated, switching between the backward and
forward phases, until all the elements are found or it is discovered that the
data-sequence does not contain 8.
11. Data Mining 515

Other algorithms for mining sequential patterns. The asp algorithm


assumes that a user specifies only one pattern constraint by providing the
minimum support threshold. However, users very often are interested only
in patterns satisfying certain constraints concerning pattern structure, and
are not willing to wait until a data mining system generates a huge volume
of mostly useless results. To address the above issue, in [GRS99a) incorpo-
ration of user-specified pattern constraints into sequential pattern discovery
process in order to improve performance was studied. The use of regular ex-
pressions was proposed as a flexible constraint specification method and a
family of novel algorithms (called SPIRIT - Sequential Pattern mIning with
Regular expressIon cons Traints) for mining sequential patterns that satisfy
user-specified regular expression constraints was introduced. The SPIRIT
algorithms can be regarded as GSP extensions using advanced candidate
generation and pruning methods. It was observed that there exists a class of
pattern constraints whose integration with Apriori-like algorithms (such as
aSP) is straightforward. These are constraints having the anti-monotonicity
property. Informally, a sequential pattern constraint is anti-monotone if all
subsequences of a sequence satisfying the constraint are also guaranteed to
satisfy it. In fact, Apriori-like algorithms exploit the anti-monotonicity prop-
erty of the minimum support constraint. Thus, other anti-monotone con-
straints do not affect the candidate generation process and can be used in
candidate pruning together with the minimum support constraint.
The SPIRIT algorithms discover patterns whose structure satisfies a
user-specified regular expression constraint. Unfortunately, regular expres-
sion constraints are not anti-monotone. Therefore, the algorithms have to
use sophisticated candidate generation and pruning schemes. There are four
algorithms in the SPIRIT family, and each of them processes candidates
satisfying some relaxation of the constraint specified by a user. The weakest
of relaxations requires that each item of a candidate sequence is in the regu-
lar expression (which results in an anti-monotone constraint), the strongest
requires that each candidate sequence has to satisfy the user-specified regular
expression (the constraint is not relaxed). It has been shown experimentally
that pushing regular expression constraints deep into the mining process can
reduce processing time by more than an order of magnitude. However, using
some relaxation of the original regular expression constraint might be more
efficient than candidate pruning according to the constraint in the form spec-
ified by a user (there is a balance between the support-based pruning and
pruning according to pattern structure constraints).
In [PHM +00), an algorithm capable of efficient discovery of sequential
patterns from web access logs was presented. Web access pattern mining is a
particular case of sequential pattern mining, where source data-sequences and
patterns are sequences of items, rather than itemsets. The key element of the
proposed method was the transformation of the source database into a novel
data structure called Web access pattern tree (W AP-tree). W AP-tree stores
516 T. Morzy and M. Zakrzewicz

information required for access pattern mining in a highly compressed form.


After the creation of the W AP-tree, an algorithm called W AP-mine was used
for pattern discovery. The algorithm used conditional search techniques to
reduce the search space, while looking for frequent patterns in the W AP-tree.
Conditional search consists in searching for patterns with the same suffix,
instead of searching for all patterns at once (suffix is used as a condition to
narrow the search space).
[HPM +00] introduced FreeSpan, a sequential pattern discovery algo-
rithm exploiting the concept of projected databases to reduce the expensive
candidate generation and testing. The general idea of FreeSpan is to use
frequent items to recursively project sequence databases into a set of smaller
projected databases and grow subsequence fragments in each projected data-
base. This process partitions the data as well as the set of sequential patterns
to be tested, and confines each test being conducted to the corresponding
smaller projected database. FreeSpan examines a smaller number of subse-
quences and runs considerably faster than esp, especially when the support
threshold is low and there are many long patterns.
In [PHM+01] another pattern-growth sequential pattern discovery
method, called PrejixSpan, offering further performance improvements,
was proposed. Its general idea is to examine only the prefix subsequences
and project only their corresponding postfix subsequences into projected
databases. In each projected database, sequential patterns are grown by ex-
ploring only local frequent patterns. PrejixSpan is a more efficient algorithm
than FreeSpan for two main reasons. Firstly, PrejixSpan analyzes smaller
projected databases than FreeSpan, since in case of FreeSpan a subsequence
may be generated by any substring combination in a sequence (not only its
prefixes), and as a result the whole sequences have to be kept in projected
databases. Secondly, the growth of a subsequence in case of FreeSpan is
more costly, because it is explored at any split point in a sequence.
An interesting approach to sequential pattern discovery was presented in
[Zak98]. The algorithm introduced there, called SPADE, assumed a different
structure of the source database than in other approaches. Usually the da-
tabase is seen as a collection of sequences, where each element of a sequence
(called transaction) contains a timestamp and a set of items. For the algo-
rithm proposed in [Zak98], the dataset was interpreted as a collection of items
with a list of occurrences in customer sequences maintained for each item. To
facilitate mining, the dataset had to be physically stored according to this
logical interpretation. This assumption seriously limits the application areas
of the proposed method, since source datasets are very likely to be organized
traditionally and the transformation of the original database might not be
possible due to the lack of disk space. The algorithm itself used the lattice
approach to decompose the original search space (lattice) into smaller pieces
(sub-lattices) which could be processed independently in main memory. The
advantage of SPADE, in comparison to esp, is that it usually requires only
11. Data Mining 517

three scans of the database. It has been shown experimentally that SPADE
is at least twice as fast as GSP.

3 Classification and Prediction

Classification and prediction are two forms of data analysis that are used
to extract models describing data classes or to predict future data trends.
Classification is used to predict categorical labels, while prediction is used to
predict numerical values or value ranges. For example, classification model
can be built to classify medical treatment as either safe or risky. Prediction
model can be built to asses the value of a given company stock, a blood
pressure of a given patient, or to asses the value of energy consumption by
a given company. In this section, we will briefly describe and discuss basic
techniques for data classification and data prediction.
Classification is a two-step process. In the first step, a concise model is
built describing a predetermined set of data classes. The model is constructed
by analyzing a dataset of training tuples, also called training database, de-
scribed by attributes. The data tuples of the training database are also called
samples, examples, or instances. Each tuple of the training database is a fea-
ture vector (Le. a set of <attribute-value> pairs) with its associated class. At-
tributes whose domain is numerical are called numerical attributes, whereas
attributes whose domains are not numerical are called categorical l . There
is one distinguished attribute called the dependent attribute. The remaining
attributes are called predictor attributes. Predictor attributes may be either
numerical or categorical. If the dependent attribute is categorical, then the
problem is referred to as a classification problem. If the dependent attribute
is numerical, the problem is called a prediction problem.
In the second step of classification, the resulting model is used to assign
values to tuples where the values of the predictor attributes are known but the
value of the dependent attribute is unknown. First, the predictive accuracy
of the model is estimated based on the test set of tuples. These tuples are
randomly selected and are independent of the training tuples. The accuracy
of the learned model on a given test set of tuples is defined as a percentage
of test set tuples that are correctly classified by the model. If the accuracy
of the model is acceptable, the model can be used to classify future data
tuples and predict values of new tuples, for which the value of the dependent
attribute is missing or unknown.
Prediction is very similar to classification. However, in prediction, the
model is constructed and used to predict not a discrete class label of the

1 Sometimes, a set of categorical attributes is divided into two classes: nominal


attributes and ordinal attributes. Nominal attributes take values from an un-
ordered set of categories, whereas ordinal attributes take values from an ordered
set of categories.
518 T. Morzy and M. Zakrzewicz

dependent attribute but a numeric value or value ranges of the numerical


dependent attribute.
There are many different ways for representing the learned model for
classification and prediction, and each one dictates the kind of technique
that can be used to infer the output structure from the data. Typically, in
classification, the learned model is represented in the form of decision trees,
decision tables, or classification rules, and in prediction, the learned model is
usually represented by regression trees or regression equations.
Classification and prediction has a wide range of applications, including
scientific experiments, medical diagnosis, credit approval, target marketing,
fraud detection, or treatment effectiveness analysis.
In Section 3.1 we concentrate on the classification problem. The prediction
problem will be considered in Section 3.6.

3.1 Classification

The main goal of classification is to build a formal concise model called classi-
fier of the dependent attribute based upon values of the predictor attributes.
The input to the classification problem is a training set of tuples, each be-
longing to a predefined class as determined by the dependent attribute. In the
context of classification, the dependent attribute is usually called the class
label attribute. The elements of the domain of the class label attribute are
called class labels. The learned model can be used to predict classes of new
tuples, for which the class label is missing or unknown. Figure 3.1 shows a
sample training set of tuples where each tuple represents a credit applicant.
Here we are interested in building a model of what makes an applicant a
high or low credit risk. The class label attribute is Risk, the predictor at-
tributes are: Age, Marital Status, Income, and Children. Figure 3.2 shows
a sample classifier, in the form of a decision tree, that has been built based
on the training set from Figure 3.1. Once a model is built, it can be used to
determine a credit class of future unclassified applicants.
Many classification models have been proposed in the literature: deci-
sion trees [BFO+84,Mur98,Qui86,WK91j, decision tables [Koh95j, Bayes-
ian methods [CS96,Mit97,WK91j, neural networks [Bis95,Rip96j, genetic al-
gorithms [GoI89,Mit96j, k-nearest neighbor methods [Aha92,DH73,Jam85j,
rough set approach [CPS98,Paw91,SS96,Zia94j, and other statistical meth-
ods [Jam85,MST94,WK91j. All mentioned classification models can be com-
pared and evaluated according to [HKOOj: predictive accuracy, scalability
(efficiency in large databases), interpretability and understandability, speed,
and robustness with regard to noise and missing values. Among these models,
decision trees are particularly suited and attractive for data mining. First,
due to their intuitive representation the resulting classification model is easy
to understand by humans. Second, decision trees can be constructed rela-
tively fast compared to other classification methods. Third, decision trees
11. Data Mining 519

RID Age Marital Status Income Children Risk


1 25 single low 0 high
2 35 married medium 1 low
3 38 divorced high 2 low
4 45 married medium 2 low
5 28 married low 1 high
6 39 divorced high 0 high
7 31 single low 0 high
8 56 married high 2 low
9 48 married medium 1 low
10 38 divorced low 2 high
11 29 single high 0 low
12 42 divorced medium 4 high
13 41 married medium 1 low
14 56 divorced high 2 low

Fig.3.1. Example training database

low high

Fig. 3.2. Example decision tree

scale well for large data sets and can handle high dimensional data. Last, the
accuracy of decision trees is comparable to other methods. Almost all major
commercially available data mining tools includes some form of decision tree
model. The main drawbacks of decision trees is that they cannot capture
correlation among attributes without additional computation.
520 T. Morzy and M. Zakrzewicz

In the following section we describe basic issues regarding classification


by decision tree construction. In next sections we briefly survey other classi-
fication models.

3.2 Classification by Decision Tree Construction

A decision tree is a special type of a classifier. It is a directed acyclic graph in


the form of a tree, where each internal node denotes a test on an attribute,
and each branch represents an outcome of the test. Each internal node is
labeled with one predictor attribute called the splitting or test attribute. An
internal node may have a predicate, called the splitting predicate, associated
with the node. Each leaf node of the tree is labeled with one class label
representing a given class samples or class distribution. The topmost node of
the tree is the root node. An example of a typical decision tree is shown in
Figure 3.2. Internal nodes are represented by rectangles, and leaf nodes are
denoted by ovals.
A decision tree is usually constructed in two phases. In the first phase,
called building phase, growth phase, or decision tree induction phase, a de-
cision tree is constructed from the training database. In the second phase,
called pruning phase, the final tree is determined by removing some branches
and nodes from the constructed tree, generally resulting in faster classifi-
cation and better ability of the tree to correctly classify independent data
samples 2 •
The basic algorithm for decision tree construction used during the building
phase is a greedy algorithm that constructs a decision tree in a top-down
recursive divide-and-conquer manner [HKOO]. To illustrate the general idea
of the algorithm consider the schema, depicted in Figure 3.3, of a version of
the well known decision tree construction algorithm ID3 [Qui86].

The construction of the tree starts with a single root node N represent-
ing the training database D. If all tuples of D belong to the same class C,
then, the node N becomes a leaf node labeled C, and the algorithm stops
(steps 2 and 3). Otherwise, the set of predictor attributes A is examined
according to the split selection method 88 and a splitting attribute, called
"best..split" , is selected (step 6 and 7). The splitting attribute partitions the
training database D into a set of separate classes of samples 81, 8 2 , ••. ,8v ,
where 8 i , i = 1, ... , v, contains all samples from D with splitting attribute =
ai (step 9). A branch, labeled Vi, is created for each value ai of the splitting
attribute, and to each branch Vi a set of samples 8 i is assigned. The par-
titioning procedure is repeated recursively to each descendant node to form
the decision tree for each partition of samples (step 10). The set of attributes
A is examined and from it a splitting attribute is selected. Once an attribute
has been selected as a splitting attribute at a given node, it is not necessary

2 It is possible to interleave building and pruning phases for performance reason


11. Data Mining 521

Algorithm: Decision Tree Construction Algorithm

Input: training database D, a set of predictor attributes A, split selection method


SS
Output: a decision tree rooted at node N

BuildTree(training database D, attribute.list A, split selection method SS)

(1) initialize a root node N of the tree;


(2) if all tuples of D are all of the same class C then
(3) return N as the leaf node labeled with the class label C, and exit;
(4) if attribute.list A is empty then
(5) return N as the leaf node labeled with the most common class in the training
database D, and exit;
(6) apply SS to select best-split attribute from A
(7) label node N with besLsplit attribute;
(8) for each value ai of the attribute best-split do;
(9) let Si denotes a set of samples in D with besLsplit= ai;
(10) let Ni = BuildTree(Si, SS, (attribute.list A) minus (best...split attribute));
(11) create a branch from N to Ni labeled with the test best-split = ai;

Fig. 3.3. Decision tree construction algorithm

to consider the attribute in any of the node's descendants (step 10). The
procedure stops when one of the following conditions is satisfied: (1) a node
is "pure", i.e. all samples for the node belong to the same class, (2) there are
no remaining attributes on which the samples belonging to a given internal
node may be further partitioned - in this case the given node is converted
into a leaf node and labeled with the class label in majority among samples,
and (3) there is no samples for the branch splitting attribute = ai - in this
case a leaf node is created with the majority classes in samples. In the given
example all predictor attributes are categorical which is obviously not always
the case. Continuous attributes have to be discretized.
There are several variants of the basic decision tree construction algorithm
proposed by researchers from machine learning (ID3, C4.5) [Qui86,Qui93]'
statistics (CART) [BFO+84], pattern recognition (CHAID) [Mag94], or data
mining (SLIQ, SPRINT, SONAR, CLOUDS, PUBLIC, BOAT, Rainforest)
[ARS98,FMM +96,GRGOO,GGR+99,MAR96,SAM96,RS98].
The main difference between above mentioned algorithms consists in a
split selection method used during the building phase. The split selection
method should maximize the accuracy of the constructed decision tree or,
in other words, minimize the misclassification rate of the tree. Most of split
selection methods used in practice by commercial data mining tools belong
to the class called impurity-based split selection methods [Shi99]. Impurity-
based split selection methods find the splitting attribute of a node of the
522 T. Morzy and M. Zakrzewicz

decision tree by minimizing an impurity measure, such as: the entropy (ID3,
C4.5) [Qui86,Qui93], the gini-index (CART, SPRINT) [BFO+84,SAM96],
or the index of correlation X2 (CHAm) [Mag94]. In the following we will
briefly describe two popular split selection methods used in practice, based
on: information gain and gini-index measures, and give an intuition behind
impurity-based split selection methods.

Information gain. In the version of the tree construction algorithm de-


picted in Figure 3.3, to select the splitting attribute at a node of the decision
tree, the split selection method based on information gain measure is used.
The problem of constructing a decision tree using the information gain mea-
sure can be expressed as follows. First, we have to select an attribute to place
at the root node of the decision tree and create a branch for each value of
that attribute. Selected attribute splits the set of samples D into a set of
subsets, one for each value of the attribute. The process is repeated recur-
sively for each branch using samples assigned to the branch. If all samples at
a given node created for the branch belong to the same class, the partitioning
process is terminated since there is no need to split the node further. The
problem is how to find which attribute to split on for a given set of samples?
The answer follows from the assumption that we are interested in simplest
and smallest classifier (decision tree) that correctly classify samples from D.
Since the partitioning process stops when all samples at a node belong to
the same class, we would like this to happen as soon as possible. Therefore,
we are looking for an attribute that partitions a set of samples into a set of
"purest" subsets, i.e. subsets whose samples belong to one class only. The
information gain measure is used to determine the purity of a partition of
a set of samples according to a given attribute. The greater the information
gain value, the greater the purity of the subset partition. Therefore, the best
splitting attribute for a given node is the attribute that maximizes infor-
mation gain measure for that node. This attribute minimizes the amount
of expected information needed to specify a class of a given sample in the
resulting partitions given that the sample reached that node.
Now, we will discuss more formally how to calculate the information gain
measure and we will present the split selection method based on this measure.
Assume a training database D consists of n samples. Suppose the class label
attribute has m distinct values defining m distinct classes Ci, for i = 1, ... ,m.
Let Si denotes the number of samples of D in class C i . The information needed
to classify an arbitrary sample to a given class is given by the following
formula:
m
1(Sl' S2,"" sm) = - 2:>ilo92(Pi)
i=l

where Pi is the probability that an arbitrary sample belongs to the class Ci


and is estimated by si/n.
11. Data Mining 523

Assume that the attribute A have v distinct values, {aI, a2, ... , a v }. If
we select the attribute A as the splitting attribute for the set of samples
D, then the attribute will partition D into subsets Sl, S2,"" Sv, where
Si, i = 1, ... , v, contains samples from D for which A = ai. Each subset
Si will correspond to a branch grown from the node containing D. Assume
Sij denotes the number of samples of the class Ci in a subset Sj. The in-
formation needed to classify samples in resulting partitions according to an
attribute A, called entropy, is given by the following formula
v
E(A) -- "~ Slj + 82j + ... + 8 m j 1(81),82),
. . .. ·, 8 m ).)
j=l n

where 1(81j, 82j, ... , 8 m j) is the amount of expected information needed to


classify a given sample from the subset Sj. The formula given above defines
the entropy as a weighted sum of 1 (81j, 82j, •.• , Smj), j = 1, ... , v, where the
weight of the subset Sj is equal (811 + s2j + ... + 8m j)jn. According to the
formula defining expected information needed to classify a given sample, the
value of 1 (81j, 82j, ... , 8 m j) is defined as follows:
m
1(81j, 82j,···, 8m j) = - 2:Pij log2(Pij)
i=l

where Pij = 8ij j I Sj I is the probability that a sample from the subset Sj
belongs to the class Ci .
The information gain of an attribute A is defined as follows:

Gain(A) = 1(81,82, ... , 8 m ) - E(A)

and determines the amount of information that would be gained if the set of
samples will be split on the attribute A.
The classification algorithm based on the information gain measure works
as follows. First, the algorithm calculates the information gain of all predic-
tor attributes and select the attribute with the highest information gain as
the splitting attribute for the root node. The root node is labeled with the
attribute, for each value of the attribute a branch is created, and the set
of samples D is partitioned into a set of subsets, one for each branch. The
process is repeated recursively for each branch using samples assigned to the
branch. The algorithm we have described only works when all attributes are
categorical. If a set of samples contains numerical attributes they must be
discretized.
The information gain measure has the following properties. When the
splitting attribute partitions a set of samples into "pure" subsets, i.e. each
consisting of samples belonging to a single class only, the value of the entropy
is zero and the value of the information gain reaches a maximum. When the
splitting attribute partitions a set of samples into subsets, which consist of
524 T. Morzy and M. Zakrzewicz

samples uniformly distributed among different classes (i.e. each subset has
equal number of samples belonging to different classes), then the value of the
entropy reaches the maximum and the information gain reaches the minimum.
The information gain measure tends to prefer attributes with many values.
Notice that the entropy of an attribute that has a different value for each
sample in a training database, e.g. an attribute that is an identifier of the
training database (e.g. credit applicant identifier), is zero. The attribute will
have the highest information gain. This is obvious since such an attribute
uniquely determines the class of each sample without any ambiguity. The
problem is that splitting on this attribute is unreasonable since it is useless
for predicting the class of a new unknown sample and tells nothing about the
structure of the decision tree. Therefore, to compensate for this effect of the
information gain, a correction of the measure called gain ratio was proposed
[Qui93].
To illustrate the idea of the split selection method based on information
gain consider once more the example training database D from Figure 3.l.
The training database D has four predictor attributes: Age, Marital Status,
Income, and Children. Attributes Marital Status and Income are categorical,
while attributes Age and Children are numerical attributes that have to be
discretized. Therefore, let us assume that the range of values of the attribute
Age is divided into 3 intervals: "< 30", "30, ... ,40", and "> 40", and the
range of values of the attribute Children is reduced to 3 distinct values: "0",
"1 ... 2", and "> 2". The class label attribute Risk has two distinct values:
high and low. Therefore, we distinguish two classes (m=2): C 1 and C 2 • Let
C 1 denotes the class of high risk credit applicants, while C 2 denotes the class
of low risk credit applicants. The training database D consists of 14 samples
(n=14), where 6 samples belong to the class C 1 (81 = 6) and 8 samples belong
to the class C 2 (82 = 8). First, we calculate the expected information needed
to classify a given sample:

The next step is calculation of the entropy of each predictor attribute.


Assume we start with the attribute Age. The attribute has 3 distinct values
"< 30", "30, ... ,40", and "> 40". The attribute Age would partition the
dataset D into 3 subsets S1, 8 2 , S3. To calculate the entropy of Age, first we
have to calculate the amount of information needed to classify a given sample
from a given subset Si, i = 1,2,3.
S1 - samples from D for which Age = "< 30"
811 = 2 821 = 1 I(811,82I) = -jlog2j - ilogd = 0.9183
8 2 - samples from D for which Age = "30, ... ,40"
812 = 3 822 = 2 1(812, S22) = -~log2~ - ~log2~ = 0.9709
8 3 - samples from D for which Age = "> 40"
813 = 1 823 = 5 1(813,823) = -ilog2i - ~log2~ = 0.65
11. Data Mining 525

Thus, the entropy of the attribute Age is

3
E(Age) = ""
L..- + S2j I(Slj, S2j)
Slj 14 = 0.8221
j=l

The information gain from the partitioning a dataset D according to the


attribute Age would be

Gain(Age) = I(Sl' S2) - E(Age) = 0.1631

Similarly, we calculate the entropy and information gain for the rest of at-
tributes:

E(MaritaLStatus) = 0.8221 Gain(MaritaLStatus) = 0.1631


E(Income) = 0.5157 Gain(Income) = 0.4696

E(Children) = 0.7231 Gain(Children) = 0.2622

The highest information gain has the attribute Income = 0.4696, there-
fore, it is selected as the splitting attribute for the root node of the tree.
The root node is labeled with the attribute Income and for each value of the
attribute a corresponding branch is created. The set D is partitioned into
subsets Sl, S2, and S3, where each subset is assigned to the corresponding
descendant node. Notice that a set of samples assigned to the branch Income
= low is completely pure, i.e. all samples assigned to this branch belong to
the same class Risk = high. Therefore, a leaf node is created for this branch
and labeled with the class label high. For two other branches we repeat the
process of calculating the entropy and information gain using samples as-
signed to each branch. Consider the set of samples assigned to the branch
Income = medium. It is easy to notice that two attributes, namely, Marital
Status and Children, partition the set of samples into pure subsets. There-
fore, the entropy of both attributes is zero and the information gain reaches
a maximum. Assume the attribute Marital Status is selected as the splitting
attribute. A node at the end of the branch Income = medium is created and
labeled with Marital Status, and two branches labeled married and divorced
are grown. As we noticed before, the attribute Marital Status partitions the
set of samples into pure subsets. Therefore, for both branches leaf nodes are
created and labeled, respectively, low (for the branch Marital Status = mar-
ried) and high (for the branch Marital Status = divorced). Consider now the
set of samples assigned to the branch Income = high. The algorithm computes
the entropy of all predictor attributes, but Income, for the given set of sam-
ples: E(Age) = 0.46, E(MaritaLStatus) = 0.46, E(Children) = 0.4. The
attribute Children is selected as the splitting attribute since it has the high-
est information gain. A node is created and labeled with Children, and two
526 T. Morzy and M. Zakrzewicz

branches labeled, respectively, "0" and "1 ... 2", are grown. The samples be-
longing to the partition Children = "1 ... 2" all belong to the same class Risk
= low. Therefore, a leaf node is created for this branch labeled with low. The
algorithm continues with the set of samples assigned to the partition Children
= "0". Both remaining attributes, i.e. Marital Status and Age have the same
entropy value for this set of samples: E(Age) = 0, E(M aritaLStatus) = o.
The final decision tree induced by the algorithm is depicted in Figure 3.4.

low high

Fig. 3.4. Decision tree obtained using information gain

It is easy to notice that if we would select the attribute Children instead


of the attribute Marital Status as the splitting attribute for the set of samples
assigned to the branch Income = medium, and the attribute Marital Status
instead of the attribute Age as the splitting attribute for the set of samples
assigned to the branch Children = "0", then we will receive the decision tree
shown in Figure 3.2.

Gini index. Another popular split selection method is based on the gini
index measure. To illustrate the idea of the method and present the gini
index consider the schema, depicted in Figure 3.5, of a version of the well
known decision tree construction algorithm SPRINT [SAM96j.

Notice that each internal node of a decision tree is labeled with a splitting
attribute A, but moreover, it has a predicate Qa, called the splitting predicate,
11. Data Mining 527

Algorithm: Decision Tree Construction Algorithm

Input: training database D, a set of predictor attributes A, split selection method


SS
Output: a decision tree rooted at node N

BuHdTree(training database D, attribute-list AT, split selection method SS)

(1) initialize a root node N of the tree;


(2) if all tuples of D are all of the same class C then
(3) return N as the leaf node labeled with the class label C, and exit;
(4) if attribute-list A is empty then
(5) return N as the leaf node labeled with the most common class in the training
database D, and exit;
(6) apply SS to select the splitting criterion
(7) if the node N splits then
(8) label node N with the splitting criterion;
(9) use the splitting criterion to partition D into Dl and D 2 ;
(10) let N v = BuildTree(Sv, SS, (attributeJist A) minus (best..split attribute));
(11) create a branch from N to N v labeled with the test besLsplit = V;

Fig. 3.5. Decision tree construction algorithm SP RI NT

associated with it. If A is a numerical attribute, then Qa is of the form A ~ X a ,


where Xa E dom(A). The value of Xa is called the split point of the splitting
attribute A at a given node. If A is a categorical attribute, then Qa is of
the form A E X a , where Xa C dom(A). The subset Xa is called the splitting
subset of the splitting attribute A at a given node. The combined information
of a splitting attribute and a splitting predicate at a given node is called the
splitting criterion. The splitting predicate Qa for each attribute A partitions
a training database D into two subsets Dl and D 2 , where Dl contains all
samples of D for which the outcome of the splitting predicate is true, while
D2 contains all remaining samples of D. Therefore, the split selection method
built a binary decision tree. The problem is how to select the "best" splitting
criterion for a given node? The split selection method works as follows. For
each attribute A from a set of predictor attributes, a value of the gini index
measure of all possible split points or splitting subsets is evaluated. Since
the gini index measures the impurity of the split of a training dataset into
two partitions, then a splitting criterion with lowest value of the gini index
measure is selected for a given node.

Assume as before that a training database D consists of n samples. Sup-


pose the class label attribute has m distinct values defining m distinct classes
Ci , i = 1, ... , m. Let Si denotes the number of samples of D in the class Ci .
528 T. Morzy and M. Zakrzewicz

The gini index of a binary split of the training dataset D into subsets D1
and D2 is defined as follows:
..
gznzsplit (D 1, D)
2
I D1 Igznz
= TDT . '(D) I D2 I . '(D)
1 + TDT gznz 2

where
m

gini(D) = 1 - 2: I ~ I
3=1

To illustrate the split selection method based on gini index consider once
more the example training database D from Figure 3.1. The method does not
require that all attributes are categorical, so it is not necessary to discretize
numerical attributes. However, the method requires that the training data-
base D is sorted for each numerical attribute at each node of the tree. Let's
start with the attribute Age. Table 3.1 shows the value of the gini index for
all possible split points in the domain of the numerical attribute Age.

25 28 29 31 35 38 39 41 42 45 48
::; > ::; > ::; > ::; > ::; > ::; > ::; > ::; > ::; > ::; > ::; >
C1 1 5 2 4 2 4 3 3 3 3 4 2 5 1 5 1 6 0 6 0 6 0
C2 0 8 0 8 1 7 1 7 2 6 3 5 3 5 4 4 4 4 5 3 6 2
gini 0.4396 0.3809 0.5065 0.4786 0.4571 0.4490 0.3870 0.4318 0.3429 0.3896 0.4286
Table 3.1. Gini index for Age

Similarly, we have to compute the gini index of all split points and/or
splitting subsets for other predictor attributes. Table 3.2 shows the value of
the gini index for all possible splitting subsets in the domain of the categorical
attribute Income. The lowest value of the gini index has the splitting criterion
Income E low. Therefore, it is selected as the splitting predicate for the root
node. The predicate partitions D into D1 and D 2, where D1 contains the set
of samples with Income E low and D2 contains all remaining samples. Note
that D1 is "pure", i.e. all samples belong to the class high. Therefore, we stop
developing that part of the tree and create the leaf node labeled with high.
We repeat the process of selecting splitting criterion for the subset D 2 • We
compute the gini index of all candidate split points or splitting subsets, and
from it a splitting criterion with lowest value is selected. The final decision
tree built by the algorithm is shown in Figure 3.6.

3.3 The Overfitting Problem


The goal of classification is to find a simple (but not necessarily the simplest)
classifier that fits the training database and generalizes well to unknown
11. Data Mining 529

low high

Fig. 3.6. Decision tree obtained using gini index

low medium high low, medium low, high medium, high


E ¢: E rt E rt E rt E rt E rt
C1 4 2 1 5 1 5 5 1 5 1 2 4
C20 8 4 4 4 4 4 4 4 4 8 0
gini 0.2286 0.4318 0.4318 0.4318 0.4318 0.2286
Table 3.2. Gini index for Income

future data. However, a decision tree build during the growth phase very
often is too complex if it overfits the training database. The pruning phase
of decision tree construction addresses the problem of overfitting the training
data and determines the size of the final tree by removing some branches and
nodes from the constructed tree. There are two basic approaches to avoid
overfitting: prepruning approach and postpruning approach.
In the prepruning approach, during the growth phase construction of a
tree is stopped earlier by deciding not to further split the training dataset at
a given node. Upon stopping, the node becomes a leaf node and is labeled
with the class to which most of samples at the node belong. The decision
530 T. Morzy and M. Zakrzewicz

whether to stop growing the tree at a given node is based on a measure of


the goodness of a split such as statistical significance, information gain, or X2 •
If the value of the measure computed at the node is below a given threshold
the tree growth at the node is stopped [Mag94,Qui93].
In the postpruning approach, after a complex tree has been grown, some
branches are removed and a subtree rooted at a node is replaced with a
leaf node. The node is labeled with the most frequent class among its for-
mer branches. Examples of the postpruning approach are cost-complexity
pruning, pruning with a test set, and pruning based on the MDL principle
[BFO+84,MRA95,QR89,Qui93]. The cost complexity pruning algorithm cal-
culates for each internal node of the tree the expected error rate that would
occur if the subtree rooted at the node were pruned and compares it with the
expected error rate occurring if the node were not pruned. If the expected
error of the pruned tree is no worse than that of the more complex tree, the
subtree is pruned, otherwise, it is kept. Another popular pruning strategy is
the pruning based on the MDL principle [QR89].
Pruning based on the Minimum Description Length (MDL) principle
views the decision tree as a means for efficiently encoding classes of samples
in the training database given a set of predictor attributes AT. The M DL
principle states that the "best" tree is the one that can encode samples using
the least number of bits. First, we have to define an encoding schema that
allows us to encode any decision tree. Then, given the encoding schema, we
can prune a given tree by selecting a subtree with minimum cost of encoding.
Cost of encoding a tree (subtree) is sum of costs of encoding each node of
the tree. Each node requires 1 bit to encode its type (e.g. leaf or internal
node). Then, the cost of encoding an internal node N includes the cost of
encoding its splitting criterion, which consists of the cost of encoding the
splitting attribute (if there are J predictor attributes we need log J bits to
encode attributes) and the cost of encoding the splitting predicate. Let A
be the splitting attribute at the node N, and assume that A has v distinct
values. If the splitting predicate is of the form A :$ X a , then there are v-I
different split points and encoding of a split point requires log(v-I) bits. If
the splitting predicate is of the form A E X a , then there are 22 - 2 different
splitting subsets and encoding of a splitting subset requires log(2V - 2) bits.
We will denote a cost of a splitting criterion at a node N by Csplit(N). The
cost of encoding an internal node N is Csplit(N) + 1. The cost of encoding a
leaf node N includes the cost of encoding class labels of all samples assigned
to the node N. Let N be a leaf node with n samples belonging to m different
classes Ci , i = 1, ... , m, and let Si denotes the number of samples belonging to
the class Ci . The amount of information necessary to classify a given sample
at the node N is E = -n I::l Si/n log si/n. Thus, the cost of encoding
a leaf node N with n samples is C,ea/(N) = nE + 1. Given this encoding
schema, a binary decision tree is pruned in a bottom-up fashion according to
the following recursive procedure. Given an internal node N. Assume that
11. Data Mining 531

the node N has two child nodes Nl and N2. Let minCN denotes the cost
of encoding the minimum cost subtree rooted at N. It is worth to prune
child nodes Nl, N2 and transform N into the leaf node if C'eaJ(N) is no
worse than Cspl it (N) + 1 + minCNl + minCN2 . In other words, ifthe cost of
encoding samples (their class labels) at N is lower than or equal to the cost
of encoding the subtree rooted at N, then it is worth to prune child nodes
Nl, N2 and transform N into the leaf node of the decision tree.
To illustrate the M DL pruning procedure let us consider a fragment of
a decision tree shown in Figure 3.7. The internal node N has the splitting
predicate "Age ~ 22", which splits a set of samples into two subsets assigned
to leaf nodes Nl and N2. The cost of the split at the node N is Csplit(N) =
log J + log(v -1) = 2.6, since J = 2 and v = 4. Since the splitting predicate
partitions a set of samples at the node N into pure subsets, then minCNl = 1
and minCN2 = 1. Therefore, Csplit(N) + 1 + minCNl + minCN2 = 2.6 +
1 + 1 + 1 = 5.6. On the other hand, the cost of encoding the leaf node N is
nE + 1 = 4(-! log! - ~ log~) + 1 = 4.245. Since nE + 1 ~ Csplit(N) + 1 +
minCNl + minCN2, then Nl and N2 are pruned and N is transformed into
the leaf node of the decision tree.

Age Marltlal Status Risk


22 single high
35 married low
39 single low
42 divorced low

Nl N2
high low

Fig. 3.7. MDL principle

Finally, both prepruning and postpruning approaches may be integrated


for a combining approach [RS98]. The idea is to prune a decision tree dur-
ing building phase (not after) by executing pruning algorithm periodically
on partially built tree. A good example of this approach is the PUBLIC
algorithm proposed in [RS98]. The main problem is how to compute the
minimum cost subtree rooted at a node N of a partial tree if we would like
to stop further expanding of N. The solution applied by PUBLIC algorithm
consists in computing a lower bound L(N) on the M DL cost of any sub-
tree rooted at a node N and use it to evaluate the "goodness" of pruning:
prune child nodes of N if C'eaJ(N) ~ L(N). It has been proved in [RS98]
that the cost of any subtree with 8 > 1 nodes rooted at a node N is at
least 28 + 1 + (8 - 1) log J + E~S+1 8i, where J is the number of predictor
532 T. Morzy and M. Zakrzewicz

attributes,m is the number of different classes Ci, i = 1, ... , m, and Si is the


number of samples belonging to the class Ci . During the building phase, the
algorithm is executed from the root node of the tree. If for a given node N
C'eaf(N) ~ L(N), then a node is marked as a leaf node, otherwise a split
selection method is applied to N to find the splitting criterion. It has been
shown that the PUBLIC algorithm generate identical tree to that generated
by the SPRINT algorithm, however, the integration of building phase with
pruning results in significant savings in overall tree construction.

From decision trees to classification rules. The knowledge acquired


by classification algorithms presented above is represented by decision trees.
Internal nodes in a decision tree involve testing a particular attribute. Usually,
the test at a node consists in comparing an attribute value with a constant.
Leaf nodes represent a classification that applies to all samples that reach
the node. Decision trees are easy to understand and to explain the basis
for new predictions. A popular alternative to decision trees are IF-THEN
classification rules. The antecedent of a rule ("IF" part) is a conjunction of
tests, while the consequent of a rule ("THEN" part) determines the class or
classes that apply to samples covered by that rule. It is very easy to extract
classification rules directly from a decision tree. One rule is generated for
each path from the root of the tree to a leaf node. The antecedent of the rule
is a conjunction of all tests encountered on the path from the root to the leaf,
and the consequent of the rule is the class label assigned to the leaf. Consider
the decision tree shown in Figure 3.2. The extraction procedure produces the
following set of classification rules for the tree:
IF Income='high' AND Children='O' AND MaritaLStatus='divorced'
THEN Risk='high'
IF Income='high' AND Children='O' AND MaritaLStatus='single'
THEN Risk='low'
IF Income='high' AND Children='1..2'
THEN Risk='low'
IF Income='medium' AND Children='1..2'
THEN Risk= 'low'
IF Income='medium' AND Children='>2'
THEN Risk='high'
IF Income='low'
THEN Risk='high'

The rules that are directly extracted from a decision tree are more complex
than necessary, and usually they are pruned by removing redundant tests.
Given a particular rule, each test in it is considered for deletion by tentatively
removing it, working out which of the training samples are covered by the
rule, calculating from this a pessimistic estimate of the accuracy of the new
rule, and comparing this with the pessimistic estimate of the accuracy of
the original rule. If the accuracy of the new rule is better than that of the
11. Data Mining 533

original rule, delete the test. Continue the procedure checking other tests
to delete. Left the rule if there are are no tests to delete. Once all rules
have been pruned, it is necessary to check if there are no any duplicates and
remove them from the set of rules. Usually, the set of rules is extended by an
additional "default" rule that covers cases not specified by other rules. The
most frequent class label among training samples is assigned to the rule as a
default.

3.4 Other Classification Methods

There is a number of classification methods described in the literature:


Bayesian classifiers, neural network classifiers, k-nearest neighbors, case-
based reasoning, genetic algorithms, rough sets, fuzzy sets, association-based
classification [Aha92,AP94,Bis95,CS96,DH73,Hec96,Kol93,Mic92,MST94],
[Mit96,Paw91,SS96,WFOO,Zad65,Zia94]. Some of these methods are used in
commercial data mining tools, like Bayesian classifiers, backpropagation,
others are actually less popular but offer some interesting features from
the point of view of particular applications. In this section we will briefly
describe some of these methods.

Bayesian classifiers. Bayesian classifier is a statistical classifier. It can pre-


dict the probability that a given sample belongs to a particular class. Bayesian
classification is based on Bayes theorem of a-posteriori probability. Let X is
a data sample whose class label is unknown. Each sample is represented by
n-dimensional vector, X = (Xl, X2, •.• , Xk). The classification problem may
be formulated using a-posteriori probabilities as follows: determine P(C I X),
the probability that the sample X belongs to a specified class C. P( C I X)
is the a-posteriori probability of C conditioned on X. For example, given a
set of samples describing credit applicants depicted in Figure 3.1. P(Risk =
low I Age = 38, MaritaLStatus = divorced,Income = low, children = 2) is
the probability that a credit applicant X = (38, divorced, low, 2) is the low
credit risk applicant. The idea of Bayesian classification is to assign to a new
unknown sample X the class label C such that P( C I X) is maximal.
The main problem is how to estimate a-posteriori probability P(C I X)?
By Bayes theorem P(C I X) = p(X~~r(C), where P(C) is the apriori
probability of C, that is the probability that any given sample belongs to
the class C, P(X I C) is the a-posteriori probability of X conditioned on C,
and P(X) is the apriori probability of X. In our example, P(X I C) is the
probability that X = (38, divorced, low, 2) given the class Risk = low, P(C)
is the probability of the class C, and P(X) is the probability that the sample
X = (38, divorced, low, 2).
Suppose a training database D consists of n samples. Suppose the class
label attribute has m distinct values defining m distinct classes Ci, for i =
1, ... , m. Let Si denotes the number of samples of D in class C i . As we
534 T. Morzy and M. Zakrzewicz

said above, Bayesian classifier assigns an unknown sample X to the class


Ci that maximizes P(Ci I X). Since P(X) is constant for all classes, the
class Ci for which P( C i I X) is maximized is the class Ci for which P(X I
Ci)*P(Ci ) is maximized. P(Ci ) may be estimated by si/n (relative frequency
of the class Ci ), or we may assume that all classes have the same probability
P(CI) = P(C2) = ... = P(Ck). The main problem is how to compute P(Ci I
X)? Given a large dataset with many predictor attributes, it would be very
expensive to compute P( Ci I X). Therefore, to reduce the cost of computing
P(Ci I X), the assumption of class conditional independence, or, in other
words, the attribute independence assumption is made. The assumption states
that there are no dependencies among predictor attributes, which leads to
the following formula: P(X I Ci ) = n;=l P(Xj I Ci ). The probabilities P(XI I
C i ), P(X2 I Ci ), . .. ,P(Xk I Ci ) can be estimated from the dataset:
• If j-th attribute is categorical, then P(Xj I Ci) is estimated as the relative
frequency of samples of the class Ci having value Xj for j-th attribute,
• If j-th attribute is continuous, then P(Xj I Ci ) is estimated through the
Gaussian density function.
Due to the class conditional independence assumption, the Bayesian clas-
sifier is also known as the naive Bayesian classifier. The assumption makes
computation possible. Moreover, when the assumption is satisfied, the naIve
Bayesian classifier is optimal, that is it is the most accurate classifier in
comparison to all other classifiers. However, the assumption is seldom satis-
fied in practice, since attributes are usually correlated. Several attempts are
being made to apply Bayesian analysis without assuming attribute indepen-
dence. The resulting models are called Bayesian networks or Bayesian belief
networks [Hec96]. Bayesian belief networks combine Bayesian analysis with
causal relationships between attributes.

Neural network classifiers. Neural network classifier is a neural network


learning algorithm performing learning on multilayer feed-forward neural net-
work. A neural network is a set of connected input/output units where each
connection has an associated weight. The inputs of the network corresponds
to the attributes of training samples. The input units of the network make
up a layer called the input layer. The weighted outputs of the input layer
are input to a second layer of units known as the hidden layer. The weighted
outputs of the hidden layer can be input to another hidden layer. The net-
work may have several hidden layers. The weighted outputs of the last hidden
layer are input to the output layer, which produces class label prediction for
samples. The neural network is feed-forward if none of the weights cycles back
to an input unit or to a hidden unit of a previous layer. During the training
phase, the neural network classifier learns by adjusting the weights in order
to correctly predict the class label for new samples.
The most popular neural network algorithm for classification is the back-
propagation algorithm [RHW86]. The backpropagation algorithm learns by
11. Data Mining 535

processing iteratively a set of training samples. For each training sample, it


compares the actual class label with the network's class label prediction for
the sample. Then, the weights of connections are modified so as to minimize
the mean squared error between the network's prediction class label and the
actual class label. The modification process· is made in the backward direction
from the output layer to the first hidden layer. Advantages of the backprop-
agation algorithm include its high accuracy and robustness with respect to
noisy data and outliers. However, the algorithm requires long training time
and a number of parameters that have to be chosen empirically, such as the
network topology. For many years the algorithm has been criticized for its
poor interpretability. As we said already before, the aim of any classification
algorithm is to correctly predict a class label of new samples and to pro-
duce an explicit structural description of the knowledge that is learned (e.g.
sets of rules or decision trees). This description is used to explain what has
been learned and to explain the basis for new predictions. Neural network
algorithms learn to classify samples in ways that do not involve explicit struc-
tural descriptions of the knowledge that is learned. The knowledge acquired
by neural network algorithms is represented by a network of connected units
with different weights, which is very difficult to understand and interpret. Re-
cently, several algorithms have been proposed to extract rules from a neural
network, which makes neural networks more useful and convenient for data
classification [L8L95).

k-Nearest neighbor classifiers. Nearest neighbor classifier belongs to


instance-based learning methods. Instance-based learning methods differ from
other classification methods discussed earlier in that they do not build a clas-
sifier until a new unknown sample needs to be classified. Each training sample
is described by n-dimensional vector representing a point in an n-dimensional
space called pattern space. When a new unknown sample has to be classi-
fied, a distance function is used to determine a member of the training set
which is closest to the unknown sample. Once the nearest training sample
is located in the pattern space, its class label is assigned to the unknown
sample. The main drawback of this approach is that it is very sensitive to
noisy training samples. The common solution to this problem is to adopt the
k-nearest neighbor strategy. When a new unknown sample has to be classi-
fied, the classifier searches the pattern space for the k training samples which
are closest to the unknown sample. These k training samples are called the
k "nearest neighbors" of the unknown sample. The most commOn class label
among k "nearest neighbors" is assigned to the unknown sample. To find the
k "nearest neighbors" of the unknown sample a multidimensional index is
used (e.g. R-tree, Pyramid tree, etc.).
Two different issues need to be addressed regarding k-nearest neighbor
method: the distance function and the transformation from a sample to a
point in the pattern space. The first issue is to define the distance function.
536 T. Morzy and M. Zakrzewicz

If the attributes are numeric, most k-nearest neighbor classifiers use Euclidean
distance. Assume an n-dimensional Euclidean space, the distance between two
points X and Y, X = (Xl, X2, ... , xn) and Y = (Yl, Y2,.·., Yn), is defined as

n
d(X, Y) = ~)Xi - Yi)2.
i=l

Instead of the Euclidean distance, we may also apply other distance metrics
like Manhattan distance, maximum of dimensions, or Minkowski distance.
The choice of a given distance metric depends on an application. The second
issue is how to transform a sample to a point in the pattern space. Note that
different attributes may have different scales and units, and different variabil-
ity. Thus, if the distance metric is used directly, the effects of some attributes
might be dominated by other attributes that have larger scale or higher vari-
ability. A simple solution to this problem is to weight the various attributes.
One common approach is to normalize all attribute values into the range [0,
1]. This solution is sensitive to the outliers problem since a single outlier could
cause virtually all other values to be contained in a small subrange. Another
common approach is to apply a standardization transformation, such as sub-
tracting the mean from the value of each attribute and then dividing by its
standard deviation. Recently, another approach was proposed which consists
in applying the robust space transformation called Donoho-Stahel estimator
[KNZOl]. The estimator has some important and useful properties that make
the estimator very attractive for different data mining applications.
The description of other classification methods like case-based reasoning,
genetic algorithms, rough sets, and fuzzy sets can be found in [AP94,KoI93],
[CPS98,Mic92,Mit96,Paw91,SS96,Zad65,Zia94].

3.5 Classifier Accuracy

The accuracy of a classifier on a given test set of samples is defined as the


percentage of test samples correctly classified by the classifier, and it mea-
sures the overall performance of the classifier. Note that the accuracy of the
classifier is not estimated on the training dataset, since it would not be a
good indicator of the future accuracy on new data. The reason is that the
classifier generated from the training dataset tends to overfit the training
data, and any estimate of the classifier's accuracy based on that data will
be overoptimistic. In other words, the classifier is more accurate on the data
that was used to train the classifier, but very likely it will be less accurate on
independent set of data. Therefore, to predict the accuracy of the classifier
on new data, we need to asses its accuracy on an independent dataset that
played no part in the formation of the classifier. This dataset is called the
test set. It is important to note that the test dataset should not to be used
in any way to built the classifier.
11. Data Mining 537

There are several methods for estimating classifier accuracy. The choice
of a method depends on the amount of sample data available for training and
testing. If there are a lot of sample data, then the following simple holdout
method is usually applied. The given set of samples is randomly partitioned
into two independent sets, a training set and a test set. Typically, 70% of the
data is used for training, and the remaining 30% is used for testing. Provided
that both sets of samples are representative, the accuracy of the classifier on
the test set will give a good indication of accuracy on new data. In general,
it is difficult to say whether a given set of samples is representative or not,
but at least we may ensure that the random sampling of the data set is done
in such a way that the class distribution of samples in both training and test
set is approximately the same as that in the initial data set. This procedure
is called stratification.
Note that a classifier accuracy computed on a test set is only an estimate of
the true value of the classifier accuracy on the target (new) data set. Assume
p denotes a true (unknown) value of the classifier accuracy and f denotes the
classifier accuracy measured on a test set. The question is how close f is to p?
The answer is usually expressed as a confidence interval, that is, p lies within
a specified interval [f - z, f + z] with a certain specified confidence, which
depends on the size of the test set and the data distribution. The following
formula taken from [WFOO] gives the values of upper and lower confidence
boundaries for p:

For example, if f = 75% on a test set of the size N = 1000, then with 95%
confidence p lies within the interval [73,3%,76,8%].
If the amount of data for training and testing is limited, the problem is
how to use this limited amount of data for training to get a good classi-
fier and for testing to obtain a correct estimation of the classifier accuracy?
The standard and very common technique of measuring the accuracy of a
classifier when the amount of data is limited is k-fold cross-validation. In k-
fold cross-validation, the initial set of samples is randomly partitioned into k
approximately equal mutually exclusive subsets, called folds, 8 1 ,82 , ... , 8k.
Training and testing is performed k times. At each iteration, one fold is used
for testing while remainder k - 1 folds are used for training. So, at the end,
each fold has been used exactly once for testing and k - 1 for training. The
accuracy estimate is the overall number of correct classifications from k itera-
tions divided by the total number of samples N in the initial dataset. Often,
the k-fold cross-validation technique is combined with stratification and is
called stratified k-fold cross-validation.
There are many other methods of estimating classifier accuracy on a
particular dataset. Two popular methods are leave-one-out cross-validation
538 T. Morzy and M. Zakrzewicz

and the bootstrapping. Leave-one-out cross-validation is simply N-fold cross-


validation, where N is the number of samples in the initial dataset. At each
iteration, a single sample from the dataset is left out for testing, and remain-
ing samples are used for training. The result of testing is either success or
failure. The results of all N evaluations, one for each sample from the dataset,
are averaged, and that average represents the final accuracy estimate. The
second estimation method, called bootstrapping, is based on the sampling
with replacement. The general idea is the following. The initial dataset is
sampled N times, where N is the total number of samples in the dataset,
with replacement, to form another set of N samples for training. Since some
samples in this new "set" will be repeated, so it means that some samples
from the initial dataset will not appear in this training set. These samples will
form a test set. Both mentioned estimation methods are interesting especially
for estimating classifier accuracy for small datasets. However, in practice the
standard and most popular technique of estimating a classifier accuracy is
stratified tenfold cross-validation.

3.6 Prediction

The main goal of prediction is to construct a formal concise model for pre-
dicting numeric values or value ranges. The constructed model can be used
to predict, for example, the sales of a product given its price. As in the clas-
sification, the input to the prediction problem is a training set of tuples. The
outcome of the prediction is a value or a range of values. The classification
methods we have discussed in the previous Section work well with numerical
as well as categorical predictor attributes. However, when the dependent at-
tribute is numeric, and all the predictor attributes are also numeric, then we
may apply well known statistical methods of regression.

Linear regression. Linear regression is an excellent, simple method from


statistics that can be used to predict numeric values. The idea is to express the
value of the dependent attribute X as a linear combination of the predictor
attributes A with predetermined regression coefficients:

where x is the value of the dependent attribute X, al, a2, ... , ak are the
predictor attributes values, and wo, WI, W2,"" Wk are regression coefficients.
This is a regression equation, and the process of determining the coefficients
is called regression.
The coefficients are calculated from the training dataset of tuples by
the method of least squares. Assume the first sample from the training
dataset has the dependent attribute value xl, and predictor attributes values
aL a~, ... ,al, where the superscript denotes that it is the first sample. For
11. Data Mining 539

simplicity of notation, assume an extra attribute ao whose value is always l.


The predicted value for the first sample is the following:

Of interest, is the difference between the actual value of the sample xl and the
predicted value given by the above formula. The method of linear regression
is to choose the regression coefficients Wi, i = 0,1, ... , k, to minimizes the
sum of the squares of these differences over all training samples. Given n
samples, then the sum of the squares of the differences is defined as follows:

n k
2::(x i - 2:: wja;)2
i=l j=l

This sum of squares is what we have to minimize by choosing the regres-


sion coefficients appropriately. There are several popular statistical software
packages solving regression problems, like SAS, SPSS, etc. The model learned
from the training database using linear prediction is called linear model, since
the dependency between the dependent attribute and predictor attributes is
modeled as a linear function.
Linear models suffer from one serious disadvantage of linearity. Basic lin-
ear models are inappropriate to model data that exhibits a nonlinear depen-
dency between attributes, which makes them too simple in many practical
situations. However, this models may serve as building blocks for more com-
plex models.

Regression trees. Decision trees are designed to predict class labels of new
unseen data. When it comes to predict numeric values, the same kind of tree
representation can be used. Trees used for numeric prediction are just like or-
dinary decision trees except that each leaf node of the tree contains numeric
value that represents the average value of all training samples that reach the
leaf node. This kind of the tree is called a regression tree [BFO+84,WFOOj.
Regression trees are constructed in the similar way as decision trees. First,
a decision tree induction algorithm is used to build the initial tree. Then,
when the initial tree is constructed, the tree is pruned by removing some
branches and nodes from the constructed tree. The main difference between
the construction of a decision tree and a regression tree consists in the split-
ting criterion. Decision tree induction algorithms find the splitting attribute
of a node of the decision tree by minimizing the an impurity measure (en-
tropy or gini-index). In regression tree construction the splitting criterion is
usually based on the standard deviation of the class values in training data-
base D as a measure of the error at the node. The attribute that maximizes
the expected error reduction is chosen for splitting at the node.
540 T. Morzy and M. Zakrzewicz

The expected error reduction, denoted 8DR, is calculated by the following


formula [WFOOl:

I8 I
8DR = sd(T) - ~
v
TDT
i
X sd(8i )

where 81. 8 2 , ••• are the sets of training data that result from splitting the
node according to the chosen attribute, and sd(D), sd(8i ) denote standard
deviation of D and 8 i , respectively.
Regression trees are usually larger, more complex and much more difficult
to interpret than corresponding regression equations.

4 Clustering
Clustering is a process of grouping a set of physical or abstract objects into a
set of classes, called clusters, according to some similarity function. Cluster is
a collection o( objects that are similar to one another within the cluster and
dissimilar to objects in other clusters. Objects belonging to one cluster can be
treated collectively as one group. Unlike classification, there is no predefined
classes or class-labeled training objects. A "good" clustering method produces
a number of clusters in which the intra-cluster similarity is high, and the
inter-cluster similarity is low.
Clustering has a wide range of applications, including marketing, pattern
recognition, data analysis, image processing, biology, banking, and informa-
tion retrieval. For example, in marketing, clustering help discover groups of
customers with similar behavior based on purchasing patterns, discover cus-
tomers with unusual behavior, discover companies with similar growth or
similar energy consumption, etc. Clustering can be used to classify similar
documents on the Web for information discovery or to discover groups of
Web users with similar access patterns. In biology, clustering can be used for
animal or plant classification. In general, by clustering we can discover overall
distribution patterns and interesting correlations among objects attributes.
For other examples of clustering applications see [FPS+96,HKOOl.
Clustering is a well known research problem intensively studied in many
areas including machine learning, statistics, biology, and data mining. In ma-
chine learning, clustering was analyzed as an example of unsupervised classi-
fication (or unsupervised learning). In statistics, cluster analysis was focused
mainly on distance-based cluster analysis, where each object is described as
a n-dimensional data feature vector. Recently, due to the huge amount of
data collected in databases and data warehouses, clustering has become a
highly active topic in data mining research. In data mining, current research
on clustering focuses on the scalability of clustering algorithms with respect
to the number of objects, the number of dimensions, and the noise level,
and effectiveness of clustering algorithms for new types of data (numerical,
categorical, sequences, unstructured documents, Web pages, etc.) [HKOOl.
11. Data Mining 541

A lot of clustering methods and algorithms have been proposed in the


literature. The choice of the clustering algorithm depends both on the type
of analyzed data and on the particular purpose and application. In general,
clustering algorithms can be classified into the following 5 categories:
• Partitioning methods.
• Hierarchical methods.
• Density-based methods.
• Grid-based methods.
• Model-based methods.
In the following we will briefly describe basic clustering methods.

4.1 Partitioning Methods


Partitioning methods construct a partition of n objects into a set of k clusters,
where each cluster represents a set of "similar" objects, whereas the objects
belonging to different clusters are "dissimilar". Formally, the problem can be
formulated as follows. Given a database of n objects and k, the number of
clusters to create, find a partition of n objects into k clusters that optimizes
the chosen partitioning criterion.
The partitioning method works as follows. It creates an initial partitioning
of objects, and, then, applying an iterative reallocation technique, attempts
to improve the partitioning by moving objects from one cluster to another.
Obviously, to achieve global optimality, the method would require the exhaus-
tive enumeration of all possible partitions. However, this is computationally
infeasible. Therefore, most of partitioning methods adopt one of two popular
heuristic techniques: (1) the K-means algorithm, where each cluster is repre-
sented by the center of the cluster, or (2) K-medoids algorithm, where each
cluster is represented by one of the objects in the cluster.
The most popular and commonly used partitioning method is K-means
method and its variations [Fuk90,JD88,JMF99,Mcq67]. Given a set D =
{Db D 2 , ..• , Dn} of n objects and Di = {d il , di2 , ... , dim} a point in space
Rm representing an object D i . Let k be a number of desired clusters. The
K-means clustering problem is as follows. Find cluster centers (or means)
C l , C 2 , .•. , C k of k clusters such that the objective function E(k) is mini-
mized. Typically, the objective function to be minimized for K-meams algo-
rithm is the squared-error criterion
k n,
E(k) = 2: 2: dist 2 (D j , C i )
i=l j=l

where D j is the point in space Rm representing an object D j , C i is the mean


of cluster i, ni is the number of objects in cluster i, and dist(D j , C i ) is the
2-norm (Euclidean) distance between the point D j and its nearest cluster
center Ci.
542 T. Morzy and M. Zakrzewicz

Algorithm: K-means Clustering Algorithm

Input: a training database D of n objects, the number of clusters k


Output: a set of k clusters that minimizes the squared-error criterion

BuildTree(training database D, attributeJist A, split selection method SS)

(1) choose k objects at random as the initial cluster centers;


(2) repeat
(3) for each object Di ED, assign Di to cluster i such that center C i is nearest
to Di according the Euclidean distance function;
(4) for each cluster i compute C i as the mean of all objects assigned to the
cluster i;
(5) until no change;

Fig.4.1. K-means clustering algorithm

The K-means algorithm solves the clustering problem iteratively. First,


k points (objects) are chosen at random as clusters centers. For each of the
remaining objects, an object is assigned to its closest cluster center according
to the Euclidean distance function. Then, for each cluster the new center is
calculated as the mean of all objects assigned to the cluster. Finally, the whole
process is repeated with the new centers. Iteration continues until the same
points are assigned to each cluster in consecutive iterations. The K-means
algorithm is presented in Figure 4.l.
The K-means clustering algorithm is simple, scalable and reasonably ef-
fective. The algorithm attempts to produce k partitions that minimize the
objective criterion. However, the final clusters do not represent a global min-
imum but only a local one. Depending on the initial choice of cluster centers,
the algorithm may produce completely different final clusters. The standard
procedure applied to increase the chance of obtaining a "good" partitioning
is to repeat the clustering algorithm several times with different initial cluster
centers and choose the best partitioning. The basic K-means algorithm has
several disadvantages. First, the algorithm is sensitive to the initial parti-
tion of objects. Second, the algorithm is sensitive to noisy data and outliers.
Third, it is not suitable to discover clusters with non-convex shapes. Finally,
it is applicable only to objects described by numerical attributes.
A large number of variants of the basic K -means algorithm have been
developed [And73,Mit97J. Some of them attempt to select a good initial clus-
ter centers so that the algorithm is more likely to find the global minimum.
Another variation of the K -means algorithm is to permit splitting and merg-
ing of the resulting clusters. Typically, a cluster is split when its variance
is above a pre-specified threshold, and two clusters are merged when the
distance between their centers is below another pre-specified threshold. Us-
11. Data Mining 543

ing this variant, it is possible to obtain the optimal partition starting from
any arbitrary initial partition provided proper threshold values are specified.
Another variation of the K -means algorithm involves selecting a different
objective function or strategies to calculate cluster centers.
An interesting generalization of the K-means algorithm is the EM (Ex-
pectation Maximization) algorithm [Lau95,Mit97]. The algorithm is the com-
bination of the probability-based clustering with the K-means paradigm.
From a statistical perspective, the goal of clustering is to find the most likely
set of clusters given a set of objects. The foundation of statistical clustering
is a statistical model called finite mixture. A mixture is a set of k probability
distributions representing k clusters. Each distribution gives the probability
that a particular object would have a certain set of attribute values if it were
known to be a member of a given cluster. We assume that the individual
components of the mixture model are Gausian but with different means and
variances. The clustering problem is to take a set of objects, a number of clus-
ters, and work out each cluster mean, standard deviation and the population
distribution between the clusters. If we knew which of distributions each ob-
ject came from, finding the parameters of the mixture model would be easy.
On the other hand, if we knew the parameters of the model, then finding
the probabilities that a given object comes from given distribution would be
easy too. The problem is that we do not know neither the distribution that
each object came from, nor the parameters of the mixture model. So, EM
adopts the K -means paradigm to estimate these distributions and param-
eters from the objects. The EM algorithm begins with an initial estimate
of parameters, then uses them to calculate the cluster probabilities for each
object. The calculated probabilities are then used to update the parameter
estimates, and the process repeats.
One of the main disadvantages of the K -means algorithm is its sensitivity
to outliers. In general, by outliers we mean a set of objects that are consid-
erably dissimilar from the remainder of objects. Outliers may substantially
distort the distribution of objects among clusters. To deal with the problem
of outliers and noisy data, the K-medoids clustering method has been pro-
posed [KR90,NH94]. The basic idea of the algorithm consists in replacing
the center of a cluster, as a reference point of the cluster, by the medoid,
which is the most centrally located object in the cluster. The basic strategy
of K -medoids clustering algorithms consists in finding k representative ob-
jects (medoids) representing k clusters. The strategy then iteratively replaces
one of the medoids by one of the non-medoid objects if it improves the qual-
ity of the resulting partition. The quality is estimated using a cost function,
called total swapping cost, that measures the average dissimilarity between
an object and the medoid of its cluster. One of the first K-medoid clustering
algorithm was PAM (Partitioning Around medoids) [KR90]. The algorithm
is less sensitive to outliers and noisy data than K-means algorithms, however,
it has higher processing cost than the K-means algorithms.
544 T. Morzy and M. Zakrzewicz

P AM works effectively for small data sets, but does not scale well for large
data sets. To copy with large data sets, a sampling-based K-medoid algo-
rithm, called CLARA (Clustering LARge Applications) can be used [KR90j.
The algorithm works as follows. It draws multiple samples of the data set, ap-
plies P AM on each sample, and returns the best clustering as the output. If a
sample is selected in a fairly random manner, it should represent correctly the
original data set. Multiple samples increase the chance of producing a "good"
clustering. The problem is that a good clustering based on samples will not
necessarily represent a good clustering of the whole data set. Therefore, the
effectiveness of CLARA depends on the size of the sample. The larger sample
size, the greater probability of finding the best medoids for clusters. Notice
that CLARA looks for best k medoids among the selected sample of the data
set. It may happen that sampled medoids are not among best medoids of the
data set. In this case CLARA will never find the best clustering. Another
interesting variant of the K-medoids algorithm is the CLARANS (Cluster-
ing Large Applications based on RANdom Search) algorithm [NH94]' which
combines randomized search with PAM and CLARA algorithms. The clus-
tering process is formalized as a searching a graph in which each node is
a K-partition represented by a set of K medoids. Two nodes of the graph
are neighbors if they only differ by one medoid. C LARAN S starts with a
randomly selected node. For current node, it checks randomly maxneighbor
number of neighbors, where maxneighbor is a user-specified parameter. If a
better solution (neighbor) is found, CLARANS moves to the neighbor and
continues. Otherwise, it records the current node as a local optimum and
starts with a new randomly selected node. The algorithm stops after some
local optima have been found and returns the best one.

4.2 Hierarchical Methods


Hierarchical methods produce a nested series of partitions of the given set of
objects. In general, there are two types of hierarchical clustering algorithms
[HKOOj: (1) agglomerative algorithms and (2) divisive algorithms. The ag-
glomerative approach, called also the bottom-up approach, begins with each
object in a distinct (singleton) cluster, and successively merges the nearest
pair of clusters until the number of clusters becomes k or a certain stop-
ping condition is satisfied. The divisive approach, called also the top-down
approach, begins with all objects in a single cluster, then finds the most
inhomogeneous cluster and performs splitting into smaller clusters until a
stopping criterion is met.
Most of hierarchical clustering algorithms are variants of the agglomer-
ative approach. A typical hierarchical agglomerative clustering algorithm is
presented in Figure 4.2.
The algorithm works as follows. Initially, it places each object in its own
cluster and constructs a sorted list of intercluster distances for all pairs of
clusters. Then, for each distinct distance value dk the algorithm forms step by
11. Data Mining 545

Algorithm: Hierarchical Agglomerative Clustering Algorithm


Input: a database D of n objects
Output: a dendrogram representing the nested grouping of objects

(1) place each object in its own cluster;


(2) construct the matrix of intercluster distances for all distinct pairs of clusters;
(3) for each distinct dissimilarity value dk repeat
(4) forms a graph on the clusters where pairs of clusters closer than dk are
connected by a graph edge;
(5) until all clusters are members of a connected graph;

Fig. 4.2. Hierarchical agglomerative clustering algorithm

step a graph in which pairs of clusters closer than dk are connected by a graph
edge. If all initial clusters are members of a connected graph, the algorithm
stops. The output of the algorithm is a nested hierarchy of graphs (tree of
clusters), called a dendrogram, representing the nested grouping of objects
and similarity levels at which groupings change. A clustering of objects is
obtained by cutting the dendrogram at the desired dissimilarity level. Then,
each connected component in the corresponding graph forms a cluster (see
Figure 4.3).

similarity level

Fig. 4.3. A dendrogram


• • • • • • • c\u.~rs

The hierarchical agglomerative clustering algorithms differ only in their


definition of intercluster distance. According to the definition of intercluster
distance, they can be divided into four broad groups:

1. Single link clustering algorithms.


2. Complete link clustering algorithms.
546 T. Morzy and M. Zakrzewicz

3. Average link clustering algorithms.


4. Centroid link clustering algorithms.
In the single link clustering method, the distance between two clusters
Ci , Gj is defined as the minimum of the distances between all pairs of objects
drawn from the two clusters:

Dist(Gi , Gj ) = mino,EG"ojEGjdist(oi, OJ)


where Oi, OJ are objects belonging to clusters Gi , Gj , respectively, and
dist(Oi,Oj) is the distance between objects 0i and OJ.
In the complete link clustering method, the distance between two clusters
Gi, Gj is defined as the maximum of all pairwise distances between objects
in the two clusters:

Dist(Gi, Gj ) = maxo,EG"ojEGjdist(oi, OJ)


In the average link clustering method, the distance between two clusters
Gi , Gj is defined as the average distance of all pairwise distances between
objects in the two clusters:

Distavg (Gi , Gj ) = _1_ '"' '"' dist( Oi, OJ)


CiC' ~ ~
J o,EG,ojEGj

where Ci, Cj denote the number of objects in cluster Gi and Gj , respectively.


Finally, in the centroid link clustering method, the distance between two
clusters Gi , Gj is defined as the distance between their centers (means):
Dist( Gi , Gj ) = dist( Gi , Gj)
where Gi, Gj denote centers of clusters Gi and Gj , respectively.
Hierarchical clustering algorithms are more versatile than partitioning
clustering algorithms. They can produce non-isotropic clusters, clusters in-
cluding well-separated, chain-like, and concentric clusters, while typical par-
titioning algorithms works well only on objects having isotropic clusters.
The basic weakness of hierarchical clustering algorithms is that once a
step has been executed (merge or split of clusters), it can never be undone.
Thus, merge or split decisions, if not well chosen at some step, may lead to
low-quality partitioning of objects. Moreover, these algorithms do not scale
well with regard the number of objects. The time complexity of hierarchical
agglomerative clustering algorithms is O(n 2 logn), while the space complex-
ity is O(n 2 ). So, the time and space complexity of hierarchical agglomerative
clustering algorithms is usually higher than those of the partitioning cluster-
ing algorithms.
An interesting direction for improving quality of clustering of hierarchical
clustering algorithms is to combine the idea of hierarchical clustering with
distance-based clustering method as it is done in BI RGH and GU RE meth-
ods [GRS9S,ZRL96].
11. Data Mining 547

BIRCH algorithm. BIRCH (Balanced Iterative Reducing and Cluster-


ing using Hierarchies) [ZRL96] stores summary information about candidate
clusters in a dynamic tree data structure called clustering feature tree (CF-
tree). The CF-tree hierarchically organizes the candidate clusters represented
at the leaf nodes. The basic concept used by BIRCH to describe a subclus-
ter of objects is a clustering feature vector. A clustering feature vector (CF-
vector) is a triplet summarizing information about a given subcluster. Given
N d-dimensional objects {oil in a subcluster, the CF-vector is defined as
follows:
CF = (N,L§,SS)

where N is the number of objects in the subcluster, L§ is the linear sum of


all objects in the subcluster, L§ = Z:!1 at, and SS is the square sum of
objects, SS = Z:!1 at 2 •
A CF-tree is a balanced tree storing the clustering feature vectors with 2
parameters: branching factor and threshold. The branching factor B defines
the maximum number of children per nonleaf node. The threshold T defines
the maximum diameter of sub clusters stored at the leaf nodes. Each non-
leaf node contains at most B entries of the form [CFi , childi ], where i =
1,2, ... , B, childi is a pointer to its i-th child node, and CFi is the CF-vector
of the subcluster represented by this child. Thus, a non-leaf node represents a
cluster made up of all the sub clusters represented by its entries. A leaf node
contains at most L entries, each of the form [CFi ] , where i = 1,2, ... , L.
Since leaf nodes forms a double-linked list, each leaf node has two pointers
prev and next which are used for efficient scan. Additionally, all entries in
a leaf node must satisfy a threshold requirement with respect to a threshold
value T. So, CF-tree uses sum of CF-vectors to build higher levels of the
CF-tree.
The CF-tree is built dynamically as objects are inserted. It is used to
guide a new insertion into a correct subcluster for clustering purposes just
the same as a B+-tree is used to guide a new insertion into the correct
position for sorting purposes. The algorithm for inserting a new object 0i (an
entry) into a CF-tree is the following:
1. identifying the appropriate leaf: find the closest leaf entry (subcluster) ac-
cording to a chosen distance metric: centroid Euclidean distance, centroid
Manhattan distance, average inter-cluster distance, average intra-cluster
distance, or variance increase distance,
2. modifying the leaf: if an object 0i fits into the closest leaf node (say E)
and fulfills the threshold requirement, then insert 0i in E and update the
CF-vector for E. If not, a new entry (subcluster) is added to the leaf node.
If there is no space for new entry (the diameter of the subcluster after
insertion> T), split the leaf node(s). Node splitting is done by choosing
the farthest pair of entries as seeds, and redistributing the remaining
entries based on the closest criterion.
548 T . Morzy and M. Zakrzewicz

3. modifying the path to the leaf node: after inserting a new entry into a leaf
node, update the CF entries for each non-leaf node on the path from the
root node to the leaf node. In the absence of a split, this simply involves
updating CF-vectors at non-leaf nodes, otherwise, split non-leaf nodes if
necessary,
4. merging refinement: if the CF-tree is too large, condense the tree by
merging the closest leaf nodes.

The general overview of the BIRCH algorithm is shown in Figure 4.4.

Fig. 4.4. BIRCH: the algorithm

The algorithm consists of 4 phases.

Phase 1 Scan all data and build an initial in-memory CF-tree using the
given amount of memory and recycling space on disk.
Phase 2 (optional) Scan the leaf entries in the initial CF-tree to rebuild a
smaller CF tree, while removing outliers and grouping crowded clusters
into larger ones.
Phase 3 Cluster all leaf entries by applying an existing (hierarchical ag-
glomerative) clustering algorithm directly to the subclusters represented
by their CF-vectors.
Phase 4 (optional) Pass over the data to correct inaccuracies and refine
clusters further.

Phases 1-2 produces a condensed representation of the set of objects


(dataset) as fine as possible under the memory limit. Phases 3-4 applies a
clustering algorithm to the leaf nodes of the CF-tree. The computational
complexity of the algorithm is O(n 2 ), where n is the number of objects to
11. Data Mining 549

be clustered. BIRCH has several advantages. An important contribution of


BIRCH is formulation of the clustering problem in a way that is appropri-
ate for very large database - by making the time and memory constraints
explicit. In addition, BIRCH is a local (as opposed to global) clustering
method in that each clustering decision is made without scanning all objects
or all currently existing clusters. The algorithm exploits the observation that
the data space is usually not uniformly occupied, and hence not every ob-
ject is equally important for clustering purposes. A dense region of objects is
treated collectively as a single cluster. Objects in sparse regions are treated as
outliers and removed optionally. BI RCH makes full use of available memory
to derive the finest possible sub clusters while minimizing I/O costs. Finally,
BIRCH is an incremental method that does not require the whole dataset
in advance, and only scans the dataset once.

CU RE algorithm. Clustering algorithms that use 1 point (center or med-


oid) to represent a cluster usually suffer from two serious disadvantages,
namely, they work well only on objects having convex clusters of similar
sizes, and they are very sensitive with respect to outliers. CURE (Cluster-
ing Using REpresentatives) [GRS98] overcomes these disadvantages by using
multiple points, called representative points to represent a cluster. Having
more than one representative point per cluster allows CURE to adjust well
to arbitrary shaped clusters.
The general overview of the CURE algorithm is shown in Figure 4.5 .

.JJ,- Data
IDraw random sample Iq IPartition sample Iq IPartially cluster partitions I
.JJ,-
ILabel data in disk I ¢:J ICluster partial clusters I ¢:J IEliminate outliers
.JJ,-
Fig. 4.5. CURE: the algorithm

The algorithm consists of the following steps.


1. Draw a random sample n objects from the database.
2. Partition the random sample n into p partitions, each of size n/p.
3. Partially cluster each partition until the final number of clusters in each
partition is reduced to n/pq, for some constant q > 1.
4. Eliminate outliers by random sampling (outliers do not belong to any of
the clusters); eliminate clusters which are growing very slowly.
5. Cluster partial clusters to generate the final k clusters.
550 T. Morzyand M. Zakrzewicz

6. Each data object is assigned to the cluster containing the representative


point closest to it.
CU RE is a hierarchical agglomerative clustering algorithm which com-
bines random sampling with partitioning. Random sampling is used by
CURE for two purposes: (1) to reduce the size of the input data to the
CU RE's clustering algorithm, and (2) to filter outliers. A random sample
n is partitioned into p partitions, each of size nip. Then, each partition is
partially clustered until the final number of clusters in each partition is re-
duced to nlpq, for some constant q > 1. Once nlpq clusters are generated
for each partition, then, CURE runs a second clustering pass on the partial
clusters (for all partitions) to obtain the final k clusters. The partitioning
scheme is employed to ensure that the selected input set of objects to the
clustering algorithm fits always into main-memory even though the random
sample itself may not fit into main-memory. The problem appears with the
second pass since the size of the input of the second pass is the size of the
random sample. By storing only the representative points for each cluster,
CU RE reduces the input size for the second pass.
Since the input to CURE's clustering algorithm is a set of randomly
sampled objects from the original data set, the final k clusters involve only a
subset of the entire set of objects. For assigning the appropriate cluster labels
to the remaining objects, CURE employs a fraction of randomly selected
representative points for each of the final k clusters. Each object is assigned
to the cluster containing the representative point closest to the object.
CU RE is robust to outliers. It eliminates outliers in multiple steps. First,
random sampling filters out a majority of outliers. Then, during step 4, clus-
ters which are growing very slowly are identified and eliminated as outliers.
Finally, outliers are eliminated at the end of the clustering process.

4.3 Other Clustering Methods


A number of clustering methods and algorithms have been proposed in
the literature: partitioning methods, hierarchical methods, density-based
methods, grid-based methods, model-based methods, search-based methods,
evolutionary-based methods, etc. The presented list of methods does not
cover all methods proposed in the literature. Some clustering algorithms com-
bines ideas of different clustering methods, so it is difficult to classify them as
belonging to only one clustering method. Some of these methods are used in
commercial data mining tools, like partitioning methods, hierarchical meth-
ods, others are actually less popular but offer some interesting features from
the point of view of particular applications. In this section we will briefly
describe some of these methods.

Density-based clustering methods. Density-based clustering methods


have been mainly developed to discover clusters with arbitrary shape in spa-
11. Data Mining 551

tial databases. The clustering process in these methods is based on the notion
of density. The density-based methods regard clusters as dense regions of ob-
jects in the data space that are separated by regions of low density. The
basic idea of these methods is to grow the given cluster as long a5 the density
in the "neighborhood" of the cluster exceeds some threshold value. Density-
based methods have several interesting properties: they are able to discover
clusters of arbitrary shape, they handle outliers, and usually need only one
scan over data set. A well-known example of a density-based method is the
DB SCAN algorithm [EKS+96]. DBSCAN defines clusters as maximal
density-connected sets of objects. The algorithm requires user to specify two
parameters to define minimum density: t: - maximum radius of the neighbor-
hood, and minpts - minimum number of objects in an €-neighborhood of that
object. If the t:-neighborhood of an object contains at least minpts objects,
then the object is called a core object. To determine clusters, DBSCAN uses
two concepts: density reach ability and density connectivity. An object OJ is
directly density reachable from an object 0i with respect to t: and minpts if:
(1) OJ belongs to the t:-neighborhood of Oi, and (2) the €-neighborhood of
0i contains more than minpts objects (Oi is the core object). Density reach-
ability is the transitive closure of direct density reach ability. An object OJ
is density connected to an object 0i with respect to € and minpts if there
is an object Ok such that both objects OJ and 0i are density reachable from
Ok with respect to t: and minpts. The following steps outline the algorithm:
(1) start from an arbitr~ry object 0, (2) if t:-neighborhood of 0 satisfies min-
imum density condition, a cluster is formed and the objects in belonging to
the t:-neighborhood of 0 are added to the cluster, otherwise, if 0 is not the
core object, DBSCAN selects the next object, (3) continue the process until
all objects have been processed. A density-based cluster is a set of density
connected objects that is minimal with respect to the density reachability
relationship. Every object not contained in any cluster is considered to be
outlier. To determine t:-neighborhood of a given object, DB SCAN uses index
structures, like R-tree or its variants, or nearest-neighbor search.
Other interesting examples of density-based algorithms are DBCLASD
[XEK98], OPTICS [ABK+99] (extensions of DBSCAN) , and DENCLUE
[HK98].

Grid-based clustering methods. Grid-based methods quantize the object


space into a finite number of cells that form a grid structure on which all of
the operations for clustering are performed. The clustering process in these
methods is also based on the notion of density. Grid-based methods are very
efficient since the clustering process is usually independent of the number of
objects. Moreover, the grid structure facilitates parallel processing and incre-
mental updating of clusters. Well-known examples of a grid-based method
are ST INC, Hierarchical Grid CLustering, and WaveCluster algorithms
[Sch96,SCZ98,WYM97] .
552 T. Morzy and M. Zakrzewicz

STING (STatistical INformation Grid) [WYM97] uses a quadtree-like


structure of rectangular cells for condensing the data into grid cells. There are
several levels of cells corresponding to different levels of resolution - cells form
a hierarchical structure, in which a cell at a higher level is partitioned into a
number of cells at the lower level. The nodes of the quadtree contain statistical
information about objects in the corresponding cells, which is used to answer
queries. This statistical information is used in a top-down manner. The pro-
cess begins by determining a level of the hierarchical structure from which the
query answering process starts, determines relevant and irrelevant cells, and
moves to the next levels, in which only relevant cells are processed. This pro-
cess is repeated until the bottom level is reached. STING determines clusters
as the density connected components of the grid data structure. Hierarchical
Grid Clustering algorithm [Sch96] organizes the object space as a multidi-
mensional grid data structure. For each block of the grid structure (block is
a d-dimensional rectangle), calculates the density index and sorts the blocks
by their density indices. Then, the algorithm scans the blocks iteratively and
merges blocks, which are adjacent over ad - I-dimension hyperplane. The
order of the merges forms a hierarchy of clusters. WaveCluster [SCZ98] is a
multiresolution clustering algorithm which integrates grid-based and density-
based approaches. First, the algorithm partitions the object space by a multi-
dimensional grid structure and assigns objects to the grid cells. Each grid cell
summarizes information about a group of objects assigned to the cell. Due
to this quantization it reduces the number of objects to be processed. Then,
the algorithm applies wavelet transformation to the reduced feature space
and finds the connected components as clusters in the transformed feature
space at different levels. Finally, the algorithm assigns labels to the cells and
maps the objects to the clusters. The algorithm has the linear complexity,
identifies clusters at different levels using multiresolution, and is robust to
outliers. However, the algorithm is applicable only to low dimensional object
space. Another interesting example of a clustering algorithm that combines
both grid-based density-based approaches is CLIQUE [AGG+98].

4.4 Clustering Categorical Attributes


Clustering algorithms presented in previous sections focused on numeri-
cal attributes which have a natural ordering of attribute values and for
which distance functions can be naturally defined. However, many data
sets consist of objects described by categorical attributes on which dis-
tance functions are not naturally defined. As an example, consider a
data sets describing car dealers. Given two objects A and B and the
categorical attribute Car _name, which takes values from the domain
{Toyota, N issan, Ford, Honda, ... , etc.}. Objects A and B are either equal
on the attribute Car _name, Car _nameA = Car _nameB, or they have dif-
ferent values on Car _name, Car _nameA =1= Car _nameB. It is hard to reason
about the distance between Toyota or Ford, or Ford and Honda, in a way
11. Data Mining 553

similar to numeric values, it is even difficult to say that one name of a car is
"like" or "unlike" another name.
Traditional clustering algorithms are, in general, not appropriate for
clustering data sets with categorical attributes [GRS99bj. Therefore, new
concepts and methods were developed for clustering categorical attributes
[Hua98,GGR99,GKR98,GRB99,GRS99b,HKK+98j. In the following subsec-
tion we briefly present one of the proposed method, called ROCK, to illus-
trate basic concepts and ideas developed for clustering categorical attributes.

ROCK. ROCK is an adaptation of an agglomerative hierarchical clustering


algorithm for categorical attributes [GRS99bj. The algorithm is based on
new concepts of links and neighbors that are used to evaluate the similarity
between a pair of objects described by a set of categorical attributes.
Given a normalized similarity function sim(oi, OJ), that captures the close-
ness between two objects 0i, OJ, objects 0i and OJ are said to be neighbors if the
similarity between them is greater than a certain threshold, sim(oi, OJ) ~ ().
The threshold () is a user-specified parameter. If () = 1 two objects are neigh-
bors if they are identical, if () = 0, then any two objects in the data set are
neighbors.
To define a similarity between two objects (or two clusters) described by
categorical attributes, the Jaccard coefficient is often used as the similarity
measure [HKOOj. However, clustering objects based on only similarity between
them is not strong enough to distinguish two "not well separated clusters"
since it is possible for objects in different clusters to be neighbors. However,
even if a pair of objects in different clusters are neighbors, it is very unlikely
that the objects have a large number of common neighbors. This observation
motivates the definition of links that builds on the notion of closeness between
objects to determine more effectively when close objects belong to the same
cluster. The number of links between two objects 0i,Oj, denoted link(oi,Oj)
is defined as the number of common neighbors they have in the data set.
The link-based approach used in ROCK algorithm adopts a global approach
to the clustering problem. It captures the global knowledge of neighboring
objects into the relationship between individual pairs of objects.
The objective function used by ROCK algorithm to estimate the "good-
ness" of clusters is defined in terms of the links between objects:

where Ci denotes cluster i of size ni, k denotes the required number of clus-
ters, and f((}) denotes a function that is dependent on the data set as well
as the kind of clusters. The function has the following property: each object
belonging to cluster Ci has approximately n{(8) neighbors in Ci . The best
clusters are the ones that maximize the value of the objective function.
554 T. Morzy and M. Zakrzewicz

.....JJ.- Data
IDraw random sample Iq ICluster with links Iq I Label data In disk

Fig. 4.6. ROCK: the algorithm

The general overview of the ROCK algorithm is shown in Figure 4.6.


The algorithm accepts as input the set of randomly sampled objects from
the original data set and the number of desired clusters k. Initially, each object
is a separate cluster. Then, iteratively, two closest clusters are merged until
only k clusters remain. To determine the best pair of clusters to merge ROCK
uses the following goodness measure. Let link [Ci, Cj ] store the number of
cross links between clusters Ci and Cj , that is, EpqEGi,PtEGj link(pq,pt).
The goodness measure gm( Ci , Cj ) for merging clusters Ci , Cj is defined as
follows:
m(C. C.) _ link [Ci,Cj ]
9 "3 - (ni + nj)H2f(lJ) _ n~+2f(lJ) _ n~+2f(lJ)
where C i , C j denote clusters i and j of size ni, nj, respectively. The pair of
clusters for which the goodness measure gm is maximum is the best (closest)
pair of clusters to be merged at any given step.
Since the input to ROCK's clustering algorithm is a set of randomly sam-
pled objects from the original data set, additional labeling phase is necessary
to assign the appropriate cluster labels to the remaining objects residing on
disk.

Other algorithms for clustering categorical attributes. As we men-


tioned before, recently a number of algorithms have been proposed for clus-
tering categorical attributes. The algorithms proposed in [GRB99,Hua98] are
variants of the K-means algorithm adopted for clustering categorical at-
tributes. The ST I RR algorithm is an iterative clustering algorithm based
on non-linear dynamical systems [GKR98]. The algorithm defines a similar-
ity between objects based on co-occurrences of values in the same attribute.
A dynamical system represents each distinct value of a categorical attribute
as a weighted vertex in a graph. Multiple copies of the set of vertices, called
basins, are maintained. The weights of any vertex may differ across basins.
Starting with a set of weights on all vertices, the dynamical system uses iter-
ative approach for propagating weights until the fixed point is reached. When
the fixed point is reached, the weights in one or more of the basins isolate two
groups of attribute values on each categorical attribute: the first with large
positive weights and the second with small negative weights. These groups
of attribute values correspond to projections of clusters on the given cate-
gorical attribute. However, the algorithm requires non-trivial post-processing
11. Data Mining 555

to identify such sets of related attribute values and to determine produced


clusters.
The algorithm proposed in [HKK +98] uses a weighted hypergraph struc-
ture to find clusters in a data set. To construct a hypergraph frequent item-
sets (used to generate association rules) are used. Each frequent itemset is a
hyperedge in the hypergraph and the weight of the hyperedge is computed
as the average of confidences of all association rules that can be generated
from the itemset. Then, a hypergraph partitioning algorithm (e.g. H M ET I S
[KAK +97]) is used to partition the items such that the sum of weights of
hyperedges that are cut due to the partitioning is minimized. The result is a
clustering of items (not objects), so the next step is labeling of objects with
item clusters using a scoring metric. Authors proposed also the function,
called fitness, which is used to evaluate the "goodness" of a cluster.
Another interesting algorithm for clustering categorical attributes, called
CACTUS, has been proposed in [GGR99]. The basic idea behind CACTUS
is that a summary information constructed from the data set is sufficient to
discover a set of "candidate" well-defined clusters which can then be vali-
dated to determine the final set of clusters. The properties that the summary
information typically fits into main-memory, and that it can be constructed
efficiently, typically in a single scan of the data set, result in significant per-
formance improvements. C ACTU S consists of three phases: summarization,
clustering, and validation. In the summarization phase, the summary infor-
mation (inter-attribute as well as intra-attribute summaries) is constructed
from the data set. In the two-step clustering phase, the summary informa-
tion computed in the previous phase is used to discover a set of "candidate"
clusters. In the validation phase, the set of final clusters from the set of can-
didate clusters is computed. The algorithm is scalable since it requires only
two scans of the data set.

4.5 Outlier Detection


There is no single, generally accepted, formal definition of an outlier. The
popular intuitive definition given by Hawkins states that: "an outlier is an
observation that deviates so much from other observations as to arouse sus-
picion that it was generated by a different mechanism" [Haw80]. Usually,
an outlier is defined as an object that is considerably dissimilar from the
remaining set of objects according to some measure. Outlier may be consid-
ered as an anomaly in the data set caused by measurement or human error,
or it may be considered as the result of data variability. Depending on the
point of view, we distinguish two approaches to outliers. First approach to
outliers, represented by classification and clustering algorithms, focuses on
detecting and eliminating outliers from the data set, or at least, on minimiz-
ing the influence of outliers on the resulting model (e.g. a set of clusters, a
decision tree). Second approach, represented by outlier detection algorithms,
in contrast, considers outliers as objects that may be of particular interest to
556 T. Morzy and M. Zakrzewicz

users since they often contain useful information on abnormal behavior of the
system described by a set of objects. Indeed, for some applications, the rare
or abnormal events or objects are much more interesting than the common
ones, from a knowledge discovery standpoint. Sample applications include
the credit card fraud detection, network intrusion detection, monitoring of
criminal activities in electronic commerce, or monitoring tectonic activity of
the earth's crust [KNTOOj. Outlier detection and analysis is an interesting
and important data mining task, referred to as outlier mining.
The algorithms for outlier detection can be classified, in general, into the
following 2 approaches [HKOOj: (1) statistical approach and (2) distance-based
approach.
The concept of outliers has been studied quite extensively in computa-
tional statistics [BL94,Haw80j. The statistical approach to outlier detection
assumes that the objects in the data set are modeled using a stochastic distri-
bution, and objects are determined to be outliers with respect to the model
using a discordancy/outlier test. Over 100 discordancy tests have been de-
veloped for different circumstances, depending on: (1) the data distribution,
(2) whether or not the distribution parameters (e.g. mean and variance) are
known, (3) the number of expected outliers, and even (4) the types of ex-
pected outliers (e.g. upper of lower outliers in an ordered sample). How-
ever, almost all of the discordancy tests suffer from two serious problems.
First, most of the tests are univariate (i.e. single attribute). This restriction
makes them unsuitable for multidimensional data sets. Second, all of them
are distribution-based, i.e. they require parameters of the data set, such as
the data distribution. In many cases, we do not know the data distribution.
Therefore, we have to perform extensive testing to find a multidimensional
distribution that fits the data.
Distance-based approach defines outliers by using the distances of the
objects from one another. For example, the definition by Knorr and Ng
[KN98,KNTOOj defines an outlier in the following way: an object 0 in a data
set D is a distance-based (DB) outlier with respect to the parameters k and
d, that is, DB(k, d), if no more than k objects in the data set are at a dis-
tance d or less from o. According to the definition, DB outliers are those
objects who do not have "enough" neighbors, where neighbors are defined
in terms of the distance from the given object. As pointed out in [RRSOOj,
this measure is very sensitive to the use of the parameter d which is hard to
determine a-priori. Moreover, when the dimensionality increases, it becomes
increasingly difficult to define d, since most of the objects are likely to lie in
a thin shell about any other object. This means, that it is necessary to define
d very accurately in order to find a reasonable number of outliers. To over-
come the problem of the distance parameter d, Ramaswamy, Rastogi, and
Shim [RRSOOj introduces another definition of an outlier. Let Dk(o) denote
the distance of the k-th nearest neighbor of the object 0 in a data set D.
Then, according to [RRSOOj, outliers are defined as follows: given a k and
11. Data Mining 557

n, an object 0 in a data set D is an outlier if there are no more than n - I


other objects 0' such that Dk(o') > Dk(o). Intuitively, Dk(o) is a measure
of how much of an outlier object 0 is. So, according to the above definition,
the top n objects with the maximum Dk values are considered outliers. The
benefit of the distance-based approach is that it does not require any apriori
knowledge of data distributions that the statistical approach does. Moreover,
both definitions are general enough to model statistical discordancy tests for
normal, poisson, and other distributions. The authors proposed a number
of efficient outlier detection algorithms for finding distance-based outliers.
These algorithms base on nested-loop, grid or multidimensional index struc-
tures [KN98,RRSOO]. However, the proposed algorithms do not scale well for
high dimensional data sets. An interesting technique for finding outliers based
on the average density in the neighborhood of an object have been proposed
in [BKN+OO]. Most of outlier detection algorithms consider being an out-
lier as a binary property. However, as Authors demonstrate in [BKN+OO], in
many situations it is meaningful to consider being an outlier as the degree, to
which the object is isolated from its surrounding neighborhood. They intro-
duce the notion of the local outlier factor, which captures this relative degree
of isolation. In order to compute the outlier factor of an object 0, the method
in [BKN+OO] computes its local reachability density based on the average
smoothed distances to objects in the locality of o. Authors show that the
proposed method is efficient for data sets where the nearest neighbor queries
are supported by index structures. Recently, a new interesting technique for
outlier detection has been proposed in [AYOI]. This technique is especially
well suited for high dimensional data sets. The technique finds the outliers
by studying the behavior of lower dimensional projections from the data set.

There are some studies in the literature that focus on identifying the
deviations in large multidimensional data sets [CSD98,JKM99,SAM98]. The
proposed techniques are significantly different from those of outlier detection,
but the idea behind the techniques is very similar: to identify objects (or data
values) that are "intuitively surprising" . Sarawagi, Agrawal, and Megiddo de-
veloped the deviation detection technique to find deviations in OLAP data
cubes [SAM98]. The authors define a deviation as a data value that is signif-
icantly different from the expected value computed from a statistical model.
The technique is a form of discovery-driven exploration where some precom-
puted measures indicating data deviations are used to guide the user in data
analysis. So, the user navigates through the data cube, visually identifying
interesting cells that are flagged as deviations. The user can drill down further
to analyze lower levels of the cube, thus, the user can detect deviations at var-
ious levels of data aggregation. The deviation detection process is overlapped
with cube computation to increase the efficiency. This interactive technique
involves the user in the discovery process, which may be difficult since the
search space is typically very large, particularly, when there are many dimen-
558 T. Morzy and M. Zakrzewicz

sions of analysis. The work of Chakrabarti, Sarawagi, and Dom deals with
the problem of finding surprising temporal patterns in market basket data
[CSD98], while Jagadish, Koudas , and Muthukrishnan propose the efficient
method for finding data deviations in time-series databases [JKM99].

5 Conclusions

In this chapter, we have described and discussed the fundamental data mining
methods. Since data mining is the area of very intensive research, there are
many related problems that still remain open. The most commonly discussed
data mining issues include interactive and iterative mining, data mining query
languages, pattern interestingness problem, and visualization of data mining
results. The reason for perceiving data mining as an interactive and iterative
process is that it is difficult for users to know exactly what they want to have
discovered. Typically, users experiment with different constraints imposed on
a data mining algorithm, ego different minimum support values, to narrow
the resulting patterns to those, which are interesting to the users. Such an
iterative process would normally require rerunning of the basic data mining
algorithm. However, if the user constraints change slightly between iterations,
then possibly previous results of the data mining algorithm can be used in
order to answer the new request. Similarly, the concept of a materialized
view should be considered here to provide for optimizations of frequent data
mining tasks. Another method to provide for efficient iterative data mining of
very large databases is to use appropriate sampling techniques to be applied
for fast discovery of initial set of patterns. After a user is satisfied with the
rough result based on the sample, the complete algorithm can be executed
to deliver final and precise set of resulting patterns.
Data mining can be seen as advanced database querying, in which a user
describes a data mining problem by means of a declarative query language
and then the data mining system executes the query and delivers the results
back to the user. The declarative data mining query language should be
based on a relational query language (such as SQL), since it would be useful
to mine relational query results. The language should allow users to define
data mining tasks by facilitating the specification of the data sources, the
domain knowledge, the kinds of patterns to be mined and the constraints to
be imposed on the discovered patterns. Such a language should be integrated
with a database query language and optimized for efficient and flexible data
mining.
The fundamental goal of data mining algorithms is to discover interesting
patterns. Patterns which are interesting to one user, need not be interesting to
another. Users should provide the data mining algorithms with the specific
interestingness measures, and the algorithms should employ the measures
to optimize the mining process. Such interestingness measures include sta-
11. Data Mining 559

tistical factors, logical properties of patterns, containment of a number of


user-specified items, etc.
In order to provide humans with better understanding of discovered
knowledge, the discovered knowledge should be expressed in high-level, pos-
sibly visual, languages or other expressive forms. Visual representation of the
discovered knowledge is crucial if the data mining system is to be interactive.
The representation techniques include trees, tables, rules, charts, matrices,
etc.
During the last ten years, many general commercial data mining systems
were developed. The most commonly known are: IBM Intelligent Miner, SAS
Enterprise Miner, SGI Mineset, ISL Clementine and DBMiner. Some of them
specialize in only one data mining task while others provide a broad spectrum
of data mining functions. They also differ in data types processed, DBMS
cooperation, visualization tools and query languages implemented.

References

[ABK+99] Ankerst, M., Breunig, M., Kriegel, H-P., Sander, J., Optics: order-
ing points to identify the clustering structure, Proc. ACM SIGMOD
Conference on Management of Data, 1999, 49-60.
[AGG+98] Agrawal, R, Gehrke, J., Gunopulos, D., Raghavan, P., Automatic
subspace clustering of high dimensional data for data mining appli-
cations, Proc. ACM SIGMOD Conference on Management of Data,
1998,94-105.
[Aha92] Aha, D., Tolerating noisy, irrelevant, and novel attributes in instance-
based learning algorithms, International Journal of Man-Machine
Studies 36(2), 1992, 267-287.
[AIS93] Agrawal, R, Imielinski, T., Swami, A., Mining association rules be-
tween sets of items in large databases, Proc. ACM SIGMOD Confer-
ence on Management of Data, 1993, 207-216.
[And73] Anderberg, M.R, Cluster analysis for applications, Academic Press,
New York, 1973.
[AP94] Aamodt, A., Plazas, E., Case-based reasoning: foundational issues,
methodological variations, and system approaches, AI Communica-
tions 7, 1994, 39-52.
[ARS98] Alsabati, K., Ranka, S., Singh, V., Clouds: a decision tree classifier
for large datasets, Proc. 4th International Conference on Knowledge
Discovery and Data Mining (KDD'1998), 1998, 2-8.
[AS94] Agrawal, R, Srikant, R, Fast algorithms for mining association
rules, Proc. 20th International Conference on Very Large Data Bases
(VLDB'94), 1994, 478-499.
[AS95] Agrawal, R., Srikant, R., Mining sequential patterns, Proc. 11th In-
ternational Conference on Data Engineering, 1995, 3-14.
[AS96] Agrawal, R., Shafer, J.C., Parallel mining of association rules, IEEE
Transactions on Knowledge and Data Engineering, vol. 8, No.6, 1996,
962-969.
560 T. Morzy and M. Zakrzewicz

[AYOl] Aggarwal, C.C., Yu, P.S., Outlier detection in high dimensional data,
Proc. ACM SIGMOD Conference on Management of Data, 2001, 37-
46.
[BFO+84] Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., Classification
and regression trees, Wadsworth, Belmont, 1984.
[Bis95] Bishop, C., Neural networks for pattern recognition, Oxford Univer-
sity Press, New York, NY, 1995.
[BKN+OO] Breunig, M.M, Kriegel, H-P., Ng, RT., Sander, J., LOF: identify-
ing density-based local outliers, Proc. ACM SIGMOD Conference on
Management of Data, 2000, 93-104.
[BKS+90] Beckmann, N., Kriegel, H-P., Schneider, R, Seeger, B., The R*-tree:
an efficient and robust access method for points and rectangles, Proc.
ACM SIGMOD Conference on Management of Data, 1990, 322-331.
[BL94] Barnett, V., Lewis, T., Outliers in statistical data, John Wiley, 1994.
[BMU+97] Brin, S., Motwani, R, Ullman, J.D., Tsur, S., Dynamic itemset count-
ing and implication rules for market basket data, Proc. ACM SIG-
MOD Conference on Management of Data, 1997, 255-264.
[BWJ+98] Bettini, C., Wang, X.s., Jajodia, S., Lin, J., Discovering frequent
event patterns with multiple granularities in time sequences, IEEE
Transactions on Knowledge and Data Engineering, vol. 10, No.2,
1998, 222-237.
[CHN+96] Cheung, D.W., Han, J., Ng, V., Wong, C.Y., Maintenance of discov-
ered association rules in large databases: an incremental updating
technique, Proc. 12th International Conference on Data Engineering,
1996, 106--114.
[CHY96] Chen, M.S., Han, J., Yu, P.S., Data mining: an overview from a data-
base perspective, IEEE Trans. Knowledge and Data Engineering 8,
1996, 866-883.
[CPS98] Cois, K., Pedrycz, W., Swiniarski, R, Data mining methods for knowl-
edge discovery, Kluwer Acadamic Publishers, 1998.
[CS96] Cheeseman, P., Stutz, J., Bayesian classification (autoclass): theory
and results, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R Uthu-
rusamy (eds.), Advances in Knowledge Discovery and Data Mining,
MIT Press, 1996, 153-180.
[CSD98] Chakrabarti, S., Sasrawagi, S., Dom, B., Mining surprising patterns
using temporal description length, Proc. 24nd Conference on Very
Large Data Bases (VLDB'98), 1998,606-617.
[DH73] Duda, RO., Hurt, P.E., Pattern classification and scene analysis,
John Wiley, New York, 1973.
[EKS+96] Ester, M., Kriegel, H-P., Sander, J., Xu, X., A density-based algo-
rithm for discovering clusters in large spatial database with noise,
Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining
(KDD'96), 1996, 226-231.
[FMM+96] Fukuda, T., Marimoto, Y., Morishita, S., Tokuyama, T., Construct-
ing efficient decision trees by using optimized association rules, Proc.
22nd Conference on Very Large Data Bases (VLDB'96), 1996, 146-
155.
[FPM91] Frawley, W.J., Piatetsky-Shapiro, G., Matheus, C.J., Knowledge dis-
covery in databases: an overview, G. Piatetsky-Shapiro, W. Frawley
11. Data Mining 561

(eds.), Knowledge Discovery in Databases, AAAI/MIT Press, Cam-


bridge, MA, 1991, 1-27.
[FPS+96] Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R,
Advances in knowledge discovery and data mining, MIT Press, 1996.
[Fuk90] Fukunaga, K., Introduction to statistical pattern recognition, Acar
demic Press, San Diego, CA, 1990.
[GGR99] Ganti, V., Gehrke, J., Ramakrishnan, R, CACTUS - clustering cate-
gorical data using summaries, Proc. 5th International Conference on
Knowledge Discovery and Data Mining (KDD'99), 1999, 73-83.
[GGR+99] Gerke, J., Ganti, V., Ramakrishnan, R., Loh, W.Y., BOAT - opti-
mistic decision tree construction, Proc. ACM SIGMOD Conference
on Management of Data, 1999, 169-180.
[GKR98] Gibson, D., Kleinberg, J., Raghavan, P., CLustering categorical data:
an approach based on dynamical systems, Proc. 24th International
Conference on Very Large Data Bases (VLDB'98), 1998, 311-323.
[GoI89] Goldberg, D.E., Genetic algorithms in search optimization and ma-
chine learning, Morgan Kaufmann Pub., 1989.
[GRB99] Gupta, S.K., Rao, K.S., Bhatnagar, V., K-means clustering algorithm
for categorical attributes, M. Mohania, A. Min Tjoa (eds.), Lecture
Notes in Computer Science 1676, Data Warehousing and Knowledge
Discovery, Springer-Verlag, Berlin, 1999,203-208.
[GRGOO] Gerke, J., Ramakrishnan, R., Ganti, V., Rainforest - a framework
for fast decision tree classification of large datasets, Data Mining and
Knowledge Discovery, vol. 4, issue 2/3, 2000, 127-162.
[GRS98] Guha, S., Rastogi, R., Shim, K., Cure: an efficient clustering algo-
rithm for large databases, Proc. ACM SIGMOD Conference on Man-
agement of Data, Seattle, USA, 1998, 73-84.
[GRS99a] Garofalakis, M., Rastogi, R., Shim, K., Mining sequential patterns
with regular expression constraints, Proc. 25th International Confer-
ence on Very Large Data Bases (VLDB'99), 1999, 223-234.
[GRS99b] Guha, S., Rastogi, R, Shim, K., ROCK: a robust clustering algo-
rithm for categorical attributes, Proc. International Conference on
Data Engineering (ICDE'99), 1999, 512-521.
[GWS98] Guralnik, V., Wijesekera, D., Srivastava, J., Pattern directed mining
of sequence data, Proc. 4th International Conference on Knowledge
Discovery and Data Mining (KDD'98), 1998, 51-57.
[Hec96] Heckerman, D., Bayesian networks for knowledge discovery, U.M.
Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds.), Ad-
vances in Knowledge Discovery and Data Mining, MIT Press, 1996,
273-305.
[HF95] Han, J., Fu, Y., Discovery of multiple-level association rules from
large databases, Proc. 21th International Conference on Very Large
Data Bases (VLDB'95), 1995, 420-431.
[HKOO] Han, J., Kamber, M., Data mining: concepts and techniques, Morgan
Kaufmann Pub., 2000.
[HK98] Hinneburg, A., Keirn, D. A., An efficient approach to clustering in
large multimedia databases with noise, Proc. 4th International Con-
ference on Knowledge Discovery and Data Mining (KDD'98), 1998,
58-65.
562 T. Morzy and M. Zakrzewicz

[HKK+98] Han, E., Karypis, G., Kumar, V., Mobasher, B., Hypergraph based
clustering in high-dimensional data sets: a summary of results, Bul-
letin of the Technical Committee on Data Engineering, 21(1), 1998,
15-22.
[Haw80] Hawkins, D., Identification of outliers, Chapman and Hall, 1980.
[HPM+OO] Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., Hsu, M-
C., FreeSpan: frequent pattern-projected sequential pattern mining,
Proc. 6th International Conference on Knowledge Discovery and Data
Mining (KDD'OO), 2000, 355-359.
[HPYOO] Han, J., Pei, J., Yin, Y., Mining frequent patterns without candi-
date generation, Proc. ACM SIGMOD Conference on Management
of Data, 2000, 1-12.
[HS93] Houtsma, M., Swami, A., Set-oriented mining of association rules,
Research Report RJ 9567, IBM Almaden Research Center, San Jose,
California, USA, 1993.
[Hua98] Huang, Z., Extensions to the K -means algorithm for clustering large
data sets with categorical values, Data Mining and Knowledge Dis-
covery 2, 1998, 283-304.
[IM96] Imielinski, T., Mannila, H, A database perspective on knowledge dis-
covery, Communications of ACM 39, 1996, 58--64.
[Jam85] James, M., Classification algorithms, John Wiley, New York, 1985.
[JD88] Jain, A.K., Dubes, RC., Algorithms for Clustering Data, Prentice
Hall, Englewood Cliffs, NJ, 1988.
[JKK99] Joshi, M., Karypis, G., Kumar, V., A universal formulation of se-
quential patterns, Technical Report 99-21, Department of Computer
Science, University of Minnesota, Minneapolis, 1999.
[JKM99] Jagadish, H.V., Koudas, N., Muthukrishnan, S., Mining deviants in
a time series database, Proc. 25th International Conference on Very
Large Data Bases (VLDB'99), 1999, 102-113.
[JMF99] Jain, A.K., Murty, M.N., Flynn, P.J., Data clustering: a survey, ACM
Computing Surveys 31, 1999, 264-323.
[KAK+97] Karypis, G., Aggarwal, R, Kumar, V., Shekhar, S., Multilevel hyper-
graph partitioning: application in VLSI domain, Proc. ACM/IEEE
Design Automation Conference, 1997, 526-529.
[KN98] Knorr, E.M., Ng, RT., Algorithms for mining distance-based outliers
in large datasets, Proc. 24th International Conference on Very Large
Data Bases (VLDB'98), 1998, 392-403.
[KNTOO] Knorr, E.M., Ng, RT., Tucakov, V., Distance-based outliers: algo-
rithms and applications, VLDB Journal 8(3-4), 2000, 237-253.
[KNZ01] Knorr, E.M., Ng, R.T., Zamar, R.H., Robust space transformation
for distance-based operations, Proc. 8th International Conference on
Knowledge Discovery and Data Mining (KDD'2001), 2001, 126-135.
[Koh95] Kohavi, R., The power of decision tables, N. Lavrac, S. Wrobel (eds.),
Lecture Notes in Computer Science 912, Machine Learning: ECML-
95, 8th European Conference on Machine Learning, Springer Verlag,
Berlin, 1995, 174-189.
[KoI93] Kolodner, J.L., Case-based reasoning, Morgan Kaufmann, 1993.
[KR90] Kaufman, L., Rousseeuw, P.J., Finding groups in data: an introduc-
tion to cluster analysis, John Wiley & Sons, 1990.
11. Data Mining 563

[Lau95] Lauritzen, S.L., The EM algorithm for graphical association models


with missing data, Computational Statistics and Data Analysis 19,
1995, 191-201.
[LSL95] Lu, H., Setiono, R., Liu, H., Neurorule: a connectionist approach
to data mining, Proc. International Conference on Very Large Data
Bases (VLDB'95), 1995, 478-489.
[Mag94] Magidson, J., The CHAID approach to segmentation modeling: Chi-
squared automatic interaction detection, R.P. Bagozzi (ed.), Advanced
Methods of Marketing Research, Blackwell Business, Cambridge, MA,
1994, 118-159.
[MAR96] Mehta, M., Agrawal, R., Rissanen, J., SLIQ: a fast scalable classifier
for data mining, Proc. International Conference on Extending Data-
base Technology (EDBT'96), 1996, 18-32.
[Mcq67] Mcqueen, J., Some methods for classification and analysis of multi-
variate observations, Proc. 5th Berkeley Symposium on Mathematical
Statistics and Probability, 1967, 281-297.
[Mic92] Michalewicz, Z., Genetic algorithms + data structures = evolution
programs, Springer Verlag, 1992.
[Mit96] Mitchell, T.M., An introduction to genetic algorithms, MIT Press,
Cambridge, 1996.
[Mit97] Mitchell, T.M., Machine learning, McGraw-Hill, New York, 1997.
[MRA95] Mehta, M., Rissanen, J., Agrawal, R., MDL-based decision tree prun-
ing, Proc. 1st International Conference on Knowledge Discovery and
Data Mining (KDD'1995), 1995, 216-22l.
[MST94] Michie, D., Spiegelhalter, D.J., Taylor, C.C., Machine learning, neural
and statistical classification, Ellis Horwood, 1994.
[MT96] Mannila, H., Toivonen, H., Discovering generalized episodes using
minimal occurrences, Proc. 2nd International Conference on Knowl-
edge Discovery and Data Mining (KDD'96), 1996, 146-15l.
[MTV94] Manilla, H., Toivonen, H., Verkamo, A.I., Efficient algorithms for dis-
covering association rules, Proc. AAAI Workshop Knowledge Discov-
ery in Databases, 1994, 181-192.
[MTV95] Mannila, H., Toivonen, H., Verkamo, A.I., Discovering frequent
episodes in sequences, Proc. 1st International Conference on Knowl-
edge Discovery and Data Mining (KDD'95), 1995, 210-215.
[Mur98] Murthy, S.K., Automatic construction of decision trees from data: a
multi-disciplinary survey, Data Mining and Knowledge Discovery vol.
2, No.4, 1997, 345-389.
[NH94] Ng, R., Han, J., Efficient and effective clustering method for spatial
data mining, Proc. 20th International Conference on Very Large Data
Bases (VLDB'94), 1994, 144-155.
[Paw91] Pawlak, Z., Rough sets: theoretical aspects of reasoning about data,
Kluwer Academic Publishers, 1991.
[PF91] Piatetsky-Shapiro, G., Frawley, W.J., Knowledge discovery in
databases, AAAI/MIT Press, 1991.
[PFS96] Piatetsky-Shapiro, G., Fayyad, U.M., Smyth, P, From data mining
to knowledge discovery: an overview, U.M. Fayyad, G. Piatetsky-
Shapiro, P. Smyth, R. Uthurusamy (eds.), Advances in Knowledge
Discovery and Data Mining, AAAI/MIT Press, 1996, 1-35.
564 T. Morzy and M. Zakrzewicz

[PHM+OO] Pei, J., Han J., Mortazavi-Asl, B., Zhu, H., Mining access patterns ef-
ficiently from Web logs, Proc. 4th Pacific-Asia Conference on Knowl-
edge Discovery and Data Mining (PAKDD'OO), 2000, 396-407.
[PHM+Ol] Pei, J., Han J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu
M-C., PrefixSpan: mining sequential patterns efficiently by prefix-
projected pattern growth, Proc. 17th International Conference on
Data Engineering (ICDE'Ol), 2001, 215-224.
[PZO+99] Parthasarathy, S., Zaki, M.J., Ogihara, M., Dwarkadas, S., Incremen-
tal and interactive sequence mining, Proc. 8th International Confer-
ence on Information and Knowledge Management, 1999, 251-258.
[QR89] Quinlan, J.R., Rivest, R.L., Inferring decision trees using the mini-
mum description length principle, Information and Computation 80,
1989, 227-248.
[Qui86] Quinlan, J.R., Induction of decision trees, Machine Learning, vol. 1,
No.1, 1986, 81-106.
[Qui93] Quinlan, J. R., C4.5: programs for machine learning, Morgan Kauf-
mann, 1993.
[RHW86] Rumelhart, D.E., Hinton, G.E., Williams, R.J., Learning internal rep-
resentation by error propagation, D.E. Rumelhart, J.L. McClelland
(eds.), Parallel Distributed Processing, MIT Press, 1996, 318-362.
[Rip96] Ripley, B., Pattern recognition and neural networks, Cambridge Uni-
versity Press, Cambridge, 1996.
[RRSOO] Ramaswamy, S., Rastogi, R., Shim, K., Efficient algorithms for mining
ouliers from large data sets, Proc. ACM SIGMOD Conference on
Management of Data, 2000, 427-438.
[RS98] Rastogi, R., Shim, K., PUBLIC: a decision tree classifier that inte-
grates building and pruning, Proc. 24th International Conference on
Very Large Data Bases (VLDB'98), 1998,404-415.
[SA95] Srikant, R., Agrawal, R., Mining generalized association rules, Proc.
21th International Conference on Very Large Data Bases (VLDB'95),
1995, 407-419.
[SA96a] Srikant, R., Agrawal, R., Mining quantitative association rules in
large relational tables, Proc. ACM SIGMOD Conference on Man-
agement of Data, 1996, 1-12.
[SA96b] Srikant, R., Agrawal, R., Mining sequential patterns: generalizations
and performance improvements, P.M.G. Apers, M. Bouzeghoub, G.
Gardarin (eds.) Lecture Notes in Computer Science 1057, Advances
in Database Technology - EDBT'96, 5th International Conference on
Extending Database Technology, 1996,3-17.
[SAM96] Shafer, J., Agrawal, R., Mehta, M., SPRINT: a scalable parallel clas-
sifier for data mining, Proc. International Conference on Very Large
Data Bases (VLDB'96), 1996, 544-555.
[SAM98] Sarawagi, S., Agrawal, R., Megiddo, N., Discovery-driven exploration
of OLAP data cubes, Proc. International Conference on Extending
Database Technology (EDBT'98), 1998, 168-182.
[Sch96] Schikuta, E., Grid clustering: an efficient hierarchical clustering
method for very large data sets, Proc. International Conference on
Pattern Recognition, 1996, 101-105.
11. Data Mining 565

[SCZ98] Sheikholeslami, G., Chatterjee, S., Zhang, A., WaveCluster: a multi-


resolution clustering approach for very large spatial databases, Pmc.
24th International Conference on Very Large Data Bases (VLDB'98),
1998, 428-439.
[Shi99] Shih, Y.-S., Family of splitting criteria for classification trees, Statis-
tics and Computing 9, 1999, 309-315.
[SON95] Savasere, A., Omiecinski, E., Navathe, S., An efficient algorithm for
mining association rules in large databases, Pmc. 21th International
Conference on Very Large Data Bases (VLDB'95), 1995, 432-444.
[SS96] Slowinski, R., Stefanowski, J., Rough-set reasoning about uncertain
data, Fundamenta Informaticae, vol. 27, No. 2-3, 1996, 229-244.
[Toiv96] Toivonen, H., Sampling large databases for association rules,
Pmc. 22nd International Conference on Very Large Data Bases
(VLDB'96), 1996, 134-145.
[WFOO] Witten, I.H., Frank, E., Data mining: practical machine learning tools
and techniques with Java implementations, Morgan Kaufmann Pub.,
2000.
[WK91] Weiss, S.M., Kulikowski, C.A., Computer systems that learn: classi-
fication and prediction methods fmm statistics, neural nets, machine
learning, and expert systems, Morgan Kaufmann Pub., 1991.
[WT96] Wang, K., Tan, J., Incremental discovery of sequential patterns, The
ACM-SIGMOD's 96 Data Mining Workshop: on Research Issues on
Data Mining and Knowledge Discovery, 1996,95-102.
[WYM97] Wang, W., Yang, J., Muntz, R., Sting: a statistical information grid
approach to spatial data mining, Proc. 23nd International Conference
on Very Large Data Bases (VLDB'97), 1997, 186-195.
[XEK98] Xu, X., Ester, M., Kriegel, H-P., Sander, J., A distribution-based
clustering algorithm for mining in large spatial databases, Pmc. 14th
International Conference on Data Engineering, 1998, 324-33l.
[Zad65] Zadeh, L.A., Fuzzy sets, Information and Control 8, 1965, 338-353.
[Zak98] Zaki, M.J., Efficient enumeration of frequent sequences, Pmc. 1998
ACM CIKM Int. ConI on Information and Knowledge Management,
USA,1998.
[Zia94] Ziarko, W., Rough sets, fuzzy sets and knowledge discovery, Springer
Verlag, 1994.
[ZRL96] Zhang, T., Ramakrishnan, R., Livmy, M., BIRCH: An efficient data
clustering method for very large databases, Pmc. ACM SIGMOD
Conference on Management of Data, 1996, 167-187.
Index

r,169 atomicity, 270


X, 169 attribute, 29, 88, 227, 517-540
- class label, 518-522
abstraction, 52 - class label attribute, 538, 540
access support relation, 139 - dependent, 517, 518
ACID principle, 270 - predictor, 517, 518, 521, 523
ACID properties, 301 - set-valued, 160
active database system, 234 -- indexing, 163
activities - - join on, 160
- group, 369 - splitting, 520-523, 525, 526
- optional, 369 - test, see splitting
- prescribed, 369 avg, 107
adaptability, 445 awareness, 370, 382
agent, 448
base node, 437
aggregate function, 107
Bayesian belief networks, see Bayesian
aggregation, 402
networks
algebra, 167
Bayesian networks, 534
algebraic operator, see operator
Bayou, 466
AMPS, 442
BCIS,469
anti-Join, 179
BIRCH algorithm, 547-549
application-aware adaptability, 446
- clustering feature tree (CF-tree), 547,
archiving, 407
548
artificial keys, 80
- clustering feature vector (CF-vector),
ASR,139
547
association mining algorithms
BLOB, 240, 298
- Apriori, 496
broadcast, 468
- CARMA, 510 - caching, 473
- Count Distribution, 511 - index, 471
- Data Distribution, 511 broadcast disks, 471
- DIC,51O bushy tree, 182
- Eclat, 504
- FP-tree, 500 C4.5 algorithm, 522
- FreeSpan, 516 cache, 458, 473
- FUP, 510 cache miss, 463
- GSP, 513 cache replacement, 473
- Partition, 509 caching, 462
- PrefixSpan, 516 CACTUS algorithm, 555
- SPADE,516 CART algorithm, 522
- SPIRlT, 515 categorization, 64
- WAP-mine, 516 cell, 437
association rules, 490 cellular digital packet data (CDPD),
- generalized, 507 441
- multiple-level, 507 CG-tree, 158
- quantitative, 506 CHAID algorithm, 522
568 Index

circuit-switched wireless technologies, - overfitting problem, 528-532


439. - postpruning approach, 530
CLARA algorithm, 544 - predictive accuracy, 517
CLARANS algorithm, 544 - prepruning approach, 529
class, 237 - pruning phase, 529
class dependence, 71 - PUBLIC algorithm, 531
class diagrams, 61 - rough set algorithms, 518, 536
class hierarchy, 59 - split point, 527-528
class independence, 71 - split selection method, 521-528
classification, 67, 517-538 - splitting predicate, 527
- backpropagation, 534-535 - splitting subset, 527-528
- Bayesian classifier, 533-534 - SPRINT algorithm, 522, 526, 532
- Bayesian classifiers, 518 - stratification, 537
- bootstrapping, 538 - stratified k-fold cross-validation, 537
- C4.5 algorithm, 522 - test set, 536
- CART algorithm, 522 - training set, 536
- case-based reasoning, 536 classifier accuracy, 536-538
- CHAID algorithm, 522 clause
- class conditional independence - define, 113
assumption, 534 - from, 108
- classification rules, 532-533 - group by, 112
- classifier, 518-538 - select, 108
- classifier accuracy, 536-538 - where, 108
- decision table, 518 client-server, 135
- decision tree, 518-533 client-server computing, 447
- - building phase, 520 CLIQUE algorithm, 552
- - decision tree induction phase, see cluster, 353, 540-555
building phase cluster analysis, 415
- - growth phase, see building phase clustering, 129, 540-555
- - pruning phase, 520 - agglomerative approach
- entropy, 522 - - dendrogram, 545
- fuzzy set algorithms, 536 - categorical attributes, 552-555
- gain ratio, 524 - - CACTUS algorithm, 555
- genetic algorithms, 518, 536 - - ROCK algorithm , 553-554
- gini index, 522, 526-528 - - STIRR algorithm, 554
- holdout, 537 - CLARA algorithm, 544
- ID3 algorithm, 522 - CLARANS algorithm, 544
- impurity measure, 522, 539 - CLIQUE algorithm, 552
- impurity-based split selection - complexity, 129
method,521 - DBCLASD algorithm, 551
- information gain, 522-526 - DBSCAN algorithm, 551
- instance-based learning, 535 - DEN CLUE algorithm, 551
- k-fold cross-validation, 537 - density-based methods, 541, 550-551
- k-nearest neighbor classifiers, 518, - EM (Expectation Maximization)
535-536 algorithm, 543
- leave-one-out cross-validation, 538 - graph partitioning, 132
- neural network classifiers, 518, - grid-based methods, 541, 551-552
534-535 - hierarchical methods, 541, 544-550
Handbook on Data Management in Information Systems 569

- - agglomerative approach, 544-550, Content-Based Image Retrieval, 306


553 control flow, 376
- - average link algorithms, 546 conversion, 114
-- BIRCH algorithm, 547-549 CORBA,223
- - bottom-up approach, see agglomer- count, 107
ative approach cross product, 181
- - centroid link algorithms, 546 CURE algorithm, 549-550
- - complete link algorithms, 545, 546
- - CURE algorithm, 549-550 d-join, 168
- - divisive approach, 544 dangling reference, 127
- - single link algorithms, 545, 546 data
- - top-down, see divisive approach - semantics of, 50
- K-means algorithm, 541-543 data definition language, 20
- K-medoids algorithm, 541, 543-544 data distribution, 333
- model-based methods, 541 data hoarding, 455
- OPTICS algorithm, 551 data manipulation language, 20
- PAM algorithm, 543-544 data marts, 409
- partitioning methods, 541-544 data mining, 68, 73, 413
- STING algorithm, 551-552 data model, 20, 52
- hierarchical, 52, 56
- stochastic, 132
- network, 53, 56
- WaveCluster algorithm, 551-552
- object-oriented, 80-84
Coda, 458, 464
- relational, 55, 56
CODASYL DBTG, 22
- semantic, 57
Codd-complete language, 231
data modeling, 50, 51
code generation, 186
- conceptual, 50
coercion, 114 - semantic, 50
cognition, 63 data persistence, 33
cognitive economy, 64 data semantics, 51, 57
complex object, 236 data shipping, 136
composition, 60 data stream, 275
conceptual model, 57 data warehouse, 44, 223, 393
concurrency control, 465 data warehouse architecture, 44
confidence, 491 database, 19
conflict resolution, 460 - relational, 227
connection-oriented wireless technolo- database design, 19
gies, 439 database design process, 19, 227
connectionless wireless technologies, database management, 20, 222
439 database schema, 227
connectivity, 58 database system
consistency, 270 - federated, 38
consistency control, 475 - heterogeneous, 224
constructor, 95 - mediated, 42
- array, 107 - object-oriented, 223
- bag, 107 - object-relational, 224
- in queries, 107 database systems
- list, 107 - disconnected operation, 460
- set, 107 - weak connectivity, 465
content management, 372 Datacycle, 469
570 Index

dataflow, 376 Exodus, 34, 134


Datalog, 232 expand operator, 168, 169
DB2,31 extensional relation, 232
DBCLASD algorithm, 551 extent, 88
DBSCAN algorithm, 551 - logical, 183
DCOM,223 - physical, 183
de-normalization, 403 - strict, 183
decision tree, 520
declarative query language, 230 factorization, 165, 166
declustering, 196 fault tolerance, 334
- DYOP, 202 Ficus, 465
- hash, 196 file systems, 21
- MAGIC, 199 - disconnected operation, 458
- range, 196 - weak connectivity, 462
- round-robin, 196 foreign key, 55
DECT,444 forward engineering, 228
deductive database, 232 frequent itemsets
define, 113 - anti-monotonicity, 493
define clause, 166 - closed, 494
DEN CLUE algorithm, 551 - lattice, 493
dependency-join, 168 - maximal, 494
destructor, 95 from, 108
deviation detection, 557 function materialization, 149
dicing, 416 functional join, 120
difference operation, 230
dimensional business model, 397 gain ratio, 524
disconnected operation, 454 generalization, 59
Discrete Cosine Transformation (DCT), generalized materialization relation,
294 150
dissemination, 469 - completeness, 151
distinct, 108 - invalidation, 152
distinct type, 239 - reducing invalidation overhead, 153
Document Object Model, 249 - RRR, 153
DOM tree, 249 - storage representation, 152
domain, 29 - validity, 151
Donoho-Stahel estimator, 536 GEO systems, 443
DTD,247 gini index, 526-528
dual buffering, 138 GMR,150
durability, 270 granularity, 401
dynamic process creation, 369 greedy graph partitioning, 132
dynamic team creation, 369 group by, 112
dynamic workspace creation, 369 grouping, 112
DYOP declustering, 202 grouping operator, 169
- binary, 177
encapsulation, 33, 61, 236 - unary, 169
entity type, 57 groupware, 372
Entity-Relationship model, 227 GSM,442
entropy, 522-526
EOS, 135 hash partitioning, 196
Handbook on Data Management in Information Systems 571

having, 113 isolation, 270


heterogeneity, 41 isolation-only transactions, 461
heterogeneous configurations, 203 Itasca, 119
hierarchical data model, 27
hierarchical database systems, 25 Jaccard coefficient, 553
hoarding, 455 Java, 238
horizontal declustering, 196 join, 120
- DYOP,202 - functional, 120
- hash, 196 - on set set-valued attributes, 160
- MAGIC, 199 - pointer-based, 120
- range, 196 join index hierarchy, 145
- round-robin, 196
horizontal mining algorithms, 495 k-nearest neighbor classifiers, 535-536
horizontal partitioning, 196 key, 30,227
- DYOP,202 knowledge organization, 63
- hash, 196
- MAGIC, 199 large object management, 133
- range, 196 late binding, 101
- round-robin, 196 LEO systems, 443
HSM,302 Little Work project, 464
hybrid data delivery, 469 Lixto,234
logic-based query language, 231
ID3 algorithm, 522 loose reads, 465
IDEA methodology, 242 loosely coupled, 40
IDMS,22
Illustra, 117 MAGIC declustering, 199
image, 289 market basket analysis, 490
impedance mismatch, 82 materialization, 61
impurity measure, 539 materialize operator, 168, 169
IMS,26 max, 107
inclusion dependency, 227 MDL (Minimum Description Length)
index structures, 317 principle, 530--531
indexing, 471 media clients, 322
- function results, 149 media server, 322
- path expressions, 139 medoid,549
- set-valued attributes, 163 memory mapping, 127
indexing type hierarchies, 155 MEO systems, 443
inference, 64 meta algebra, 258
information gain, 522-526 meta-database, 407
information hiding, 237 Meta-SQL, 260
information model, 57 method, 250
Informix Dynamic Server, 240 min, 107
inheritance, 33, 96, 237 mobile agents, 452
- multiple, 98, 104 mobile computing, 431
insertion manager, 405 - adaptability, 445
integrator, 406 - architecture, 437
integrity constrains, 19 - challenges, 433
intensional relation, 232 - disconnected operation, 454
interoperability, 73 - infrastructure, 437
572 Index

- models, 444 object-based language, 249


- resources, 476 object-orientation, 236
- software architectures, 444 object-oriented database systems, 33
- transaction management, 461, 466 object-oriented modeling, 85
- weak connectivity, 462 - advantages of, 84
models of concepts, 64 object-relational database, 239
- classical, 64 ObjectStore, 34, 117, 128
- exemplar, 65 observer, 95
- prototype, 64 Ode, 34
monitor, 45 ODL, 95
monitoring, 406 ODMG,106
MS Access, 31 ODMG standard, 85
multi-dimensional analysis, 416 OID,117
multi-tier architectures, 447 OLAP, 45, 415
multimedia data models, 304 on-line re-organization, 203
multimedia database system, 300 - migrate, 206
multimedia databases, 73 - monitor, 206
multimedia objects, 288 - predict, 206
multiple fact scheme, 399 online analytical processing, 45, 415
multiple star scheme, 400 ontology, 62
mutator, 95 operational data store, 410
operator
natural join operation, 230
- expand, 168
nested relation, 234
- grouping
network database, 23
- - binary, 177
network database systems, 22
-- unary, 169
network partition, 461
- join
neural network classifiers, 534-535
- - d-join, 168
NFST, 165, 166
- - outer-join, 175
normalization, 58, 165, 166
-- semi-join, 175
02,34,117 - project, 168
object - scan, 168
- identity, 87 - select, 168
- state, 87 OPTICS algorithm, 551
- type, 87 OQL, 106, 252
object algebra, 167 - abbreviations, 115
object diagrams, 61 - aggregate function, 107
object identifier, 117 - collection operation, 111
- B+-Tree, 119 - constructor, 107
- direct mapping, 119 - conversion, 114
- hash table, 119 - distinct, 108
- logical, 118 - grouping, 112
- physical, 117 - nested query, 110
object manager, 117 - nil, 108
object modeling, 61 - ordering, 112
object query language, 106 - path expression, 106
object server, 137 - quantifier, 109
object type definition, 88 - query block, 108
Handbook on Data Management in Information Systems 573

- struct, 109 - reverse reference list, 125


- UNDEFINED, 108 - RRR, 125
- view, 113 post-relational, 22
OQL queries, 35 prediction, 518, 538-540
Oracle, 32, 119 - linear model, 539
order by, 113 - linear regression, 538
ordering, 112 - regression, 538
ORE,203 - regression equation, 538
- migrate, 206 - regression tree, 539-540
- monitor, 206 prefetching, 473
- predict, 206 primary key, 55
- - OVERLAP, 208 Pro-motion infrastructure, 468
outboard table, 399 procedural data, 256
outlier, 548-550, 555 procedural query language, 230
outlier detection, 555-557 project operator, 168
- discordancy test, 556 projection operation, 229
- distance-based approach, 556 properties, 62
- statistical approach, 556 - behavioral, 70
outrigger table, 400 - relational, 70
OVERLAP, 208 - structural, 70
proxy, 448, 451
packet-oriented wireless technologies, public packet-switched networks, 441
439 pull-based data delivery, 468
page server, 137 push-based data delivery, 468
paging, 442
PAM algorithm, 543-544 quantifier
parallel join - existential, 178
- Grace hash-join, 201 - universal, 109, 178
- Hybrid hash-join, 202 Quel, 256
- sort-merge, 201 query, 20
partitioning, 196, 402, 415 - nested, 110
- DYOP, 202 query block, 108
- hash, 196 query execution plan, 165
- MAGIC, 199 query language, 20
- range, 196 query optimization, 230
- round-robin, 196 - cost-based, 165
path expression, 106, 140, 250 - disjunction, 185
PHS, 444 - in the presence of class hierarchies,
pivoting, 416 182
plan, 165 - phases, 164
plan generation, 181 query optimizer, 164
pointer chasing, 120 - architecture, 165
pointer swizzling, 124 query representation, 167
- adaptable, 124 query rewrite, 171
- copy, 124 - decorrelation, 173
- direct, 124 - quantifier, 178
- eager, 124 - semantic rewrite, 180
- in place, 124 - type-based rewriting, 171
- indirect, 124 - unnesting, 173
574 Index

query server, 256 segmentation, 80


query shipping, 136 select, 108
querying techniques, 309 SELECT ... FROM ... WHERE, 31
select operator, 168
RAID architecture, 263 selection operation, 229
RAID-O, 263 semantic analysis, 165, 166
RAID-l,263 semantics, 53
RAID-2, 264 semi-structured data, 243
RAID-3,265 semi-structured processes, 369, 381
RAID-4, -5, -6, 265 sequential patterns, 490, 511
range partitioning, 196 session guarantees, 467
range predicate, 178 shared disk, 332
raster-graphics model, 269 shared everything, 331
recursion, 233 shared nothing, 332
redundancy, 53 Shore, 119
refinement, 101 signature, 161
reflective algebra, 256 similarity functions, 314
regression, 538 slicing, 416
reintegration, 457 snowflake scheme, 401
relational algebra, 30, 230 spatial data, 268
relational calculus, 230 specialization, 59
relational data model, 227 SPRINT algorithm, 522, 526
relational databases, 29 SQL, 29, 56, 68, 71
- deficiencies of, 80 star scheme, 399
relationship, 88 Starburst, 134
- recursive, 92 starflake scheme, 401
- ternary, 92 STING algorithm, 551-552
relationships, 54-57 STIRR algorithm, 554
- many-to-many, 53, 59 struct,109
- one-to-many, 52, 58 subclass, 60
relevance feedback, 303 substitutability, 96
relocation strategy, 138 subtype, 60, 96
retrieval manager, 405 sum, 107
reverse engineering, 228 superclass, 60
reverse reference list, 125 superimposed coding, 161
reverse reference relation, 153 superkey, 30
ROCK algorithm , 553-554 supertype,60
roles, 377 supply chain, 383
round-robin partitioning, 196 support, 491, 511
Rover, 468 support station, 437
RPC, 451 synchronization, 136
RRL,125
RRR,153 table, 227
rule, 232 template matching, 314
Rumor, 465 temporal database, 266
Texas persistent store, 127
satellite networks, 443 things, 62
scrubbing, 414 three-ball architecture, 380
search space, 165 tightly coupled, 40
Handbook on Data Management in Information Systems 575

transaction, 20 visitor node, 437


transaction concept, 270 VoD server, 323
transaction management, 461, 466
transaction processing, 51, 68 warehouse manager, 45, 405
transaction time, 267 warehouse repository, 407
translation of queries, 165 WaveCluster algorithm, 551-552
translation of queries into algebra, 168 weak connectivity, 462
tree-structure diagram, 25 weak transactions, 466
TSQL,267 Web service, 274
tuple, 227 web warehousing, 411
two-tier replication, 462 WebExpress, 451
type WfMS, 371
- hierarchy, 96 WfMS architecture, 378
type property where, 108
- behavior, 95 wireless local area networks, 443
- extent, 93 wireless networks, 438
- key, 93 workflow, 273
- status, 88 workflow activities, 372, 373
workflow coordination, 369
UNDEFINED,108 workflow definition tools, 377
union operation, 230 workflow dependencies, 372, 374
unsupervised classification, see workflow engine, 373
clustering Workflow Management Systems, 371
unsupervised learning, see clustering workflow models, 373
user-defined function, 107, 239 workflow monitoring tools, 378
user-defined time, 267 workflow participants, 374
user-defined type, 239 workflow reference architecture, 373
workflow resources, 372, 374
valid time, 267 workflow specification languages, 375
Versant, 119 wrapper, 45
vertical mining algorithms, 495
view, 113 XML,245
view integration, 66, 69 XPath,254
virtual data warehouse, 410 XQuery, 255
virtual memory, 127 XSL, 253
List of Contributors

J acek Blaiewicz Dimitrios Georgakopoulos


Institute of Bioorganic Chemistry Telcordia Technologies
Polish Academy of Sciences 106 East Sixth Street
Laboratory of Bioinformatics Austin
ul. Noskowskiego 12 Texas, USA
61-704 Poznan
Poland Shahram Ghandeharizadeh
941 W. 37th Place
Omran Bukhres Computer Science Department
Computer Science Department University of Southern California
School of Science Los Angeles
Purdue University CA 90089-0781, USA
723 W. Michigan St.
SL 280 Indianapolis Odej Kao
Indiana 46202, USA Department of Computer Science
TU Clausthal
Andrzej Cichocki Julius-Albert-Strasse 4
Telcordia Technologies D-38678 Clausthal-Zellerfeld
106 East Sixth Street Germany
Austin
Texas, USA
Alfons Kemper
Fakultat fur Mathematik
Ulrich Dorndorf
und Informatik
INFORM - Institut fur Operations
Research und Management GmbH Universitat Passau
Pascalstr. 23 Innstr.30
D-52076 Aachen 94030 Passau
Germany Germany

Chris Gahagan Russ Krauss


BMC Software BMC Software
2101 CityWest Blvd. 2101 City West Blvd.
Houston Houston
Texas 77042, USA Texas 77042, USA

Shan Gao Zbyszko Krolikowski


941 W. 37th Place Institute of Computing Science
Computer Science Department Poznan University of Technology
University of Southern California ul. Piotrowo 3a
Los Angeles 60-965 Poznan
CA 90089-0781, USA Poland
578 List of Contributors

Guido Moerkotte Marek Rusinkiewicz


Fakultiit flir Mathematik Telcordia Technologies
und Informatik 106 East Sixth Street
U niversitiit Mannheim Austin
D7,27 Texas, USA
68131 Mannheim
Germany
Gottfried Vossen
Tadeusz Morzy Dept. of Information Systems
Institute of Computing Science University of Munster
Poznan University of Technology Leonardo-Campus 3
ul. Piotrowo 3a D-48149 Munster
60-965 Poznan Germany
Poland and
PROMATIS Corp.
Jeffrey Parsons 3223 Crow Canyon Road
Faculty of Business Administration Suite 300
Memorial University of Newfound- San Ramon
land CA 94583, USA
St. John's
NF AlB 3X5, Canada
Maciej Zakrzewicz
Erwin Pesch Institute of Computing Science
University of Siegen Poznan University of Technology
FB5 - Management Information ul. Piotrowo 3a
Systems 60-965 Poznan
Hoelderlinstr. 3 Poland
D-57068 Siegen
Germany
Arkady Zaslavsky
Evaggelia Pitoura School of Computer Science
Department of Computer Science and Software Engineering
Metavatiko Building Monash University
Dourouti Campus 900 Dandenong Road
P.O. Box 1186 Caulfield East
GR 45110 - Ioannina Melbourne
Greece Vic 3145, Australia

You might also like