Professional Documents
Culture Documents
2003 Book HandbookOnDataManagementInInfo
2003 Book HandbookOnDataManagementInInfo
Series Editors
Peter Bernus . Jacek Blazewicz . Gunter Schmidt· Michael Shaw
Springer
Berlin
Heidelberg
New York
Hong Kong
London
Milan
Paris
Tokyo
Titles in the Series
C. W. Holsapple (Ed.)
Handbook on Knowledge Management 1
Knowledge Matters
ISBN 3-540-43527-1
Handbook on Knowledge Management 2
Knowledge Directions
ISBN 3-540-43527-1
Handbook
on Data Management
in Information
Systems
With 157 Figures
and 9 Tables
Springer
Professor Jacek Blazewicz e-mail: blazewic@put.poznan.pl
Institute of Bioorganic Chemistry
Polish Academy of Sciences
ul. Noskowskiego 12
61-704 Poznan, Poland
J acek Blazewicz
Wieslaw Kubiak
Tadeusz Morzy
Marek Rusinkiewicz
Contents
3. Data Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 49
Jeffrey Parsons
1 Introduction................................................. 50
2 Early Concerns in Data Management . . . . . . . . . . . . . . . . . . . . . . . . . .. 50
3 Abstraction in Data Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 52
4 Semantic Data Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 56
5 Models of Reality and Perception .... . . . . . . . . . . . . . . . . . . . . . . . . .. 62
6 Toward Cognition-Based Data Management. ... .. . ... .. .. .... ... 66
7 A Cognitive Approach to Data Modeling. . . . . . . . . . . . . . . . . . . . . . .. 70
8 Research Directions .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 72
1. Introduction ....................................................... 2
1.1 Database Systems..... .... .... ... ............ ....... ........... 3
1.2 Beyond Database Systems ...................................... 8
1.3 The Future Research .......................................... 11
2. Survey of the Volume ............................................. 12
1 Introduction
One of the most important application of computers is the management of
data in various forms, such as: records, documents, scientific and business
data, voice, videos, and images. The systems that are used to store, manage,
manipulate, analyze, and visualize the data are called data management sys-
tems. During the last 40 years the technology for managing data has evolved
from simple file systems to multimedia databases, complex workflow systems,
and large integrated distributed systems. Nowadays, they allow for an effi-
cient, reliable, and secure access to globally distributed complex data.
The history of data management research is one of exceptional productiv-
ity and starling economic impact [SSU96]. Achievements in data management
research underpin fundamental advances in communication systems, financial
management, administration systems, medicine, law, knowledge-based sys-
tems, and a host of other civilian and defense applications. They also serve as
the foundation for considerable progress in the basic science fields from com-
puting to biology [SSU91,SSU96,BBC+98]. The research on data manage-
ment has led to the database systems becoming arguably the most important
development in the field of software engineering as well as the most important
technology used to build information systems. Now, it would be unthinkable
to manage the large volumes of data that keeps corporations running with-
out support from commercial database management systems (DBMSs). The
field of database system research and development is an example of enormous
success story over its 30-year history both in terms of significant theoretical
results and of significant practical commercial values. These achievements are
documented and discussed in [ENOO,Gra96,SSU91,SSU96j.
The major strength of database systems is the ability to provide fast
and unprocedural concurrent access to data while ensuring reliable storage
and accurate maintainability of data. The features of the database system
technology, efficiency and consistency, have enabled development of huge fi-
nancial systems, reservation systems, and other business systems. During the
last decade, the database systems have evolved from the simple business-data
processing systems, that operate on well structured traditional data such as
numbers and character strings, to more complex object-relational systems
that operate on multimedia "documents", videos, geographic/spatial data,
time-series, voice, etc. Recent advances in database technology have been
leading to new exciting applications of database systems: geographic infor-
mation systems, elM systems, CASE systems, data warehouses and OLAP
systems, data mining systems, mobile systems, workflow systems, etc. How-
ever, despite the popularity and flexibility of database systems, which are
now able to cope with data of increasing complexity, still a large portion of
data is being stored and processed in places other that database systems (flat
files, data repositories, etc.). While the trend of building more powerful and
flexible database management systems has a place due to increasing demands
of its users, there is also a need for new data management solutions for new
1. Management of Data: State-of-the-Art and Emerging Trends 3
physical schema, which describes the physical layout of the database records
on storage devices. This logical-physical-subschema mechanism defined by
DBTG provided data independence. A number of DBMS were subsequently
developed following the DBTG proposal. These systems are known as CO-
DASYL or DBTG systems. The 1MS and CODASYL systems represented
thejirst-generation of DBMSs. The main disadvantage of both the 1MS and
CODASYL data models was the graph-based logical organization of data, in
which programs could navigate among records by following the relationships
among them. This navigational interface to database systems was too diffi-
cult even for programmers. To answer even simple queries they had to write
complex programs to navigate these databases.
The fifth phase of data management evolution is related to relational
databases. In 1970, E.F. Codd published the paper, in which the relational
data model was outlined. The relational data model gave database users and
programmers high-level set-oriented data acCesb to databases organized as
sets of tables (relations). Many experimental relational DBMSs were imple-
mented thereafter, with the first commercial products appearing in the late
1970s and early 1980s. The database research community in academia and
industry, inspired by the relational data model, developed many important
results and new ideas that changed the database technology, but that can be
also applied to other information environments: the standard query language
SQL and a theory of query language expressibility and complexity, query
processing and optimization techniques, concurrent transaction management
techniques, transactional recovery techniques, distributed and parallel pro-
cessing techniques, etc. The list is not exhaustive, but rather illustrates some
of the major technologies that have been developed by database research and
development. The relational data model is still the most commonly supported
among commercial database vendors. Relational DBMSs are referred to as
second-generation DBMSs.
According to Jim Gray [Gra96J, we are now in the sixth phase of data
management evolution. The phase began in the mid-1980s with a new data
model, called object-oriented data model, based on object-oriented program-
ming principles. The relational database systems have several shortcomings.
First of all, they have limited modeling capabilities. Second, they offer pre-
defined limited set of data types. Despite SQL added new data types for
time, time intervals, timestamps, dates, currency, different types of numbers
and character strings, this set of data types is still insufficient for some ap-
plications. Moreover, the relational database systems have a clear distinction
between programs and data. However, with new fields of database systems
applications, this separation between programs and data became problem-
atic. New applications require new data types together with a definition of
their behavior. In other words, DBMSs should let users create their own
application-specific data types that would then be managed by the DBMS.
The object-oriented data model assumes the unification of programs and data.
1. Management of Data: State-of-the-Art and Emerging Trends 5
on multimedia data, and the ability to deal with heterogeneous data sources
in a uniform way. In the years ahead, multimedia DBMSs are expected to
dominate on the database marketplace.
This brief outline of the history of database systems research does not
cover all developments and achievements in the database field. The database
research has developed several different "types" of DBMS for specific ap-
plication areas [CB02,ENOO,LBK02]. Temporal database systems are used to
support applications that require some aspect of time when organizing their
data. Temporal data models incorporate time as a first-class element of the
system - not only as a data type, but also real time. Therefore, they can
store and manage a history of database changes, and allow users to query
both current and past database states. Some temporal database models also
allow users to store future expected information. Spatial database systems
were developed to meet the needs of applications that store and manage
data that has spatial (multidimensional) characteristics that describe them.
These database systems are used in such applications as weather informa-
tion systems, environmental information systems, cartographic information
systems. For example, cartographic information systems store maps together
with two or three-dimensional spatial descriptions of their objects (countries,
rivers, cities, roads, etc.). Special kind of spatial database systems is used for
building Geographic Information Systems (GIS). They can store and manage
data originating from digital satellite images, roads, transportation networks,
etc. Building GIS requires advanced features in data storage, management,
and visualization that are not supported by traditional DBMSs. Moreover,
very often, the new GIS applications process data that has both temporal
and spatial characteristics. The system supporting such applications requires
new functionality for storing and managing those data types. In 1990s sev-
eral research prototypes were developed that combined spatial and temporal
DBMSs to create the new type of DBMSs called spatio-temporal database
systems. Real-time database systems (RTDBSs) are used to process trans-
actions (applications) having timing constraints associated with them and
accessing data which values and validity change in time. These constraints,
usually expressed in a form of a deadline, arise from the need to make the
results of transactions available to the system that has to perform appropri-
ate controlling decisions in time. Importance of real-time database systems is
the result of an increasing number of real-time applications maintaining and
processing large volumes of data. The applications concern: computer inte-
grated manufacturing, factory automation and robotics, workflow systems,
aerospace systems, military command and control, medical monitoring, traf-
fic control, etc. RT DBSs were created as the result of integration real-time
systems with traditional database systems. Active database systems are used
to support applications that require some kind of activity on the side of data.
Active database systems provide additional functionality for specifying the
so-called active rules. The rules, also referred as EGA rules (Event-Condition-
1. Management of Data: State-of-the-Art and Emerging Trends 7
Traditionally, database systems are used to store and manage large volumes
of data, and, as we outlined above, much database research was focused in
this direction. However, the concepts and solutions developed in the field of
database systems are of significant importance in many fields of computer
science. They can be applied and extended in different interesting ways. Re-
cently, new important fields of data management have emerged. Each has a
new environment for which data management technology, especially database
technology, had to be adopted: data warehousing and OLAP, data mining,
and workflow management. We discuss briefly each in turn.
Data mining. Over the last decades, many organizations have generated
and collected a large amount of data in the form of files, documents, and
databases. From the point of view of decision makers, simple storing of in-
formation in databases and data warehouses does not provide the benefits
an organization is seeking. To realize the value of stored data, it is necessary
to extract the knowledge hidden within databases and/or data warehouses
[HKOO,Ho103a,HoI03b,WFOOj. Useful knowledge can be partially discovered
using OLAP tools. This kind of analysis is often called the query-driven data
analysis. However, as the amount and complexity of the data stored in large
databases and data warehouses grows, it becomes increasingly difficult, if not
impossible, for decision makers to manually identify trends, patterns, regu-
larities, rules, constraints, relationships in the data using query and reporting
tools. Data mining is one of the best ways to extract or discover meaningful
knowledge from huge amount of data. Data mining is the process of dis-
covering frequently occurring, previously unknown, and interesting patterns,
relationships, rules, anomalies, and regularities in large databases and data
warehouses. The main goal of this analysis is to help human analysts to un-
derstand the data. To illustrate the difference between both OLAP and data
mining analysis let us consider typical queries formulated by both technolo-
gies. Typical OLAP query is the following: How many bottles of wine did
we sell in 1st quarter of 2003 in Poland vs. Austria? Typical data mining
queries are: How do the buyers of wine in Poland and Austria differ? What
else do the buyers of wine in Poland buy along with wine? How can the buy-
ers of wine be characterized? Which clients are likely to respond to our next
promotional mailing, and why?
Data mining technology can be used in many industries and applica-
tions [HKOOj: marketing, manufacturing, financial services, telecommunica-
tion, healthcare, scientific research, and, even, sport. Data mining is now
one of the most exciting new areas of data management. It is still evolving,
building on ideas from the latest scientific research. It incorporates the latest
development taken from artificial intelligence, statistics, optimization, paral-
lel processing, database systems, and data warehousing. From the conceptual
point of view, data mining can be perceived as advanced database querying,
since the resulting knowledge in fact exists in the database or data ware-
10 J. BlaZewicz and T.Morzy
development models that provide the extended transaction and workflow ca-
pabilities to suit the needs of complex applications accessing heterogeneous
systems.
One of the most important trends in databases is the increased use of par-
allel processing and data partitioning techniques in database management
systems. Parallel DBMSs are based on the premise that single-processor
systems can no longer meet the growing requirements for cost-effective scal-
ability, reliability, and performance. A powerful and financially attractive
alternative to a single-processor-driven DBMS is a parallel DBMS driven
by multiple processors. With the predicted future database sizes and com-
plexity of queries, the scalability of parallel database systems to hundreds
and thousands of processors is essential for satisfying the projected demands.
Parallel DBMSs can improve the performance of complex query execution
through parallel implementation of various operations (load, scan, join, sort)
that allow multiple processors automatically to share the processing work-
load. Chapter 5 describes three key components of a high performance parallel
database management system: data partitioning techniques, algorithms for
parallel processing of a join operation, and a data migration technique that
controls the placement of data to respond to changing workloads and evolving
hardware platforms.
ing) are presented in detail. The chapter concludes with the discussion of
future research directions.
We should note that there are also some other subject areas relevant to
the research and development agenda for the next-generation data manage-
ment systems, namely: deductive and object-deductive systems, XML and
semistructured data management, genome data management, database tun-
ing and administration, real-time database systems, data stream issues. Un-
fortunately, it was impossible, due to the scope limitation of the handbook, to
present all aspects of data management issues. Therefore, the reader should
not mistake the absence of chapters on these topics to mean that they are less
than important. The handbook covers a large amount of available knowledge
on currently available data management technologies and, we hope, will be
useful to understand further developments in the field of data management.
References
[LBK02] Lewis, P.M., Bernstein, A., Kifer, M., Databases and transaction pro-
cessing: an application-oriented approach, Addison-Wesley, 2002.
[LRoo] Leymann, F., Roller, D., Production workflow - concepts and tech-
niques, Upper Saddle River, NJ, Prentice Hall, 2000.
[RS95] Rusinkiewicz, M., Sheth, A., Specification and execution of transac-
tional workflows, W. Kim (ed.), Modem database systems, Reading,
MA, Addison-Wesley, 1995, 592-620.
[SSU91] SHberschatz, A., Stonebraker, M.J., Ullman, J., Database systems:
achievements and opportunities, SIGMOD Record 19(4), 1991,6-22
(also in Communication of the ACM 34(100), 1991, 110--120).
[SSU96] SHberschatz, A., Stonebraker, M.J., Ullman, J. (eds.), Database re-
search: achievements and opportunities into the 21st Century, SIG-
MOD Record 25(1), 1996, 52-63.
[SZ96] SHberschatz, A., Zdonik, S.B., Strategic directions in database sys-
tems - breaking out of the box, ACM Computing Surveys 28(4), 1996,
764-778.
[WFoo] Witten, I. H., Frank, E., Data mining: practical machine learning
tools and techniques with Java implementations, Morgan Kaufmann
Pub., 2000.
2. Database Systems: from File Systems to
Modern Database Systems
During the past forty years, databases have ceased to be simple file systems
and became collections of data that simultaneously serve a community of
users and several distinct applications. For example, an insurance company
might store in its database the data for policies, investments, personnel, and
planning. Although databases can vary in size from very small to very large,
most databases are shared by multiple users or applications [Br086].
Typically, a database is a resource for an enterprise in which the three
following human roles are distinguished in relation to the database, i.e. a
database administrator, application programmers and end users. A database
administrator is responsible for designing and maintaining the database. Ap-
plication programmers design and implement database transactions and ap-
plication interfaces, whereas, end-users use prepared applications and, possi-
bly, high level database query languages. The design of database applications
can be stated as follows. Given the information and processing requirements
of an information system, construct a representation of the application that
captures the static and dynamic properties needed to support the required
transactions and queries. A database represents the properties common to all
applications, hence it is independent of any particular application. The pro-
cess of capturing and representing these properties in the database is called
database design.
The representation that results from database design must be able to
meet ever-changing requirements of both the existing and new applications.
A major objective of database design is to assure data independence, which
concerns isolating the database and the associated applications from logical
and physical changes. Ideally, the database could be changed logically (e.g.
add objects) or physically (e.g. change access structures) without affecting
applications, and applications could be added or modified without affecting
the database.
Static properties include the following: objects, objects properties (called
attributes), and relationships among objects. Dynamic properties encompass
query and update operations on objects as well as relationships among oper-
ations (e.g. to form complex operations called transactions). Properties that
cannot be expressed conveniently as objects or operations are expressed as
semantic integrity constraints. A semantic integrity constraint is a logical
condition expressed over objects (i.e., database states) and operations.
The result of database design is a schema that defines the static properties
and specifications for transactions and queries that define the dynamic prop-
erties. A scheme consists of definitions of all application object types, includ-
ing their attributes, relationships, and static constraints. Thus, a database is
a data repository that corresponds to the schema. A database consists of in-
stances of objects and relationships defined in the schema. A particular class
of processes within an application may need to access only some of the static
properties of a predetermined subset of the objects. Such a subset, which
20 Z. Krolikowski and T. Morzy
translate those DML statements into the appropriate host language calling
sequences. The syntax of DML statements resembles the syntax of the host
language, Preprocessor is provided for the following host languages: COBOL,
PL/I, FORTRAN, and System/370 Assembler Language.
C.J. Date in [Dat95] gave among others the following critical comments on
network systems in general, and CODASYL systems and IDMS in particular.
Networks are complicated; consequently, the data structures are complex.
The operators are complex; and note that they would still be complex, even
if they functioned at the set level instead of just on one record at a time.
COURSE
TEACHER STUDENT
COURSE
PREREQUISITE
,....-L-.....-J
PREREQUISITE
STUDENT
In 1970, Codd's classic paper, "A Relational Model for Large Shared Data
Banks" , presented the foundation for relational database systems. Since then,
many commercial relational database systems, such as Oracle, DB2, Sybase,
Informix, and Ingres, have been built. In fact, relational database systems
have dominated the database marked for years. The remarkable success of
relational database technology can be attributed to such factors as having a
solid mathematical foundation and employing an easy to use query language,
Le., SQL (Structured Query Language). SQL is a declarative language in the
sense that users need only specify what data they are looking for in a database
without providing the information how to get the data. The relational data
model, basic relational operators, and the relational query language SQL are
briefly reviewed below [Dat95,EN99,KS86,Nei94, Ull89,Ram03].
In a relational database [Ram03J, data are organized into table format.
Each table (or relation) consists of a set of attributes describing the table.
Each attribute corresponds to one column of the table. Each attribute is as-
sociated with a domain indicating the set of values the attribute can take.
Each row of a table is called a tuple, and it is usually used to describe one
real-world entity and/or a relationship among several entities. It is required
for any tuple and any attribute of a relation that the value of the tuple un-
30 Z. Krolikowski and T. Morzy
der the attribute be atomic. The atomicity of an attribute value means that
no composite value or set value is allowed. For each relation, there exists
an attribute or a combination of attributes such that no two tuples in the
relation can have the same values under the attribute or the combination of
attributes. Such an attribute or combination of attributes is called a superkey
of the relation. Namely, each tuple of a relation can be uniquely identified by
its values under a superkey. If every attribute in a superkey is needed for it to
uniquely identify each tuple, then the super key is called a key. In other words,
every key has the property that if any attribute is removed from it, then the
remaining attribute(s) can no longer uniquely identifies each tuple. Clearly,
any superkey consisting of a single attribute is also a key. Each relation must
have at least one key. But a relation may have multiple keys. In this case,
one of them will be designated as the primary key, and each of the remaining
keys will be called a candidate key. Note that key and superkey are concepts
associated with a relation, not just the current set of tuples of the relation.
In other words, a key (superkey) of a relation must remain to be a key (su-
perkey) even when the instance of the relation changes through insertions
and deletions of tuples. Relational algebra is a collection of operations that
are used to manipulate relations. Each operation takes one or two relations as
the input and produces a new relation as the output. The operations are cho-
sen in such a way that all well-known types of queries may be expressed by
their composition in a rather straightforward manner. First, the relational
algebra contains the usual set operations: Cartesian product, union, inter-
section, and difference. Second, this algebra also includes the operations of
projection, selection, join and division. The latter are in fact characteristic
for the relational algebra and essential for its expressive power for stating
queries. If we represent the relation R as a table, then the projection of R
over the set of attributes X is interpreted as the selection of those columns
of R which correspond to the attributes X and elimination of duplicate rows
in a table obtained by such selection. Similarly, the operation of selection
applied to R may be interpreted as elimination of those rows from the table
R, which do not satisfy the specified condition.
e.inj_unit. unit_type
are called path expressions. Path expressions express so-called implicit
(hidden) joins. Implicit joins are also possible in OQL.
Predicates in an OQL query can be formed using set attributes and set
membership operator in. For example, the query below selects those profes-
sors who teach the course entitled 'Introduction to databases':
select p.name from Professors p
where 'Introduction to databases' in p.teaches.course..subject;
In an OQL query a path expression can be bind to a variable, called
a reference variable. After that this variable can be used within a query.
Reference variables can be considered as shorthand for path expressions.
An OQL query can use methods of two kinds, i.e. a predicate method and
a derived-attribute method. A predicate method returns a Boolean value
for each object it is invoked for. Whereas, a derived-attribute method is
used to compute the value of an attribute (or attributes) of an object and
return this value. For each object returned by a query such a method can
be invoked. A derived-attribute method can be used in a query just like an
attribute. For example, let us assume that class Material defines attribute
melting_temperature measured in Celsius centigrade. A method melt-tem..F
could be defined in the same class to compute a melting temperature from
Celsius to Fahrenheit. While querying Material, method melt_tem..F can be
invoked to return the melting temperature of objects in Fahrenheit.
The set of a subclass instances is the subset of a superclass instances.
For example, the instances of the Radio class are at the same time electronic
devices. The existence of an inheritance hierarchy allows to use a new kind
of querying technique. While querying along inheritance hierarchy one may
be interested in retrieving objects from some, but not all, classes in this
hierarchy. For example, one query rooted at ElectronicDevice may be issued
in order to retrieve the instances of the ElectronicDevice class and the Radio
class, but not from the TapeRecorder class. Whereas another query rooted
at ElectronicDevice may return the instances of ElectronicDevice as well as
the instances of all its subclasses.
With the support of OQL, object-oriented database systems have to pro-
vide optimization techniques of such queries. Query optimization in OODBS
is more difficult than in relational systems due to the following reasons:
• Different data types - the input type and the output type of an OQL
query may be different, that results in the difficulty of designing object
algebra. Most of the proposed object algebra use separate sets of alge-
bra operators dedicated for individual types, e.g. object operators, tuple
operators, set operators, list operators. In a consequence, object algebra
and equivalence transformation rules are much more complicated than in
relational systems .
• Methods - they are written in a high-level programming language and
their code is usually hidden to a query optimizer in order to preserve
2. Database Systems: from File Systems to Modern Database Systems 37
• versioning the whole database with its schema, e.g. Orion and Itasca.
user I external
schema
constructing processor
export schema
data marts
data sources
At the lowest layer of this architecture data sources DSl, DS2, and DS3
are located. These sources may contain heterogeneous information: struc-
tured data - stored in relational, object-relational, object- oriented databases,
semistructured data - in the format of XML or SGML, or unstructured in-
formation. Data sources are usually distributed.
2. Database Systems: from File Systems to Modern Database Systems 45
~ l DM2) daamarts
-----------------~- ~------------
operaional
daastores
data SOU"ces
8 Conclusions
In this chapter, we have introduced and briefly discussed the basic definitions
and concepts of database systems, including a data model, a database man-
agement system, a transaction, a query language, etc. We have also explained
how databases are created and used. Then, we have briefly described the evo-
lution of database systems starting from simple record-oriented navigational
databases systems to relational and object-relational systems, and explored
the background, characteristics, advantages and disadvantages of the main
database models: hierarchical, network, relational, and object-oriented. Then,
we have addressed the issue of data integration in a heterogeneous comput-
ing environment. We have briefly presented and pointed out advantages and
disadvantages of the three basic approaches to data integration, namely, fed-
erated database systems, data warehousing systems, and mediated systems.
References
[ACP+99] Atzeni, P., Ceri, S., Paraboschi, S., Torlone, R., Database systems -
concepts, languages and architectures, McGraw-Hill Publishing Com-
pany, 1999.
48 Z. Krolikowski and T. Morzy
Jeffrey Parsons
1. Introduction ...................................................... 50
2. Early Concerns in Data Management .............................. 50
3. Abstraction in Data Modeling ..................................... 52
3.1 Traditional Data Models ...................................... 52
4. Semantic Data Models ............................................ 56
4.1 Specialization/Generalization.................................. 59
4.2 Composition .................................................. 60
4.3 Materialization................................................ 61
4.4 Encapsulation ................................................. 61
4.5 Emergent Themes in Data Modeling Abstractions ............. 62
5. Models of Reality and Perception ................................. 62
5.1 Ontology...................................................... 62
5.2 Cognition ..................................................... 63
5.3 Reconciling Models of Data with Models of Perceived Reality .. 65
6. Toward Cognition-Based Data Management....................... 66
6.1 Classification Issues ........................................... 67
6.2 Well-Defined Entity Types .................................... 67
6.3 Stable Entity Types ........................................... 68
6.4 Shared Entity Types .......................................... 69
7. A Cognitive Approach to Data Modeling.......................... 70
7.1 Other Applications of Cognition to Data Modeling ............ 72
8. Research Directions ............................................... 72
1 Introduction
In the field of information technology, there has been a tremendous focus on
improvements in hardware, ranging from processing speed to primary and
secondary storage capacity to telecommunications bandwidth. In addition,
much has been made of advances in software, exemplified by generations
of programming languages with increasing levels of abstraction [Sha84]. In
the field of data management, there has been a similar, though perhaps less
widely recognized, degree of progress. Several reviews of the evolution of data
modeling have been written (e.g., [Bro84], [Nav92]' [TL82]), focusing mainly
on the structuring of data in various models. In this chapter, a case is made
for viewing progress in data management in terms of the degree to which a
database can be viewed as a model of knowledge about some segment of the
real world.
Section 2 examines early concerns in data modeling. Section 3 outlines the
changes in focus during the evolution through flat file, hierarchical, network,
and relational approaches to organizing data. Section 4 discusses the increas-
ing level of abstraction demonstrated in research on semantic data models
and conceptual modeling. Section 5 introduces a framework for understand-
ing the ontological and conceptual foundation of data modeling by outlining
views of ontology and cognition as human activities of creating models of the
world. Section 6 describes what can be gained from a cognitive approach to
data modeling. Section 7 summarizes an information model based on cogni-
tive principles. Section 8 concludes by outlining some directions for future
research in data management.
method that constituted the only way to access data on the dominant early
secondary storage medium - magnetic tape. Under a constraint of sequential
access, efficient applications are those that involve processing most or all of
the data in a file. In particular, since updates to files on magnetic tape re-
quire writing all the data in a file to a new tape after an update, updates to
only a few pieces of data at a time are extremely inefficient. Consequently,
batch processing emerged as the dominant data processing strategy in early
applications. In batch processing, a file of transaction data (e.g., all the trans-
actions for a day) is used to update a master file of application data at certain
intervals.
In conjunction with batch techniques, processing efficiency is maximized
by sorting both master and transaction data according to the same field, so
that the master file can be updated by reading the file sequentially only once.
This discussion of processing strategies in a sequential access world is rel-
evant to data modeling since it leads naturally to a particular approach to
organizing data. To illustrate, suppose that a master file contains accounts
payable balances for customers of an organization. Transactions consist of
purchases and payments by customers. The only way to support efficient
batch processing (i.e., processing the batch while reading the master and
transaction files only once) is to first organize the master file sorted by cus-
tomer identification data (e.g., customer number). The transaction file is
similarly sorted by customer identification data and all the transactions of
each customer for the batch are grouped and arranged in sequence on the
tape.
This example implies that the data organization method appropriate to
sequential access is a variable-length record-based structure. A record consists
of a collection of related fields. Records may vary in length since, for any batch
of transactions, there may be anything from zero to many transactions for a
single master record. In this structure, all the transactions pertaining to the
same master record are arranged in sequence, followed by all the transactions
for the next master record, and so on.
In this structure, the data contain little, if any, semantics. Instead, the
relevant knowledge about how to interpret the next byte of data is contained
in the program(s) which access that data. One negative consequence of this is
that, if the data structure is changed for any reason, all programs that access
that data need to be changed. This limitation was a major factor motivat-
ing the subsequent development of the data modeling field. Data modeling
involves embedding domain semantics in the structure of the data.
The evolution of data modeling was, in some respects, possible due to
the development of direct-access secondary storage devices (disks) to replace
sequential access devices. This allowed data management to be driven by
semantic issues, instead of the constraints of the technology of secondary
storage devices.
52 J. Parsons
DEPARTMENT
DNAME I DNUMBER
I MGRNAME I MGRSTARTDATE
I
EMPLOYEE PROJECT
DEPENDENT WORKER
and time-variant relationships that can exist between things in the applica-
tion domain [Ken78]. To illustrate, although the above hierarchical structure
is appropriate for updating department information as employees are hired or
fired and projects started, worked on, or completed, it does not easily support
finding information such as which projects a given worker works on. To model
such a relationship, a separate hierarchical structure would be needed (e.g.,
WORKER-+PROJECT). Alternatively, virtual pointers can be used to avoid
some of the duplication that would result from implementing distinct hierar-
chical structures [EN89]. Moreover, answering typical managerial queries such
as "Which departments have projects located in Canada?" would in prac-
tice require replicating the data in yet another hierarchical structure (e.g.,
PROJECT-+DEPARTMENT, where PROJECT records might be sorted by
project location). In other words, accommodating a wide variety of uses re-
quires developing a potentially large number of distinct hierarchies. Moreover,
since it would be difficult, if not impossible, to determine all possible uses
during database design, it might be necessary to add new hierarchical struc-
tures as new uses of the data are identified. In other words, the hierarchical
structure does not capture enough domain semantics to support a wide range
of database uses. Providing additional hierarchies increases the complexity of
the database, and may result in a high level of redundancy, with the associ-
ated insertion, deletion, and update anomalies that accompany poor database
design [EN89].
To combat the replication associated with multiple hierarchies or virtual
pointers needed to represent many-to-many relationships in a pure hierar-
chical database, the network data model (or CODASYL DBTG model) was
developed [COD71] [COD78]. The essence of the network model is that it
54 J. Parsons
FNAME
DEPENDENT
SEX BIRTHDATE RELATIONSHIP
EMPLOYEE
IFNAME IMINIT ILNAME I~ IBDATE IADDRESS ISEX ISALARY ISUPERSSN IDNO
DEPARTMENT
IDNAME IDNUMBER IMGRSSN IMGRSTARTDATE
DEPT_LOCATIONS
IDNUMBER IDLOCATION
PROJECT
IPNAME IPNUMBER IPLOCATION IDNUM
I~ 1Et:!Q IHOURS
DEPENDENT
IESSN IDEPNAME ISEX IBDATE IRELATIONSHIP
in the network model. Terms for such an extended model include semantic
data model, information model, and conceptual model. While these terms can
be argued to refer to different kinds of models (on a continuum from data-
oriented to knowledge-oriented), the emphasis in semantic data models and
information models has been on mechanisms to capture more domain knowl-
edge in the structure of data. Therefore, the terms are used interchangeably
in this chapter, and follow the original authors' usage as much as possible.
Although the evolution of data models shows a continual increase in the
degree of domain semantics captured in the conceptual schema, the adjective
"semantic" gained popular use in describing data models only in the early
1980s. The term may have been adapted to data modeling because the se-
mantic models borrowed ideas from research in knowledge representation on
semantic networks [TL82], [BMS84]. In fact, one of the stated motivations of
semantic data modeling was to develop representation constructs that corre-
spond more closely to how humans think about a problem domain [Che76],
[Br084], [HM81]. Semantic models recognized that the relational model did
not easily or naturally allow the database designer to capture a great deal of
what users know about how the data can be interpreted in the context of the
subject matter of the database; that is, the relational model cannot express
much of the semantics of the data.
The first of the widely known semantic data models is generally acknowl-
edged to be the entity-relationship (ER) model [Che76]. The ER model in-
troduced the notion of entity type as a fundamental modeling construct. An
entity type expresses the similarity of a set of real world entities. The sim-
ilarity of entities belonging to a given entity type is characterized by the
attributes defined for an entity type (all entities of a given type share all the
attributes defining that type) and by the relationships which link different
entity types.
The ER model contains more abstract representation constructs than any
of the classical data models. The model is purely conceptual in that its main
constructs - entities, attributes, and relationships - imply no implementation
mechanism (such as pointers). Indeed, an ER design can easily be converted
to a design in any of the hierarchical, network, or relational models, although
some semantics may be lost in the conversion. However, without additional
information (specifically the semantics lost in the conversion), it is not pos-
sible to convert a design in any of the classical models to a semantically
equivalent ER design.
The additional semantics of the ER model can be seen clearly by exam-
ining a conversion from an ER representation (ER diagram) to a relational
structure. Figure 4.1 contains an ER diagram for the project database exam-
ple introduced earlier.
Generally, an entity type can be converted to a relational table, with the
attributes of the entity type becoming attributes in the relation. However,
to avoid anomalies in a relational design, the table should be normalized or
58 J. Parsons
4.1 Specialization/Generalization
Perhaps the most widely used construct in semantic data modeling is the
notion of a hierarchy of entity types or classes. Specialization/generalization
hierarchies or networks can be found in models such as [SS77], RMjT [Cod79],
2 The rationale for the distinction between entities and relationships is not always
clear in an ER model, as the designer has some latitude is deciding whether to
model something as an entity type or as a relationship [Ken78]. Hence, the in-
tended domain semantics is lost in the relational representation. In addition, there
is no explicit indication in the schema whether a relationship involves mandatory
or optional participation of the related entity types. As with one-to-many rela-
tionships, this semantics can only be revealed by examining the contents of the
database.
60 J. Parsons
4.2 Composition
4.3 Materialization
Recently, another abstraction has been recognized in the data modeling liter-
ature - materialization [GS94]. The basis of materialization is the recognition
that some entities or things of interest in a domain have only a conceptual,
rather than physical, existence, but have a specific manifestation in a number
of physical entities. A classical example occurs in a video store between the
abstract concept of MOVIE and the concrete concept COPY. Each instance
of movie is an abstract entity which is manifested in one or more instances
of COPY, the latter reflecting individual copies of the movie in a store's
inventory.
Goldstein and Storey [GS94] demonstrate that the semantics of this ab-
straction cannot be captured by traditional abstraction mechanisms such
as specialization/generalization and composition (either alone or in combi-
nation). They further give evidence that materialization is a relatively com-
mon abstraction for organizing knowledge about conceptual entities and their
manifestations across a variety of applications. Attempts to express material-
ization through a generic modeling construct such as the relationship in the
ER model will result in a loss of semantics about the nature of the linkage
between the entities.
4.4 Encapsulation
5.1 Ontology
5.2 Cognition
Knowledge organization. Cognition is the study of human thinking. An
integral part of cognitive research is the investigation of mechanisms for
organizing knowledge. Much research in cognitive psychology has studied
the semantic structure of memory, proposing a variety of memory models
[Smi78j [RN88j. Several of these models deal with the nature of concepts and
the categorization of individual things as instances of concepts [SM81j. In
other words, developing concepts and categorizing things involves developing
a model of the world.
Two aspects of cognition are particularly relevant to modeling the world.
First, perception involves recognizing features that distinguish things in the
world from other things. Since there are potentially a very large number of
features that could serve to identify and distinguish things, perception in-
herently involves abstraction, or choosing to focus on certain features and
ignoring others. From a Darwinian perspective, the ability to abstract based
64 J. Parsons
members can diverge from the prototype [Smi88]. In some versions of this
view, the prototype does not have to be an actual thing, but may be idealized
[Ros78]. Numerous variations on the prototype approach have been proposed
[MS84], although the differences are not relevant to this discussion.
A third competing model that has received attention is the exemplar
approach [SM81]. Both the classical and prototype models view a concept as
an abstract characterization of the similarity of a set of instances. In contrast,
the exemplar view has no abstract notion of a concept. Instead, the members
of a class may be similar in different ways to other members of the class. In
fact, the similarity of any two members may not be direct, but only reflected
through a chain of similarity between pairs of intervening instances. This
view has gained popularity largely as a result of the work of George Lakoff
on the complexity of categorization and the differences in concept structures
between different cultures [Lak87].
In addition to these distinct models reflecting the presence or type of ex-
plicit abstractions in concepts, recent thinking on categorization has high-
lighted the degree to which, through human activity, concepts are "con-
structed" rather than "discovered". Historically, the classical view of con-
cepts as being well-defined was also associated with the notion that concepts
were fixed [Lak87]. Under the most extreme version of that view, concepts
were seen as having some objective existence outside of human observers,
and that the task of concept formation consisted largely of identifying these
preexisting objective differences.
That view has been largely replaced by a more subjective perspective in
which concept formation is seen as constructing abstractions that capture
useful differences among the things in the world [Lak87]. This evolution in
perspective helps to account for the observed fact that different people or
groups do not necessarily agree on a set of concepts by which to categorize
things in the world. Since what is useful differs among people and over time,
two consequences follow. First, different people may conceptualize the same
domain in different ways. Second, one's conceptualization of a domain may
change over time.
defines precisely the criteria for membership of entities in the type, data
modeling implicitly incorporates a classical view of well-defined concepts, or
a functional schema-based view of things. There is no provision in semantic
data models for degrees of membership of entities in a type. The ER model
does allow the idea of optional relationships to be defined for an entity type,
but does not treat any entities as being more or less typical instances of a
type. In addition, semantic data models do not allow an entity to be modeled
with attributes that are not attributes of the entity type to which it belongs.
Beyond this, data models do not easily accommodate varying views of
the data, or views that change over time. Semantic data models are used for
conceptual modeling of data requirements, and are typically translated to an
underlying representation, such as a relational one, for implementation. Dur-
ing conceptual modeling, multiple smaller data models may be developed for
different users in an organization. To prepare for database implementation,
these views must be combined into a global conceptual schema in a process
called view integration [NEL86]. However, the global schema is artificial in
that it does not correspond to any user's view of the domain [Par96a]. In
that sense, models, and databases developed from them, are not supportive
of a multiplicity of conceptualizations of a domain. Moreover, such models do
not easily accommodate changes in the conceptualization of a domain over
time. Indeed, many advocates of data modeling hold the view that while the
contents of a database change frequently, the schema is relatively stable over
time [CY91]. In this context, changing a database schema as a conceptual
model changes can be a very time consuming and expensive activity [LH90].
In sum, data models attempt to reflect the structure of the real world as
perceived by users. However, they do not draw their modeling constructs ex-
plicitly from ontological models of reality or cognitive models of how people
organize information about the things in their environment. In particular,
semantic models typically structure knowledge according to well-defined en-
tity types. When used as the basis for implementing a database, they do not
easily accommodate multiple or changing views. The next section considers
one line of research on the implications of basing a conceptual model on a
model of knowledge organization.
The view that data modeling should be intended to represent users' knowl-
edge of things in a domain has gained increasing attention in recent years
[HM81J, [Nav92], [Par96a]. While this view has typically been presented only
informally and without building on cognitive foundations, one line of research
has looked formally at building semantic information modeling constructs by
explicitly drawing on what has been learned in recent years about catego-
rization and knowledge organization.
The essence of the MIMIC model [Par96a] is the separation of instances
and classes (entity types). That is, instances (representing things) are mod-
eled independently of any classes to which they might be assigned. The basis
for this separation derives from an ontology in which things exist and classes
are regarded as a view of things according to the properties they possess
[Bun77J, [WW88].
In MIMIC, instances represent things in the real world. Instances are rec-
ognized to have three kinds of properties, according to the kinds of properties
distinguished in the cognitive literature. Each of these can be represented in,
and is formally defined in, the model. Structural properties describe the state
of an instance in terms of primitive values (e.g., name, height, salary). Rela-
tional properties describe associations among instances. Behavioral proper-
ties describe the constraints or laws that determine the allowable changes in
values of structural and relational properties.
Structural, relational, and behavioral properties are not different in pur-
pose from similar notions developed in semantic and object-oriented data
models. However, in MIMIC, properties are defined in terms of instances or
sets of instances that possess them. This contrasts with the common approach
in semantic models of defining attributes, relationships, and behavior in terms
of classes of entities or objects that possess them. In those models, classes
precede properties; in MIMIC, properties precede, and are independent of,
classes.
In MIMIC, classes are defined intensionally. A class is defined by a set of
structural and relational properties. In addition, since behavioral properties
are defined in terms of constraints on changes to structural and relational
properties, a class also implies the allowable behavior of its instances. The
membership of a class is dynamic, and consists at any given time of the set
of instances that possess all the properties that define the class. Instances
in the model can acquire and lose properties at any time, and therefore can
enter or leave classes without any explicit operation to add or remove them.
3. Data Modeling 71
This contrasts with the insert and delete operations in relational databases
(typically supported using SQL) to add and remove rows in relations.
The strongest contrast between MIMIC and traditional semantic models
can be seen in the area of classification. In all the models reviewed earlier,
there is no provision that entities or objects can be represented independent
of the type or class to which they belong. Such models can thus be said to
exhibit class dependence. Instances cannot be modeled except as members of
classes.
In MIMIC, however, instances (along with the properties they possess) can
and, in fact, must be modeled independent of any classification. A database
constructed according to an implementation of the MIMIC model would be
first and foremost a database of instances possessing properties. Hence, a
MIMIC representation can be said to exhibit class independence [PW97aj.
Classification enters the MIMIC model as a way of facilitating access to
instances. Since users generally think of things in the domain of interest as
members of classes, it is neither natural nor cognitively reasonable to ex-
pect them to relate easily to unclassified instances. Therefore, a classification
structure in which classes are defined as sets of properties based on the con-
cepts relevant to users makes it convenient to both populate a database and
to retrieve information about instances from that database. However, unlike
databases based on traditional semantic models, the class structure does not
form the basis for structuring data in the implementation. Instead, structur-
ing is based on instances and their properties.
The treatment of classification in MIMIC points out another significant
departure from traditional semantic models. Since there is no classification
implied by the underlying instance/property orientation, multiple class struc-
tures can independently exist on top of an underlying instance collection.
Each of these structures can coexist, providing independent views of (por-
tions) of the instance base corresponding to the interests and needs of different
groups of users. Integration of these local schemas or views is not necessary
for developing the underlying database. Instead, the local views can be pre-
served to provide various "windows" to the data.
Under such an instance-based data model, several difficult issues in data
modeling are either non-existent or solved [PWOOj. First, schema evolution is
merely a matter of redefining classes, or adding and dropping attributes and
relationships to class definitions. In contrast with conventional approaches,
instances do not need to be "moved" from one class to another as definitions
change. Second, view integration does not have to be performed. Each view
can stand alone and serve as the basis for accessing instances relevant to
specific users to whom the view is meaningful. Since view integration is a
time-consuming and difficult activity, this makes the database design process
easier.
If used as the foundation for a DBMS implementation, the model also
resolves several problems in databases that arise from the class-based models
72 J. Parsons
8 Research Directions
Recent areas of intensive research in the database field have not focused
on data modeling. However, there are reasons to believe that cognitive ap-
proaches to data modeling can inform other questions of interest in the field.
As outlined above, adopting a pluralistic view of classification, as in the
MIMIC model, promises to help deal with the complexity of combining infor-
mation from existing independent and heterogeneous data sources. As cor-
porations merge and business becomes more global, and as valuable new
databases are made available over the Internet, the need to combine existing
databases has never been greater. This has sparked a great interest in re-
search on database integration [SL90]. However, existing research has drawn
almost exclusively on a "class-based" data modeling paradigm such as that
3. Data Modeling 73
References
[AC085] Albano, A., Cardelli, L., Orsini, R., Galileo: a strongly-typed inter-
active conceptual language, ACM Transactions on Database Systems
10(2), 1985, 230-260.
[BCG+87] Banerjee, J., Chou, H.-T., Garza, J., Woelk, D., Ballou, N., Kim, H.-
J., Data model issues for object-oriented applications, ACM Trans-
actions on Office Information Systems 5(1), 1987, 3-26.
3. Data Modeling 75
[Ken78] Kent, W., Data and reality: basic assumptions in data processing re-
considered, North-Holland, Amsterdam, 1978.
[Lak87] Lakoff, G., Women, fire, and dangerous things: what categories reveal
about the mind, University of Chicago Press, Chicago, IL, 1987.
[LH90] Lerner, B.S., Habermann, A.N., Beyond schema evolution to database
reorganization, Proc. Conference on Object-Oriented Programming
Systems, Languages, and Applications / European Conference on
Object-Oriented Programming (ECOOP/OOPSLA gO), 1990,67-76.
[MS84] Medin, D.L., Smith, E.E., Concepts and concept formation, Annual
Review of Psychology 35, 1984, 113-138.
[MyI91] Mylopoulos, J., Conceptual Modeling and Telos, P. Loucopoulos, R.
Zicari (eds.) , Conceptual modeling, databases, and CASE: an inte-
grated view of information systems development, McGraw-Hill, New
York, 1991.
[Nav92] Navathe, S.B., Evolution of data modeling for databases, Communi-
cations of the ACM 35(9), 1992, 112-123.
[NEL86] Navathe, S.B., Elmasri, R., Larson, J., Integrating user views in
database design, IEEE Computer, June 1986, 50-62.
[Nij76] Nijssen, G., A gross architecture for the next generation database
management systems, G. Nijssen (ed.), Modelling in Database Man-
agement Systems, North-Holland, 1976, 1-24.
[Par96a] Parsons, J., An information model based on classification theory,
Management Science 42(10), 1996, 1437-1453.
[Par96b] Parsons, J., An experimental investigation of local versus global
schemas in conceptual data modeling, Proc. 6th Workshop on Infor-
mation Technologies and Systems (WITS96) , Cleveland, OH, 1996,
61-70.
[PW97a] Parsons, J., Wand, Y., Choosing classes in conceptual modeling, Com-
munications of the ACM 40(6), 1997, 63-69.
[PW97b] Parsons, J., Wand, Y., Using objects for systems analysis, Commu-
nications of the ACM 40(12), 1997, 104-110.
[PWOO] Parsons, J., Wand, Y., Emancipating instances from the tyranny of
classes in information modeling, ACM Transactions on Database Sys-
tems 23(2), 2000, 228-268.
[PM88] Peckham, J., Maryanski, F., Semantic data models, ACM Computing
Surveys 20(3), 1988, 153-189.
[Qui68] Quillian, R., Semantic Memory, M. Minsky (ed.), Semantic Informa-
tion Processing, MIT Press, Cambridge, MA, 1968.
[RB99] Ramesh, V., Browne, G.J., Expressing causal relationships in concep-
tual database schemas, Journal of Systems and Software 45, 1999,
225-232.
[Ros78] Rosch, E., Principles of Categorization, E. Rosch, B. Lloyd (eds.) ,
Cognition and categorization, Erlbaum, Hillsdale, NJ, 1978, 27-48.
[RBP+91] Rumbaugh, J., Blaha, M., Premerlani, W., Eddy, F., Lorensen, W.,
Object-oriented modeling and design, Prentice-Hall, Englewood Cliffs,
NJ, 1991.
[RN88] Rumelhart, D., Norman, D., Representations in memory, Steven's
Handbook of Experimental Psychology (vol. 2): Representations in
Memory, 1988, 511-587.
3. Data Modeling 77
[Sha84] Shaw, M., The impact of modelling and abstraction concerns on mod-
ern programming languages, in [BMS84], 1984, 49-78.
[Shi81] Shipman, D.W., The functional data model and the data language
DAPLEX, ACM Transactions on Database Systems 6(1), 1981, 140-
173.
[SL90] Sheth, A., Larson, J., Federated database systems for managing dis-
tributed, heterogeneous, and autonomous databases, ACM Comput-
ing Surveys 22(3), 1990, 184-236.
[SM81] Smith, E.E., Medin, D.L., Categories and concepts, Harvard Univer-
sity Press, Cambridge, MA, 1981.
[Smi78] Smith, E.E., Theories of semantic memory, W.K. Estes (ed.), Hand-
book of Learning and Cognitive Processes, vol. 6, Erlbaum, Hillsdale,
NJ, 1978, 1-56.
[Smi88] Smith, E.E., Concepts and thoughts, R. Sternberg, E.E. Smith (eds.),
The Psychology of Human Thought, Cambridge University Press,
Cambridge, England, 1988.
[SS77] Smith, J.M., Smith, D.C.P., Database abstractions: aggregation and
generalization, ACM Transactions on Database Systems 2(2), 1977,
105-133.
[Sow76] Sowa, J.F., Conceptual graphs for a database interface, IBM Journal
of Research and Development 20(4), 1976.
[Sto91] Storey, V.C., Meronymic relationships, Journal of Database Admin-
istration 2(3), 1991, 22-35.
[Sto93] Storey, V.C., Understanding semantic relationships, VLDB Journal
2(4), 1993, 455-488.
[Teo90] Teorey, T.J., Database modeling and design: the entity relationship
approach, Morgan Kaufmann, 1990.
[TYF86] Teorey, T.J., Yang, D., Fry, J.P., A logical design methodology for
relational databases using the Extended Entity-Relationship Model,
ACM Computing Surveys 18(2), 1986, 197-222.
[TL82] Tsichritzis, D.C., Lochovsky, F.H., Data models, Prentice-Hall, En-
glewood Cliffs, NJ, 1982.
[Wan89] Wand, Y., A proposal for a formal model of objects, W. Kim, F.
Lochovsky (eds.), Object-Oriented Concepts, Databases, and Applica-
tions, Addison-Wesley, Reading, MA, 1989, 537-559.
[WW88] Wand, Y., Weber, R, An ontological analysis of some fundamental
information systems concepts, Proc. 9th International Conference on
Information Systems, Minneapolis, MN, 1988, 213-225.
[WW93] Wand, Y., Weber, R, On the ontological expressiveness of informa-
tion systems analysis and design grammars, Journal of Information
Systems, 1993,217-237.
[WW95] Wand, Y., Weber, R, Towards a deep structure theory of information
systems, Journal of Information Systems, 1995, 203-223.
[WSW99] Wand, Y., Storey, V., Weber, R, An ontological analysis of the re-
lationship construct in conceptual modeling, ACM Transactions on
Database Systems 24(4), 1999, 494-528.
[WWW90] Wirfs-Brock, R, Wilkerson, B., Wiener, L., Designing object-oriented
software, Prentice-Hall, Englewood Cliffs, NJ, 1990.
4. Object-Oriented Database Systems
Abstract. This section introduces the reader into object-oriented databases. Af-
ter a brief motivation where we assess the disadvantages of relational database
technology and give the advantages of object-oriented technology, we introduce the
reader to object-oriented modeling. The main modeling constructs are discussed
and illustrated by examples. We then give an introduction to OQL, ODMG's query
language for object-oriented databases. Then come the technical issues like physical
object management, architectures of client-server systems, indexing, dealing with
set-valued attributes, and optimizing OQL.
80 A. Kemper and G. Moerkotte
Before introducing the object-oriented data model we will first assess the
shortcomings of the relational data model. In order to illustrate our discus-
sion, we will consider the following data modeling task: boundary representa-
tion of solid geometric objects. The conceptual schema of a simple boundary
representation of polyeder objects is graphically depicted in Figure 1.1.
The schema consists of four entity sets: Polyeder modeling the highest
level abstraction of a solid geometric object; Faces modeling the outer hull
of a Polyeder in the form of polygons; Edges, which represent the boundaries
of the polygons; and finally Vertices, which contain the metric information
in the form of X, Y, Z coordinates. The four entity sets are associated by
three N : M relationship types: Hull, Boundary, and StartEnd. We assume
that distinct Polyeders have distinct Faces which makes the relationship Hull
1 : N. In the ER-diagram we specify the cardinalities of these relationships
more precisely using the so-called (min,max)-notation. For example, every
edge of a polyeder bounds exactly two faces; every vertex is associated with
at least three edges; and every edge is bounded by two vertices.
Exploiting these cardinalities we can derive the concise relational database
schema shown in Figure LIb. The depicted database extension includes some
of the tuples representing a geometric object of type cuboid, which is iden-
tified within the Polyeder relation by "cubo#5". The relationship Boundary
and StartEnd are both represented by foreign keys in the relation Edges:
Polyeder Polyeder
Poly ID weight material ...
cubo#5 25.765 iron
tetra#7 37.985 glas
Faces
FaceID PolyID surface
Faces
f1 cubo#5 ...
f2 cubo#5 ...
" . .. . . ..
f6 cubo#5 ' "
f7 tetra#7 ...
Edges
Edges
EdgeID FI F2 VI V2
el f1 f4 vI v4
e2 f1 f2 v2 v3
Vertices
VertexID X Y Z
Vertices vI 0.0 0.0 0.0
v2 1.0 0.0 0.0
(a) (b)
because different application objects are mapped onto the same relational
schema. For example, the FaceID values have to be unique within the entire
relation Faces; i.e., no other faces may assume the identifier values "f1", ... ,
"f6" , which are assigned for the Faces belonging to the Polyeder "cubo#5".
82 A. Kemper and G. Moerkotte
Lack of data abstraction. The relational model has only one very simple
structuring concept, the relation. In advanced application domains more het-
erogeneous structure occur. A complex object may be composed of a variety
of differently structured subobjects as, for example, a Polyeder in boundary
representation. A natural (and user-friendly) representation of such complex
objects demands more sophisticated abstraction mechanisms than the rela-
tional model offers. In particular, aggregation of different part-objects to form
a higher-level composite object and type hierarchies to support the concepts
of generalization and specialization should be integrated in the data model -
albeit they are not supported in the pure relational model.
1. The structural representation, which models the current state of the ap-
plication object.
2. The behavioral specification, which consists of an interface of operations
by which the object can be queried and modified.
Application A Application B
Transf. TA
relational database
as part of the application program. This has the disadvantage that the data-
base system cannot serve as a repository for the operations. This makes the
sharing of application-specific operations among different applications, say,
applications A and B, difficult. In practice, one often experiences that the
same operations are multiply coded by different application programmers, as
exemplified in the graphic.
Application A Application B
someCuboid--+rotate('x', 10); w := someCuboid--+weightO;
scale
volume translate
object-oriented database
c++ Java
Object
Model
Smalltalk
/
--- --- ,...../- .......
'-- --- , / / - .......
/ Students --, t' -- ""'--
/ \ COOffi~ / ~
/ StudentiD : integer 1G M / CourseNo : integer /
I Name: string ( .... r r
\ \,
" Semester' integer r----enrolle-----,- Titl!'l : s!ring \
'-.... . \ '-...." Duration: Integer
-...., \ \ \
\ J ~ _~)
//-~/ ~I//--
,---~/ takinQ. /..... con ents -- N
"y----/ '/
/ Exams -'I
( ExDate: date / .
'-....,\Grade: nUmber\ teaCjhln g
/-~/
---
)
\
'"
_/
giving 1
//--- -.- /-~, --........ -y--- . . . . . . . . . . . . .
/ Assistants) / Professors --"
(/ Expertise: string L-.\I works for ---f/ Rank: string /
'-.... N - 1 '-.... \
'\
\, __ .-/
/-~_) _
/"\
... -
//<1--)
office
/
/
-----//
Employees
~.......
-,
\ /
/
/-i 1
-- ... - ....
Rooms
'-
--,
/ . ) / )
/ BlrthDa~e: qate / / Size: integer /
I, Name.. stnng \ (, RoomNo' integer\
,_ SS#: Integer -...." .
,
\ \ \ J
'--
\
\ __
) /--_/
//---- \.-'
//
id 1 Professors
SS#: 2137
Name: "Knuth"
Rank: "full"
residesln: id91~--------------~
givenExams: {... }
teaches: {id 2 , id3}
!
CourseNo:
Title:
Courses
5001
"Foundations"
I L CO~
r.C~o~u-r-se~N~o:------~4~6~30~
Title: "The Art"
Duration: 4 Duration: 4
taughtBy: id 1 taughtBy: idl
enrollment: { ... }I- enrollment: { ... }
successors: {... } successors: { ... }
predecessors: { ... } predecessors: { ... }
~__p__ro_£_e_ss_o_r_s__~+I~l~--~~>---~lL-~____Ro
___ __s____~
om
class Professors {
attribute long SS#;
class Rooms {
attribute long RoomNo;
attribute short Size;
Thus we have defined the relationship office of Figure 2.1 in both "directions"
- from Professors via residesln to Rooms as well as vice versa from Rooms
via occupiedBy to Professors. For a very small part of a university database
the example objects are shown in Figure 2.3.
idg Rooms
RoomNo: 007
id 1 Professors
Size: 18
SS#: 2137 ...
Name:
Rank:
"Knuth"
"full"
occupiedBy: id;'l
residesIn: idg
givenExams: { ... }
teaches: {... }
Fig. 2.3. Example objects illustrating the symmetry of relationships
class Rooms {
attribute long RoomNo;
90 A. Kemper and G. Moerkotte
id1 Professors
SS#: 2137
Name: "Knuth"
Rank: "full"
residesIn: ids
givenExams: { ... }
teaches: { ... }
~__p_r_o_fu_s_s_or_s__~~~l~--~~~--~1V~~____c_o_u_r_ses____~
4. Object-Oriented Database Systems 91
class Professors {
class Courses {
~___S_tu_d_e_n_t_s__~r-~l\T~--~~~--~A1L-~____c_o_u_r_s_es____~
Now, the relationship is represented by set-valued relationships in both
object types:
class Students {
class Courses {
N successor
M predecessor
L -____________________________ ~
class Courses {
Courses Professors
Students
class Courses {
attribute long CourseNo;
attribute string Title;
attribute short Duration;
relationship Professors taughtBy inverse Professors::teaches;
relationship set (Students) enrollment inverse Students::enrolled;
relationship set (Courses) successors inverse Courses::predecessors;
relationship set (Courses) predecessors inverse Courses::successors;
relationship set (Exams) examinedIn inverse Exams::contents;
};
class Students {
Type properties: extents and keys. The extent constitutes the set of
all instances of a particular object type. l The extent of an object type can
1 Further on, we will see that an extent also includes all instances of direct and
indirect subtypes of the object type.
94 A. Kemper and G. Moerkotte
I
Rooms
I
occupied By
residesIn
I Students
I Professors
I
enrolled taken Exams givenExams te aches
I Exams I
takenBy ·1 1 givenBy
contents
examined In
J Courses 1
enrollment l 1 taughtBy
predecessors successors
serve as an anchor for queries, such as "find all Professors whose Rank is
associate" .
The ODMG model allows to specify that an extent is automatically main-
tained. Newly created objects are implicitly inserted into the corresponding
extent(s) and delete objects are removed from the extent.
Furthermore, the ODMG model allows to specify a set of attributes
as keys. The system automatically ensures the uniqueness of theses keys
throughout all objects in the object type's extent.
Extents and keys are object type properties because they are globally
maintained (enforced) for all instances of the object type. In contrast, the
attributes and relationships specified in the type definition are instance prop-
erties because they are associated with every individual object.
Let us illustrate these two type properties on our example object type
Students:
class Students (extent AllStudents key StudentID) {
attribute long StudentIDj
attribute string N amej
4. Object-Oriented Database Systems 95
class Professors {
exception hasNotYetGivenAnyExams { }j
exception alreadyFullProf { }j
properties are provided in the subtype. The subtype inherits all the properties
of all of its (direct and indirect) supertypes. Inheritance does not only cover
structural properties (attributes and relationships) but also the behavior.
Thereby, a subtype always has a superset of the supertype's properties. This
way object-oriented models can safely allow the so-called substitutability: A
subtype instance can always be substituted at places where a supertype in-
stance is expected. Substitutability is the key factor to achieve a high degree
of flexibility and expressive power in object models.
0
Object Types Instances
1
idl A: ...
Type1
0
is-a
2
id2 A: .. .
Type2 B: .. .
is-a
Type3 c
id 3 :'"
U
B: .. .
c: .. .
e3
Figure 2.7. The extents are named Ext Typel , ExtType2, and Ext Type3, re-
spectively. The different sizes of the elements of the particular extents was
chosen to visualize the inheritance: Objects of a subtype contain more infor-
mation/behavior than objects of a supertype.
o
o
ExtType2
------
ExtType3
.~----
D D
o
Fig. 2.7. Illustration of subtyping
• single inheritance: Every object type has at most one direct supertype.
4. Object-Oriented Database Systems 99
Object
2 Actually, in ODMG the root of the type hierarchy of all durable objects is called
d_Object.
100 A. Kemper and G. Moerkotte
type hierarchy. An object type inherits only along this one unique path. For
example, for object type OTn this unique path is:
Let us now move from abstract examples to a more practical, though small
example type hierarchy within our university administration. Employees of
a university can be specialized to Professors and Lecturers. This yields the
type structure shown in Figure 2.9. In ODL these types are defined as follows:
SS# )
C::::"" Name
( BirthDate AgeO )
SalaryO"")
Salary 0
Fig. 2.9. Inheritance of object properties (dotted ovals contain inherited features,
italicized operations are refined)
illustrates why substitutability works: The Professors have all the "knowl-
edge" that Employees have and can therefore be safely substituted in any
context (operation argument, variable assignment, etc.) where Employees is
expected. Likewise, Lecturers are substitutable for Employees. This inclusion
polymorphism is highlighted in Figure 2.10 which shows that all Professors
and all Lecturers are also contained in the extent of Employees.
AllEmployees
-~
a~_D DDD
0 0
0 0
0 0 0
0 0 0 0 0
0 AllProfessors 0
~ID
0
ODD
0
DDD
---_._-
0
this coding that is bound and executed. This procedure implies that every
object (logically) knows its most specific type, i.e., the type from which it
was instantiated.
For our example type hierarchy the determination of the most specific im-
plementation of Salary() is trivial because every type has its own specialized
implementation:
• For the object id1 the Professors-specific Salary()-computation is bound;
• for object id u the implementation specialized for Lecturers is executed;
and
• for the object identified by id7 the most general implementation defined
in type Employees is dynamically bound.
Employees Students
TAs
interface EmployeeIF {
short AgeO;
long SalaryO;
};
class Employees: EmployeeIF (extent AllEmployees) {
attribute long SS#;
attribute string Name;
attribute date BirthDate;
};
class TAs extends Students: EmployeeIF (extent AllTAs) {
attribute long SS#;
attribute date BirthDate;
attribute short WorkLoad;
};
Let us concentrate on the object type TAs: It inherits all the features
(properties and behavior) of type Students and, in addition, it implements
the interface EmployeeIF. This makes TAs substitutable in any context where
Students or EmployeeIF objects are expected. However, TAs cannot be sub-
stituted for Employees because the two types are unrelated - in our example
model.
106 A. Kemper and G. Moerkotte
5
"Jeff"
are already perfect queries returning the values 5 and uJeff" respectively. If
a named object Dean exists, then
Dean
retrieves the name of the spouse of the dean via a path expression.
The query
Dean.subordinates
4. Object-Oriented Database Systems 107
Assume that the spouse attribute of the Dean is not defined, i.e. contains a
nil value. Then, the path expression
Dean.spouse.name
produces the special value UNDEFINED. In general, any property of the nil
object is UNDEFINED. Any comparison (e.g. with =, <) produces false if
at least one of the compared values is UNDEFINED. There exists a special
function is_undefined to check whether some value is undefined. By
is_undefined(Dean.spouse.age)
we could check whether the path expression returns a legal result. Applying
any function other than a comparison operator on an undefined value results
in a run-time error. Hence, the query
Dean.spouse.age +5
will result in a run-time error.
select s
from Student s
where s.year = 1
select distinct s
from Student s
where s.year = 1
select distinct s
from Student s
where s.year = 1
select distinct s
from s in Student
where s.year = 1
select distinct s
from Student as s
where s.year = 1
Quantifiers can also occur within the where clause where they play the
role of a selection predicate. The following query retrieves all students that
passed the database course:
select s
from s in Student
where for all c in select c
from c in Course
where c.title = "database")
c in s.passedCoursesO
This query is a little awkward. Another possibility is to use the subset predi-
cate to verify that a set of database courses is a subset of the passed courses.
This saves the universal quantifier. In fact, universal quantifiers can be re-
placed by a subset predicate and vice versa. The alternative formulation of
the query is:
select s
from s in Student
where (select c
from c in Course
where c.title = "database")
<= c in s.passedCoursesO
In OQL, <= denotes the subset predicate. Likewise, >= denotes the superset
predicate, = denotes set equality. The comparison operators < and > can be
used if we test for strict subsets or strict supersets.
As can be seen from the above queries, SFW-blocks can occur nested within
SFW-blocks. In fact, in OQL SFW-blocks can be nested anywhere (in the
select, from, and where clause) as long as the typing rules are obeyed. The
following three queries demonstrate nesting in different places:
This query retrieves students named "Smith" and for each such student the
names of all courses passed. Note that a reference to the student s occurs
in the inner block nested in the select clause. It occurs in the so-called
correlation predicate c in s.passedCourses() that correlates a course with the
passed courses of a student.
4. Object-Oriented Database Systems 111
Often it is necessary to select the best. For example, we would like to query
the best students. By definition, the best students are those with the highest
gpa. The following query retrieves the best students by applying nesting in
the where clause:
select s
from s in Student
where s.gpa = max (select s.gpa
from s Student)
Additionally, the query demonstrates a typical application of an aggregate
function (max) and shows a block without a where clause. The where clause
is optional and can be omitted, not only in nested blocks.
The next query demonstrates nesting in the from clause. Nesting in the
from clause is a convenient means to restrict a variable's range:
selects
from s in select s
from s in Student
where s.gpa > 10
where s.supervisor = dean
The nested query retrieves all students whose gpa is greater than 10. From
these, those are selected whose supervisor is the dean. Obviously, the query
could be stated much simpler by applying the boolean connective and as
done in the next query:
select s
from s in Student
where s.supervisor = dean and s.gpa > 10
Besides and the other boolean connectives or and not are available in OQL.
The usual boolean expressions can be built from base predicates and the
boolean connectives. They can be used stand-alone or as selection predicates
in the where clause.
Sometimes it is preferable to express a query with collection operations
instead of boolean connectives. The last query can equivalently be stated as
follows:
select s
from s in Student
where s.supervisor = dean
intersect
select s
from s in Student
where s.gpa > 10
112 A. Kemper and G. Moerkotte
where intersect denotes set intersection. The other supported collection op-
elCa.tions are union and except. The latter denotes set minus and is applied
ill the following query:
select s
from s in Student
where s.gpa > 10
except
select s
from s in Student
where s.supervisor = dean
The query retrieves the good students not supervised by the dean.
Grouping in OQL looks a little more complex than in SQL. Consider the
Qluery below. It is evaluated as follows. First, the from clause and the where
dause of the SFW-block involved in the query are evaluated. Typically, this
ClID be done by taking the cross product of the result of the expressions in the
from clause. Then, the selection predicate of the where clause is evaluated.
Second, this result is split into different partitions. For each partition, there
will be one tuple in the output of the query. All but one of the attributes cor-
respond to the properties used for grouping in the group by clause. The last
31ttribute is always called partition and is collection-valued. Each collection
c()ntains the result elements from the second step that belong to the accord-
iJm.g partition. Third, unwanted groups can be eliminated by a predicate given
in the having clause.
Let us consider an example. We want to group students into good,
mediocre, and bad students according to their gpa. The following query does
exactly this:
select *
from s in Student
group by good: s.gpa >= 10
mediocre: s.gpa < 10 and s.gpa >= 5
bad: s.gpa < 5
ill order to understand this query, it is useful to look at the result type. The
result type is:
set<struct(good: bool, mediocre: bool, bad: bool,
partition: bag<struct(s: Student» »
The result contains three tuples whose values for the first three attributes
are:
4. Object-Oriented Database Systems 113
This is due to the fact that for the partition attributes (good, mediocre, bad)
three possible value assignments exist: the expressions following the partition
attributes are boolean expressions such that for every student exactly one
predicate results in true. Of course, such a restriction does not exist in OQL
but queries of the above kind are rather typical.
For each combination of values for the partition attributes, the students
exhibiting this value combination are collected in the partition attribute.
Grouping all students by their advisor results in more than a single value
for the partition attribute:
select *
from s in Student
group by adv: s.advisor
Restricting the result tuples to those whose advised student group has a good
gpa is managed by applying a having predicate:
select *
from s in Student
group by adv: s.advisor
having avg(select p.s.gpa from p in partition) >= 10j
Results - with or without grouping - can be ordered by applying the
order by clause. Assume that in the above query we would like to retrieve
the average gpa and order the result by the decreasing average gpa. This can
be done as follows:
The annex desc states that we want to order by descending average gpa.
Ordering by an increasing value is specified by asc which is also the default
if no order specifier is given. In general, a list of expressions can be used as
an order specification.
3.7 Views
OQL supports simple views and views with parameters that behave like func-
tions, often returning a collection. For example, if we are often interested in
good students, we might define a view GoodStudent:
114 A. Kemper and G. Moerkotte
define GoodStudent as
select s
from s in Student
where s.gpa >= 10
and to refer to them in another query:
select s.name
from s in GoodStudent
where s.age = 25
Views are persistent. That is, they are stored permanently in the schema
until they are explicitly deleted:
delete definition GoodStudent
In OQL, views are not called views but they are called named queries. A
named query can take parameters. An example is a named query that re-
trieves good students where the measure of what's good and what's not is
given as a parameter:
define GoodStudent(goodGPA) as
select s
from s in Student
where s.gpa >= goodGPA
The syntax for referencing named queries with parameters is the same as the
syntax for function calls:
select s.name
from sin GoodStudent(lO)
where s.age = 25
It is important to note that names for named queries cannot be overloaded.
Whenever the same name occurs for a named query, the old definition is
overwritten.
3.8 Conversion
OQL provides for a couple of conversions. A collection can be turned into a
single element by the element operator. If the argument of element contains
more than a single element, element raises an exception. For example,
element(select s from s in Student where s.name = "Smith")
results in an exception, if there is more than one student named "Smith";
otherwise the single student named "Smith" is returned.
Other conversion operators are concerned with the conversion between
different collection types:
4. Object-Oriented Database Systems 115
3.9 Abbreviations
OQL contains a couple of possible abbreviations - Or syntactic sugar - that
makes OQL look more like SQL. The first important construct allows to omit
the explicit construction of tuples in the select clause. OQL allows for queries
with multiple entries in the select clause:
select Pl, ... Pn
from
where
where Pi are projections of the form:
1. expressioni as identifieri
2. identifieri : expressioni
3. expressioni
Such a query is equivalent to:
select struct(identifierl: expressionl, ... identifiern : expressionn )
from
where
In the third case, an anonymous identifier is chosen by the system.
Let us consider an example query where we want to select the names and
ages of all good students:
select s.name, s.age
from Student s
where s.gpa > 10
This query does not look different from an SQL query. If we want to give
names to the projected expressions, we write:
select s.name as studentName, s.age as studentAge
from Student s
where s.gpa> 10
116 A. Kemper and G. Moerkotte
aggr(selectexpression
from
where ... )
select aggr(expression)
from
where ... )
select count(*)
from
where ... )
translates into:
count(select *
from
where ... )
The same abbreviations apply to SFW-blocks exhibiting a distinct.
SQL allows to compare a single value via a comparison operator (=, <,
<=, ... ) and a quantifier (some, all) with a whole set of elements. The same
applies to OQL. For example,
select s
from Student s
where s.gpa >= all (select s1.gpa
from Student s1)
is a perfect OQL query. It retrieves all students whose gpa is greater or equal
than all gpa's found for students. This query is equivalent to:
select s
from Student s
where for all s1 in Student:
s.gpa >= s1.gpa
For a comparison operator () E {=, <, >, <=, >=,! =}, the predicate
expression () some set expression
is equivalent to
4. Object-Oriented Database Systems 117
~ 123
I I I
4711:3 0 5001 0 Foun-
"------+
dations 0 ...
page 4711
1 2 3
-1-1-1
--- ...
4711:2040520 Mathe-
matical Logic for CS 0
page 4812
Logical object identifiers. Logical OIDs do not contain the object ad-
dress and are thus location independent. To find an object by OlD, however,
an additional mapping structure is required to map the logical OlD to the
physical address of the object. If an object is moved to a different address,
only the entry in the mapping structure is updated. In the following, we
describe three data structures for the mapping. [EGK95] give details and a
performance comparison.
4. Object-Oriented Database Systems 119
Mapping with a B+ -Tree. The logical OID serves as key to access the tree
entry containing the actual object address (cf. Figure 4.2a). In this graphic,
the letters represent the logical OIDs and the numbers denote the physical
address of the corresponding object (e.g., the object identified by a is stored
at address 6). Here, we use simplified addresses; in a real system the address
is composed of page identifier and location within that page - like physical
OIDs. For each lookup, the tree is traversed from the root. Alternatively, if
a large set of sorted logical OIDs needs to be mapped, a sequential scan of
the leaves is possible. Shore [CDF+94] and (presumably) Oracle8 [LMB97]
are systems employing B-trees for OlD mapping.
(g,9) a6
(e,5) b2
(b,2) c3
U,8) d7
(d,7) e5
(c,3) f8
(a,6) g9
(h,4) h4
(i, 1) i 1
Mapping with a hash table. The logical OlD is used as key for a hash
table lookup to find the map entry carrying the actual object address (cf. Fig-
ure 4.2b). For example, Itasca [Ita93] and Versant [Ver97] implement OlD
mapping via hash tables.
Direct mapping. The logical OlD constitutes the address of the map entry
that in turn carries the object's address. In this respect, the logical OlD can
120 A. Kemper and G. Moerkotte
R
S
OIDR Sref
Map OIDs S..Attr
rl b
a6 1 i 17
r2 e
b2 2 b 11
r3 c
c3 3 c 19
r4 9 4 h 13
r5 i d7
"-'
e5
"-' 5 e 18
r6 d
12
18 6 a
r7 a
g9 7 d 10
rs c
r9 h h4 8 I 14
i 1 9 9 15
rIO i
Fig. 4.3. Naive pointer join with logical OIDs ("-' denotes the pointer dereference)
3 If a hash table is used for implementing the Map the same hash function used for
the hash table has to be applied on the Srel value before applying the partitioning
function.
122 A. Kemper and G. Moerkotte
Rl RMI
8
R Tl b Tl 2 OIDs S..Attr
OIDR Sref T3 C T3 3 81
Tl b T6 d,,-+ Map TS 3 17
"-+ 1 i
T2 e T7 a a6 T5 1 2 b 11
c TS C b2 Tg 4
T3 3 c 19
c3 M/"
T4 9 /' 4 h 13
T5 i d7
hM e5
hs
T6 d RM2 82
'\. R2
18 '\. T6 7
T7 a T2 e 5 e 18
TS c T4 9 9 9 M2 T7 6 6 a 12
Tg h T5 i,,-+
h4 T2 5 10
"-+ 7 d
TIO i i 1 T4 9
T9 h 8 I 14
TIO I TIO 8 9 9 15
Again, we show one generic query on our abstract schema and another one
based on the University schema. The query on the right-hand side combines
student information with the titles of the courses they are enrolled in.
[BCK98] proposed the partition/merge-algorithm for evaluating such
functional joins along set-valued relationships. It is an adaptation of the
above partition-algorithm. It flattens the R objects but it retains the group-
ing of the flattened R objects across an arbitrary number of functional joins.
This is achieved by interleaving partitioning and merging in order to retain
(very cheaply) the grouping after every intermediate partitioning step. This
is captured in the notation P(PM)* M. We will describe the basic idea of the
algorithm by way of an example.
Figure 4.5 shows a concrete example application of the P(PM)* M-
algorithm with two partitioning steps. The tables R;, RMij and RMSj are
labeled by a disk symbol to indicate that these temporary partitions are
stored on disk.
We start with the extent R containing two objects with logical OIDs rl
and r2 - for simplicity, any additional R attributes are omitted. The set-
valued relationship SrefSet contains sets of references (logical OIDs) to S.
The first processing step flattens these sets and partitions the stream of flat
objects N-way. In our example, the partitioning function hM is 2-way and
maps {a, ... , d} to partition Rl and {e, ... , i} to partition R2. The next
B
B
[61tR Srefl B
100bR-~
(Direct)
R1 rn Map 10IDR SAttrl
RMll
a3....
T1 b "Cl Saddr T1 2 8
T1 C "Cl RM81
<:t:: M1 T1 3 ~ Os SAttr
T2 a a 6 3 T1 11
81
T2 d rvtb 2 h T1 19
17
c 3
~; T1 17
R
OR SrefSet ;(2 ~ d 7
RM12
\
mergervt 3
b
c
11
19 RMS
T2 6 19
T2 13
T1 {b, e, c, g, i} ~M T2 7 h 13 - ' " 1\ OR { SAttr }
M2 T2 17 >f>-
T2 {a, d, c, h, i} T1 {11,19,17,18,15}
\ R2
e 5 ~merge
T1 e 82 T2 {19, 13, 17, 12, 1O}
o0-
......
8 RM21 e 18
T1 9 rvt f
T1 9 9 TI 1 a 12 ~
h o
T2
T2
h h 4
1
4
1
~g,~; d 10
14
~IRM~21/
T1 18
...
(ii'
f T1 15 a
9 15 ~
T2 12
/ i o
T2 10 ~
T1 ~
T1 ~
CD
en
.....
t$
124 A. Kemper and G. Moerkotte
processing step starts with reading R1 from disk, maps the logical OIDs in
attribute Srei to object addresses using the portion M1 of the Map (note
that the Map is not explicitly partitioned) and in the same step partitions
the object streams K-way with partitioning function hs (In our example a
2-way partitioning was assumed and hs maps {I, ... ,4} to partition 1 and
{5, ... ,9} to partition 2). The resulting partitions RM1j (here 1 ~ j ~ 2) are
written to disk. Processing then continues with partition R2 whose objects
are partitioned into RM 2j (1 ~ j ~ 2). The fine-grained partitioning into
the N * K (here 2 * 2) partitions is essential to preserve the order of the flat
R objects belonging to the same R object. The subsequent merge scans N
(here 2) of these partitions in parallel in order to re-merge the fine-grained
partitioning into the K partitions needed for the next functional join step.
Skipping the fine-grained partitioning into N * K partitions and, instead,
partitioning RM into the K partitions right away would not preserve the
ordering of the R objects. In detail, the third phase starts with merging
RMl1 and RM21 and simultaneously dereferencing the S objects referred. In
the example, h ,2] is fetched from RMl1 and the S object at address 2 is
dereferenced. The requested attribute value (S_Attr) of the S object - here 11
- is then written to partition RMS1 as object h,ll]. After processing [r1,3]
from partition RM l1 , [rl,l] is retrieved from RM21 and the object address
1 is dereferenced, yielding a object h, 17] in partition RMS 1. Now that all
flattened objects belonging to r1 from RM 11 and RM 21 are processed, the
merge continues with r2. After the partitions RMl1 and RM21 are processed,
RM12 and RM22 are merged in the same way to yield a single partition RMS2.
As a final step, the partitions RMS1 and RMS2 are merged to form the result
RMS. During this step, the flat objects [r,S_Attr] are nested (grouped) to
form set-valued attributes [r,{S_Attr}]. If aggregation of the nested S_Attr
values had been requested in the query, it would be carried out in this final
merge.
1. In place/copy
Here, we distinguish whether the objects in which pointers are swizzled
remain on their pages (in place) on which they are resident on secondary
storage or whether they are copied (copy) into a separate object buffer.
2. Eager/lazy
Along this dimension we differentiate between techniques that will swiz-
zle all pointers that are detected versus those swizzling techniques that
will only swizzle on demand, i.e., when the particular reference is deref-
erenced.
3. Direct/indirect
Under direct pointer swizzling, the swizzled attribute (reference) con-
tains a direct pointer to the referenced in-memory object. Under indirect
swizzling there exists one indirection; that is, the attribute contains a
pointer to a so-called descriptor, which then contains the pointer to the
referenced object.
The three dimensions are summarized in tabular form in Figure 4.6. In the
subsequent sections, we will discuss those three dimensions in a bit more
detail.
list called reverse reference list (RRL).4 Figure 4.7 illustrates the scenario of
direct swizzling.
Father Mother
Child
Note that in case of eager direct swizzling, we are not allowed to simply
unswizzle the pointers, as eager swizzling guarantees that all pointers in the
buffer are swizzled - instead, we have to displace those pointers (i.e., their
"home objects"), too. This may result in a snowball effect - however, in this
presentation we will not investigate this effect in detail.
Maintaining the RRL can be very costly; especially in case the degree of
sharing of an object is very high. In our context, the degree of sharing can
be specialized to the fan-in of an object that is defined as the number of
swizzled pointers that refer to the object. Assume, for example, an attribute
of an object is assigned a new value. First, the RRL of the object the old value
of the attribute referenced needs to be updated. Then the attribute needs to
be registered in the RRL of the object it now references. Maintaining the
RRL in the sequence of an update operation is demonstrated in Figure 4.8,
in which an attribute, say, spouse, of the object Mary is updated due to
a divorce from John and subsequent remarriage to Jim. First, the reverse
reference to the object Mary is deleted from the RRL of the object John;
then a. reverse reference is inserted into the RRL of the object Jim.
Indirect swizzling avoids this overhead of maintaining an RRL for every
resident object by permitting to swizzle pointers that reference nonresident
objects. In order to realize indirect swizzling, a swizzled pointer materializes
4 In the RRL the OlD of the object and the identifier of the attribute, in which
the pointer appears, is stored - we say that the context of the pointer is stored.
4. Object-Oriented Database Systems 127
Mary Mary
~
CD
John Jim
• • • •
Child
virtual memory
Fig. 4.10. Wave front of swizzled and mapped pages in the Texas persistent store
ory since mapping a whole segment at once is cheaper than mapping every
page individually. On the other hand, more virtual memory is reserved by
segments or parts of segments that are never accessed. Pages, however, are
also loaded and swizzled incrementally by ObjectStore in a client/page-server
architecture.
4.4 Clustering
The clustering problem is the problem of placing objects onto pages such
that for a given application the number of page faults becomes minimal.
This problem is computationally very complex - in fact, the problem is NP-
hard. Hence, several heuristics to compute approximations of the optimal
placement have been developed. Here we will just discuss a single heuristic
that is based on graph partitioning.
Let us first motivate clustering by way of an example. Assume that in
many application the three objects id 1> id 2 , and id3 are used together. If they
are stored on separate pages, as exemplified in Figure 4.11, the application
induces three page faults . Assuming an average access time of 10 ms per page
130 A. Kemper and G. Moerkotte
'--- ~
- (idl, ... )
(id2, ... )
(idl, ... ) (i d2, ... )
-
-
r--
(id3, ... )
(id3, ... )
'--
------
~~dl ... ~
t d2,' ...
id3, ...
-
D
rdl, ... ~
td2, ...
i d3, ...
DD
Fig. 4.11. Placement of three related objects onto pages: unclustered (top) versus
clustered (bottom)
access, this fetch phase lasts 30 IllS. The result after fetching these objects is
shown at the top of Figure 4.11. Since the involved objects are quite small,
they could all easily fit on a single page. If all the objects reside on a single
page, only one page access - taking approximately 10 ms - is needed to
fetch all three related objects into main memory. A factor of three is saved.
Obviously, the saving increases the more logically related objects fit onto a
single page. It is obvious, that these three objects should have been placed
on the same page - as shown at the bottom of Figure 4.11.
Besides this obvious saving, there exists another less obvious advantage of
clustering several logically related objects onto a single page. We first observe
that all pages fetched into main memory occupy buffer space. Further, buffer
space is usually restricted. Hence, if too many pages are needed, some of them
4. Object-Oriented Database Systems 131
must be stored back onto disk despite the fact that during the continuation of
the application certain objects they contain are again accessed. This results
in more page faults and, hence, in decreased performance. Less buffer space
is wasted if the percentage of objects on a page needed by an application is
high. Clustering of those objects that are accessed together in an application
increases this percentage and, hence, increases performance. From this point
of view, filling a page totally with objects always needed together is the best
clustering strategy possible.
Fig. 4.12. Referencing within the SIMPLE-example: schema references on the left
and the cluster graph on the right
The optimal clustering for the above example is very intuitive. To illus-
trate that this is not always the case consider the so-called SIMPLE-example
[TN91] exhibiting an interesting pitfall when following the above, intuitively
straightforward clustering strategy of filling pages maximally. There exist
objects with identifiers S, I, M, P, L, and E. They reference each other in
the way indicated on the left-hand side of Figure 4.12. The application we
consider is characterized by the following access pattern or reference string:
• [S,I,M], [P,L,E]
to which the objects 01 and 02 are assigned. If PI =I- P2 and if the total size of
all objects assigned to PI and P2 is less than the page size the two partitions
are jOined. 6 Otherwise, the edge is merely discarded - and the partitions
remain invariant.
It is easy to see that the GGP-algorithm obtains the optimal clustering
for the SIMPLE-example consisting of three partially empty pages. It first as-
signs each of the six objects into a separate partition (page). Then it merges
the pages with the objects S and I, M and P, Land E, respectively - in
no particular sequence since there are ties with the weight 198. Having ob-
tained these three pages [8, I, -], [M, P, -], and [£, E, -] no further merging
is possible because the page limit was set at three.
So far we have only dealt with objects that fit into one page. However, in
advanced applications there are many "bulky" data types, e.g., multi-media
data like video, audio, images, etc, where this premise no longer holds. There-
fore, techniques for mapping large objects of any size - ranging from several
hundred Kilo bytes to Giga bytes - are needed. It is, of course, not feasible
to simply map such large objects onto a chained list of pages. This naive ap-
proach would severely penalize reading an entire large object or a part in the
"middle" of a large object from the secondary memory. Therefore, smarter
techniques are needed that map large objects onto large chunks of consecutive
pages - called segments - while, at the same time, allowing dynamic growth
and shrinking of objects "in the middle". Also, the object structure has to
provide for efficient access to random byte positions within the large object
- without having to read the entire part preceding the desired position.
root
The Starburst approach. The Exodus storage structure has the disadvan-
tage of fixed segment sizes. This may be a problem if very differently sized
objects need to be stored. Therefore, in Starburst [LL89] segments with a
fixed growth pattern were introduced. That is, a large object is created by
starting with a segment of a chosen size. Additionally allocated segment are
twice the size of their predecessor segment; except for the last segment which
can have an arbitrary size in order to avoid storage waste. The segments are
chained by a so-called Descriptor - as illustrated in Figure 4.15. The De-
scriptor contains the number of segments (here 5), the size of the first (here
100) and the last segment (here 340), and the pointers to the segments.
This approach seems to favor sequential reads because the segments of
really large objects can be chosen accordingly large. On the other hand,
dynamic growth and shrinking in the middle is more complex than in the
Exodus approach.
4. Object-Oriented Database Systems 135
The EOS approach. In EOS [Bil92] the Exodus and Star burst approaches
were combined such that variable sized segments are possible and a B+ -tree
is used as a directory in order to support dynamic growth and shrinking
efficiently. This is illustrated in Figure 4.16.
350 150
Fig. 4.16. Representation of a large object in EOS
[Bil92] also describes a buddy scheme for allocating the variable sized
segments.
5 Architecture of Client-Server-Systems
With the advent of powerful desktop computers in the early 1980's client-
server-computing has become the predominant architecture of database ap-
plications. The database is installed on a powerful backend server while the
application programs are executed on the client computers. Here, we will
briefly survey the architectural design choices for client-server databases.
136 A. Kemper and G. Moerkotte
Database
For a data shipping client-server-architecture there are two choices with re-
spect to the granularity of data items being shipped between the server and
the client(s): page versus object server. In a page server the client requests
entire pages (in the predetermined size of, e.g., 8KB). The effectiveness of
this architecture is dependent on a good clustering of objects onto pages.
This is a prerequisite for making good use of the resources: bandwidth of the
network and buffer space in the client.
In the object server architecture, the client requests individual objects
from the server. This way, only those objects that are actually needed in the
client are sent to over the network and are placed in the client's buffer. That
138 A. Kemper and G. Moerkotte
is, the object server minimizes resource consumption as far as network band-
width and client buffer space is concerned. On the negative side, explicitly
requesting each individual object easily leads to a performance degradation
if many objects are accessed in the application.
The advantage of the dual buffer management is that well clustered pages
containing many objects relevant for the particular application are left intact
in the buffer. On the other hand, pages that contain only a few relevant ob-
jects are evicted from the buffer after these few objects have been extracted.
Under dual buffer management, the client's main memory buffer is effectively
utilized because only relevant objects occupy the precious buffer space. This
is achieved without incurring the high client-server interaction rate exhibited
by an object server. It is, of course, the buffer management's task to maintain
access statistics such that the two types of pages - those containing a high
portion of relevant objects and those containing only few relevant objects -
are detected.
4. Object-Oriented Database Systems 139
~
eager lazy
relocation
6 Indexing
In this and the next section we will use the object base shown in Figure 6.1 for
illustration for the illustration of new indexing techniques in object-oriented
database systems.
Fig. 6.1. Example object base with Students, Exams, and Professors
arbitrary long attribute chains where the chain may even contain collection-
valued attributes. The ASRs allow to avoid the actual evaluation of the func-
tional joins by materializing frequently traversed reference chains.
Ro := {o}
R i := U V.Ai for 1 :::; i :::; n
vER i _ 1
7 This means that the attribute Ai can be associated with objects of type ti or any
subtype thereof.
8 Note, however, that we do not permit powersets.
142 A. Kemper and C. Moerkotte
2. The left-complete extension [[to.A I •·•· .An]] left contains all paths origi-
nating in to but not necessarily leading to tn, but possibly ending in a
NULL.
3. The right-complete extension [[to.A I .··• .An]] right, analogously, contains
paths leading to tn, but possibly originating in some object OJ of type tj
which is not referenced by any object of type t j - I via the Aj attribute.
4. Finally, the full extension [[to.A I .··· .An]]/ull contains all partial paths,
even if they do not originate in to or do end in a NULL.
Definition 4 (Extensions). Let t><I (J><[, J><] , [XC ) denote the nat-
ural (outer, left outer, right outer) join on the last column of the first relation
and the first column of the second relation. Then the different extensions are
obtained as follows:
This extension contains all paths and subpaths corresponding to the underly-
ing path expression. The first four tuples actually constitute complete paths
which would be present in the canonical extension as well; however the fifth
path would be omitted in the canonical extension. In the left-complete ex-
tension the only he first four tuples would be present, whereas the fifth tuple
would also be present in the the right-complete extension.
It should be obvious, that the full extension of an ASR contains more in-
formation than the left- or right-complete extensions which, in turn, contain
more information than the canonical extension. The right- and left-complete
extensions are incomparable. The next definition states under what condi-
tions an existing access support relation can be utilized to evaluate a path
expression that originates in an object (or a set of objects) of type s.
144 A. Kemper and G. Moerkotte
Definition 5 (Applicability).
An access support relation [[to.A 1 •••• .An)] X under extension X is applicable
for a path s.Ai .··· .Aj where s is a sUbtype9 of ti-l under the following
condition, depending on the extension X:
X=ft£ll 1\1:5i:5j:5n
. { X = left 1\ 1 = i :5 j :5 n
Applzcable([[to.A 1 •••• .Anl] x, S.Ai.· ... A j ) = X = right 1\ 1 :5 i :5 j = n
X = can 1\ 1 = i :5 j = n
We will call the left B+ -tree the "forward clustered" tree, and, analo-
gously, the right one the "backward clustered" tree. The left-hand B+ -tree
supports the evaluation of a forward query, e.g., retrieving the professor's
name who has examined the student identified by id35 • The left-hand B+ -tree
supports the evaluation of backward queries - with respect to the underlying
path expression. For our example, an entry point for finding the Students
who have taken exams from "Knuth" is provided by the backward clustered
B+-tree.
This storage scheme is also well suited for traversing paths from left-to-
right (forward) as well as from right-to-Ieft (backward) even if they span
9 Note, that every type is a subtype of itself.
4. Object-Oriented Database Systems 145
[Students.takenExams.givenBY)can
80 : OlD Students 8 1 : OlD Exams 82 : OlD Professors
id35 id21 id1
id35 id23 id6
id37 id22 id1
~~ ~n ~
I [Professors.Name)can
80 : OlD Professors 81 : string
id6 "Babbage"
id 1 "Knuth"
id5 "'lUring"
The above example illustrates the virtues of the redundant storage model
for ASRs. The right B+ -tree of the ASR [[Professors. Name]] can directly
supports the lookup of those Professors whose Name is Knuth, i.e., the
one with OlD id l in our example. Then, the right B+ -tree of the ASR
[[Students.takenExams.givenBy]] can supports the traversal to the correspond-
ing Students to obtain the result {id35 , id37 }.
Thus, the backward traversal constitutes a "right-to-Ieft" semi-join across
ASRs:
r[Students.takenExams.givenByllUcan
IIso( U IX r[Projessors.Namel1
O'Sl=Knuth( U ~can
))
Analogously, the "forward clustered" B+ -tree supports the semi-join from left
to right, such that, for instance, the Names of Professors who have examined
student id35 can be retrieved efficiently. This corresponds to the "left-to-
right" semi-join across ASRs:
IIs3(O'so=id35 ([[Students.takenExams.givenByJ] ca) ><I nprojessors.Namel] can)
Join index hierarchies. Recently, [XH94] have adapted the ASR scheme
to a so-called join index hierarchy. Their key idea is to omit the intermediate
objects in the join index and merely store the OIDs of the start and the
target object. In addition, the number of possible paths between the start
and target object is counted and stored.
Their approach is still based on the binary access support relations
[[to.A l ] , ... , [tn-l.A n ]]. A join index covering the sub-path from ti to tj
- denoted JI(ti.Ai+l.··· .Aj ) is obtained from the binary ASRs as follows:
146 A. Kemper and G. Moerkotte
Let us briefly explain the derivation of the first tuple [ids7, id88 , 17]. There
are 3 paths connecting ids7 with id67 and 3 paths connecting id67 with id88.
Therefore, there are 3 * 3 = 9 different ways to traverse from ids7 to id88 via
id67' Likewise, there are 2 * 4 = 8 different ways to traverse from ids7 to id88
via id78. This amounts to 9 + 8 = 17 different paths between ids7 and id88.
For our university database the join index
JI(Students.takenExams.givenBy.Name) looks as follows:
IJI(Students.takenExams.givenBy.Name)I
OlDStudents string count
id 3S "Knuth" 1
id 3S "Babbage" 1
id 37 "Knuth" 1
id37 "Babbage" 1
Note that join indices are always ternary relations - no matter how long
a path expression they cover, because intermediate objects are omitted. In
our case, the join index does not contain less tuples than the canonical ASR
because none of the students in our example database has taken two (or
more) exams from the same professor.
Maintaining just one join index covering the entire path expression is
usually not sufficient because it allows to evaluate only those queries that
span the entire path. The other extreme is to materialize all the possible join
indices that cover anyone of the subpaths. This results in precomputing (and
maintaining) the so-called complete join index hierarchy. For the abstract
path expression to.Al.A2.A3.A4 the complete join index hierarchy is shown
in Figure 6.2.
The disadvantage of the complete join index hierarchy is that materializ-
ing all the possible join indices leads to high storage and high update costs.
10 For simplicity we assume that the binary ASRs are augmented with a count
attribute which is set to 1 in all tuples.
4. Object-Oriented Database Systems 147
length 4
length 3
length 2
length 1
Fig. 6.2. The complete join index hierarchy for a path of length 4
length 4
length 3
length 2
length 1
into a join index has to be done with care: If a tuple [id i , idj , n] exists in
the relation and another tuple [id i , idj , m] representing m additional paths
between idi and idj is inserted, the two should be combined to the one tuple
[id i , idj , (n + m)].
Let us illustrate this bottom-up update propagation on our partial join
index hierarchy of Figure 6.3. Inserting the additional tuple(s) Ll [[tl.A 2]] into
[t l .A2 ]] is propagated to the other join indices as follows:
1. JI(to.A I .A2) is updated by inserting the tuples
LlJI(to.A I .A2) := [[to.A I ]] IXI eLl [[tl.A 2]] .
2. JI(tl.A 2.A3.A4) is updated by inserting the tuples
Ll[[tl.A2] IXI cJI(t2.A3.A4)'
3. In updating JI(to.A I .A2.A3.A4) the set of new tuples LlJI(to.A I .A2)
for join index JI(tO.AI.A2) that was computed in step 1. is
reused. JI(to.A I .A2.A3.A4) is updated by inserting the tuples
LlJI(to.A I .A2) IXI cJI(t2.A3.A4)'
select s
from s in AllStudents
where s.gpaO > 3.0
Storing materialized results. There are two obvious locations where ma-
terialized results could possibly be stored: in or near the argument objects of
the materialized function or in a separate data structure. Storing the results
near the argument objects meanS that the argument and the function result
are stored within the same page such that the access from the argument to
the appropriate result requires nO additional page access. In general, storing
results near the argument objects has several disadvantages:
• If the materialized function f : tI, ... , tn -7 tn+l has more than one
argument (n > 1) One of the argument types must be designated to hold
the materialized result. But this argument has to maintain the results of
all argument combinations - which, in general, won't fit On One page.
• Clustering of function results would be beneficial to support selective
queries on the results. But this is not possible if the location of the ma-
terialized results is determined by the location of the argument objects.
may be stored within the same data structure. This provides for more effi-
ciency when evaluating queries that access results of several of these functions
and, further, avoids to store the arguments redundantly. These thoughts lead
to the following definition:
Definition 6 (Generalized Materialization Relation, GMR).
Let t1, ... , tn, t n+1, ... , tn+m be types and let h, ... , fm be side-effect free
functions with fj : h, ... , tn ---t tn+j for 1 :::; j :::; m. Then the generalized
materialization relation ((h, ... , fm)) for the functions h,.··, fm is of arity
n + 2 * m and has the following form:
((h,···, fm))
tn+m, Vm : boot]
Intuitively, the attributes 0 1 , ... ,On store the arguments (Le., values if the
argument type is atomic or references to objects if the argument type is
complex); the attributes h, ... , f m store the results or - if the result is of
complex type - references to the result objects of the invocations of the
functions iI, ... , fm; and the attributes V1, ... , Vm (standing for validity)
indicate whether the stored results are currently valid.
An extension of the GMR ((iI, ... , fm)) is consistent if a true validity
indicator implies that the associated materialized result is currently valid,
Le.:
((Students.gpa))
01 : OlD Students gpa : float Vgpa : bool
i d35 2.0 true
id37 2.5 true
id53 - true
... ... . ..
Upon the creation of a new GMR the database administrator can choose
whether the GMR extension has to be complete or whether the extension
may be set up incrementally (starting with an empty GMR extension). In-
crementally set up GMR extensions can be used as a cache for function results
that were computed during the evaluation of queries. If the number of entries
is limited (due to space restrictions) specialized replacement strategies for the
GMR entries can be applied. Note that GMRs must be set up incrementally
if they contain at least one partial function.
It should now be obvious that the example query Q3 can be evaluated as
11 01 (a gpa>3.0 ((Students.gpa)))
as long as the GMR ((Students.gpa)) is gpa-valid and complete.
called Reverse Reference Relation (RRR). The RRR contains tuples of the
following form:
[id{o), f, (id{Ol), ... ,id{on))]
Herein, id{o) is the identifier of an object 0 utilized during the materialization
of the result f{ol, ... , on). Note that 0 need not be one of the arguments
01, ... , On; it could be some object related to one of the arguments. Thus,
each tuple of the RRR constitutes a reference from an object 0 influencing a
materialized result to the tuple of the appropriate GMR in which the result
is stored. We call this a reverse reference as there exists a reference chain in
the opposite direction in the object base. l l
Definition 9 (Reverse Reference Relation). The Reverse Reference Re-
lation RRR is a set of tuples of the form
[0: OlD, F: Functionld, A: (OlD)]
For each tuple r E RRR the following condition holds: The object (with the
identifier) r.O has been accessed during the materialization of the function
r.F with the argument list r.A. Remember, that the angle brackets (... )
denote the list constructor.
The reverse references are inserted into the RRR during the materializa-
tion process. Therefore, each materialized function f and all functions in-
voked by f are modified - the modified versions are extended by statements
that inform the GMR manager about the set of accessed objects. During
a {re-)materialization of some result the modified versions of these functions
are invoked.
For our University object base a part of the RRR that controls the in-
validation of precomputed results in the GMR ((Students.gpa)) is shown in
Figure 6.4. Each time an object is updated in the object base, the RRR
is inspected to find out which materialized results have to be invalidated
(lazy rematerialization) or recomputed (immediate rematerialization). Ref-
erence [KKM94] describes ways to detect object updates by schema modifi-
cation and efficient algorithms for maintaining the RRR - which, of course,
changes under object base updates.
RRR ((Students.gpa))
0 F A 01 : OlD Students gpa : float Vgpa : bool
id21 Students.gpa (id35) id35 2.0 true
id22 Students.gpa (id37) id37 2.5 true
id23 Students.gpa (id35) id53 - true
id27 Students.gpa (id37) ... ... . ..
.. . .. . . ..
For the discussion of this section consider the type hierarchy shown in Fig-
ure 6.5. Based on this type hierarchy we can phrase the following three
Person
Emp Student
r
Manager
r
CEO
Fig. 6.5. Sample type hierarchy
[Emp.salary]
80: OIDEmp 81 : int
id4 90000
id 5 100000
id7 100000
id n 150000 B+
ids 260000
id13 900000
id77 1500000
id88 2000000
[Emp.salary]
80 : OlD Emp 8 1 : int
id4 90000
id5 100000
id7 100000
ids 260000
a 81>200000 [[EmP.salary]]
u a 81>200000 [Manager. salary]]
U a 81>200000 [[CEO.salary]]
This problem appears even more severe when considering query Q3 under
the assumption of separate single type indexing on the age attribute.
This layout of the leaf nodes provides support for extracting the (OIDs
of) objects of a particular type by jumping to the corresponding offset which
is maintained in the key directory.
[LOL92] developed an indexing scheme, called H-trees, for combining type
hierarchy indexing with single type indexing. The basic idea consists of nest-
ing B+ -trees, Le., nesting the index tree of a subtype within the tree of the
super type. For our example this is graphically visualized in Figure 6.6, where
the three H-trees HEmp for direct Emp instances, HManager for direct Man-
ager instances, and H CEO for CEO instances are sketched.
The nesting is achieved by incorporating so-called L pointers which refer
from the supertype index to the subtype index tree. There are two essential
conditions for a valid H-tree nesting:
HOED
The CG tree. [KM94] pointed out the principal difference between a key
grouping index - such as the CH-tree - and a set-grouping index - such as
the H-tree. Figure 6.7 sketches the relative performance of these two indexing
schemes for exact match and range queries. The key grouping scheme has very
good performance (Le., low numbers of pages have to be read) for exact match
queries whereas the set grouping scheme degenerates if many sets (Le., many
levels of a type hierarchy) have to be processed. This is due to the fact that
basically every type extent is covered by a separate B-tree. On the other hand,
the key grouping scheme shows poor performance for range queries because
it has to process a large number of leaf pages. It cannot draw profit from
a restriction on the number of sets (type extents) that should be processed
because all sets' objects are intermixed on the leaf pages.
Observing this principal difference, [KM94] designed the so-called CG-tree
which combines the advantages of both schemes. The idea is to replace the
leaf pages of a B+ -tree by several linked lists, one for each set (type extent)
being indexed. This basic idea is illustrated in Figure 6.8 for two sets (type
extents) 81 and 82 only.
The linked lists of leaf pages are considered to be at level 1 of the tree.
Then, at level 2 of the tree particularly structured so-called directory pages
4. Object-Oriented Database Systems 159
n Number of n Number of
Exact match query queried sets Range query queried sets
81-objects
82-objects
are needed that reference the pages at level 1. The directory pages have the
following structure - for n indexed sets S1, ... , Sn:
The directory page contains m search keys Kl. ... , Km. Thereby, the m
ranges R 1 , •.• , Rm are defined. For each range, the directory contains n point-
ers to level 1 pages. The pointer ~.Sj refers to the page of set
srobjects whose keys are in the range ~, i.e., whose keys are in the in-
terval [Ki' Ki+1). If the set Si does not contain any such elements, ~.Sj is
null.
The higher-up nodes of the CG-tree are regular B+ -tree nodes - having
just one emanating node pointer per range.
The cardinality of the indexed sets and their distribution of key values
may be non-uniform. In our example, one can expect that higher salaries
are typically found for CEO objects whereas the lower salaries are usually
paid to "regular" Emp objects. To compensate for this skew in attribute
value distribution, leaf nodes may be shared by several neighbored directory
entries. Such a situation is shown in Figure 6.9. Assuming that an underflow
occurs in the leaf pages L1 and/or L2 of the tree shown on the left-hand side.
This underflow is compensated by merging the two leaves into a single leaf
160 A. Kemper and G. Moerkotte
page L12 - as shown on the right-hand side. This combined leaf page is now
referenced by two neighboring directory entries via the pointers Rl.Sl and
R2. S1.
7.1 Introduction
In queries, selection and join predicates may not only refer to single-valued at-
tributes but also to set-valued attributes. Predicates on set-valued attributes
can be used as selection predicates and join predicates. The next query is an
example of the former:
select s
from s in Student
where Requirements <= s.coursesPassed
The query retrieves students who have passed at least all courses contained
in the set of courses Requirements. This query can be evaluated efficiently by
using an index on Student.coursesPassed.
The following query contains a join predicate based on set-valued at-
tributes. It matches students with courses. The result is a pair of courses and
students such that the student passed all the courses which are a prerequisite
for the course:
select c, s
from c in Course, s in Student
where c.prerequisites <= s.coursesPassed
4. Object-Oriented Database Systems 161
No traditional join algorithm designed for fast join processing - like hash-
join or sort-merge join - is able to handle this query. The only possible
evaluation strategy is to use the slow nested-loop join where every course's
prerequisites is compared with every student's coursesPassed. Obviously, this
is quite expensive.
Both queries contained the subset predicate <= but other queries could
use set equality, the strict or non-strict superset predicate and other variants.
All these possible set predicates can be treated by the techniques introduced
in this section. They rely on signatures which will be discussed in the next
section. Both, the new join algorithms and the new index structures use
signatures as their essential ingredient. Another common variant is to test
two sets for a non-empty intersection. This case cannot be treated by any of
the methods in this section.
Superimposed coding and signatures. The join operators and the index
structures for set-valued attributes represent sets by their signature. When
applying the technique of superimposed coding, each element of a given set s
is mapped via a coding function to a bit field of length b - called signature
length - where exactly k < b bits are set. These bit fields of all elements in
the set are superimposed by a bitwise or operation to yield the final signature
denoted by sig(s).
The following property of signatures is essential. Given two sets sand t,
the implication
sBt ==> sig(s)Bsig(t)
holds for any comparison operator B E {=,~,;;;?} where sig(s) ~ sig(t) and
sig(s) ;;;? sig(t) are defined as
sig(s) ~ sig(t) := sig(s)&-sig(t) = 0
sig(s) ;;;? sig(t) := sig(t)&-sig(s) = 0
As in the programming language C, & denotes bitwise and and - denotes
bitwise complement.
d=3
000
001 d' =3 (x) = 010
010
01 1
100 d' =3 (x) = 011
10 1
1 10
111
An extendible signature hashing index is divided into two parts, the direc-
tory and the buckets [FNP+79]. A bucket contains pairs [Sig(Oi.A), re!(oi)].
The directory begins with a header holding the global depth d. Further, it
consists of 2d entries containing references to buckets (see Fig. 7.1). When
looking up a data item in the directory, the lowest d bits of its signature are
used if the predicate to be evaluated is based on set equality. If a ~-predicate
is employed, than the same mechanism as for the signature-based hash join
164 A. Kemper and G. Moerkotte
is used: all possible d bit endings of signatures for subsets are generated and
each one is looked up in the hash table. Insertions and deletions are treated
the same way as in the original proposal for extendible hashing [FNP+79].
8 Query Optimization
8.1 Overview
A query optimizer for object bases includes optimization techniques from re-
lational query optimization (e.g. join ordering) as well as new optimization
techniques (e.g. type-based rewriting). We will concentrate on the new op-
timization problems and techniques. For traditional optimization techniques
see [vBii90,JK84].
4. Object-Oriented Database Systems 165
A query optimizer typically involves the phases shown in Fig. 8.1. Within
the first phase, syntactic analysis takes place. Syntactic analysis is divided
into two substeps: lexical analysis and parsing. During lexical analysis a token
stream is generated which is then during parsing translated into an abstract
syntax tree. The techniques involved here are standard compiler techniques,
and are not discussed here.
-- Syntactical
Analysis
r---o NFST r---- Rewrite I r----
Query
Optimization
,....--
Rewrite II
-- Code
Generation
---;;..
The fifth phase is again a rewrite phase. Here, small cosmetic rewrites are
applied to the plan in order to prepare it for the code generation phase. The
phases Rewrite II and Code Generation are discussed in section 8.5.
Different implementations of query optimizers use different names for
these phases and sometimes permute the phases or steps within the phases.
For example, some optimizers perform the semantic analysis after the trans-
lation into the algebra. Since the implementation of query optimizers is quite
tricky, different architectures have been designed to facilitate the organiza-
tion of the optimization process. Among them are rule-based query optimiz-
ers, region-based query optimizers, blackboard-based query optimizers, and
query optimizers using object-oriented implementation techniques. However,
we do not go into the details of these architectural approaches but instead
discuss the main tasks and techniques of each phase.
8.2 NFST
The first two steps of this phase consist in normalization and factorization
of the expressions occurring in the query. During normalization we introduce
a new variable for every function or operator call in the original query and
bind these variables to the according expressions. All function applications
are gathered in a define clause appended to the SFW-block. Consider the
following example query:
select distinct s.name, s.age, s.supervisor.name, s.supervisor.age
from s in Student
where s.gpa > 8 and s.supervisor .age < 30
Here, the only (explicit) function calls are attribute accesses. For each at-
tribute access a new variable holding the result is introduced. Since some at-
tributes (e.g. s.age) are accessed multiple times, these are factorized. The def-
initions of the newly introduced variables are gathered in the define clause.
Hence, the result of the normalization and factorization steps is:
select distinct sn, sa, ssn, ssa
from s in Student
where sg > 8 and ssa< 30
define sn = s.name
sg = s.gpa
sa = s.age
ss = s.supervisor
ssn = ss.name
ssa = ss.age
The next step during the NFST-phase is the semantic analysis. It works
recursively through the query. First, for every identifier in the from clause
4. Object-Oriented Database Systems 167
SCAN [s:student]
Fig. 8.2. Algebraic representation of a query
168 A. Kemper and G. Moerkotte
In the last step of the NFST-phase the query is translated into some in-
ternal representation. Several such representations have been proposed. They
are based on object calculi, comprehensions, an object algebra. Here, we fa-
vor a simplified version of the object algebra. We explain the translation
process into the algebra by means of two examples. The above query serves
as our first example. Its translation into the algebra can be found in Fig. 8.2.
The expression SCAN[s: Studentj scans the extent Student and produces tu-
ples with a single attribute s successively bound to the object-identifiers of
the Students. We assume that all algebraic operators work on sets of tuples
where the attribute values may be complex and not just atomic values like in
the relational model. The EXPAND operator expands the given input tuples
by new attribute values. For example, the bottom most EXPAND opera-
tor adds the three attributes sn, sg, and ss to its input tuples. The expand
operator comes in different flavors and is also called map or materialize op-
erator [BK90,BMG93,KMP92]. The SELECT operator selects those input
tuples that satisfy the given predicate. The PROJECT operator performs a
projection for a given set of attributes.
Besides the already stated algebraic operators, we also treat a SFWD-
block as an n-ary algebraic operator. The collection-valued entries in the
from clause are considered as the input arguments of this special operator.
However, in a typical runtime system of an object-oriented database manage-
ment system there is no direct evaluation possibility for such a block. Prior
to execution these blocks are translated into "regular" algebraic expressions.
We briefly describe this translation process.
In standard relational query processing multiple entries in the from clause
are translated into a cross product. This is not always possible in object-
oriented query processing. Consider the following query:
select distinct s
from s in Student, c in s.courses
where c.name = "Database"
select distinct s
from s in Student, c in s.courses
where cn = "Database"
define cn = c.name
every student s from its left input, the d-join computes the set s.courses. For
every course c in s.courses an output tuple containing the original student s
and a single course c is produced. If the evaluation of the right argument of
the d-join is not dependent on the left argument, the d-join is equivalent with
a cross product. The first optimization is to replace d-joins by cross products
whenever possible.
PROJECT [s]
EXPAND [cn:c.name]
where p
group by
al: el
where Xai:ei is the EXPAND operator (also sometimes called materialize op-
erator). We often use Greek letters for algebraic operators to make plans
more compact. The operator Xe is similar to the EXPAND operator. It eval-
uates for every input element the expression e. The results are collected in
the output. The unary grouping operator will play another major role during
unnesting nested queries (cf. Sec. 8.3). The traditional nest operator [SS86]
is a special case of unary grouping. It is equivalent to rg;=A;id.
Let us consider a small example query:
select struct(age: s.age, gpa: s.gpa, cnt: count(partition))
from s in Student
group by a: s.age
g: s.gpa
After normalization and factorization, we have:
select struct(age: sa, gpa: sg, cnt: cp)
from s in Student
define sa: s.age
sg: s.gpa
group by a: sa
g: sg
define cp: count(partition)
We had to introduce a second define clause in order to normalize those
expressions in the select clause, that can only be computed after grouping.
This applies to all expressions referring to partition. We easily see that the
entries in the group by clause of the normalized and factorized query block
can be simplified to contain only the group variables (attributes) sa and sg.
Abbreviating SCAN[x:X] by X[x], the translation into the algebra yields
where we used a special case of the PROJECT (n) operator which includes
renaming.
8.3 Rewrite I
The goal of this phase is to rewrite the query with rules that either allow for a
more efficient evaluation of the query or that facilitate later query optimiza-
tion. The prevailing example of the first case is unnesting. Nested queries
enforce a nested loops evaluation strategy and fix certain parts of the join
order. U nnesting typically leads to plans which are orders of magnitude faster
than the nested counterparts [PHH92]. Type-based rewriting is a technique
of the second kind. After its application, the optimizer can consider a larger
search space and, hence, most probably finds better plans. However, it should
be noted that it is not always clear which of the techniques discussed in this
section will in an actual optimizer implementation occur in the rewrite phase
and which will occur in the optimization phase. Some optimizers even don't
have a rewrite phase.
The rewrite phase includes traditional optimization techniques [JK84] in-
vented in the relational context as well as optimization techniques especially
tailored for queries against object bases. The traditional optimization tech-
niques used in the query rewriting phase include pushing of the boolean COn-
nector not, simplifications, introduction of transitively implied equality pred-
icates, and the introduction of indexes. The latter point subsumes not only
the introduction of traditional index structures like B-trees [BM72,Com79]
but also of more advanced index structures like ASRs [KM90], join index
hierarchies [XH94] and GMRs [KKM91]. These plans are represented by re-
placing the SCAN on extents by an according INDEX-SCAN operator.
The algebraic expression in Fig. 8.2 implies a scan of all students and a
subsequent dereferentiation of the supervisor attribute in order to access the
supervisors. If not all supervisors fit into main memory, this may result in
many page accesses. Further, if there exists an index on the supervisor's age,
and the selection condition ssa < 30 is highly selective, the index should
be applied in order to retrieve only those supervisors required for answering
the query. Type-based rewriting enables this kind of optimization. For any
expression of certain type with an associated extent, the extent is introduced
in the from clause. For our query this results in:
select distinct p
from p in Professor
where p.room.number = 209
Straight forward evaluation of this query would scan all professors. For every
professor, the room relationship would be traversed to find the room where
the professor resides. Last, the room's number would be retrieved and tested
to be 209. Using the inverse relationship, the query could as well be rewritten
to:
JOIN [ss=p]
The evaluation of this query can be much more efficient, especially if there
exists an index on the room number. Rewriting queries by exploiting in-
verse relationships is another rewrite technique to be applied during Rewrite
Phase 1.
The subquery for computing the maximum age of all students is not corre-
lated to the outer query. Uncorrelated sub queries behave like constant ex-
pressions in a query. Hence, unnesting can already take place during the
normalization step in the NFST phase. For the above query, the result is:
define ma = max( select sa
from s in Student
define sa = s.age)
select s
from s in Student
where sa= ma
define sa = s.age
The define preceding the SFWD-block is then evaluated prior to the block.
However, sometimes more efficient ways to evaluate a query are possible.
According to Kim's classification of nested queries [Kim82], there are the
following types of nested queries:
• Type A nested queries have a constant inner block returning single ele-
ments.
• Type N nested queries have a constant inner block returning sets.
• Type J nested queries have an inner block that is dependent on the outer
block and returns a set.
• Type J A nested queries have an inner block that is dependent on the
outer block and returns a single element.
A second dimension of the classification of nested queries in the object-
oriented context is their location: nested queries can occur in the select,
from, and where clause. We concentrate on unnesting of queries in the
where clause. Unnesting in the select clause is treated similar. Unnesting
nested queries in the from clause can be performed by techniques given in
[CM95a,PHH92].
Type A nested queries can be unnested by moving them one block up
(like in the example). Sometimes, more efficient ways to unnest these queries
are possible. In the example the extent of Student has to be scanned twice.
This can be avoided by introducing the new algebraic operator MAX defined
as
MAXf(e) := {xix E e, f(x) = maxYEe(f(y))}
The MAX operator can be computed in a single pass over e.
Using MAX the above query can be expressed in the algebra as
q == M AXs.age(Student[s])
4. Object-Oriented Database Systems 175
Type N nested queries can also be unnested by moving them one block
up, since they also do not depend on their surrounding block. Again, more ef-
ficient evaluation plans are sometimes possible. We distinguish three different
kinds of predicates occurring within the outer where clause:
1. f(x) in select ...
2. not (f(x) in select ... )
3. f(x) = (~, 2, ... ) select ...
where x represents variables of the outer block, f a function (or sub query)
on these variables and =,~, 2, ... are set comparisons.
Subsequent equivalences will be subject to constraints. To express these
constraints we need some abbreviations. We denote by :F the free vari-
ables/attributes of an algebraic expression and by A the attributes in the
result of an algebraic expression. Further we use the standard short hands
for SELECT (0"), JOIN (t><I ), E~PAND (X), left semi-join (XI ), left outer-
join ( J><I ), and left anti-join ( XI ).
1. Type N queries with an in operator can be transformed into a semi-join by
using the following equivalence inspired by relational type N unnesting:
O"AIEXA2(e2)el == el t>< A1=A2e2 (1)
if Ai ~ A(ei), F(e2) n A(el) = 0
The first condition is obvious, the second merely stipulates that expres-
sion e2 must be independent of expression el.
2. Also inspired by the relational type N unnesting is the following equiv-
alence which turns a type N query with a negated in operator into an
anti-join:
O"Alll XA 2(e2)el == el t>< A1=A2e2 (2)
if Ai ~ A(ei), F(e2) n A(el) =0
The third case does not have a counterpart in SQL. However, if we formu-
late the corresponding queries on a relational schema using the non-standard
SQL found in [Kim82J, they would be of type D - resolved by a division.
Using standard SQL, they would require a double nesting using EXISTS
operations. Unnesting Type D queries using a relational division can only
handle very specific queries where the comparison predicate corresponds, in
our context, to a non-strict inclusion. Hence, the third case is typically treated
by moving the nested query to the outer block, so that it is evaluated only
once and then rely on fast set comparison operators.
The algebraic expression for query:
select p
from p in Professor
where p.residesIn in select r
from r in Room
where r.size > 30
176 A. Kemper and G. Moerkotte
is:
q =: O"prEXr(e2) (el)
el =: Xpr:p.residesln(Projessor[p])
e2 =: O"rs>30(Xrs:r.size(Room[rj))
and Eq. 1 can be applied. The result is
q =: el I>< pr=re2
where we reuse expressions el and e2 from above.
Contrary to Kim's unnesting technique for the relational context, type J
and JA queries are treated by the same set of equivalences in the object-
oriented context. For queries featuring a in or not in in the where clause,
the equivalences for type N queries only need slight modifications:
1.
q =: O"pr~Xr(CTpd=rb(e2»el
retrieves for every student the number of better students. It translates into
the algebra as
q == 1I"stud:sl,cnt:c(Xc:count(u.29>.19(e2» (el))
el == Xslg:s1.gpa(Student[s1])
e2 == Xs2g:s2.gpa(Student[s2])
Applying Eq. 7 yields
Here, exists denotes the test for emptiness as in SQL. Unnesting now pro-
ceeds by the technique given in [PHH92j.
The general template for a query with a universal quantifier is:
select el
from el in El
where for all e2 in select e2
from e2 in E2
where p: q
The predicate p is called mnge predicate and q is called quantifier predicate.
Both of them possibly refer to el and/or e2. This results in 16 different cases.
All but three of them are rather trivial [CKM+97a,CKM+97bj. The more
complex cases give rise to three classes:
4. Object-Oriented Database Systems 179
select s.name
from s in Student
where for all c in (select c
from c in Course
where c.name like "%database%"):
c in s.coursesPassed
Xs.name (Xsc:s.coursesPassed (
select d.name
from d in Department
where for all p in (select p
from p in Professor
where p.dept = d):
p.status != "full professor"
180 A. Kemper and G. Moerkotte
Xd.name(Department[d]
select d.name
from d in Department
where for all s in (select s
from s in Student
where s.city = d.city):
s.dept = d
Xd.name(Xdc:d.city(Department[d])
select distinct *
from Professor pI, Professor p2
where p1.university.name = p2.university.name
select distinct *
from Professor pI, Professor p2
where p1.university = p2.university
4. Object-Oriented Database Systems 181
[OL90]. For these queries, a plan containing a cross product of small rela-
tions is often superior to those plans without cross products. Hence, newer
dynamic programming algorithms consider cross products as well.
One such algorithm that generates plans with cross products, selections,
and joins is given in Figure 8.5. The algorithm is described in pseudo code. It
generates optimal bushy trees - that is, plans where both join partners can be
intermediate relations. Efficient implementation techniques for the algorithm
can be found in [SM98]. As input parameters, the algorithm takes a set of
relations R and a set of predicates P. The set ofrelations for which a selection
predicate exists is denoted by Rs. We identify relations and predicates that
apply to these relations. For all subsets Mk of the relations and subsets P,.
of the predicates, an optimal plan is constructed and entered into the table
T. The loops range over all Mk and Pl. Thereby, the set Mk is split into two
disjoint subsets Land L', and the set P,. is split into three parts (line 7).
The first part (V) contains those predicates that apply to relations in L only.
The second part (V') contains those predicates that apply to relations in L'
only. The third part (p) is a conjunction of all the join predicates connecting
relations in Land L' (line 8). Line 9 constructs a plan by joining the two
plans found for the pairs [L, V] and [L', V'] in the table T. If this plan has so
far the best costs, it is memorized in the table (lines 10-12). Last, different
possibilities of not pushing predicates in P,. are investigated (lines 15-19).
For queries against object-oriented databases, the third major operator
is the EXPAND operator (X). The following equivalences show that the EX-
PAND operator is also freely reorder able with selections and joins:
Extents vs. strict extents. A strict extent contains the objects (or their OIDs)
of a class excluding those of its subclasses. A non-strict extent contains the
objects of a class and all objects of its subclasses.
184 A. Kemper and G. Moerkotte
1
CEO
Employee: {e1, e2, ....} Employee': {e1, e2, .... , m1, ... , e1}
L
o
G
I Manager: {m1 .... } Manager': {m1 .... , e1, ... }
C
A
L
CEO: {e1 .... } CEO': {e1, ... }
Employee: {e1: [name: Peter, salary:20.000, boss: m1], Employee': {e1: [name: Peter, salary:20.000, boss: m1],
e2: [name: Mary, salary:21.000, boss: m1], e2: [name: Mary, salary:21.000, boss: m1],
..... } ;;;1': [name: Paul, salary: 100.000, boss: e1],
P
H ;;1', [name: May, salary: 500.000, boss: e1],
Y Manager: {m1: [name: Paul, salary:100.000, boss: e1], .... }
S ... }
I Manager': {m1: [name: Paul, salary: 100.000, boss: e1],
C
A CEO: {e1: [name: May, salary: 500.000, boss: e1], ;;1': [name: May, salary: 500.000, boss: e1],
L ... } .... }
input stream, who do not have a join partner in the right stream. Generation
of bypass plans is beyond the scope of this chapter and the reader is referred
to the literature [KMP+94,SPM+95].
lTp(lTq(e)) == lTpAq(e)
lTp( el C><I qe2) == el C><I qApe2
Similar equivalences exist for projections which are also pushed down dur-
ing the Rewrite II phase. Another major performance improvement can be
achieved by factorizing common algebraic subexpressions [CD92].
The code generation phase heavily depends on the runtime system. The
relatively fixed part is the translation of the algebraic operators within the
query evaluation plan. They are translated into according iterators. The flex-
ibility concerns the translation of the subscripts, for example the selection
predicates. Three alternative exist. First, they can be translated directly into
machine code. This approach is rather efficient but makes code generation
dependent on the underlying hardware. The second alternative is to interpret
the expressions. This is easiest to implement and machine independent but
also less efficient. The third alternative is a compromise. The query evalua-
tion plan is translated into an interpreted code similar to machine code. The
generated machine code is then executed by a virtual machine. This guar-
antees hardware independence and a performance between the other two
alternatives.
9 Conclusion
References
[AF95] Aberer, K., Fischer, G., Semantic query optimization for methods in
object-oriented database systems, Proc. IEEE Conf. on Data Engi-
neering, 1995, 70--79.
[AG96] Arnold, K., Gosling, J., The Java progmmming language, Addison-
Wesley, Reading, MA, USA, 1996.
[AL80] Adiba, M.E., Lindsay, B.G., Database snapshots, Proc. 6th Interna-
tional Conference on Very Large Data Bases (VLDB), 1980, 86-91.
[BCK98] Braumandl, R, Claussen, J., Kemper, A., Evaluating functional joins
along nested reference sets in object-relational and object-oriented
databases, Proc. 24th International Conference on Very Large Data
Bases (VLDB) , 1998, 110--122.
[BCL89] Blakeley, J.A., Coburn, N., Larson, p.-A., Updating derived relations:
detecting irrelevant and autonomously computable updates, ACM
Trans. on Database Systems 14(3), 1989, 369-400.
[Bil92] Biliris, A., The performance of three database storage structures for
managing large objects, Proc. ACM SIGMOD Conf. on Management
of Data, 1992, 276-285.
[BK89] Bertino, E., Kim, W., Indexing techniques for queries on nested ob-
jects, IEEE Trans. Knowledge and Data Engineering 1(2), 1989,
196-214.
[BK90] Beeri, C., Kornatzky, Y., Algebraic optimization of object-oriented
query languages, S. Abiteboul, P.C. Kanellakis (eds.), Lecture Notes
in Computer Science 470, 3rd International Conference on Database
Theory (ICDT'90), Springer-Verlag, Berlin, 1990, 72-88.
[BLT86] Blakeley, J.A., Larson, P.-A., Tompa, F.W., Efficiently updating
materialized views, Proc. ACM SIGMOD Conf. on Management of
Data, 1986, 61-71.
[BM72] Bayer, R, McCreight, E., Organization and maintenance of large
ordered indices, Acta Informatica 1(4), 1972,290--306.
[BMG93] Blakeley, J., McKenna, W., Graefe, G., Experiences building the
Open OODB query optimizer, Proc. ACM SIGMOD Conf. on Man-
agement of Data, 1993, 287-295.
[Bo094] Booch, G., Object-oriented analysis and design, Benjamin/Cum-
mings, Redwood City, CA, USA, 1994.
[BP95] Biliris, A., Panagos, E., A high performance configurable storage
manager, Proc. IEEE Conf. on Data Engineering, 1995, 35-43.
[Cat94] Cattell RG.G. (ed.), Object database standard, Morgan Kaufmann
Publishers, San Mateo, CA, USA, 1994.
[CBB+97] Cattell, R., Barry, D., Bartels, D., Berler, M., Eastman, J., Gamer-
man, S., Jordan, D., Springer, A., Strickland, H., Wade D., The
object database standard: ODMG 2.0, The Morgan Kaufmann Series
in Data Management Systems, Morgan Kaufmann Publishers, San
Mateo, CA, USA, 1997.
[CD92] Cluet, S., Delobel, C., A general framework for the optimization of
object-oriented queries, Proc. ACM SIGMOD Conf. on Management
of Data, 1992, 383-392.
188 A. Kemper and G. Moerkotte
[CDF+94] Carey, M.J., DeWitt, D.J., Franklin, M.J., Hall, N.E., McAuliffe,
M.L., Naughton, J.F., Schuh, D.T., Solomon, M.H., Tan, C.K., Tsa-
talos, O.G., White, S.J,. Zwilling, M.J., Shoring up persistent appli-
cations, Proc. ACM SIGMOD Conf. on Management of Data, 1994,
383-394.
[CDR+86] Carey, M., DeWitt, D., Richardson, J., Shekita, E., Object and file
management in the EXODUS extensible database system, Proc. 12th
International Conference on Very Larye Data Bases (VLDB) , 1986,
91-100.
[CDV88] Carey, M.J., DeWitt, D.J., Vandenberg, S.L., A data model and query
language for EXODUS, Proc. ACM SIGMOD Con/. on Management
of Data, 1988, 413--423.
[CKM+97a] Claussen, J., Kemper, A., Moerkotte, G., Peithner, K., Optimizing
queries with universal quantification in object-oriented and object-
relational databases, Proc. 29rd International Conference on Very
Larye Data Bases (VLDB) , 1997, 286--295.
[CKM+97b] Claussen, J., Kemper, A., Moerkotte, G., Peithner, K., Optimizing
queries with universal quantification in object-oriented and object-
relational databases, Technical Report MIP-9706, University of Pas-
sau, Fak. f. Mathematik u. Informatik, 1997.
[CM93] Cluet, S., Moerkotte, G., Nested queries in object bases, Proc. 4th In-
ternational Workshop on Database Programming Languages - Object
Models and Languages, 1993, 226--242.
[CM95a] Cluet, S., Moerkotte, G., Classification and optimization of nested
queries in object bases, Technical Report 95-6, RWTH Aachen, 1995.
[CM95b] Cluet, S., Moerkotte, G., Query optimization techniques exploiting
class hierarchies, Technical Report 95-7, RWTH Aachen, 1995.
[Com79] Comer, D., The ubiquitous B-tree, ACM Computing Surveys 11(2),
1979, 121-137.
[CS97] Chaudhuri, S., Shim, K., Optimization of queries with user-defined
predicates, Technical Report, Microsoft Research, Advanced Technol-
ogy Division, One Microsoft Way, Redmond, WA 98052, USA, 1997.
[Day87] Dayal, U., Of nests and trees: a unified approach to processing queries
that contain nested sub queries, aggregates, and quantifiers, Proc 19th
International Conference on Very Larye Data Bases (VLDB) , 1987,
197-208.
[Dep86] Deppisch, U., S-tree: a dynamic balanced signature index for office
retrieval, Proc. 9th Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval (SIGIR'86),
1996, 77-87.
[EGK95] Eickler, A., Gerlhof, C.A., Kossmann, D., A performance evaluation
of OlD mapping techniques, Proc. 21th International Conference on
Very Larye Data Bases (VLDB) , 1995, 18-29.
[FJK96] Franklin, M.J., Jonsson, B., Kossmann, D., Performance tradeoffs
for client-server query processing, Proc. ACM SIGMOD Conf. on
Management of Data, 1996, 149--160.
[Flo96] Florescu, D., Espaces de recherche pour l'optimisation de requetes
objet (Search spaces for query optimization), PhD thesis, Universite
de Paris VI, 1996.
4. Object-Oriented Database Systems 189
[FNP+79] Fagin, R., Nievergelt, J., Pippenger, N., Strong, H., Extendible hash-
ing - a fast access method for dynamic files, ACM Trans. on Database
Systems 4(3), 1979, 315-344.
[Fra96] Franklin, M., Client data caching: a foundation, Kluwer Academic
Press, 1996.
[GKK+93] Gerlhof, C.A., Kemper, A., Kilger, C., Moerkotte, G., Partition-
based clustering in object bases: from theory to practice, Lecture
Notes in Computer Science 730, Intl. Conf. on Foundations of Data
Organization and Algorithms (FODO), Springer-Verlag, Berlin, 1993,
301-316.
[GKM96] Gerlhof, C.A., Kemper, A., Moerkotte, G., On the cost of monitoring
and reorganization of object bases for clustering, ACM SIGMOD
Record 25(3), 1996, 28-33.
[GL87] Ganski, R.A., Long, H.K.T., Optimization of nested SQL queries
revisited, Proc. ACM SIGMOD Conf. on Management of Data, 1987,
22-33.
[Gra93] Graefe, G., Query evaluation techniques for large databases, ACM
Computing Surveys 25(2), 1993,73-170.
[Gut84] Guttman, A., R-trees: a dynamic index structure for spatial search-
ing, Proc. ACM SIGMOD Conf. on Management of Data, 1984,
47-57.
[Han87] Hanson, E., A performance analysis of view materialization strategies,
Proc. ACM SIGMOD Conf. on Management of Data, 1987, 440-453.
[Han88] Hanson, E., Processing queries against database procedures: a perfor-
mance analysis, Proc. 1988 ACM SIGMOD International Conference
on Management of Data, 1988, 295-302.
[Hiir78] Hiirder, T., Implementing a generalized access path structure for a
relational database system, ACM Trans. on Database Systems 3(3),
1978, 285-298.
[HM96] Helmer, S., Moerkotte, G., Evaluation of main memory join algo-
rithms for joins with set comparison join predicates, Technical Report
13/96, University of Mannheim, Mannheim, Germany, 1996.
[HM97] Helmer, S., Moerkotte, G., Evaluation of main memory join algo-
rithms for joins with set comparison join predicates, Proc. 23rd In-
ternational Conference on Very Large Data Bases (VLDB), 1997,
386-395.
[HP94] Hellerstein, J., Pfeffer, A., The RD-tree: an index structure for sets,
Technical Report 1252, University of Wisconsin, Madison, Wisconsin,
1994.
[HS93] Hellerstein, J.M., Stonebraker, M., Predicate migration: optimizing
queries with expensive predicates, Proc. ACM SIGMOD Conf. on
Management of Data, 1993,267-276.
[IK093] Ishikawa, Y., Kitagawa, H., Ohbo, N., Evaluation of signature files
as a set access facility in OODBMS, Proc. ACM SIGMOD Conf. on
Management of Data, 1993, 247-256.
[Ita93] Itasca Systems Inc., Technical summary for Release 2.2, Itasca Sys-
tems, Inc., USA, 1993.
[Jhi88] Jhingran, A., A performance study of query optimization algorithms
on a database system supporting procedures, Proc. 14th International
Conference on Very Large Data Bases (VLDB), 1988, 88-99.
190 A. Kemper and G. Moerkotte
[JK84] Jarke, M., Koch, J., Query optimization in database systems, ACM
Computing Surveys 16(2), 1984, 111-152.
[KC86] Khoshafian, S.N., Copeland, G.P., Object identity, Proc. ACM Conf.
on Object-Oriented Programming Systems and Languages (OOP-
SLA), 1986, 408-416.
[KD91] KefUer, U., Dadam, P., Auswertung komplexer Anfragen an hier-
archisch strukturierte Objekte mittels Pfadindexen, Proc. der GI-
Fachtagung Datenbanksysteme fur Buro, Technik und Wissenschaft
(BTW), Informatik-Fachberichte No. 270, Springer-Verlag, 1991,
218-237.
[Kie84] Kiessling, W., SQL-like and Quel-like correlation queries with ag-
gregates revisited, ERL/UCB Memo 84/75, University of Berkeley,
1984.
[Kim82] Kim, W., On optimizing an SQL-like nested query, ACM Trans. on
Database Systems 7(3), 1982, 443-469.
[Kim89] Kim, W., A model of queries for object-oriented databases, Proc.
15th International Conference on Very Large Data Bases (VLDB),
1989, 423-432.
[KK94] Kemper, A., Kossmann, D., Dual-buffering strategies in object
bases, Proc. 20th International Conference on Very Large Data Bases
(VLDB), 1994,427-438.
[KK95] Kemper, A., Kossmann, D., Adaptable pointer swizzling strategies
in object bases: design, realization, and quantitative analysis, The
VLDB Journal 4(3), 1995, 519-566.
[KKD89] Kim, W., Kim, K.C., Dale, A., Indexing techniques for object-
oriented databases, W. Kim, F.H. Lochovsky (eds.), Object-Oriented
Concepts, Databases, and Applications, Addison-Wesley, 1989, 371-
394.
[KKM90] Kemper, A., Kilger, C., Moerkotte, G., Materialization of functions
in object bases: design, realization, and evaluation, Technical Re-
port 28/90, Fakultiit fiir Informatik, Universitiit Karlsruhe, Karl-
sruhe, 1990.
[KKM91] Kemper, A., Kilger, C., Moerkotte, G., Function materialization in
object bases, Proc. ACM SIGMOD Conf. on Management of Data,
1991, 258-268.
[KKM94] Kemper, A., Kilger, C., Moerkotte, G., Function materialization in
object bases: design, implementation and assessment, IEEE Trans.
Knowledge and Data Engineering 6(4), 1994, 587-608.
[KL70] Kernighan, B., Lin, S., An efficient heuristic procedure for partition-
ing graphs, Bell System Technical Journal 49(2), 1970, 291-307.
[KM90] Kemper, A., Moerkotte, G., Access support in object bases, Proc.
ACM SIGMOD Conf. on Management of Data, 1990,364-374.
[KM92] Kemper, A., Moerkotte, G., Access support relations: an indexing
method for object bases, Information Systems 17(2), 1992, 117-146.
[KM94] Kilger, C., Moerkotte, G., Indexing multiple sets, Proc. 20th Interna-
tional Conference on Very Large Data Bases (VLDB), 1994, 180-191.
[KMP92] Kemper, A., Moerkotte, G., Peithner, K., Object-orientation axioma-
tised by dynamic logic, Technical Report #92-30, RWTH Aachen,
Germany, 1992.
4. Object-Oriented Database Systems 191
[KMP+94] Kemper, A., Moerkotte, G., Peithner, K., Steinbrunn, M., Optimizing
disjunctive queries with expensive predicates, Proc. ACM SIGMOD
International Conference on Management of Data, 1994, 336-347.
[KMS92] Kemper, A., Moerkotte, G., Steinbrunn, M., Optimization of Boolean
expressions in object bases, Proc. 18th International Conference on
Very Large Data Bases (VLDB), 1992, 79-90.
[Kru56] Kruskal, J.B., On the shortest spanning subtree of a graph and the
travelling salesman problem, Proc. Amer. Math. Soc. 7, 1956, 48-50.
[LL89] Lehman, T.J., Lindsay, B.G., The Starburst long field manager, Proc.
15th International Conference on Very Large Data Bases (VLDB),
1989, 375-383.
[LLO+91] Lamb, C., Landis, G., Orenstein, J., Weinreb, D., The ObjectStore
database system, Communications of the ACM 34(10), 1991, 50-63.
[LMB97] Leverenz, L., Mateosian, R, Bobrowski, S., Oracle8 Server - concepts
manual, Oracle Corporation, Redwood Shores, CA, USA, 1997.
[LOL92] Low, C.C., Ooi, B.C., Lu, H., H-trees: a dynamic associative search
index for OODB, Proc. ACM SIGMOD Can/. on Management of
Data, 1992, 134-143.
[Lum70] Lum, V.Y., Multi-attribute retrieval with combined indexes, Com-
munications of the ACM 13, 1970, 660-665.
[MS86] Maier, D., Stein, J., Indexing in an object-oriented DBMS, K.R Dit-
trich, U. Dayal (eds.), Proc. IEEE Intl. Workshop on Object-Oriented
Database Systems, IEEE Computer Society Press, 1986, 171-182.
[MS93] Melton, J., Simon, A., Understanding the new SQL: a complete guide,
Morgan Kaufman, San Mateo, California, 1993.
[Mur89] Muralikrishna, M., Optimization and dataflow algorithms for nested
tree queries, Proc. 15th International Conference on Very Large Data
Bases (VLDB), 1989, 77-85.
[Mur92] Muralikrishna, M., Improved unnesting algorithms for join aggregate
SQL queries, Proc. 18th International Conference on Very Large Data
Bases (VLDB), 1992, 91-102.
[MZD94] Mitchell, G., Zdonik, S., Dayal, U., Optimization of object-oriented
queries: problems and applications, A. Dogac, M.T. Ozsu, A.
Biliris, T. Sellis (eds.), Advances in Object-Oriented Database Sys-
tems, NATO ASI Series F: Computer and Systems Sciences, vol. 1SO,
Springer-Verlag, Berlin, 1994, 119-146.
[NHS84] Nievergelt, J., Hinterberger, H., Sevcik, K.C., The grid file: an adapt-
able, symmetric multikey file structure, ACM Trans. on Database
Systems 9(1), 1984, 38-71.
[02T94] O 2 Technology, Versailles Cedex, France, A technical overview of the
O2 system, 1994.
[OL90] Ono, K., Lohman, G.M., Measuring the complexity of join enumer-
ation in query optimization, Proc. 16th International Conference on
Very Large Data Bases (VLDB), 1990, 314-325.
[PHH92] Pirahesh, H., Hellerstein, J., Hasan, W., Extensible/rule-based query
rewrite optimization in Starburst, Proc. ACM SIGMOD Can/. on
Management of Data, 1992, 39-48.
[SAC+79] Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, RA., Price,
T.G., Access path selection in a relational database management
192 A. Kemper and G. Moerkotte
[WK92] Wilson, P., Kakkad, S., Pointer swizzling at page fault time: efficiently
supporting huge address spaces on standard hardware, Pmc. Int.
Workshop on Object Orientation in Operating Systems, Paris, IEEE
Press, 1992, 364-377.
(WM95] Wilhelm, R., Maurer, D., Compiler design, Addison Wesley, 1995.
[XH94] Xie, Z., Han, J., Join index hierarchies for supporting efficient nav-
igations in object-oriented databases, Pmc. 20th International Con-
ference on Very Large Data Bases (VLDB) , 1994, 522-533.
5. High Performance Parallel Database
Management Systems
Abstract. Parallelism is the key to realizing high performance, scalable, fault tol-
erant database management systems. With the predicted future database sizes and
complexity of queries, the scalability of these systems to hundreds and thousands
of nodes is essential for satisfying the projected demand. This paper describes three
key components of a high performance parallel database management system. First,
data partitioning strategies that distribute the workload of a table across the avail-
able nodes while minimizing the overhead of parallelism. Second, algorithms for
parallel processing of a join operator. Third, ORE as a framework that controls
the placement of data to respond to changing workloads and evolving hardware
platforms.
5. High Performance Parallel Database Management Systems 195
1 Introduction
Database management systems (DBMS) have become an essential compo-
nent of many application domains, e.g., airline reservation, stock market
trading, etc. In the arena of high performance DBMS, parallel database sys-
tems have gained increased popularity. Example research prototypes include
Gamma [DGS+90J, Bubba [BAC+90J, XPRS [SKP+88J, Volcano [Gra94bJ,
Omega [GCK+93J, etc. Products from the industry include Tandem's Non-
Stop SQL [Tan88J, NCR's DBC/IOI2 [Ter85J, Oracle Parallel Server [Ora94],
IBM's DB2 parallel edition [BFG+95], etc. The hardware platform of these
machines is typically a multi-node platform, see Figure l.la, where each node
might be a computer with one or more disks, see Figure l.lb. In these sys-
tems, several forms of parallelism can be utilized to improve the performance
of the system. First, parallelism can be applied by executing several queries or
transactions simultaneously. This form of parallelism is termed inter-query
parallelism. Second, inter-operator parallelism can be employed to execute
several operators in the same query concurrently. For example, multiple nodes
could execute two or more relational join operators of a complex bushy join
query in parallel. Finally, intra-operator parallelism can be applied to each
operator within a query. For example, multiple nodes can be employed to
execute a single relational selection operator. This chapter describes how a
system employs these alternative forms of parallelism.
2 Partitioning Strategies
~--------------~--------------~
~--....:::.----
"BMC" to node 1 with range partitioning. (With hash, this query would
be directed to node 2.) This frees up the other two nodes to process other
queries.
When a transaction updates the partitioning attribute value of a record,
the system might migrate the record from one node to another in order to
preserve the integrity of the partitioning strategy. In the example of Fig-
ure 2.2, if the partitioning attribute (Symbol) value of a record changes from
"AXP" to "XAP", the system migrates this record from node 1 to node 3 to
preserve the integrity of range partitioning strategy.
In [GD90], we quantified the performance tradeoff associated with range,
hash and round-robin partitioning strategies using the alternative indexing
mechanisms provided by the Gamma database machine. This study reveals
that for a shared-nothing multiprocessor database machine, no partitioning
strategy is superior under all circumstances. Rather, each partitioning strat-
egy outperforms the others for certain query types. The major reason for
this is that there exists a tradeoff between exploiting intra-query parallelism
by distributing the work performed by a query across multiple nodes and
the overhead associated with controlling the execution of a multisite query.
Localizing the execution of queries requiring minimal amount of resources,
results in the best system response time and throughput since the overhead
associated with controlling the execution of the query is either minimized or
eliminated. On the other hand, for queries requiring more resources, certain
tradeoffs are involved. In general, with access methods that result in the re-
trieval of only the relevant tuples from the disk, if the selectivity factor of the
query is very low, it is advantageous to localize the execution of the query
to a single node. While the hash partitioning strategy localizes the execution
of the exact match selection queries that reference the partitioning attribute,
198 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss
the range partitioning strategy attempts to localize the execution of all query
types that reference the partitioning attribute regardless of their selectivity
factor. At the other end of the spectrum, the round-robin partitioning strat-
egy directs a query to all the nodes containing the fragments of the referenced
relation.
( Range Partition )
,- -t--
A- I J-Q "'
R-Z
For sequential scan queries, the best response time and throughput is
observed when the partitioning strategy constructs the smallest fragment size
on each node, the execution of each query is localized to a single node, and the
simultaneously executing queries are evenly dispersed across the nodes. The
system generally performs best when the query executes all by itself at a site
and performs a series of sequential disk requests. By localizing the execution
of the query to a single node, there is a higher probability of maintaining
the sequential nature of disk requests made by a query, free from interference
of the other concurrently executing queries. Thus, for the sequential scan
queries, the optimal partitioning strategy is the range partitioning strategy.
Symbol
A-D E-B I-L ..... 0-" u-z
0-10 1 2 3 , 5 6
11-20 7 8 9 10 11 12
P 21-30 13 14 15 16 17 18
/
E
31-'0 19 20 21 22 23 2'
41-50 25 26 27 28 29 30
51-CO 31 32 33 3' 35 36
Next, contrast the execution of queries A and B when the Stock table is
hash partitioned on the Symbol attribute with when it is declustered using
MAGIC and the assignment presented in Figure 2.3. Query type A is an exact
match query on the Symbol attribute. The hash partitioning strategy local-
izes the execution of this query to a single node. The MAGIC declustering
strategy employs six nodes to execute this query because its selection pred-
icate maps to one column of the two dimensional directory. As an example,
consider the query that selects the record corresponding to BMC Software
(Stock.Symbol = "BMC"). The predicate of this query maps to the first col-
umn of the grid directory and nodes 1, 7, 13, 19, 25, and 31 are employed to
execute it.
Query type B is a range query on the PIE attribute. The hash partition-
ing strategy must direct this query to all 36 nodes because PIE is not the
partitioning attribute. Again, MAGIC directs this query to six nodes since its
predicate value maps to one row of the grid directory and the entries of each
row have been assigned to six different nodes. If instead the Stock relation
was range partitioned on the PIE attribute, a single node would have been
5. High Performance Parallel Database Management Systems 201
used to execute the second query; however, then the first query would have
been executed by all 36 nodes.
Consequently, the MAGIC partitioning strategy uses an average of six
nodes, while the range and hash partitioning strategies both use an average
of 18.5 nodes. Ideally, however, a single node should have been used for each
query since they both have minimal resource requirements. Approximating
the optimal number of nodes closely provides two important benefits. First,
the average response time of both queries is reduced because query initia-
tion overhead [CAB+88] is reduced. Second, using fewer nodes increases the
overall throughput of the system because the "freed" nodes can be used to
execute additional queries.
objective, the algorithm starts with a very large value for N. This reduces the
probability of a bucket exceeding the memory size. If the buckets are much
smaller than main memory, the algorithm combines several buckets into one
during the third phase to approximate the available memory.
This algorithm is different than sort-merge in one fundamental way: In its
last step, the tuples from bucket Bi of R are stored in memory resident hash
tables (using the join attribute, attribute A). The tuples from bucket Bi of 8
are used to probe this hash table for matching tuples. Grace-join may use the
smaller table (say R) to determine the number of buckets: this calculation is
independent of the larger table (8).
Hybrid hash-join. The Hybrid hash-join also operates in three steps. Its
main difference when compared with Grace hash-join is as follows. It main-
tains the tuples of the first bucket of R to build the memory resident hash
table while constructing the remaining N-l buckets are stored in temporary
files. Relation 8 is partitioned using the same hash function. Again, the last
N-l buckets are stored in temporary files while the tuples in the first bucket
are used to immediately probe the memory resident hash table for matching
tuples.
3.1 Discussion
(1)
The bandwidth of a disk is a function of block size (13) and its physical
characteristics [GG97,BGM+94]: seek time, rotational latency, and transfer
rate (tfr). It is defined as:
13
BW(di ) = tfr x 13 + (tfr x (seek time + rotational latency)) (2)
Given a fixed seek time and rotational latency, BW(di ) approaches disk trans-
fer rate with larger block sizes.
There are F files stored on the underlying storage. The number of files
might change over times, causing the value of F to change. A file fi might
be partitioned into two or more fragments. Its number of fragments is in-
dependent of the number of storage devices, i.e., K. Fragments of a file
may have different sizes. Fragment j of file fi is denoted as fi,j' In our
assumed environment, two or more fragments of a file might be assigned
to the same disk drivel. Moreover, a file Ii may specify a certain availabil-
ity requirement from the underlying system. For example, it may specify
that its mean-time-to-data-Ioss, MTT DL(Ii), should exceed 200,000 hours,
MTT DLmin(fi) = 200,000 hours.
We assume physical disk drives fail independent of one another. Each
disk has a certain failure rate [ZGOO,SS82,Gib92], termed A/ailure. Its mean-
time-to-failure (MTTF) is simply:-,_l_.
Afa1.lure
When a file (say fJ) is partitioned
into n fragments and assigned to n disks (say d l to d n ) then the data be-
comes unavailable in the presence of a single failure 2 • Hence, it is defined as
follows [ZGOO,SS82,Gib92]:
1
MTTDL(fi) = I:ni=l A.
/atlure
(d.)
t
(3)
For example, if the MTTF of disk A and B is 1 million and 2 million hours,
respectively, then the MTTDL of a file with fragments scattered across these
two disks is 666,666 hours.
o - - - - Block,
eClion ie ....
Deplh :2
Buddie, Buddie,
Fig. 4.1. Physical division of disk space into blocks and the corresponding logical
view of the sections with an example base of B = 2
fragment into many smaller pieces and disperse them amongst the available
disk drives.
Monitor constructs a profile of the load imposed on each disk drive and the
average response time of each disk d i . The load imposed on disk drive d i is
quantified as the bandwidth required from disk di • It is the total number of
bytes retrieved from di during a time slice divjded by the duration of the
time slice. The average response time of d i is the average response time of
the requests it processes during the time interval.
This process produces three tables that are used by the other two steps:
• FragProfiler table maintains the average block request size, heat, and load
imposed by each fragment hj per time slice,
• for each disk drive d i per time slice, DiskProfiler table maintains the heat,
load, standard deviation in system load, average response time, average
queue length, and utilization of di ,
• FragOvlp table maintains the OVERLAP between two fragments per
time slice. The concept of OVERLAP is detailed in Section 4.2.
availability of bandwidth from dsre. It assumes some buffer space for staging
data from primary copy to facilitate construction of its secondary copy. This
buffer space might be provided as a component of the embedded device.
Depending on its size, the system might read and write units larger than
a block. Moreover, it might perform writes against ddst in the background
depending on the amount of free buffer space. Once the free space falls below
a certain threshold, the system might perform writes as foreground tasks that
compete with active user requests [AKN+97].
In this section, we describe two algorithms that strive to distribute the load
of an application evenly across the K disks. These are termed EVEN and
EVEN C / B. As implied by their name, EVEN C / B is a variant of EVEN. A
taxonomy of alternative techniques can be found in [GGG+Ol].
4 Given two disks, d 1 and d2 with negative imbalance of -0.5 and -2.0, respectively,
d2 has the minimum negative load imbalance.
208 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss
The average response time of a fragment, RTavg(Aj), is the sum of its average
service time, Savg(Aj), and wait time, Wavg(Aj), ofrequests that reference
5. High Performance Parallel Database Management Systems 209
it:
(5)
# of Requests
140000
120000
100000
80000
60000
40000 II
IW~ II, J~
It, I
20000
o
II \( 1(1 IJiMJ ~
Time Slice ID
locates the fragment referenced by the request and resolves which disk
contains the referenced data. It consults with the file system of the disk
drive to identify the appropriate cylinder and track that contains the
referenced block.
Tbae8llcem
4.4a. Starting with time slice 1 4.4b. Starting with time slice 200
Fig. 4.4. Cumulative average response time for the homogeneous environment
response times till the end of that time slice. The cumulative average response
time is the ratio of these two numbers , i.e. , total response time. If during a
total requests
time slice, no requests are issued then the cumulative average response time
remains constant. This explains the periodic flat portions.
In addition to EVEN and EVEN C / B, these figures present the response
time for three other configurations. These correspond to:
• No-reorganization: this represents the base configuration that processes
requests without on-line reorganization.
• Optimal: this configuration assigns requests to the disks in a round-robin
manner, ignoring the placement of data and files referenced by each re-
quest. This configuration represents the theoretical lower bound on re-
sponse time that can be obtained from the 9 disk configuration.
• Heat-Based: This is an implementation of the re-organization algorithm
presented in [SWZ98]. Briefly, this algorithm monitors the heat [CAB+88]
of disks and migrates the fragment with highest temperature from the
hottest disk to the coldest one if: a) the heat of the target disk after
this migration does not exceed the heat of the source disk, and b) the
hottest disk does not have a queue of pending requests. The heat of a
fragment is defined as the sum of the number of block accesses to the
fragment per time unit, as computed using statistical observation during
some period of time. The temperature of a fragment is the ratio between
its heat and size. The heat of a disk is the sum of the heat of its assigned
fragments [CAB+88,KH93].
Figure 4.4a and b show the cumulative average response time starting
with the 1th and 200 th time slice, respectively. The former represents a cold
start while the later is a warm start after 20 hours of using the framework.
In both cases, ORE is a significant improvement when compared with no-
reorganization. (ORE refers to the framework consisting of the three possible
algorithms: EVEN, EVEN c / B , and Heat-Based.) The peaks in this figure
correspond to the bursty arrival of requests which result in the formation of
queues. Even though Optimal assigns requests to the nodes in a round-robin
manner, it also observes formation of queues because many requests arrive
in a short span of time.
We also analyzed the performance of alternative algorithms on a daily
basis. This was done as follows. We set the cumulative average response time
to zero at midnight on each day. When compared with the theoretical Opti-
mal, ORE is slower by an order of magnitude. Figure 4.5 shows how inferior
EVEN, EVEN c / B and Heat-Based are when compared with Optimal. The
y-axis on this figure is the percentage difference between an algorithm (say
EVEN) and Optimal. A large percentage difference is undesirable because it
is further away from the ideal. We show two different days, corresponding to
the best and worst observed performance. During day 2, ORE is 50 to 300
percent slower than the theoretical Optimal. During day 6, ORE is at times
several orders of magnitude slower than Optimal.
214 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss
SOO.()l)% ---- ----- -_ •• ---._ ••• --_._ •••• -----_ •• _•• - •••••••• _._._. --_._ ••• -_. 3600.00% ._-_. -- _•••• _-_ •••••••• _.- ••• _._-----._--------- - •• -- ••••• _--.---.
350lII11' ••••• -- ••. ------ ••••••• --- .•.•••••••••.•••••••• -.---- •.••••• ------- 2500.00"5 ••••••••••••••••• _-_ •••••• -.-- ••••• --.--------------- --------_.
300.00% •• _.• ---- ••••• _-_ ••• ----- ••• ------ ••• ---- --_. ------ •••• ----- ••• ---
20D0.CIO% •••••••••••••••••• --- ••••••••• - ••••••• - •••• - •• -.- •••• ---.-.- •• -.
250AIO'Ao • - ....••••••••••••••. -- ••••••••. -.- ..•.•••••••••••••••••••
15OO.OO"J1o .•••••••••..•...••••••••••..••••.••..............•.•...........
200~ ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
EVEN
150.000. ...............••.......•.••..•...••.........•..••••••••••••••.••• 1~ ••••••••••••••••••••••••••••••••••••••••• _•••• _••••••••••••••••••
TlmeSll... m
400000 --- ••••• _-.- •••••• _- •••• --_ ••••••••• -••••••.•••••• -•••.••••• -•••••••..
Tlme8UcoID Tlme8UcoID
4.6a. Starting with time slice 1 4.6b. Starting with time slice 200
Fig. 4.6. Cumulative average response time for the heterogeneous environment
lative average response time is reset to zero at the beginning of each day,
12 am. Generally speaking, EVEN c / B is superior to EVEN. In Figures 4.7a
and b, we show the percentage degradation relative to EVENc/B observed
for two different days, day 3 and 6. These correspond to the best and worst
observed performance with EVEN. During day 3, EVEN provides a perfor-
mance that is at times better than EVENc / B . During day 6, EVEN exhibits
a performance degradation that is several orders of magnitude slower than
EVENc / B . In this case, no re-organization outperforms EVEN .
~.-- ... -...... -............. ---.- ...... -... -............. -....... _-_ .. .
2OD~---- ••••• - •••••••• - •• -.-.-- ••• - •••••• - •• ---- ••• ---.-- •••••• -- •••• _---.-
25DOO.IID'I. - •.. - -. --.-.. --- .... -... -.... -- ... -.•. -.. -..•... -... -..... -..... -.
150.IJO'I. •• -.- ••.•• -.- ••• -- ••••.•.••••••••. - ••• -....••• - •...• -.-- •• -.-- •••• -.- •.
SO~--· .. ··-· .. ------.. -.... --...... --......... -....... -.--...... -.. --... 10000.00..·······-······················-···-····-·-··-··-··· ............... .
EVEN SJOO.IID'I. •••••••• - ••• - •••••••••••••• - •••• - ••••••••••••••••••.••••••••••••.••.
0.... ~-1=================== N.a-pnJu1ion
0..... !"-':;;;;;;;;;;;;;;;-==-=:=Eii:iiiii=ii~;"
-50.IID'I. .. -- •••• -- •••••• -- .. -.- •• - .•• -- •• -.--- ••••••••..••• --.--.-- •..••. --.- •. 130'1 1321 1341 1381 1311 140\ 1421 '44' , .. 148'1 180'11521
Tlme8UcoID Tlme8liooID
Acknowledgments
We wish to thank Anouar Jamoussi and Sandra Knight of BMC Software
for collecting and providing traces used in this study. We also thank William
Wang, Sivakumar Sethuraman, and Dinakar Yanamandala of USC for assist-
ing with the implementation of our simulation model.
References
[AKN+97] Aref, W., Kamel, I., Niranjan, T., Ghandeharizadeh, S., Disk schedul-
ing for displaying and recording video in non-linear news editing sys-
tems, Proc. Multimedia Computing and Networking Conference, SPIE
Proceedings, vol. 3020, 1997, 1003-1013.
[Bab79] Babb, E., Implementing a relational database by means of specialized
hardware, ACM Transactions on Database Systems 4(1), 1979, 1-29.
[BAC+90] Boral, H., Alexander, W., Clay, L., Copeland, G., Danforth, S.,
Franklin, M., Hart, B., Smith, M., Valduriez, P., Prototyping Bubba,
a highly parallel database system, IEEE Transactions on Knowledge
and Data Engineering 2(1), 1990, 4-24.
[BFG+95] Baru, C.K., Fecteau, G., Goyalet, A., Hsiao, H., Jhingran, A., Pad-
manabhan, S., Copeland, G.P., Wilson, W.G., DB2 Parallel Edition,
IBM Systems Journal 34(2), 1995, 292-322.
[BGM+94] Berson, S., Ghandeharizadeh, S., Muntz, R, Ju, X., Staggered strip-
ing in multimedia information systems, Proc. ACM Special Interest
Group on Management of Data, Minneapolis, Minnesota, SIGMOD
Record 23(2), 1994, 79-90.
[Bra84] Bratbergsengen, K., Hashing methods and relational algebra oper-
ations, Proc. Very Large Databases Conference, Singapore, Morgan
Kaufmann, 1984,323-333.
[BSCOO] Bhatia, R., Sinha, RK., Chen, C., Dedustering using Golden Ratio
Sequences, Proc. 16th International Conference on Data Engineering,
San Diego, California, 2000, 271-280.
[CAB+88] Copeland, G., Alexander, W., Boughter, E., Keller, T., Data place-
ment in Bubba, Proc. ACM Special Interest Group on Management
of Data, Chicago, Illinois, SIGMOD Record 17(3), 1988, 99-108.
[CR93] Chen, L.-T., Rotem, D., Dedustering objects for visualization, Proc.
Very Large Databases Conference, Dublin, Ireland, Morgan Kauf-
mann, 1993, 85-96.
[DG85] DeWitt, D.J., Gerber, R., Multiprocessor hash-based join algorithms,
Proc. Very Large Databases Conference, Stockholm, Sweden, Morgan
Kaufmann, 1985, 151-164.
[DGS+90] DeWitt, D., Ghandeharizadeh, S., Schneider, D., Bricker, A., Hsiao,
H., Rasmussen, R., The Gamma database machine project, IEEE
Transactions on Knowledge and Data Engineering 2(1), 1990, 44--{)2.
[DKO+84] DeWitt, D.J., Katz, RH., Olken, F., Shapiro, L.D., Stonebraker,
M.R., Wood, D., Implementation techniques for main memory data-
base systems, ACM Special Interest Group on Management of Data
Record 14(2), 1984, 1-8.
218 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss
[DS82] Du, H.C., Sobolewski, J.S., Disk allocation for Cartesian product files
on multiple-disk systems, ACM 7ransactions on Database Systems
7(1), 1982, 82-101.
[FB93] Faloutsos, C., Bhagwat, P., Declustering using fractals, Proc. 2nd In-
ternational Conference on Pamllel and Distributed Information Sys-
tems, 1993, 18-25.
[FM89] Faloutsos, C., Metaxas, D., Declustering using error correcting codes,
Proc. Symp. on Principles of Database Systems, 1989, 253-258.
[GCK+93] Ghandeharizadeh, S., Choi, V., Ker, C., Lin, K., Design and imple-
mentation of the Omega object-based system, Proc. 4th Austmlian
Database Conference, 1993, 198-209.
[GD90] Ghandeharizadeh, S., DeWitt, D., A multiuser performance analysis
of alternative declustering strategies, Proc. 6th IEEE Data Engineer-
ing Conference, 1990, 466-475.
[GD92] Ghanderharizadeh, S., DeWitt, D., A performance analysis of al-
ternative multi-attribute declustering strategies, Proc. ACM Special
Interest Group on Management of Data, San Diego, California, SIG-
MOD Record 21(2), 1992, 29-38.
[GD94] Ghandeharizadeh, S., DeWitt, D.J., MAGIC: a multiattribute declus-
tering mechanism for multiprocessor database machines, IEEE 7rans-
actions on Pamllel and Distributed Systems 5(5), 1994, 509-524.
[GG97] Gray, J., Graefe, G., The Five-Minute Rule ten years later, and other
computer storage rules of thumb, ACM Special Interest Group on
Management of Data Record 26(4), 1997,63-68.
[GGG+01] Ghandeharizadeh, S., Gao, S., Gahagan, C., Krauss, R., An on-line
reorganization framework for embedded SAN file systems, Submitted
for publication, 2ool.
[Gib92] Gibson, G., Redundant disk arrays: reliable, pamllel secondary stor-
age, The MIT Press, 1992.
[GIZ96] Ghandeharizadeh, S., Ierardi, D., Zimmermann, R., An algorithm for
disk space management to minimize seeks, Information Processing
Letters 57, 1996, 75-8l.
[GIZOl] Ghandeharizadeh, S., Ierardi, D., Zimmermann, R., Management of
space in hierarchical storage systems, M. Arbib, J. Grethe (eds.), A
Guide to Neuroinformatics, Academic Press, 200l.
[GM94] Golubchik, L., Muntz, R.R., Fault tolerance issues in data decluster-
ing for parallel database systems, Data Engineering Bulletin 17(3),
1994, 14-28.
[G093] Gottemukkala, V., Omiecinski, E., The sensible sharing approach
to a scalable, high-performance database system, Technical Report
GIT-CC-93-24, Georgia Institute of Technology, 1993.
[Gra93] Graefe, G., Query evaluation techniques for large databases, ACM
Computing Surveys 25(2), 1993,73-170.
[Gra94a] Graefe, G., Sort-merge-join: an idea whose time has passed? Proc.
IEEE Conf. on Data Engineering, 1994, 406-417.
[Gra94b] Graefe, G., Volcano - an extensible and parallel query evaluation sys-
tem, IEEE 7ransactions on Knowledge and Data Engineering 6(1),
1994, 120-135.
5. High Performance Parallel Database Management Systems 219
[HD90] Hsiao, H., DeWitt, D., Chained declustering: a new availability strat-
egy for multiprocessor database machines, Proc. 6th International
Data Engineering Conference, 1990, 456-465.
[HD91] Hsiao, H.-I., DeWitt, D., A performance study of three high avail-
ability data replication strategies, Proc. 1st International Conference
on Pamllel and Distributed Information Systems, 1991, 18-28.
[HL90] Hua, K., Lee, C., An adaptive data placement scheme for parallel
database computer systems, Proc. Very Large Databases Conference,
Brisbane, Australia, Morgan Kaufmann, 1990, 493-506.
[KH93] Katz, R.H., Hong, W., The performance of disk arrays in shared-
memory database machines, Distributed and Pamllel Databases 1(2),
1993, 167-198.
[KP88] Kim, M.H., Pramanik, S., Optimal file distribution for partial match
retrieval, Proc. ACM Special Interest Group on Management of Data,
Chicago, Illinois, SIGMOD Record 17(3), 1988, 173-182.
[LKB87] Livny, M., Khoshafian, S., Boral, H., Multi-disk management algo-
rithms, Proc. 1987 ACM SIGMETRICS Conference on Measurement
and Modeling of Computer Systems, 1987,69-77.
[LKO+OO] Lee, M.L., Kitsuregawa, M., Ooi, B.C., Tan, K., Mondal, A., To-
wards self-tuning data placement in parallel database systems, Proc.
ACM Special Interest Group on Management of Data, Dallas, Texas,
SIGMOD Record 29(2), 2000, 225-236.
[LSR92] Li, J., Srivastava, J., Rotem, D., CMD: a multidimensional decluster-
ing method for parallel data systems, Proc. 18th Conference on Very
Large Databases Conference, Vancouver, Canada, Morgan Kaufmann,
1992, 3-14.
[MS98] Moon, B., Saltz, J., Scalability analysis of declustering methods for
multidimensional range queries, IEEE Transactions on Knowledge
and Data Engineering, 10(2), 1998, 310-327.
[NH84] Nievergelt, J., Hinterberger, H., The grid file: an adaptive, symmetric
multikey file structure, ACM Transactions on Database Systems 9(1),
1984, 38-7l.
[NKT89] Nakano, M., Kitsuregawa, M., Takagi, M., Query execution for large
relation on functional disk system, Proc. 5th International Conference
on Data Engineering, Los Angeles, 1989, 159-167.
[0085] Ozkarahan, E., Ouksel, M., Dynamic and order preserving data par-
titioning for database machines, Proc. Very Large Databases Confer-
ence, Stockholm, Sweden, 1985, 358-368.
[Ora94] Oracle & Digital, Omcle Pamllel Server in Digital Environment,
Technical Report, Oracle Inc., 1994.
[PGK88] Patterson, D., Gibson, G., Katz, R., A case for Redundant Arrays
of Inexpensive Disks (RAID), Proc. ACM Special Interest Group on
Management of Data, Chicago, Illinois, SIGMOD Record 17(3), 1988,
109-116.
[RE78] Ries, D., Epstein, R., Evaluation of distribution criteria for dis-
tributed database systems, Technical Report UCB/ERL, Technical
Report M78/22, UC Berkeley, 1978.
[SD89] Schneider, D.A., DeWitt, D.J., A performance evaluation of four
parallel join algorithms in a shared-nothing multiprocessor environ-
220 S. Ghandeharizadeh, S. Gao, C. Gahagan, and R. Krauss
Gottfried Vossen
Abstract. Database systems have emerged into a ubiquitous tool in computer ap-
plications over the past 35 years, and they offer comprehensive capabilities for stor-
ing, retrieving, querying, and processing data that allow them to interact efficiently
and appropriately with the information-system landscape found in present-day fed-
erated enterprise and Web based environments. They are standard software on vir-
tually any computing platform, and they are increasingly used as an "embedded"
component in both large and small (software) systems (e.g., workflow management
systems, electronic commerce platforms, Web services, smart cards); they continue
to grow in importance as more and more data needs to get stored in a way that
supports efficient and application-oriented ways of processing. As the exploitation
of database technology increases, the capabilities and functionality of database sys-
tems need to keep track. Advanced database systems try to meet the requirements
of present-day database applications by offering advanced functionality in terms of
data modeling, multimedia data type support, data integration capabilities, query
languages, system features, and interfaces to other worlds. This article surveys the
state-of-the-art in these areas.
222 G. Vossen
1 Introduction
The practical need for efficient organization, creation, manipulation, and
maintenance of large collections of data, together with the recognition that
data about the real world, which is manipulated by application programs,
should be treated as an integrated resource independently of these programs,
has led to the development of database management. In a nutshell, a data-
base system consists of a piece of software, the database management system,
and some number of databases. Modern database systems are mostly client-
server systems where a database server is responding to requests coming
from clients; the latter could be end-users or applications (e.g., a browser on
a notebook, a query interface on a palmtop) or even application servers (e.g.,
a workflow management system, a Web server). Database systems have be-
come a fundamental tool in many applications over the past 35 years, ranging
from the original ones in administrative and business applications to more
recent ones in science and technology, and to current ones in electronic com-
merce and the World-Wide Web. They are now standard software on virtually
any computing platform; in particular relational database systems, that is,
database systems that are based on the relational model of data, are avail-
able on any type of machine, from laptops (or even personal digital assistants
and smartcards) to large-scale supercomputers. Moreover, database servers
and systems continue to grow in importance as more and more data needs to
get stored in a way that supports efficient and application-oriented ways of
processing.
Historically, database systems have started out in the late 60s as simple
data stores with a conceptual level added to file systems and could hence
provide an early form of data independence; the field was soon taken over
by relational systems in the 70s. However, the quest for departing from pure
relational systems has been around for more than 20 years; indeed, techni-
cal areas such as CAD/CAM or CASE have early on demanded so-called
"non-standard" database systems that departed from simple data types such
as numbers and character strings; later, applications such as geography or
astronomy requested an integration of images, text, audio, and video data.
Nowadays, database systems are strategic tools that are integrated into the
enterprise-wide landscape of software and its information-related processes
and workflows, and to this end can provide features such as user-defined
types, standardized data exchange formats, object semantics, user-defined
functions, rules, powerful query tools, and sophisticated transactional tech-
niques. They support a variety of applications from simple data tables, to
complex data integration from multiple sources, to analysis by warehousing,
to business processes through a close interaction even with workflow man-
agement systems. This requires a number of properties, and we try to give a
glimpse of these properties and features in this article.
Commercial vendors have been picking up on these developments for a
variety of reasons:
6. Advanced Database Systems 223
Users
Clients
Application
Servers
Database
Servers
Databases
and data servers, as indicated in Figure 1.1. These servers will typically be
heterogeneous in that they use different products, different interfaces, differ-
ent data models and access languages. In addition, the servers can differ in
their degree of autonomy in the sense that some may focus on the workload
of a specific business process (e.g., stock exchange), while less autonomous
servers may be willing to interact with other servers. We are not elaborating
here on the technical problems that have to be solved when living in a fed-
erated system [WV02], but we will use Figure 1.1 as a motivation for data
integration challenges we will come across.
Advanced database systems try to meet their requirements by offering
advanced functionality in terms of data modeling and integration support,
query languages, and system features; we will survey these areas below. Data
modeling essentially refers to the question of how to build a high-quality
database schema for a given application (and to maintain it over time under
evolution requests). Data modeling is commonly perceived as a multi-step
process that tries to integrate static and dynamic aspects derived from data
and functional requirements. Recent achievements here include the possibility
to take integrity constraints into account, and to design triggers or even event-
condition-action (ECA) rules that capture active aspects. There are now even
methodologies that allow to design and model even advanced applications in
terms of a unified framework which can then be declared and used in an
advanced database system.
The latter is due to the fact that a category of products is now com-
mon which is called object-relational (OR). Corresponding systems essen-
6. Advanced Database Systems 225
Interface Layer
Data Model and Query Language
Host Language Interfaces
Other Interfaces
Language Processing Layer
View Management Language Compiler
Semantic Integrity Control Language Interpreter
Authorization Query Decomposition
Query Optimization
Access Plan Generation
Transaction Management Layer
Access Plan Execution Transaction Generation
Concurrency Control Recovery
Storage Management Layer
Physical Data Structure Management
Bugger Management Disk Accesses
2 Preliminaries
Figure 2.2 shows an ER diagram for the relational database from Figure
2.1. As can be seen, some relations stem from entity types, while others are
derived from relationship types (and some optimizations have already been
applied). In this example, it is even easy to go from one representation to the
other in either direction. In many applications, doing forward engineering
is as important as being able to do reverse engineering, in which a given
database is "conceptualized" [FV95b], e.g., for the purpose of migrating to
another data model.
An important observation regarding the entries in a relational table or the
types of attributes is that there is no "built-in" restriction in the database
concept saying that data has to be numerical or to consist of simple character
strings only. Indeed, by taking a closer look at Figure 2.1 we see that the URL
of a document is essentially a path expression that represents a unique local
address; we can easily imagine the path to be even globally unique or being
computed via an expression that takes, for example, search parameters into
account or that itself has an inner structure. In other words, a data entry in
a table could as well be the description of a program, and by the same token
it could be of a completely different type: an image in gif or jpg format, an
mp3 music file, an avi video. We will see later what the impact of this will
be and how such unconventional types can be handled in a database that is
essentially relational.
6. Advanced Database Systems 229
Computer User
Document
while
O'DomainName = ·www.um.de·(Computer)
yields
IP-Address DomainName OSType
128.176.6.1 www.um.de Unix
The three binary operations we introduce are as follows. Union as well as
difference are the usual set operations, applicable to relations that have the
same attributes and are thus "compatible". The natural join of relations R
and S combines the tuples of the operands into new ones according to equal
values for common attributes. For example, to compute address and name of
those users who have participated in a session, we can write
Now consider the query "who are the children of John?" In SQL, we write
this as
select Child from Parent where Name 'John'
whereas in Datalog we simply write
?- Parent(John. X).
The answer, computed from the above two tables, will be X = {Jeff, Anthony}.
The important feature is to be able to define new relations intensionally,
i.e., through rules. To this end, let Father and Mother be intensional relations,
defined from the extensional (given) relations above through the following
rules:
6. Advanced Database Systems 233
We can now use intensional relations like extensional ones. For example, the
query "who is the mother of Annie?" is written as
?- Mother(X, Annie).
The answer is obtained by evaluating the right-hand side of the rule defining
the intensional relation in question, which is
Mother(X,Y) :- Person(X,_,female), Parent(X,Y).
Variable Y occurring in this rule is unified (equated) with value 'Annie', while
variable X is unified with value 'Margaret,' such that the following is obtained:
Person (Margaret , 32, female), Parent (Margaret , Annie).
This immediately gives the answer X = {Margaret}.
Another important feature is the possibility to make use of recursion in
Datalog, i.e., to let the same predicate occur in the body and in the head of a
rule. Consider the following rules defining predecessors, siblings, and cousins:
Predecessor(X,Y) :- Parent (X,Y) .
Predecessor(X,Y) :- Parent (X,Z), Predecessor(Z,Y).
Sibling(X,Y) :- Parent (Z,X) , Parent(Z,Y), not(X=Y).
Cousin(X,Y) Parent(X1,X) , Parent(Y1,Y), Sibling(X1,Y1).
Cousin(X,Y) :- Parent(X1,X) , Parent(Y1,Y), Cousin(X1,Y1).
An evaluation of these rules relative to the state shown earlier will yield the
following intensional relations (where attribute names are again shown for
clarity, but are not part of the definition):
Predecessor Successor
John Jeff
Jeff Margaret
Margaret Annie
John Anthony Cousinl Cousin2
Anthony Bill Margaret Bill
John Margaret
Jeff Annie
John Bill
John Annie
234 G. Vossen
Computer User
99
Document
Fig. 3.1. An ER diagram for the sample database using nested tuples (x)
and sets (*)
into the question of how to generalize the results obtained for the flat model
to the nested one, and proved theorems about language equivalence or com-
pleteness, expressive power and complexity. Of particular relevance to nested
algebras are the structure-manipulating operations nest and unnest.
Figure 3.1 shows an alternative representation of the information previ-
ously shown in Figure 2.2, in which some aspects are modeled more directly.
In particular, a document now has a set of URLs that can be reached from it,
as opposed to a recursive relationship structure needed earlier. In addition,
log entries now do not need a numbering any more, since the entries associ-
ated with a particular user are put in a nested relation, i.e., a set of tuples
of computer as well as document keys.
More generality and additional flexibility are achieved by allowing con-
structors (typically beyond tuple and set, i.e., including list, bag etc.) that
can be applied to (base or already derived) types in an orthogonal fashion.
Finally, if such "complex" types and their instantiations are combined with
type-specific behavior, we arrive at what is known as object-orientation; if we
236 G. Vossen
The investigation and study of data models has followed at least three di-
rections over the years: The first focused on so-called complex objects, which
are typed objects recursively constructed from atomic objects using construc-
tors for tuples, sets, or other data structures (e.g., bags, lists, arrays). It was
soon recognized that complex structures alone are not sufficient. Indeed, an
increasing interest has recently been in modeling behavioral aspects of ob-
jects as well, and in encapsulating object structure and object behavior. This
has paved the way for including object-oriented features in databases, which
in turn has given rise to the other two directions: One focused on so-called
pure objects, in which basically everything in a database is considered as an
abstract object that has a unique identity. The schema of a database can
then be considered as a directed graph, whose nodes are class names, and
whose edges represent single- or multi-valued attributes. A database instance
becomes another directed graph, whose nodes represent objects, and whose
edges are references between these objects (Le., attribute values). While such
a model is theoretically appealing, it appears too sophisticated for many real-
world applications; the third and currently most active direction therefore is
to distinguish between objects and their values, and to let only the former
have an identity. The exposition in this section will mostly center around this
latter direction, in particular since it nicely carries over to object-relational
structures as found in several present-day "universal server" systems.
Object-orientation has been recognized as an important new paradigm in
the area of programming languages ever since the arrival of the language Sim-
ula. It is roughly based on the following five fundamental principles [LV98J:
1. Each entity of the real world is modeled as an object which has an exis-
tence of its own, manifested in terms of a unique identifier (distinct from
its value).
238 G. Vossen
objects like medical images or audio messages. BLOBs can typically hold up
to 2 GB of binary data. Most of the time, they are not directly stored in
tables, but are represented by descriptors, and they can be loaded directly
from files. With BLOBs, the definition of more complex user-defined types
is more tricky, since the BLOB's structure is typically hidden in the data
structure of a corresponding program that reads or writes the BLOB, and
the same applies to the BLOB's behavior. Indeed, a BLOB becomes useful
only through attached functions that "decompose" the BLOB as desired.
On the positive side, BLOBs can be used for any kind of data that would
normally not fit into the structures or the types of a (relational) system.
We should point out that for object-relational database systems it is nowa-
days common to provide predefined class libraries for specific applications,
e.g., text, HTML pages, audio data, video data, graphics, images. Informix
calls them DataBlades, while IBM calls them Database Extenders and Oracle
Cartridges.
Requirements
Conceptual Models
Deductive
OODBMS
E"try
TV Show
all attributes the same; for example the set {3,6,9} will have a root object
with three edges all labelled member to objects with values 3, 6, 9. More
complex graphs may represent nested collections, shared objects, cyclic struc-
tures, etc. Note that there are no a-priori restrictions on the structure: objects
may have any combinations of attributes, even repeated ones, collections may
be heterogeneous, and attributes may have any type.
The most popular model of semi-structured data is OEM (Object Ex-
change Model), originally developed for the Tsimmis data integration project
at Stanford University [CGH+94,UllOO]. The literature also contains a few
non-essential variations of this basic data model, e.g., labels can be placed on
nodes or on edges; see [ABSOO] for a good account of the relevant literature.
At the physical level, semi-structured data depends on the application
at hand. For example, in applications like the integration of heterogeneous
sources, some external sources happen to be relational databases. Here the
mapping into the logical model of semi-structured data is easy; the hard
part is dealing with the fact that these sources often have limited access
capabilities to their data. Other sources, especially those on the Web, export
semi-structured data simply as text files. Each source has its own preferred
way of formatting the text file, and even that can change without notice;
writing wrappers to map such data into the logical level is a work-intensive
task and, to a some serious extend, also an important research topic.
Research on semi-structured data has focused, among other issues, on
schema specification, schema extraction (from data), expressive power and
complexity of query languages, and optimizations. Several research proto-
types have been built and are publicly available [ABSOO,FLM98,SV98]. On
the other hand, many discussions about models for and modeling of semi-
structured data have been ended by the arrival of XML, the Extensible
Markup Language, which represents an important linguistic framework for
describing data that is to be transported or exchanged without any regard
for layout. Essentially, XML is a meta-language, i.e., a language in which
other languages can be specified, yet it can readily be used for writing sim-
ple (and, of course, also complex) documents that describe (semi-structured
or structured) data. Although XML is already widely covered in books, e.g.,
[ABSOO,HoqOO,CZOl], the "world" ofXML is still changing at such a fast pace
that the interested reader better checks on the Web for latest information,
for example at www . w3 • org/xml.
A simple example of an XML document describing bibliographic infor-
mation is shown in Figure 3.5. Here, a bibliography can comprise books as
well as articles, and each such element can have an inner structure captured
by nested elements. For example, a book can have an author plus (possibly)
additional authors, and besides that has a title, a publisher, and a (publica-
tion) year. Elements start with an opening and end with a closing tag in a
similar way this is done in other markup languages (e.g., g\'IEX, HTML); if
every element has both an opening and a closing tag (and the latter at the
246 G. Vossen
<bibliography>
<book>
<author> S. Abiteboul
<additional_author>
<name> R. Hull </name>
<name> V. Vianu </name>
</additional_author>
</author>
<title> Foundations of Databases </title>
<publisher> Addison-Wesley </publisher>
<year> 1995 </year>
</book>
<article>
<author> E.F. Codd </author>
<title>
A Relational Model of Data for Large Shared Data Banks
</title>
<journal> Communications of the ACM </journal>
<year> 1970 </year>
</article>
</bibliography>
bibliography
book article
~
author title journal year
author title publisher year I I I I
Codd RM CACM 1970
~ I I I
SA addauthor Found AW 1995
/'--....
name name
I I
RH VV
Fig. 3.7. A DTD for the XML document containing book information
agencies employ NewsML for publishing news that radio stations or newspa-
per publishers can subscribe to. Other examples of XML-based languages are
SMIL (Synchronized Multimedia Integration Language), MathML (Mathe-
matical Markup Language), WML (Wireless Markup Language), or BSML
(Bioinformatic Sequence Markup Language). More recently, XML Schema
has been proposed as a way of adding more database-like features, in partic-
ular type information, to a conceptual language or data structure specifica-
tion. XML schemas offer additional data types and features like inheritance
(or "derivation"), so that semi-structured data as well as many other appli-
cations can be adequately supported.
The relevance of XML to databases stems from several facts: First, XML
appears as an appropriate way of handling semi-structured data, and from
a terminological point of view resembles many database concepts: structure
vs. contents, schemas, or typing. We will see in the next section that another
such concept is declarative querying, which is perceived as a good way to ex-
plore large collections of XML documents. Second, as many XML documents
will be generated automatically, since XML is in many applications consid-
ered as a reasonable format for exchanging data, there is a growing need of
storing XML documents in a database. To this end, database vendors are
picking up and offer extensions to their systems or native XML support. We
refer the reader to [Kos99,CFPOO,SSB+Olj for further information. Third,
XML is easily coupled with programming languages such as Java. Indeed,
an easy transition from XML to Java can be accomplished using the Doc-
6. Advanced Database Systems 249
The second major area we will survey for advanced database systems is the
wide field of query languages. In particular, we will look at three represen-
tative subareas here, object-based languages, rule-based languages, and pro-
cedural data; as we go along, we will also touch upon the issue of querying
XML documents.
• x.manufacturer.headquarter.street
(the street of the headquarter of the manufacturer of vehicle x)
• x. president.familyMembers. owned Vehicles. color
(the colors of the vehicles owned by the family members of the president
of company x)
When path expressions are used, establishing a desired navigation path may
be complex, for example due to the requirement that correct typing must
be obeyed. However, by picking up ideas from universal relation interfaces
that were developed during the 1980s, path expressions can be simplified con-
siderably, as has been demonstrated in [VV93a]. We sketch the central idea
next. Consider a schema like the one shown in Figure 4.1 and the definition
of a path expression given above. This definition has several consequences,
which impose unnecessary limitations on the usage of path expressions in
queries: They have to be specified "in full", i.e., it is not allowed that a
sequence A l .A2..... An of attributes is interrupted at any point. For exam-
ple, if we ask for the cc value of an automobile x, we would have to use
tEj
6. Advanced Database Systems 251
V~ llDlElIlIllIl
AuIDmabIe
drivetrain
body
... engine
transmission • hp
cc
~
VeNcIe 1
L '""'-' I
name
age
model .... residence
ownedVehicles·
manufacturer ~
color
1
I~ I
name
headuarter
... n
.r e-.. I
divisions·
president - street
city
~.
qualifications·
r "
salary
DIvIIIon 1 ' - - - - familyMemoors·
~~
name
location
manager
employees
where variable x stands for an employee and variable y creates a link from
vehicles (owned by employees as persons) to class Automobile. It is easily seen
that this formulation is far from perfect. Indeed, the [y]-selector enforces the
requirement that the vehicle must actually be an automobile. This could
be performed automatically by the underlying system if class names were
252 G. Vossen
or even
Since there is only one way to connect employees to cc values, the latter could
even be simplified to
Employee. cc
Implicitly, this assumes inheritance links to be bidirectional, and the same
can be applied to aggregation links. The approach is further developed in
[VV93aj, where a formal treatment as well as more examples can be found.
It demonstrates on the one hand again the impact that previous studies in
the context of the relational model may have in more advanced models; at
the same time it shows the higher potential associated with object-based
languages.
Features such as path expressions show up in basically every object-based
database language. A quasi-standard ist currently provided by the ODMG
proposal OQL, an acronym for Object Query Language [CBB+OOj. In brief,
OQL relies on the ODMG object model, extends SQL with object features
such as path expressions, polymorphism, or late binding, and provides high-
level primitives to deal with sets of objects. It is a functional language whose
operators can be freely composed, but it is not computationally complete.
OQL is a pure query language without update operators, which can be in-
voked from a programming language for which an ODMG binding is defined.
where wildcards * allow us to navigate the part of the data that has an
unknown or uncertain schema.
The study of semi-structured data and, somehow related, that of the
Web have raised several new theoretical questions. The regular expressions
found in query languages for semi-structured data are particular instances of
recursion. While most problems for recursive queries are undecidable (con-
tainment, equivalence, etc.), it turns out that they become decidable for the
restricted class of regular path expression queries. This has motivated re-
search on the optimization and containment of queries with regular path ex-
pressions. The Web and its browsing-style computation has generated other
kinds of questions: What is a good model of computation on the Web? What
kind of queries can we ever hope to answer? Computations can only traverse
links in forward direction, and, since the Web is ever growing, queries will
never be able to exhaust a search. Such questions have generated interesting
research on the limitations of query computations on the Web [AVOO,ABSOO],
and have also started to generate a conceptual understanding of the Web
[KRR+OOj.
When looking at the tree structure of (semi-structured) documents (e.g.,
the movie database previously shown in Figure 3.4) and in particular at that
of an XML document, it is obvious that a common way of retrieving infor-
mation will be to specify how the tree should be traversed (typically starting
from the root) and what conditions have to be satisfied by subtrees in order
to be considered a match (or relevant to the result). For example, consider the
sample XML document from Figure 3.5 once more, whose tree representation
was shown in Figure 3.6. It is easy to imagine that this bibliography is much
larger, with many books and many articles; we can also imagine that the tree
structure is deeper nested, for example in order to reflect more information
about a publisher, keyword sections for articles, or even links between the
various publications of a particular author. "Queries" to such a tree would
then amount to descriptions of how to search through the tree. For example,
looking for all titles of books or articles occurring in the bibliography could
be written as follows:
bibliography//title
This path specification essentially asks for a tree traversal that starts from the
root of the document in question (bibliography) and that descends from there
to (the values of) all title elements, which could be at arbitrary depths (the
latter is indicated by "/ /,,). In order to express selection conditions (e.g., "all
titles from publications by author Ullman") additional language constructs
would be needed. We mention that all this is provided in a language called
XSL (XML Stylesheet Language) [KayOl], which is not exactly a language for
specifying style sheets, i.e., layout information, but which is mainly a language
254 G. Vossen
for describing tree transformations (in its XSLT portion), i.e., procedures for
transforming input XML trees into output XML trees.
As an example, assume that we had represented the Person class (actually
just a single instance from that class) from Figure 4.1 as an XML document as
shown in Figure 4.2 (where it is assumed that the particular person considered
owns three vehicles). Then an XSL program returning all vehicles of the
Person
4.3 XQuery
The aforementioned XSL is a language for "querying" XML documents that
closely follows the syntactical spirit of XML, but it is not as easy to use as a
common database language (under the assumption that a collection of XML
documents is considered a "database"). For this reason, alternative propos-
als have been under discussion for quite a while [STZ+99,BCOOj. Recently, a
convergence of this discussion has been reached through XQuery, the XML
Query Language (see www. w3. org/TR/xquery as well as [CRL+02]) that is
vastly based on an earlier proposal called Quilt [CRF01j. XQuery knows a
number of different expressions for specifying queries, including path expres-
sions (in style of XSL, see above), element constructors (to make sure that
query results can conform to XML syntax), expressions involving operators
and functions, conditional or quantified expressions, list constructors, ex-
pressions that test or modify data types, and FLWR (pronounced "flower")
expressions.
A FLWR expression is reminiscent of an SQL query expression and gen-
erally consists of (up to) four types of clauses: a FOR clause, a LET clause,
a WHERE clause, and a RETURN clause (in this order). The first part of a
FLWR expression serves to bind values to one or more variables, where values
to be bound can be represented by expressions (e.g., by path expressions).
A FOR clause is used whenever iteration is needed, as each expression in a
FOR clause returns a list of nodes (from the XML document to which the
query is applied). The result of the entire clause then is a list of tuples, each
of which contains a binding for each of the variables. A LET clause serves
local binding purposes as in functional languages such as Scheme. A WHERE
clause acts as a filter for the binding tuples generated by preceding FOR and
LET clauses; only those for which the given WHERE predicates are true are
used to invoke the RETURN clause. The latter, finally, generates the output
of the FLWR expression, which may be a node, an ordered forest of nodes,
or just a value.
As an example, consider our XML document from Figures 3.5 (text) and
3.6 (tree) once more. The following FLWR expression lists the titles of books
from the bibliography that were published by AWL in 1995:
FOR $b IN document("bibliography.xml")//book
WHERE $b/publisher = "AWL" AND $b/year = "1995"
RETURN $b/title
language core seems stable, and database system vendors are already looking
at ways to support XQuery. Moreover, the responsible committees have made
a serious effort to "surround" XQuery with a number of other documents:
query requirements, query use cases, a data model, a syntax and a formal se-
mantics; all of these documents can be found at www. w3c. org/XML/Query. A
vastly complete account of the currently studied interactions between XML
and databases, in particular database querying, is given by [WilOOj.
in the traditional closed form and which is executed line by line) computes
the result in the variable that occurs in the last assignment (Xs):
The program relation storing this little program would have the following
contents:
a a1=2(8) x a1=2(T) d8
b a1=4(8 x T) U (8 x 8) eU
C 7I"l,2,3,4a2=6a4=5(8 x (T x 8)) 18
ad 8 0"1=2(8) x 0"1=2(T)
be U 0"1=4(8 x T) U (8 x 8)
be U 0"1=4(8 x T)
b e U8x8
b e U8xT
cf 8 7r1,2,3,40"2=60"4=5(8 x (T x 8))
cf 8Tx8
ad 0'1=2(T) x 0'1=2(T)
cf 71'1,2,3,40'2=60'4=S(T x (T x 8))
cf 71'1,2,3,40'2=60'4=S(8 x (T x T))
Similarly, 71'1,3,Srewrite-aI12:D4-+T(R) rewrites all occurrences of an expres-
sion from column 4 in column 2 (simultaneously) by T (and projects as
before):
ad 0'1=2(T) x 0'1=2(T)
b e 0'1=4(8 x T) U (8 x 8)
cf 71'1,2,3,40'2=60'4=S(T x (T x T))
Think of rewrite as operating on the parse-tree representation of a query: It
takes a subtree of a parse tree and repiaces one or all occurrences of the sub-
tree by another subtree. The last new operator, eval, takes a query column
and attaches the result of evaluating all the queries in that column to the
given relation.
General definitions of these operators appear in [NVV +99], and it is
shown there that extract, rewrite, eval are primitive operators, i.e.,
MA is non-redundant, and that MA is a conservative extension of the
relational algebra, i.e., it coincides with the relational algebra on ordinary
databases.
As a concrete application, consider a bookstore database which is queried
over the Internet. Let queries be algebra expressions. Imagine we want to
monitor the database usage by maintaining a meta relation Log of type [0, (4)],
containing pairs (u, q), where u is a username and q is a query u has posed.
Our focus thus is on queries of arity 4 returning 4-ary relations, e.g., sets of
book records. The query show the results of all queries posed every user is
expressed as
71'1,3,4,s,6evaI2(Log)
Similarly, determine all queries that gave no result is
71'3,4,S,6evaI20'1='Jones' (Log)
More examples appear in [NVV+99].
Interestingly, for MA there is a "Codd theorem" in style of the one men-
tioned previously, as a meta calculus, restricted to safe expressions, can be
shown to be equivalent to MA. However, when compared to the reflective
algebra, there exists a limitation on the expressive power of MA, due to
its typed nature, since some computationally simple, well-typed queries are
not definable in MA. The intuitive reason is that the computation of such a
query requires untyped intermediate results.
260 G. Vossen
4.5 Meta-SQL
In this subsection, we briefly sketch a practical meta-querying system called
Meta-SQL [VVV02), where stored queries are represented as syntax trees
in XML format. This representation allows us to use XSLT for a (syntacti-
cal) manipulation of stored queries. Many syntactical meta-queries can then
directly be expressed simply by allowing XSLT function calls within SQL
expressions. We note that it would be easy to substitute XSLT by XQuery
in this approach.
We consider relational databases as before, except that in a table columns
can now be of type "XML". In any row of such a table, the attribute cor-
responding to a column of type XML holds an XML document. To query
databases containing XML in this way, it seems natural to extend SQL by
allowing calls to XSLT functions, in the same way as extensible database sys-
tems extend SQL with calls to external functions. However, in these systems,
external functions have to be precompiled and registered before they can be
used. In Meta-SQL, the programmer merely includes the source of the needed
XSLT functions and can then call them directly.
As an example, consider a simplified system catalog table called Views
which contains view definitions. There is a column name of type string, holding
the view name, and a column def of type XML, holding the syntax tree of
the SQL query defining the view, in XML format. For example, over a movies
database, suppose we have a view DirRatings defined as follows:
create view DirRatings as
select director, avg(rating) as avgrat
from Movies group by director
Then catalog table Views would have a row with the value for name equal to
'DirRatings', and the value for def equal to the following XML document:
<query>
<select>
<sel-item>
<column>director<lcolumn>
</sel-item>
<sel-item>
<aggregate>
<avg/>
6. Advanced Database Systems 261
<column-ref>
<column>rating</column>
</column-ref>
</aggregate>
<alias>avgrat</alias>
</sel-item>
</select>
<from>
<table-ref>
<table>Movies</table>
</table-ref>
</from>
<group-by>
<column-ref>
<column>director</column>
</column-ref>
</group-by>
</query>
Clearly, for writing such XML representations of SQL expressions in a uni-
form way, an XML DTD is needed, which can be derived, for example, from a
BNF syntax for SQL such as the grammar given by Date [DD97]. The derived
DTD is given in [YVV02].
Now consider the (meta-) query "which queries do the most joins?" which
is to be applied to our Views table. For simplicity, let us identify the number
of joins an SQL query performs with the number of table names occurring
in it. To express this meta-query in Meta-SQL, we write an auxiliary XSLT
function count_tables, followed by an obvious SQL query calling this func-
tion:
function count_tables returns number
The combination of SQL and XSLT just sketched provides a basic level
of expressive power, yet for more complex syntactical meta-queries SQL can
be enriched with XML variables which come in addition to SQL's standard
range variables. XML variables range over the subelements of an XML tree,
where the range can be narrowed by an XPath expression. XML variables thus
allow to go from an XML document to a set of XML documents. Conversely,
we also add XML aggregation, which allows us to go from a set of XML
documents to a single one. SQL combined with XSLT and enriched with
XML variables and aggregation offers all the expressive power needed for
ad-hoc (syntactical) meta-querying. To allow for a form of semantical meta-
querying as well, it suffices to add again an evaluation function that takes
the syntax tree of some query as input and produces the table resulting from
executing the query as output.
The resulting language Meta-SQL is compatible with modern SQL im-
plementations offered by contemporary extensible database systems. Indeed,
these systems support calls to external functions from within SQL expres-
sions, which allows us to implement the XSLT calls. Furthermore, XML vari-
ables and the evaluation function can be implemented using set-valued exter-
nal functions. XML aggregation, finally, can be implemented as a user-defined
aggregate function.
To conclude this section, we mention that, starting with the seminal paper
on HiLog [CKW93], the concept of schema querying has received considerable
attention in the recent database literature. Clearly, schema querying is a spe-
cial kind of meta-querying. For instance, SchemaSQL [LSS01] augments SQL
with generic variables ranging over table names, rows, and column names. It
is not difficult to simulate SchemaSQL in Meta-SQL. Although the focus here
has been on meta-querying as opposed to general XML querying, it should
be understood that Meta-SQL, even without eval, can serve as a general
query language for databases containing XML documents in addition to or-
dinary relational data. Its closeness to standard SQL and object-relational
processing is a major advantage.
Traditionally, databases have always been kept on magnetic disks, and since
disks are a relatively cheap storage medium these days, it has become common
to use a larger number of disks for storing the data of a database. If these
disks are uniform from a technical point of view, i.e., have the same access
times and storage capacities, it makes sense to organize data on these disks
in such a way that a speed-up in processing time is achieved. The common
way to do so is to distribute data over the available disks such that accesses
can be performed in parallel. However, there is a second aspect to be kept in
mind, that of protecting data against loss and corruption, so that it might
also be a good idea to keep a little redundancy among multiple available
disks.
The most successful form of disk-oriented data storage nowadays is the
disk array or the RAID architecture, which stands for Redundant Array of
Inexpensive Disks [CLG+94]. Its success is mainly due to the fact that it
allows for an adaptable balance between efficiency and safety. In its simplest
form, known as RAID-O, there is a number of n disks which are accessible
through a single disk controller. In such a setting data can be stored so
that parallel access becomes an option. The common way to do so is by
striping data items in a bit-oriented or a block-oriented fashion. With bit-
wise striping, each byte to be stored is spread over several disks, each of
which takes another bit. For example, if there are 8 disks, each can take one
of the bits in a byte; since all 8 bits can be read in parallel, access is 8 times
faster than with a single disk. Under block-wise striping, consecutive storage
blocks are distributed over consecutive disks, usually in a circular fashion.
More precisely, for n disks the ith block of a file or a data set is stored on
disk (i mod n) + 1, or on disk (i + j - 1 mod n) + 1 if storing the blocks starts
with block 0 on disk j > 1. This is illustrated in Figure 5.1, where block ao
is stored on disk 1, al on disk 2, etc.; for the b blocks storing starts from disk
2.
Clearly, the approach just described increases throughput and shortens
access times through an exploitation of the available parallelism, but appar-
ently this scheme is sensitive to crashes of single or even multiple disks. There
are at least two work-arounds: replication of data, or keeping additional in-
formation through which data can be reconstructed in case of an error. RAID
levels higher than 0 can essentially be distinguished by the way they trade
space utilization for reliability.
A RAID-l architecture uses mirror disks, so that only half of the available
disks can be used for storing data, and the other half is a copy of the first.
A disk an its mirror are together considered a logical disk. This principle is
illustrated in Figure 5.2, where the striping shown is again block-oriented as
in Figure 5.1. Apparently, RAID-1 is good for applications such as logging in
database systems, where high reliability is mandatory. The underlying idea
is that a disk and its mirror will rarely crash together. For reading data in a
264 G. Vossen
DIsk
Coatroller
I
I I
1
1
1
-.., 1
3
RAID-l setting, reading the corresponding disk (or its mirror if the disk has
crashed) is apt, for writing both disks need to be accessed.
Disk
c-a-o.r
I
I I
I
., .,
1 I 1 1
b, b, bz
., ., bz
·0 ·0
·z b, ·z b, b" b"
." b5 ." bIJ bo ". bo
".
DoIa M1nw DoIa M1nw
Other RAID levels partially give up reliability for the sake of space uti-
lization. In particular, RAID-2 uses bit striping; bits distributed over various
disks are additionally encoded, so that data bits are augmented with code
bits. The techniques used are related to those used for other storage com-
ponents as well (hence the name "memory-style EEC") and are often based
on Hamming codes or on block codes, which, for example, encode 4-bit data
in 7-bit code words and are then able to locate and correct single-bit errors.
Thus, 4-bit striping would require seven disks, four of which would take data
bits, the others the additional code bits.
6. Advanced Database Systems 265
RAID-3 makes use of the observation that it is generally easy for a disk
controller to detect whether one of the attached disks has crashed. If the
goal to just detect a disk crash, a single parity bit per byte or half-byte
suffices, which would be set according to odd or even parity. For bit striping,
individual bits would again be stored on separate disks, and an extra disk
stores all parity bits. This is illustrated in Figure 5.3 for four data disks.
When data is read in the case shown in Figure 5.3, bits are read from all four
disks; the parity disk is not needed (unless a disk has crashed). However,
when data is written, all five disks need to be written.
DIsk
Coatrolkr
I I lI
LJ LJ
Pull)'
RAID-4 uses block-oriented striping with parity; one parity block per set
of blocks from the other disks is kept on a separate disk. Reading data is now
faster than with RAID-3; however, the parity block can become a bottleneck
when writing data. RAID-5 is also block-oriented, now with distributed par-
ity which tries to avoid that bottleneck. Data and parity are distributed over
all disks; for example, with five disks the parity block for the nth set of blocks
is written to disk (n mod 5) + 1, while the data blocks are stored on the four
other disks. Finally, RAID-6 stores additional information to make the disk
array robust against a simultaneous crash of multiple disks; this is called P
+ Q redundancy. Reed-Solomon codes are used to protect an array against
a parallel crash of two disks, using two additional disks for encoding. Table
5.1 summarizes the various RAID levels.
As the data that is stored in a file system or a database grows, disk arrays
are becoming more and more popular. A trend for the near future seems to
be to make disks more and more "intelligent" , so that, for example, searching
can be directed by the disk controller instead of the database or the oper-
ating system. Clearly, disk arrays are particularly suited for data intensive
applications that have to deal with versioning, temporal data, spatial data,
or more generally multimedia data. On the other hand, a clever and efficient
266 G. Vossen
logical organization of the data in index structures is still crucial for achiev-
ing reasonable performance; see [GG98] for a survey on index structures, and
[VitOl] for one on external memory algorithms and data structures.
Since the latter is rarely the case, a temporal database provides system
support for time, and typically distinguishes several kinds of time:
Transaction time can be used to answers queries like "what has been
Mary's rank on 10.12.1992", but on the other hand it only records activ-
ities on the database, not in the application. A tuple becomes valid as
soon as it is stored in the database.
2. Valid time tries to reflect the validity of a fact in the application at hand,
independent of the time at which this fact gets recorded in the database.
Our sample relation could now look as shown in Figure 5.6, where 00 is
used to denote the fact that something is still valid. Notice that valid
time makes it possible to update data pro-actively, i.e., with an effect for
the future, but also retro-actively, i.e., with an effect for the past.
select { select-list }
from { relations-list}
268 G. Vossen
where { conditions}
when { time-clauses }
in which the when clause is new. In this clause, several temporal comparison
operators may used, including before, after, during, overlap, follows, or
precedes, which refer to time intervals. As an example, the query asking for
Mary's rank at the time Tom arrived is written as
select X.Rank
from Professor X, Professor Y
where X.Name = 'Mary' and Y.Name 'Tom'
when X.interval overlap Y.interval
As can be seen, the time interval stored in a relation can now be accessed
via the. interval extension of the relation name in question. As TSQL (and
more recently TSQL2) gets standardized, we will see temporal capabilities as
ordinary capabilities of database systems emerge in the near future.
We next look at an advanced system functionality that has been of interest for
many years already, and that only recently opened the appropriate tracks for
formal research. Spatial data arises when spatial information has to be stored,
which is typically information in two or three dimensions. Examples include
maps, polygons, bodies, shapes, etc. A spatial database supports spatial data
as well as queries to such data, and provides suitable storage structures for
efficient storage and retrieval of spatial data. Applications include geographic
information systems, computer-aided design, cartography, medical imaging,
and more recently multimedia databases [SK98,Sub98,SJ96].
Data models for representing spatial data have several properties that
clearly distinguish them from classical data models:
6. Advanced Database Systems 269
For illustrating some of the problem with representing spatial data, we briefly
consider the so-called raster-graphics model. In this model, spatial informa-
tion is intensionally represented in discretized form, namely as a finite set of
raster points which are equally distributed over the object in question; this is
reminiscent of a raster graphics screen which is an integer grid of pixels each
of which can be switched on or off (i.e., be set to 1 or 0). Infinity is captured
in this model by assuming that for each point p, infinitely many points in
the neighborhood of p have the same properties as p. Now this model can
exhibit anomalies which are due to the absence of the properties of Euclidian
geometry.
For example, a straight line is represented in the raster model by two
of its raster points. In case a line does not exactly touch two points, it is
assumed that points that are "close" to the line can be used to represent it.
The following situation, illustrated in Figure 5.7, is now possible: Straight
line g1 is represented by points A and B, g2 by A and C, and g3 by D and
E. Apparently, g2 and g3 have an intersection point, which, however, is not
a raster point. So following the raster philosophy, the point closest to the
intersection is chosen as its representative; in the example shown, this is F.
Now as an intersection point, F needs to be a point on line g2; on the other
hand, since it is also a point of g1, it is also an intersection point of g1 and
g2. Therefore, g1 and g2 have two intersections (the other is A), which is
impossible from a classical geometric point of view.
We mention that there are other models for representing spatial data.
Moreover, many data structures exist for storing such multi-dimensional data
[GG98). Efficient algorithms are then needed for answering typical queries,
which may be exact or partial match queries or, more often, range queries.
Imagine, for example, the data to represent a map of some region of the world;
then a range query might ask for all objects having a non-empty intersection
with a given range, e.g., "all cities along the shores of a river, with a distance
of at most 50 miles from a given point" .
270 G. Vossen
A E
• • •
• • •
• • •
• • •
• • •
• D
•
C
•
Fig. 5.7. Line intersections in the raster-graphics data model
Similar problems arise with image data, with complex graphics, and with
pictures, and the situation is technically made more complicated by the facts
that (i) often all these various types of data occur together, and (ii) pic-
tures may be silent or moving, Le., video data. A major problem then is to
guarantee a continuous retrieval of a specific bandwidth for a certain period
of time, e.g., in applications such as on-demand video [EJH+97j. Another
problem area is given by image processing based on the contents of an image
database, which emerges to the task of not only retrieving images, but also
interpreting them or searching them for specific patterns. Finally, a combina-
tion of spatial data with temporal aspects has to deal with geometries that
change over time; if changes occur continuously, the data represents moving
objects, an area whose study has only just begun [GBE+OOj.
The final system capability we look into here is transactions. Database sys-
tems allow shared data access to multiple users and simultaneously provide
fault tolerance. In the 1970s, the transaction concept [Gra78j emerged as a
tool to achieve both purposes. The basic idea (the "ACID principle") is to
consider a given program operating on a database as a logical unit (Atomic-
ity), to require that it leaves the Consistency of the database invariant, to
process it as if the database was at its exclusive disposal (Isolation), and to
make sure that program effects survive later failures (Durability). To put this
to work, two services need to be provided: Concurrency control brings along
synchronization protocols which allow an efficient and correct access of mul-
tiple transactions to a shared database; recovery provides protocols that can
6. Advanced Database Systems 271
For the former, many options are available, including providing more oper-
ations, providing higher-level operations, providing more execution control
within and between transactions, or providing more transaction structure.
Structure, in turn, can refer to parallelism inside a transaction, it can refer
to transactions inside other transactions, or it can even refer to transactions
plus other operations inside other transactions. In essence, the goal thus is to
272 G. Vossen
their specific properties which have only recently begun to study such con-
troversial issues like commit/abort vs. fail, compensation vs. undo, inter-
ruptability of long-running activities, coordination and collaboration even at
a transactional level, transactional vs. non-transactional tasks, decoupling
transactional properties (in particular atomicity and isolation) into appro-
priately small spheres, serializability vs. non-serializable (e.g., goal-correct)
executions [RMB+93,VV93b], and the distinction between local and global
correctness criteria [EV97J.
To conclude our system considerations for databases, we briefly touch
upon an area that is of increasing interest these days, and that at the same
time uses database management systems as an "embedded" technology hardly
visible from the outside. Transactional workflows become particularly rele-
vant today in the context of the Internet, which is, among other things, a
platform for offering electronic services. While Internet services in the past
have widely relied on forms, present-day services are offered over the Web
and are more and more oriented towards an automated use of computers as
well as an automated exchange of documents between them. A Web service
aims at the provision of some kind of service that represents an interoperation
between multiple service providers. For example, one could think of a moving
service on the Web that combines a service for arranging the moving of fur-
niture with a service that orders a rental car and a service for changing the
address of the mover in various places. More common already are electronic
shopping services in which a catalog service is combined with a payment col-
lection service and a shipping service. In business-to-business scenarios, Web
services come in the form of marketplaces where buying, selling, and trading
within a certain community (e.g., the automotive industry) is automated.
Each service by itself typically relies on a database system.
From a conceptual viewpoint, each individual service could be perceived
as a workflow with its underlying transactional capabilities, so that the goal
becomes to integrate these workflows into a common one that can still provide
certain transactional guarantees. Thus, what was said above about advanced
transactions becomes readily applicable. On the other hand, there are more
conceptual problems to make Web services fly, including ways for uniform
communication so that services can talk to each other in a standardized way
(in particular beyond database system borders), or possibilities to describe,
publish, and find Web services conveniently and easily. A recent account of
the situation in this area is provided by [CGSOIJ.
References
[AAA+96] Alonso, G., Agrawal, D., El Abbadi, A., Kamath, M., Giinthor, R.,
Mohan, C., Advanced transaction models in workflow contexts, Proc.
12th IEEE Int. Conf. on Data Engineering, 1996, 574-58l.
[AAH+97] Arpinar, LB., Arpinar, S., Halid, U., Dogac, A., Correctness of work-
flows in the presence of concurrency, Proc. 3m Int. Workshop on Next
276 G. Vossen
[CRF01] Chamberlin, D., Robie, J., Florescu, D., Quilt: an XML query lan-
guage for heterogeneous data sources, Proc. 3rd Int. Workshop on the
Web and Databases (WebDB 2000), in [SV01].
[CRL+02] Cagle, K., Russell, M., Lopez, N., Maharry, D., Saran, R., Early
Adopter XQuery, Wrox Press, 2002.
[CWOO] Chaudhuri, S., Weikum, G., Rethinking database system architecture:
towards a self-tuning RISC-style database system, Proc. 26th Int.
ConJ. on Very Large Data Bases, 2000, 1-10.
[CZ01] Chaudhri, A.B., Zicari, R., Succeeding with object databases - a prac-
ticallook at today's implementations with Java and XML, John Wiley
& Sons, New York, 2001.
[DD97] Date, C.J., Darwen, H., A guide to the SQL standard, Addison-
Wesley, Reading, MA, 4th edition, 1997.
[DFS99] Deutsch, A., Fernandez, M.F., Suciu, D., Storing semistructured data
with STORED, Proc. ACM SIGMOD International Conference on
Management of Data, 1999, 431-442.
[EJH+97] Elmagarmid, A.K., Jiang H., Helal A.A., Joshi A., Ahmed M., Video
database systems - issues, products, and applications, Kluwer Aca-
demic Publishers, 1997.
[EJS98] Etzion, 0., Jajodia, S., Sripada, S. (eds.), Temporal databases:
research and practice, Lecture Notes in Computer Science 1399,
Springer-Verlag, Berlin, 1998.
[Elm92] Elmagarmid, A.K. (ed.), Database transaction models for advanced
applications, Morgan Kaufmann Publishers, San Francisco, CA, 1992.
[ENOO] Elmasri, R., Navathe, S.B., Fundamentals of database systems,
Addison-Wesley, Reading, MA, 3rd edition, 2000.
[EV97] Ebert, J., Vossen, G., I-serializability: generalized correctness for
transaction-based environments, Information Processing Letters 63,
1997, 221-227.
[FLM98] Florescu, D., Levy, A., Mendelzon, A., Database techniques for the
World-Wide Web: a survey, ACM SIGMOD Record 27(3), 1998,59-
75.
[FSTOO] Fernandez, M.F., Suciu, D., Tan, W.C., SilkRoute: trading between
relations and XML, Computer Networks 33, 2000, 723-745.
[FV95a] Fahrner, C., Vossen, G., A survey of database design transformations
based on the Entity-Relationship model, Data & Knowledge Engi-
neering 15, 1995, 213-250.
[FV95b] Fahrner, C, Vossen, G., Transforming relational database schemas
into object-oriented schemas according to ODMG-93, Lecture Notes
in Computer Science 1013, 4th Int. ConJ. on Deductive and Object-
Oriented Databases, Springer-Verlag, Berlin, 1995,429-446.
[Gar98] Gardner S.R., Building the data warehouse, Communications of the
ACM 41(9), 1998, 52-60.
[GBE+OO] Giiting, R.H., Bohlen, M.H., Erwig, M., Jensen, C.S., Lorentzos, N.A.,
Schneider, M., Vazirgiannis, M., A foundation for representing and
querying moving objects, ACM Transactions on Database Systems
25, 2000, 1-42.
[GG98] Gaede, V., Giinther, 0., Multidimensional access methods, ACM
Computing Surveys 30, 1998, 170-231.
6. Advanced Database Systems 279
[GHK+94] Georgakopoulos, D., Hornick, M., Krychniak, P., Manola, F., Specifi-
cation and management of extended transactions in a programmable
transaction environment, Proc. 10th IEEE Int. Con/. on Data Engi-
neering, 1994, 462-473.
[GHM96] Georgakopoulos, D., Hornick, M., Manola, F., Customizing transac-
tion models and mechanisms in a programmable environment sup-
porting reliable workflow automation, IEEE Trans. Knowledge and
Data Engineering 8, 1996, 630-649.
[GHS95] Georgakopoulos, D., Hornick, M., Sheth, A., An overview of workflow
management: from process modeling to workflow automation infras-
tructure, Distributed and Parallel Databases 3, 1995, 119-153.
[Gra78] Gray, J., Notes on data base operating systems, R. Bayer, M.R. Gra-
ham, G. Seegmiiller (eds.), Operating systems - an advanced course,
Lecture Notes in Computer Science 60, Springer-Verlag, Berlin, 1978,
393-48l.
[GUW02] Garcia-Molina, H., Ullman, J.D., Widom, J., Database systems: the
complete book, Prentice Hall, Upper Saddle River, NJ, 2002.
[HoqOO] Hoque, R., XML for real programmers, Morgan Kaufmann Publish-
ers, San Francisco, CA, 2000.
PS82] Jiischke, G., Schek, H.J., Remarks on the algebra of non first nor-
mal form relations, Proc. 1st ACM SIGACT-SIGMOD Symposium
on Principles of Database Systems, 1982, 124-138.
[Kay01] Kay, M., XSLT programmer's reference, 2nd edition, Wrox Press,
200l.
[KKS92] Kifer, M., Kim, W., Sagiv, Y., Querying object-oriented databases,
Proc. ACM SIGMOD Int. Conf on Management of Data, 1992, 393-
402.
[KLW95] Kifer, M., Lausen, G., Wu, J., Logical foundations of object-oriented
and frame-based languages, Journal of the ACM 42, 1995, 741-843.
[KM94] Kemper A., Moerkotte G., Object-oriented database management -
applications in engineering and computer science, Englewood-Cliffs,
NJ, Prentice-Hall, 1994.
[Kor95] Korth, H.F., The double life of the transaction abstraction: funda-
mental principle and evolving system concept, Proc. 21st Int. Con/.
on Very Large Data Bases, 1995, 2-6.
[Kos99] Kossmann, D. (ed.), Special issue on XML, Bulletin of the IEEE
Technical Committee on Data Engineering 22(3), 1999.
[KRR+OO] Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins,
A., Upfal, E., The web as a graph, Proc. 19th ACM SIGMOD-
SIGACT-SIGART Symp. on Principles of Database Systems, 2000,
1-10.
[Liu99] Liu, M., Deductive database languages: problems and solutions, ACM
Computing Surveys 31, 1999, 27-62.
[LROO] Leymann, F., Roller, D., Production workflow - concepts and tech-
niques, Prentice Hall, Upper Saddle River, NJ, 2000.
[LSS01] Lakshmanan, L.V.S., Sadri, F., Subramanian, I.N., SchemaSQL: an
extension of SQL for multidatabase interoperability, ACM TI-ansac-
tions on Database Systems 26, 2001, 476-519.
[LV98] Lausen, G., Vossen, G., Object-oriented databases: models and lan-
guages, Addison-Wesley, Harlow, UK, 1998.
280 G. Vossen
Odej Kao
Abstract. This chapter presents an introduction to the area of parallel and dis-
tributed multimedia database systems. The first part describes the characteristics
of multimedia data and depicts the storage and annotation of such data in con-
ventional and in multimedia databases. The main aim is to explain the process of
multimedia retrieval by using images as an example. The related computational,
storage, and network requirements create an urgent need for the integration of
parallel and distributed computer architectures in modern multimedia information
systems. Different hardware and software aspects have to be examined, for example
the partitioning of multimedia data and the distribution over multiple nodes have
a decisive impact on the performance, efficiency, and the usability of such multime-
dia databases. Other distributed aspects such as streaming techniques, proxy and
client issues, security, etc. are only briefly mentioned and are not in the focus of this
chapter. The last section gives an overview over an existing cluster-based prototype
for image retrieval named CAIRO.
286 O. Koo
1 Introduction
1 www.wapforum.org
2 www.umts-forum.org
7. Parallel and Distributed Multimedia Database Systems 287
The largest operational areas for multimedia applications are still the
mass information systems and marketing communications. The first group
incorporates information systems at heavily frequented, public areas, such
as railway stations, airports, etc. Furthermore, newspapers, magazines, and
books are published in digital form. A part of these is used for advertisements,
while another part co-exists with the traditional printed media. Product-
catalogues are found at many points of sales and enable a fast overview
of the products offered and their prices. Electronic stores, reservation, and
booking sites at terminals as well as on the Internet supplement these sys-
tems. Detailed outlines of multimedia applications are found, for example, in
[Fur99,SteOO,GJM97j.
The development of digital technologies and applications allows the pro-
duction of huge amounts of multimedia data. The scope and spread of doc-
ument management systems, digital libraries, photo archives used by public
authorities, hospitals, corporations, etc., as well as satellite and surveillance
photos, grow day by day. Each year, with an increasing tendency, Petabytes
worth of multimedia data is produced. All this information has to be system-
atically collected, registered, stored, organised, and classified. Furthermore,
search procedures, methods to formulate queries, and ways to visualise the
results, have to be provided. For this purpose a large number of prototypes
and operational multimedia database management systems is available.
This chapter concerns mainly parallel and distributed aspects - hardware
architectures as well as data engineering - for such multimedia databases. It
is organised as follows: after the introduction of the basic properties of mul-
timedia objects together with the accompanying methods for compression,
content analysis and processing (Section 2, 3.1), the storage and management
of such data in traditional and multimedia database systems are discussed in
Section 4 Thereby, existing data models, algorithms, and structures for mul-
timedia retrieval are presented and explained by considering image retrieval
as an example (Section 5, 6).
The analysis of the related storage, computational, and bandwidth re-
quirements in Section 7 shows, that powerful parallel and distributed archi-
tectures and database systems are necessary for the organisation of the huge
archives with multimedia data and the implementation of novel retrieval
approaches, for example an object-based similarity search. Therefore, the
properties and requirements of distributed multimedia applications, such as
Video-on-Demand servers and federated multimedia databases are described
in Section 8.
The parallel and distributed processing of multimedia data is depicted
in greater detail in the last part of the chapter by considering an image da-
tabase as an example. The main attention is given on the partitioning, the
distribution, and the processing of the multimedia data over the available
database nodes, as these methods have a major impact on the speedup and
the efficiency of the parallel and distributed multimedia databases. Section
288 O. Kao
9.1 gives an overview over some existing approaches for partitioning of im-
ages, whereas Section 9.5 explains the functionality of dynamic distribution
strategies. Section 10 closes this chapter with a case study of a cluster-based
prototype for image retrieval named CAIRO.
2 Media Fundamentals
The foundation of the entire construct multimedia are the media contained
therein, which are called multimedia objects. An often used classification
divides these into
2.1 Images
2.2 Video
3.1 MPEG I
these frame types. The B-Frames cannot be used as reference images for
other frames.
Figure 3.2 depicts the different frame types and the relations between
them. Not used for the prediction of other pictures are the D-Frames, which
allow simple fast forward mode.
The general structure of all frame types is identical, thus no further differ-
entiation after the mentioned three classes is necessary. Each frame consists
of an introductory part - so called header - and a body. The header contains
information about time, coding, and the frame type. The frame body consists
of at least one slice, which can be separated into macro blocks. Each of these
blocks is compounded of 16x 16 pixels and can be further subdivided into
8x8 blocks.
Coding of the video stream. The MPEG I coding method for video
streams is based on six different processing levels, which are graphically de-
picted in Fig. 3.3.
Motion compensation is used in order to eliminate the multiple coding
of the redundant information in succeeding frames. Thus, it is necessary to
identify the spatial redundancy present in each frame of the video sequence.
This static information is subsequently supplemented by the changing parts
of the frame and transmitted.
Two translation vectors, also called motion vectors, describe the estimated
motion. These contain the number of pixels in x- and y-direction, which are
used for the offset calculation of the examined region in the next frame. The
combination of the offset values and of the co-ordinates of the region in the
reference image gives the new position of the region. In the case of MPEG I
coding not objects, but similar 8x8 blocks are searched in the neighbouring
frames. The new position of these blocks can be interpolated with sub-pixel
accuracy. Well-known methods are 2D search, logarithmic search, and tele-
scopic search [172J.
The foundation for the MPEG compression is the two-dimensional Dis-
crete Cosine Transformation (DCT). The DCT is a lossless, reversible trans-
formation converting spatial amplitude data in spatial frequency data. For
7. Parallel and Distributed Multimedia Database Systems 295
Video
l. . . .:o.:..::.:=~=~~
~r---:L-----;~
8,-__..,...____
Coded Video
Fig. 3.3. Levels of the MPEG I coding process
I 2 6-.7 15 16 28 29
DC
~..,/
3 5
/' 8 ~ 14 17 27 30 43
Component
~/9 13 18 26 31 42 44
10 12 19 2S 32 41 45 54 AC
Components
II 20 24 33 40 46 53 55
21 23 34 39 47 52 56 61
22 35 38 48 51 57 60 62
36 37 49 SO 58 59 63 64
This process leads to a suitable starting position for the data reduction,
which is performed using the well-known methods RLC (Run Length Encod-
ing) and VLC (Variable Length Coding) [Huf52]. Thereby only the values
different from zero and the number of zero values between them are consid-
ered, i.e. pairs of the following form are generated:
These pairs serve as input data for the next processing level, where the
VLC is applied. VLC identifies common patterns in the data and uses fewer
bits to represent frequently occurring values. The coding of the DC com-
ponents is realised using a difference building approach: for each 8 x 8 block
the difference between the current DC component and the DC component
of the preceding 8 x 8 block is calculated and coded. The already run length
encoded AC components are subsequently represented by a VLC code: the
MPEG standard provides an exhaustive table with VLC codes for every pos-
sible value combination.
A detailed description of the MPEG I compression process can be found
for example in the MPEG standard [172]. It also contains further information
about the bit representation of the introduced codes and other coding-related
attributes.
7. Parallel and Distributed Multimedia Database Systems 297
The, at the moment, latest MPEG standard is called MPEG VII. It has
been designed for communication purposes and offers new methods for the
description of the media content by using manually, semi-automatically or
automatically generated meta-information. The existing coding methods are
extended by an additional track, which includes all meta-information. This is
used to improve the retrieval and presentation properties, for the maintenance
of the data consistency, etc. Moreover, besides the descriptive elements, the
MPEG VII system parts focus on compression and filtering issues which are
a key element of the MPEG VII application in distributed systems. The
standard, however, does not provide any details about the methods for the
extraction or creation of the meta-information. This is also true for algorithms
for retrieval of the information.
An overview of the basic principles of the MPEG VII data model is given
in Section 5.
pared, and combined in a final result. The corresponding raw data is then
determined and displayed. Discourses on data modelling, modified query lan-
guages, and further analysis are found, among others, in [KB96].
The advantages of object orientation, opposed to relational database sys-
tems, are the result of supporting complex objects, i.e. the media can be
treated and processed as a unit. This includes modelling, meta-information
management, and storing the complex content. Extended relational and ob-
ject oriented data models add some concepts of object oriented data models
to the classical relational data model, thereby reducing the known drawbacks
of these approaches.
Neglecting to demand content-based search procedures leads to the fact
that a series of databases with multimedia content are falsely designated mul-
timedia databases. Some examples for such databases with media are found
in the following enumeration [KB96]:
Web
Saver
Often used data sets are moved upwards in the hierarchy. In the mean-
time HSM is just a small part of the architectural characteristics demanded:
sufficient computing resources are necessary for a proper search in the mul-
timedia stock, high-speed networks are used to transfer the media to the
processing and to the presentation components. Distributing the data among
several nodes in a suiting way cannot only increase processing speed by using
of parallelism, but also makes valuable features possible. Further technical
problems are connected to the introduction of new architectures, such as
efficient backup systems, fault tolerance and thus providing of redundancy,
workload balancing, etc.
The main difference between traditional and multimedia databases is a
result of the complex media content: analysing, describing, and sorting of
the media, as well as deriving similarity statements are orders of magnitudes
more difficult as with the corresponding operations on alphanumerical data.
This requires the following aspects to be considered:
~?J
Feature Extraction
Data models, ...
Physical level representation is the lowest level in the hierarchy and con-
tains the raw data of the image and its objects.
Logical level representation is above the physical layer and contains the
logical attributes of an image.
Semantic level representation makes it possible to model the different,
user-dependent layers onto the data, as well as synthesising the semantic
query features by derived logical and meta-features.
The main problem when using these models is that methods for extracting
the meta-information from a higher level of abstraction are not available.
query and the analysed media. This process is repeated for all n media in
the database, resulting in a similarity ranking. The first k entries, k being
a user-defined constant, represent the k best hits, whose raw data are then
displayed. The comparison process can be accelerated by using index struc-
tures. These contain features extracted a priori, and are organised in such a
way, that the comparisons can be focused to a certain area around the query.
In the following, a selection of querying techniques, extraction methods,
index structures, and metrics are presented, using the example of image re-
trieval.
A priori feature extraction: in this case, only pre-defined features are al-
lowed in the query, so that the stored images are not processed. These fea-
tures were extracted during insertion in the database, and can be searched
in the form of an index tree.
Dynamic feature extraction: this is a more flexible approach, where the
user marks relevant elements in the sample image as querying parameters
for the similarity search. This could be, for example, a persons face, or
an object. Then, all media in the database are searched for this feature.
Combining the a priori and dynamic feature extraction: some stan-
dard features and the corresponding index trees are computed during
insertion. These are then completed with user-defined features during
the search.
Some well-known image features and extraction methods will now be pre-
sented.
c;. The fundamental algorithms are adapted to the used colour model,
the current application, the characteristics of the given image class, etc.
The result is a set C = {Cl, C2, ••• ,en}, where n E IN and c; n Cj = 0,
Vi,jE[l,nJ,ii-j.
Computing a histogram: after the colour cells are determined, a his-
togram is computed for each image in the database. The colour value
of each pixel is converted in a reference colour with a given metric and
the counter of this colour cell is incremented.
Comparing histograms: this step is performed during the runtime of the
image retrieval and is used to determine the similarity of a query image
and a stored image. The histogram of the query image needs to be com-
puted first. Then, this histogram is compared with the histograms of all
stored images using a given metric. The results are sorted and yield a
ranking of the hits.
(2)
where >.~ is the relative frequency of the reference colour >'i in the image b.
The similarity A(bI, b2) between two images b1 and b2 corresponds to the
Euclidean Distance of the two colour vectors:
n
A(b1 ,b2) = Ilfbl - fb2112 = ~)>'~l _ >,~2)2. (3)
i=l
(4)
Similar procedures are used among others by the QBIC [ABF+95) and
the ARBIRS [Gon98) system. Many systems combine the histogram informa-
tion with other features, thus increasing robustness and precision.
Calculation of statistical colour moments [S095) is a further approach
for describing colour distributions. Usually the first moment, as well as the
second and third central moments are used, since these represent the average
312 O. Kao
intensity E i , the variance ai, and the skewness Si of each colour channel.
These are computed for the i-th colour channel of the j-th pixel of the image
b with N pixel as follows:
1 N
Ei = - LPij (5)
N j=1
1
ror companng
D •
wo'Images b1 = (Ei
t b1' a b, , Sib, ) and b2 = (Eb2
i ' a ib2 , Sib2 )
i
with r colour channels each, a weighted similarity function L f is introduced:
The weights Wil, Wi2, Wi3 :::: 0 are user-defined and serve to adapt the
similarity function L f to the current application.
A description of contour or image segments can be assumed to be a mea-
surable and comparable image description, if contours are extracted from the
image. Again, many different methods exist for displaying these segments.
The QBIC System uses 18 different parameters, such as [NBE+93]:
• Area is the number of pixel within an enclosed region.
• Circularity is computed from the quotient of the square of the circumfer-
ences radius and the area.
• Direction of the largest eigenvector of the covariance matrix.
• Eccentricity is the relation of the length of the smallest to the length of
the largest eigenvector.
• Algebraic moments are the eigenvalues of a set of pre-defined matrices.
MEHROTRA and GARY use polygon lines for describing contours [MG95].
The polygon nodes can be the nodes of a line strip, approximating the con-
tour, or nodes computed from the features of the contour, such as the points
of largest curvature. So a contour is displayed as a sequence of so called
interest points.
Another category is made up of the texture-based features. The most
often used texture characteristics are computed from the covariance matrix.
ASENDORF and HERMES offer a survey over the different features [AH96].
The following are well-suited for classifications (Equations are taken from
[HKK+95]):
7. Parallel and Distributed Multimedia Database Systems 313
h = L LP(i,j)2. (9)
j
2. Contrast:
(10)
3. Correlation:
h = Ei E/ij)p(i,j) - /-Lx/-Ly.
(11)
axay
4. Variance:
5. Entropy:
/5 = LLP(i,j)log(p(i,j)). (13)
j
in an attribute vector, and stored in the database. The same number of co-
efficients is also used for the query image or sketch, so that the similarity
of two images can be determined by computing an adapted difference of the
corresponding wavelet vectors. An exact description of the criteria used to
select the coefficients, as well as a comparison metric and weights, can be
found in [JFS95,WHH+99]. Figure 6.2 and Figure 6.3 in Section 6.4 show
examples for image retrieval with wavelet coefficients.
6.3 Metrics
The similarity of two images in the content-based retrieval process is deter-
mined by comparing the representing feature vectors. A set of metrics and
similarity functions was developed for this purpose. They can be classified as
follows [JMC95]:
• Distance-based measures,
• Set-theoretic-based measures, and
• Signal Detection Theory-based measures.
[t IXi - Yir]
.1
dr{x, y) = r r ~ 1, (14)
~=1
7. Parallel and Distributed Multimedia Database Systems 315
(b) (c)
Fig.6.1. Example for template matching: (a) Manually selected region of interest
represented by MBR; (b) Search for the object in an unsuitable image; (c) The
object is found in a different environment
where x = (Xl,X2, . . . ,Xn ) and Y = (Yl,Y2, ... ,Yn) are arbitrary points in
an n-dimensional space. The fuzzy logic based MINKOWSKI r-METRIC re-
places the component subtraction by subtracting the corresponding element
functions J,L(Xi) and J,L(Yi).
The set-theoretic measures are based on using the number of same or
different components of the feature vectors. Set operations, such as intersec-
tion, difference, and union, are applied here. A family of such functions was
proposed for example by TVERSKY [Tve77] .
Let a, b be images and A, B the associated feature sets. The measure for
the similarity of both images Sea, b) is computed using the following rule:
The function f is usually used to determine the cardinality of the result set.
Not only the quality of the features, but their existence, too, are inspected
with similarity measurements of the third category. Signal Detection Theory
- also called Decision Theory - gives measures for the special case, where
feature components have binary values. Each image is assigned a vector with
binary values, so that comparisons can detect similarities. This makes the
following four cases possible:
w
S(a, b) = - - - (16)
w+x+y
This classification emphasises the advantages of vector-oriented similarity
measurements: the features can be computed automatically and can be used
to determine the nearest neighbour employing proven algorithms.
Partial Range Query: this query type looks for objects, whose feature val-
ues are within given intervals. The space spanned by the intervals defines
the region in which two objects are regarded as similar, so the similarity
term can be introduced with this query.
Next Neighbour Query: this query selects a single object, which has the
smallest distance to the query object, regarding a similarity function. An
extension is realised by looking for the k nearest neighbours. This feature
is for example necessary for ranking pictures by their similarity.
All Pair Query: in a given set of objects, all pairs are selected, which suffice
a certain distance condition.
Data structures that are employed to support such a search, are called
multidimensional index structures. Well-known examples are for example k-
d-trees and their extensions, like k-d-B-trees [Rob81], grid files [Knu73], R-
and R*-trees, SS- and SR-trees [WJ96], TV-trees (telescopic-vector tree), VP-
trees ( vantage point tree) [Chi94a, Chi94 b] or VA files ( Vector Approximation)
[FTA+OO,WSB98].
(17)
where Wi is the word-length of the colour channel i-usually 8 bits. The
constant C « S] represents additional, technical and format-specific details.
According to Equation (17), the storage space required for a page of text is
the same as for an uncompressed RGB image of the dimension 26x26. Images
7. Parallel and Distributed Multimedia Database Systems 319
The content provider thus has one or more video servers at its disposal.
A centralised solution is linked to high transfer costs and strong signalling
network traffic [DePOOl. A replicated media distribution among several video
servers, independent of one another, significantly reduce the transfer costs,
so that the quality of service demanded can be obtained. The location of the
individual servers can be determined according to different criteria:
policies, time striping and space striping. In the first case, a video is striped
in frame units across multiple servers. In contrast thereof, space striping is
a technique to divide a video stream into fixed-size units. These are easy to
manage and simplify the storage requirements.
The re-ordering and the merging of the video fragments in a coherent
video stream are performed by a component called proxy. There are three
main directions for the realisation of a proxy:
• Proxy at Server: a proxy is directly assigned to each storage server. The
proxy analyses the requests, determines the storage position of the other
video fragments, and forwards these from the corresponding proxy. The
computational resources of the storage server are used for this aim and
for the video retrieval.
• Independent Proxy: the proxies and the storage server are connected via
a network, so that the proxy can directly address all servers, and request
the required fragments. This assumes a corresponding control logic for
the proxy and the network, as well as sufficient bandwidth.
7. Parallel and Distributed Multimedia Database Systems 325
• Proxy at Client: a proxy is assigned to each client, which then takes care of
communicating with the storage servers. The communication complexity
is reduced, as the video fragments are transferred directly to the client.
On the other hand, the demands on the client complexity are significantly
increased, as they will need to realise the proxy functionality.
• The local databases store different media and the corresponding meta-
information, such as an image database with portraits of actors, an im-
age database with keyframes from different movies, video servers with
digitised movies, conventional databases with bibliographical and movie
information.
• Every local database contains a subset of the movie material, for example
sorted by production country: a server contains all information mentioned
for American, another server for French movies, etc.
Yet a central interface exists in both cases, which enables accesses to the
data in all local databases. In the case of a heterogeneous, distributed data-
base system, different database systems are allowed on the same node. The
individual systems can be entirely integrated in the global system by transla-
tion schemes, or merely supply interfaces - so called gateways - to the local
data. The later are comparable to meta-search-engines on the Internet: the
keywords are entered - a syntactical transformation is assumed - in a num-
ber of search engines that will then analyse their databases in parallel. The
syntactical transformation mostly concerns the formulation of logical expres-
sions, for example Wordl AND Word2 is transformed in + Wordl + Word2.
The results are then combined in a final result and presented to the user.
Figure 8.2 shows an example for a heterogeneous, distributed multimedia
database system.
Completely integrated database systems - called multi database manage-
ment systems - are a connection of different local database systems with
already partially existing databases, by the means of conversion subsystems,
in a new, global system. A centralised interface can query all subsystems and
combine the results. Opposed to homogenous database systems, local data ac-
cess is still allowed: the users can continue to use "their" part of the database
as before, without having to re-sort to the global interfaces and mechanisms,
i.e. the membership in the global system is transparent for these users. The
functionality of this architecture is visualised in Fig. 8.3 [BG98]. An example
326 O. Kao
Global User
Global Global
User View User View
I n
Local
User View
Local Local
Internal Schema Internal Schema
I n
Local
Multimedia
Database
Global Global
User View U erView
I n
Global Schema
Fragmentation Schema
Allocation Schema
Local Local
Conceptual Conceptual
Schema I Scheman
Local Local
[ntemal Schema Internal Schema
I n
Local Local
Multimedia Multimedia
Database
The advantage of this approach is that the used operators do not need
to be adapted with a complicated process, i.e. a large part of the sequential
code can be used unchanged. Furthermore, the computation time depends
mostly on the number of elements to be processed, so that all nodes require
nearly the same processing time.
On the other hand, this kind of parallelising cannot be employed for all
operators, as dependencies between the media elements need to be consid-
ered. The partitioning of the data can thus cause a tainted result. These
methods are not well-suited for architectures with distributed memory, since
the large transport costs for the media significantly reduce the performance
gain. The reason for this is the large amount of time the nodes spend wait-
ing for their data. In the worst case the parallel processing may take longer
than the sequential execution [GJK+OOj. Therefore, this kind of parallelism
prefers architectures with shared memory. The protracted transfers between
nodes are not performed, simplifying the segmentation, synchronisation, and
composition.
have the same impact on the development of the computer technology during
the next 20 years, as the microprocessors had in the past 20 years [CSG99].
ALMASI and GOTTLIEB [AG89] define a parallel architecture as a collection
of communicating and cooperating processing elements (PEs), which are used
to solve large scale problems efficiently.
The way in which individual PEs communicate and cooperate with each
other depends on many factors: the type and the attributes of the connecting
structures, the chosen programming model, the problem class to be solved,
etc. The organisational and the functional connections of these and other
components result in a multitude of different architectures, processing con-
cepts, system software, and applications. From the database community's
point of view the parallel architectures are divided into three categories:
Shared everything architecture: multiple, equally constructed PEs ex-
ist in the system and everyone of them can take care of exactly the
same assignments as all other processors (symmetry). These are, for ex-
ample, memory accesses, controlling and managing the input and output
activities, reacting on interrupts, etc. The other elements are regarded
by the shared operating system and the applications as unique, despite
them being able to be composed of several replicated components, such
as hard disk arrays. The synchronisation and communication is usually
performed by implementation of shared variables and other regions in the
memory. Figure 8.6 shows the principal composition of a shared every-
thing architecture.
Memory
Shared everything systems are main platform for parallel database sys-
tems and the most vendors offer parallel extensions for their relational
and object-oriented databases. Independent query parts are distributed
over a number of PEs.
Shared disk architecture: each processor has a local memory and access
to a shared storage device in this class. The data that to be processed is
transferred from the hard disk to the local memory and it is processed
332 O. Kao
there. The results are then written back to the hard disk, thus being
available for other processors. Special procedures are necessary to retain
data consistency - analogue to the cache coherency problem: the current
data can already be in the cache of a processor, so that accesses per-
formed in the meantime result in false data. The graphical display of this
architecture can be seen in Fig. 8.7.
Conversion schemes are used to map the local data models onto the new,
global data models. The conversion can take place directly or via an abstract
meta-model. The most important goal is to entirely preserve the data features
that are transferred, and the operators, which can be applied on the data sets.
334 O. Kao
on the type of application, the volume of the data to be stored and its trans-
portation costs, the required access rate, etc.
Often used solutions are RAID systems (Redundant Array of Inexpensive
/ Independent Disks): several hard disks are bundled and configured by a
controller. The data is distributed among all the hard disks. A parity block
takes care of the redundancy and needs to be brought up to date with each
write access.
Another approach is mirroring the hard disks: several independent copies
of the same data set are stored on different hard disks. In the simplest form,
the storage media exists twice, as seen in Fig. 8.9. Next to the required
doubling of the storage resources, this solution uses a higher management
overhead, since each write or delete operation needs to be performed on every
copy. The existing redundancy boosts the workload balancing within the
system, as each read access can be re-directed to a mirrored, non-overloaded
unit.
l}
~-------~",.----.I
Disk Pair I
(b)
Fig. 8.10. Example for Chained De-clustering with five independent disks: (a)
Normal case; (b) Re-distribution after disk number 4 failed
• Inter-transaction parallelism,
• Parallelism of operations within a transaction,
• Parallel execution of individual database operations, and
• Accessing the stored data in parallel.
• De-clustering: the entire data set is divided into disjunctive subsets, ac-
cording to a given attribute or a combination of attributes.
• Placement: the subsets of the first phase are distributed among the indi-
vidual nodes.
• Re-distribution: the partitioning and assignment process is repeated in
certain intervals, or on demand, to eliminate disturbances in the equilib-
rium, for example after a node was added.
Strategies for static data distribution. Different memory models for the
organisation of complex objects in conventional database systems - for exam-
ple relational database systems - are already analysed. The direct memory
model stores the main and sub-object together. This eases object accesses
and reduces the necessary I/O activity. On the other hand, the tables grow
disproportion ably and executing database operations becomes inefficient.
In a normalised memory model, the objects and the corresponding at-
tributes are divided into tupel sets. These are then mapped to one or more
files . Two basic partitioning methods are possible:
ObjectID Kitogram
Colour
Moments
_. Wavelet
CoemcleDts
BeachOOOI (0.03 •...• 0.05) (273.45 • ... ) ... (54. 17, ... )
BeachAthens (0.0 I •...• 0.08) (125.37, ... ) .. . (98.65 •. .. )
... ... ... ... ...
'--"
File I "-
File 2
Fig.9.1. Vertical partitioning of a relational database
... "
Filen
Colour Wavelet
ObjectJD Hi togJ'am Moments
... Coefficients
BeachOOOI (0.03, ... , 0.05) (273.45, ... ) ... (54. 17, ... )
>
The following basic strategies are available for the most often used hori-
zontal partitioning:
Range: the data is divided into ranges, based on the value of an attribute
or a combination of attributes. A simple example for this is the mapping
of visitors of an office to one of the available counters based on the first
letter of their surname.
Hashing-Strategy: the attributes are transformed with a given hashing
function and thus mapped on the corresponding partition.
Round-Robin-Strategy: if n nodes are available, the data is sent to node
k, with k < n, the next data set is sent to node k + 1 mod n, and so forth.
After a certain runtime, an even distribution across all nodes is reached.
the graph. The edges of this graph are given weights that correspond to the
transfer costs to neighbouring nodes. Each node pair is analysed according
to these costs and the pair with the highest costs is merged to one node.
This is repeated until the number of nodes in the graph equals the number
of nodes really existing. Variations of this fundamental algorithm, in respect
to fragment allocation, or the grouping of PEs, are looked at in for example
[IEW92J.
Strategies based on the processing size are developed for systems that
support a fine-grained parallelism. The compute time of all participating PEs
is adjusted by transferring approximately the same volume of data to all nodes
for processing. According to HUAs strategy [HL90J, the data set is divided
into a large number of heuristically determined cells that are combined in
a list. The first element is transferred to the node with the most free space
in its storage device and is then removed from the list. This procedure is
repeated until the list is empty.
The I/O system is a bottleneck, when large amounts of data are processed.
This is the reason why strategies in the third class reduce the frequency with
which the secondary memory is accessed and spread the accesses evenly across
all nodes. The BUBBA system applies such a strategy, which defines the terms
heat and temperature as symbols for the frequency with which a fragment
is accessed, or the quotient of the frequency and the relation size. Heat is
the measure according to which the number of nodes needed to process the
relation is computed. The temperature determines if the relation is to be kept
in the main memory, or if it should be swapped in the secondary memory. The
relations are sorted according to their temperature and distributed among
the nodes with a greedy algorithm, so that every node has nearly the same
temperature.
Let DVP1 ,DVP2 , ••• ,DVPn be the sums of the memory usage of all
media on the local hard disks of the database nodes 1,2, ... , n, i. e.
a.
DVPi = Lsizeo!(mij), i = 1,2, ... ,n, (18)
j=1
where the function sizeo!(mij) returns the storage space required for the
media mij. ai represents the number of media mij at the node i. The values
DV Pi are managed in a master list, which is updated whenever a new media
is added.
Let a media m new be given with x = sizeo!(mnew ), that is to be inserted
in the database. The node k, k E [1, n] with the least storage space used, is
determined for this aim:
(19)
Each document and the two categories are represented by vectors con-
sisting of T F * I D F. The degree of the match is calculated with the scalar
product in the form
The text and the corresponding image are assigned to the category with
the highest match. Different restrictions and modifications of this general
principle are introduced in [SH99], which makes matches larger than 80%
possible.
(22)
344 O. Kao
The query image is assigned to class j, if Pi > 0.5. The hit ratio achieved
is between 66% and 93.9% depending on the number of neighbours considered
and included attributes.
Combined approaches, that use textual as well as content-based at-
tributes, are described for example in [OS95,SC97].
1. q(B) = s(B),
2. q(B) = d(B),
3. q(B) = so d(B),
4. q(B) = do s(B), and
5. q(B) = d(B){U, n, $, ... }s(B).
The queries 1 and 2 consider only one feature type and therefore there
is no need for task scheduling and unification of partial results. The query
number 5 requires a parallel execution of s and d, thus there is still no need
for dynamic re-distribution, as all - initially distributed - data has to be
processed. The failure of a computing node and the migration of its tasks
to other nodes, as well as execution on heterogeneous architectures are not
considered.
The query types 3 and 4 are compositions of the sand d sequences. Query
number 3 performs a retrieval with dynamically extracted features in the first
stage. The results are then processed with the a priori extracted features in
order to determine the final ranking. From scheduling aspects is this a non-
critical case, as approximately the same processing time for all available nodes
is assumed. This is the consequence of the initial equal size data distribution.
The second processing step considers only a priori extracted features and
corresponding operations, all executed on a single node.
Query number 4 represents a critical case. The retrieval with a priori
extracted features reduces the data set, which has to be considered during
the retrieval with dynamically extracted features. The equal data distribution
over the nodes is distorted; in the worst case all data is located on a single
node, thus no parallel processing can be done. Only this particular node
performs the retrieval operations, while the other nodes idle, resulting in
much longer system response times.
Let td(bil), td(b i2 ), . .. ,td(bin ) be the processing times for bib ... , bin'
Now, the following important time parameter can be approximated:
• System response time tr is the maximal processing time of all nodes:
• Minimal processing time t max equals the processing time of the node with
the smallest number of relevant media:
During O•.. tmin, all nodes are fully loaded, thus no media re-distribution
is necessary. After this period at least one node idles and media re-
distribution is necessary in order to avoid unused resources.
346 O . Kao
(25)
In this case the idle times of all nodes are minimised, and the best possible
system response time is reached. The goal of the scheduling strategy is
to approximate this time as well as possible.
Figure 9.3 depicts the described time parameters and gives an example
for the differences between the processing times of the nodes.
Fig. 9.3. Graphic representation of the processing times ti of the individual PEs Pi,
as well as the three points in time vital for using dynamic re-distribution strategies
It follows that data sets of different sizes inevitably lead to varying pro-
cessing times of the individual nodes, and so a significant difference between
the response time tr and the optimal execution time t opt can result.
In regards to the total performance, the point in time of a dynamic re-
distribution is decisive. In the simplest case, such a strategy is activated, as
soon as a node finishes with processing the media assigned to it. A better
utilisation of the resources available is achieved, when the current situation is
analysed during the processing time [0, tmin], as all nodes are busy processing
the media on their local storage devices. The special case tmin = 0 develops,
when no media have to be processed on a certain node(s). Generally, tmin > 0
time units remain to analyse the current situation and to generate a re-
distribution plan.
But creating an a priori execution plan requires data on
Pattern recognition systems have been used for a long time. Specialised
medical information systems were developed to evaluate images, as well as
manage, organise, and retrieve patient data. A medical database for comput-
ing and comparing the geometry and density of organs was already developed
in 1980 [HSS80J. Similar improvements happened in the field of remote sens-
ing. A generalisation of the procedures used, as well as the extension of the
application areas, required a specification of so called pictorial information
systems in the 1980s. A significant functional requirement was the image
content analysis and content-based image retrieval.
The importance of image database rose enormously in recent years. One of
the reasons is the spreading of digital technology and multimedia applications
producing Petabytes of pictorial material per year.
The application areas are numerous. Document libraries offer their multi-
media stock world-wide. This is also true for art galleries, museums, research
institutions, photo agencies for publisher houses, press agencies, civil services,
etc. managing many current and archived images. Document imaging systems
are tools that digitise and insert paper documents in a computer-based data-
base. Further areas are trademark databases, facial recognition, textile and
7. Parallel and Distributed Multimedia Database Systems 349
fashion design, etc. Systems are created in combination with applied image
processing, in which the image database is only part of a more complex sys-
tem. Medical information systems, for example, manage ultra sound images,
x-ray exposures, and other medical images.
CAIRO, the image database presented here, combines standard methods
for image description and the retrieval with efficient processing On a cluster
architecture. The data is distributed among several nodes, which is then
processed in parallel. The components necessary for this are
• User interfaces.
• Algorithms for feature extraction.
• Relational database system for storing a priori extracted image attributes.
• Index structures to speed up the retrieval.
• Mechanisms for the parallel execution of retrieval operations consisting
of
- Transaction manager: sets the order of the commands to be executed
and balances the workload across the cluster.
- Distribution manager: combines the algorithms to be used with the
identifiers of the sample and the target images and sends these to the
nodes.
- Processing manager: initiates and controls the feature extraction and
the comparison at the individual nodes.
- Result manager: collects the partial results and determines the global
hits.
• Update manager: takes care of inserting new images in the database, the
computation of the a priori defined features, and the updating of the
index structures.
The functionality of the individual nodes is described more closely in the
following.
Figure 10.1 displays the interface7 for query by example image or sketch
with the corresponding browser.
.
...... ~
-,~
,-'
• 768 pixel s
Fig. 10.1. Graphical user interface: sketching tools and browser for the retrieval
results
10.3 Features
To describe the image content, as well as for conducting an image comparison,
CAIRO offers a set of algorithms for feature extraction and comparison. There
are histograms, colour moments, format attributes, texture characteristics,
wavelet-based approaches, etc. A part of these features is extracted a priori
and stored in the index structures.
One of CAIROs specialties is the support of dynamic feature extraction.
In this case, the user can select a certain region manually, and use it as a
starting point for the search. Other regions of the query image and the object
background are not regarded, so that a detail search can be performed. But
this method requires the analysis of all image sections in the database and
produces an enormous processing overhead. The different approaches and the
results that are to be expected are introduced in the following.
higher abstraction level. The similarity degree of a query image and the target
images is determined by calculation of a distance between the corresponding
features.
An example for this approach is given by the a priori extraction of wavelet
coefficients. Let 1= {h, ... , In} be a set of images to be inserted in a cata-
logue. The main feature for the content description is a vector with the largest
wavelet coefficients. Therefore the wavelet transformation is applied on all im-
ages in I resulting in a set of vectors WIp' .. ,WIn with WIj = (Cjt, ... ,Cj64).
At query time the user creates a sample sketch or loads an image, which
is subsequently processed in the same manner as the images in the database.
The wavelet transformation is applied on this image Q too and the wavelet
coefficients wQ = (CQ1' •.• ,CQ64) are determined. Subsequently the distances
between the vector of the query image and the vectors of all images in the
database are calculated. Each of these results gives an indication about the
similarity of the compared images. The images with the smallest difference
are the most similar images and the corresponding raw data is sent to the
user interface for visualisation. The extraction algorithms for the wavelet
coefficients, which are applied on the sample image as well as the similarity
functions are embedded in a SQL command sequence and executed using the
available mechanisms of the relational database systems. Further algorithms
can also be included and invoked as a user-defined function.
However, with this approach only entire images are compared with each
other, thus global features such as dominant colours, shapes or textures de-
fine the similarity. For example a query with an image showing a person on
a beach results in a set of images with beach scenes or images with large yel-
low and blue coloured objects. Images containing the same person in other
environments - for example canyon or forest - are sorted at the end of the
ranking. Figure 10.2 shows an example of such a query image and the results
obtained with a priori extracted features.
Acceptable system response time are achieved, because no further process-
ing of the image raw data is necessary during the retrieval process resulting
in immense reduction of computing time. The straightforward integration in
existing database systems is a further advantage of this approach.
Extraction of simple features results in disadvantageous reduction of the
image content. Important details like objects, topological information, etc. are
not sufficiently considered in the retrieval process, thus a precise detail search
is not possible. Furthermore, it is not clear, whether the known, relatively
simple features can be correctly combined for the retrieval of all kinds of
images.
Query/mage
• • •
2. BeachOOO7 n. Canyon3455
Fig. 10.2. Query image and results retrieved with a priori extracted features
Querylmage
• ••
2. Canyo03455 D. BeachOOO7
Fig. 10.3. Query image and results retrieved with dynamically extracted features
software and hardware components. A master node controls the whole cluster
and serves files to the client nodes. It is also the clusters console and gateway
to the outside world [SS99).
Clusters of symmetric multiprocessors - so called CLUMPs - combine
the advantages and disadvantages of two parallel paradigms: an easily pro-
grammable Symmetric Multiprocessing (SMP) model with the scalability and
data distribution over many nodes of the architectures with distributed mem-
ory. A number of well-constructed parallel image operators, which were de-
veloped and tested for the SMP model, are available. These can be used for
the image analysis on each node. The multiple nodes share the transfer effort
and eliminate the bottleneck between the memory and the I/O subsystem.
Disadvantages result from the time-consuming message passing communica-
tion, which is necessary for workload distribution and synchronisation. The
proposed image partitioning, however, minimises the communication between
the nodes and enables the use of the PEs to nearly full capacity. Based on
their functionality the nodes are subdivided in three classes:
• Query stations host the web-based user interfaces for the access to the
database and visualisation of the retrieval results .
• Master node controls the cluster, receives the query requests and broad-
casts the algorithms, search parameters, the sample image, features, etc.
to the computing nodes. Furthermore, it acts as a redundant storage
server and contains all images in the database. It unifies the intermediate
results of the compute nodes and produces the final list with the k best
hits.
7. Parallel and Distributed Multimedia Database Systems 355
query access
!CPU1 \ CPU2
!
rna ler node
• cluster control
• a-priori/eature extraction--;;I*I=I*I=I"I
• etc. slav node
• dynamic/eature
extraction
Fig. 10.4. Schematic of the CAIRO cluster architecture
A partition can consist of multiple image classes, the elements of which dif-
fer significantly from other partitions. On the other hand, the images should
be characterisable by a shared feature, like landscape images or portraits.
356 O. Kao
The introduction of existing features for the image classification in the pre-
vious section shows that a reliable, content-based partitioning of the images
in independent subsets is currently not realisable. This is especially the case
when a general image stock is used. An unsuitable assignment can lead to
some images being unfindable, since they are not even considered during the
corresponding queries.
This is the reason why the initial partitioning of the image set Buses
the content-independent, size-based strategy, that leads to a set of partitions
P = {Pl , P2 , ... , Pn } with the following characteristics:
\;fPi ,Pj CB:Pi nPj =0, i,j=l, ... ,n, i=f.j (27)
size(Pi):=;::jsize(Pj ) i,j=l, ... ,n.
(28)
suitable images have been considered. The transaction manager is not in-
voked, if only a priori or only dynamically extracted features exist. But the
query usually consists of a combination of a priori and dynamically extracted
features, so that three basic approaches can be made:
1. The a priori extracted features are evaluated in the first phase, and a list
of all potential hits is constructed. This list is forwarded - together with
the algorithms for the dynamic extraction of features - to the distribution
manager, which causes the procedures to be only applied on these images.
2. Inverting the order of operation of the first case (1) leads to the case
where the list of potential hits is determined according to the dynamically
extracted features, which is then further narrowed down by considering
a priori extracted features.
3. Both processing streams can initially be regarded as independent of each
other and be executed in parallel. The resulting intermittent lists are
transformed in a final hit list by a comparison process.
Each of these possibilities has certain advantages and disadvantages re-
garding speed of execution and precision. The combination a priori/dynamic
extracted features limits the set of images that have to be processed dy-
namically and enables the fastest system response time. On the other hand,
suitable images can be removed from the result set by imprecise comparisons
with the a priori extracted features, and are not considered anymore in the
second step. This disadvantage is eliminated in the other two approaches,
but the processing time necessary clearly grows, as every image needs to be
analysed for each query.
The transaction manager also controls the module for dynamic re-distri-
bution of images across the nodes. If only a selection of images need to be
processed, the list is handed to the scheduler, which returns are-distribution
plan. This is the foundation from which the transaction manager creates the
execution lists for each node.
Relational
database
Fig. 10.5. Schedule for the parallel execution of the retrieval operations in a cluster
architecture
This component realises the insertion of images in the database via a web-
based interface. First, the raw image data is transformed in a uniform format,
and is tagged with a unique identifier. All existing procedures for a priori
feature extraction are then applied to this image. Furthermore, the technical
and , if existent, world-oriented data is determined, and extended by a set of
user defined keywords. All information is composed in a given data structure
and stored in the relational database.
7. Parallel and Distributed Multimedia Database Systems 359
The next phase determines the cluster node, on whose hard disk the raw
image data is to be stored. In the case of an even data distribution, the image
data is sent to the node with the smallest data volume. It may be necessary
to re-distribute the data to achieve a balanced storage load, if larger images
are used. The exact image position is stored in the data structure, and the
image is sent to the corresponding node. The index structures, which may
exist, are updated in the last phase.
11 Conclusions
The development of the Internet technology enables an online access to a
huge set of digital information, which is represented by different multimedia
objects such as images, audio and video sequences, etc. Thus, the Internet can
be considered as a general digital library offering a comprehensive knowledge
collection distributed over millions of independent nodes. Thereby an urgent
need for the organisation, management, and retrieval of multimedia infor-
mation arises. Large memory, bandwidth, and computational requirements
of the multimedia data often surpass the capabilities of traditional database
systems and architectures. The performance bottlenecks can be avoided for
example by partitioning of the data over multiple nodes and by creation of a
configuration supporting parallel storage and processing.
The chapter gives an overview over the different techniques and their in-
teroperability necessary for the realisation of distributed multimedia database
systems. Thereby, existing data models, algorithms, and structures for mul-
timedia retrieval are presented and explained. The parallel and distributed
processing of multimedia data is depicted in greater detail by considering an
image database as an example. The main attention is given on the parti-
tioning and the distribution of the multimedia data over the available nodes,
as these methods have a major impact on the speedup and the efficiency
of the parallel and distributed multimedia databases. Moreover, different ap-
proaches for the parallel execution of retrieval operations for multimedia data
are shown. The chapter is closed by a case study of a cluster-based prototype
for image retrieval.
References
[172] ISO /IEC 11172-1, Information technology - coding of moving pic-
tures und associated audio for digital storage media at up to about
1,5 Mbit/s, part 1-3: Systems, Video, Compliance testing, 1993.
[818] ISO /IEC 13818, Information technology generic coding of moving
pictures and associated audio information, Part 1-3, 1995.
[ABF+95] Ashley, J., Barber, R., Flickner, M., Hafner, J., Lee, D., Niblack,
W., Petkovic, D., Automatic and semi-automatic methods for image
annotation and retrieval in QBIC, Proc. Storage and Retrieval for
Image and Video Databases III, 1995, 24-35.
360 O. Kao
[FSN+95] Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom,
B., Gorkani, M., Hafner, J., Lee, D., Petkovic, D., Query by image and
video content: the QBIC system, IEEE Computer 28, 1995, 23-32.
[FTA+OO] Ferhatosmanoglu, H., Thncel, E., Agrawal, D., Abbadi, A.E., Vector
approximation based indexing for non-uniform high dimensional data
sets, Proc. 2000 ACM CIKM International Conference on Informa-
tion and Knowledge Management, 2000, 202-209.
[Fur99] Furth, B., Handbook of internet and multimedia systems and applica-
tions, CRC Press, 1999.
[GJK+OO] Gaus, M., Joubert, G.R., Kao, 0., Riedel, S., Stapel, S., Distributed
high-speed computing of multimedia data, E. D'Holiander, G.R. Jou-
bert, F.J. Peters, H.J. Sips (eds.), Parallel computing: fundamentals
and applications, Imperial College Press, 2000, 510-517.
[GJM97] Grosky, W.I., Jain, R., Mehrota, R., The handbook of multimedia
information management, Prentice Hall, 1997.
[GMOO] Golubchik, L., Muntz, R.R., Parallel database servers and multime-
dia object servers, J. Blazewicz, K Ecker, B. Plateau, D. Trystram
(eds.) , Handbook on parallel and distributed processing, Springer-
Verlag, Berlin, 2000, 364-409.
[Gob97] Goble, C., Image database prototypes, W.I. Grosky, R. Jain, R.
Mehrota (eds.), The handbook of multimedia information manage-
ment, Prentice Hall, 1997, 365-404.
[G0I92] Goble, C.A., O'Doherty, M.H., Ireton, P.J., The Manchester multi-
media information system, Proc. 3rd International Conference on Ex-
tending Database Technology, Springer-Verlag, Berlin, 1992,39-55.
[Gon98] Gong, Y., Intelligent image databases: towards advanced image re-
trieval, Kluwer Academic Publishers, 1998.
[Gro94] Grosky, W.I., Multimedia information systems, IEEE Multimedia 1,
1994, 12-24.
[GRV96] Gudivada, V.N., Raghavan, V.V., Vanapipat, K, A unified approach
to data modelling and retrieval for a class of image database appli-
cations, V.S. Subrahmanian, S. Jajodia (eds.), Multimedia database
systems: issues and research directions, Springer-Verlag, Berlin, 1996,
37-78.
[GS98] Griwodz, C., Steinmetz, R., Media servers, Technical Report TR-
KOM-19998-08, TU Darmstadt, 1998.
[GWJ91] Gupta, A., Weymouth, T., Jain, R., Semantic queries with pictures:
the VIMSYS model, Proc. 17th Conference Very Large Databases,
1991, 69-79.
[HD90] Hsiao, H.-I., DeWitt, D.J., A new availability strategy for multipro-
cessor database machines, Proc. International Conference on Data
Engineering (ICDE 1990), 1990,456-465.
[HKK+95] Hermes, T., Klauck, C., Kreyss, J., Zhang, J., Image retrieval
for information systems, Storage and retrieval for image and video
databases III 2420, SPIE, 1995, 394-405.
[HL90] Hua, K, Lee, C., An adaptive data placement scheme for parallel
database computer systems, Proc. 16th Conference on Very Large
Databases, 1990, 493-506.
[HPN97] Haskell, B.G., Puri, B.G., Netravali, A.N., Digital video: an introduc-
tion to MPEG-2, Chapman & Hall, New York, NY, 1997.
362 O. Kao
[HSS80] Huang, H.K, Shiu, M., Suarez, F.R., Anatomical cross-sectional ge-
ometry and density distribution database, S.K Chang, KS. Fu (eds.),
Pictorial information systems, Springer-Verlag, Berlin, 1980, 351-
367.
[Huf52] Huffman, D.A., A method for the construction of minimum redun-
dancy codes, Proc. Institute of Radio Engineers (IRE) 40, 1952,
1098-1101.
[IEW92] Ibiza-Espiga, M.B., Williams, M.H., Data placement strategy for a
parallel database system, Proc. Database and Expert Systems Appli-
cations, Springer-Verlag, Berlin, 1992, 48-54.
(Jae91] Jaehne, B., Digital image processing - concepts, algorithms and sci-
entific applications, Springer-Verlag, Berlin, 1991.
(JFS95] Jacobs, C.-E., Finkelstein, A., Salesin, D.-H., Fast multiresolution
image querying, Proc. ACM Siggraph 95, Springer-Verlag, 1995, 277-
286.
[JMC95] Jain, R., Murthy, S.N.J., Chen, P.L-J., Similarity measures for image
databases, Proc. Storage and Retrieval for Image and Video Databases
III 2420, 1995, 58-65.
(JTC99] ISO/IEC JTC1 / SC29 / WGll / N2725, MPEG-4 overview, 1999,
Web site: www.cselt.stet.it/mpeg/standards/mpeg-4/mpeg-4.htm.
[KA97] Klas, W., Aberer, K, Multimedia and its impact on database system
architectures, P.M.G. Apers, H.M. Blanken, M.A.W. Houtsma (eds.),
Multimedia Databases in Perspective, Springer-Verlag, Berlin, 1997,
31-62.
[Kat92] Kato, T., Database architecture for content-based image retrieval,
Proc. Storage and Retrieval for Image and Video Databases III 1662,
SPIE, 1992, 112-123.
(KB96] Khoshafian, S., Baker, A.B., Multimedia and imaging databases, Mor-
gan Kaufmann Publishers, 1996.
[Knu73] Knuth, D.E., The art of computer programming, Addison Wesley,
Reading, MA, 1973.
[KSD01] Kao, 0., Steinert, G., Drews, F., Scheduling aspects for image re-
trieval in cluster-based image databases, Proc. IEEE/ACM Inter-
national Symposium on Cluster Computing and the Grid (CCGrid
2001), IEEE Society Press, 2001, 329-336.
[KTOO] Kao, 0., La Tendresse, I., CLIMS - a system for image retrieval by
using colour and wavelet features, T. Yakhno (ed.), Advances in infor-
mation systems, Lecture Notes in Computer Science 1909, Springer-
Verlag, Berlin, 2000, 238-248.
[Lee98] Lee, J., Parallel video servers, IEEE Transactions on Multimedia 5,
1998, 20-28.
[LZ95] Liu, H.-C., Zick, G.L., Scene decomposition of mpeg compressed
video, A.A. Rodriguez, R.J. Safranek, E.J. Delp (eds.), Digital Video
Compression: Algorithms and Technologies, vol. 2419, SPIE - The
International Society for Optical Engineering Proceedings, 1995, 26-
37.
[MG95] Mehrotra, R., Gary, J.E., Similar-shape retrieval in shape data man-
agement, IEEE Computer 28, 1995, 57-62.
[MPE98] MPEG Requirement Group, MPEG7 requirements document, ISO /
MPEG N2462, 1998.
7. Parallel and Distributed Multimedia Database Systems 363
[S095] Stricker, M., Orengo, M., Similarity of color images, Storage and re-
trieval for image and video databases III, 1995, 381-392.
[SP98] Szummer, M., Picard, R.W., Indoor-outdoor image classification,
IEEE Workshop on Content Based Access of Image and Video
Databases (CAVID-98), IEEE Society Press, 1998, 42-51.
[SS99] Savarese, D.F., Sterling, T., Beowulf, R. Buyya (ed.), High perfor-
mance cluster computing - architectures and systems, Prentice Hall,
1999, 625-645.
[SSU94] Sakamoto, H., Suzuki, H., Uemori, A., Flexible montage retrieval for
image data, Storage and Retrieval for Image and Video Databases II,
1994, 25-33.
[ST96] Stonebaker, M., Moore, D., Object-relational DBMSs - the next wave,
Morgan Kaufmann, 1996.
[SteOO] Steinmetz, R., Multimedia technology, Springer-Verlag, Berlin, 2000.
[Swe97] Sweet, W., Chiariglione and the birth of MPEG, IEEE Spectrum,
1997, 70-77.
[Tve77] Tversky, A., Features of similarity, Psychological Review 84, 1977,
327-352.
[VJZ98] Vailaya, A., Jain, A., Zhang, H.J., On image classification: city vs.
landscape, Proc. IEEE Workshop on Content-Based Access of Image
and Video Libraries, IEEE Computer Society Press, 1998, 3-8.
[WSB98] Weber, R., Schek, H., Blott, S., A quantitative analysis and per-
formance study for similarity-search methods in high-dimensional
spaces, Proc. International Conference on Very Large Data Bases,
1998, 194-205.
[WHH+99] Wen, X., Huffmire, T.D., Hu, H.H., Finkelstein, A., Wavelet-based
video indexing and querying, Journal of Multimedia Systems 7, 1999,
350-358.
[WJ96] White, D.A., Jain, R., Similarity indexing with the SS-tree, Proc.
12th International Conference on Data Engineering, IEEE Computer
Society Press, 1996, 516-523.
(WNM+95] Wu, J.K., Narasimhalu, A.D., Mehtre, B.M., CORE: a content-based
retrieval engine for multimedia information systems, ACM Multime-
dia Systems 3, 1995, 25-41.
[WZ98] Williams, M.H., Zhou, S., Data placement in parallel database sys-
tems, M. Abdelguerfi, K.-F. Wong (eds.), Parallel Database Tech-
niques, IEEE Computer Society Press, 1998, 203-219.
[YL95] Yeo, B.-L., Liu, B., Rapid scene analysis on compressed video, IEEE
Transactions on circuits and systems for video technology 5, 1995,
533-544.
8. Workflow Technology: the Support
for Collaboration
products in context of an advanced application that tests the strength of the state-
of-the-art solutions. We will conclude this chapter with a discussion of unsolved
problems and research directions.
8. Workflow Technology: the Support for Collaboration 367
1 Introduction
The product is either a single report released at one time, or report parts
released as the analysis unfolds. Frequently the product is electronically co-
authored across multiple business areas and different intelligence gathering
organizations that reside in different locations.
Next, the product undergoes multiple levels of electronic reviews before it
is electronically published. The product review process involves the following
electronic review activities:
2.2 Coordination
Distributed electronic teams require tools for finding, accessing, and main-
taining shared content such as documents, images, team calendars, and elec-
tronic bulletin boards. The analysts in our example may participate in several
teams and projects. To facilitate the management of shared content, project
and team information, and providing needed tools for the function of each
project and team, the common information and tools must be organized in
different workspaces. Furthermore, just as teams and projects, analysts and
supervisors may create workspaces as they need to perform their functions.
The main advantage of providing team workspaces is that content and
tools for communication and content analysis and discussion are presented
in one virtual space. This reduces tool and content setup time for each team
activity regardless how the shared content is maintained.
2.4 Awareness
participants will act inappropriately or be less effective. With too much in-
formation, participants must deal with an information overload that adds to
their work and masks important information.
Simple forms of awareness included determining who is present/active in
an electronic team workspace, if there is somebody else editing a document,
or whether a document you are editing have been read or edited since you
created its latest version. In addition to these simple forms of awareness,
electronic teams require awareness provisioning technology that supports the
following complementary types of awareness:
• activities,
• resources,
• dependencies.
8. Workflow Technology: the Support for Collaboration 373
References
Organisation!
Role Model ===~,.,.=========~
Data
Workflow
control
may data
refer to
Worldlow
Enactment , Manipulate
Service
1==== = = Worldlow /
'----,_--1 R~:~ant ",Upd
ate 1_ _ _ 1, Workflow
Application
11'
. - . Administration
& Control
Data
(Supervisor)
Invokes
• Software component
D System control data
D Extema,l product/data
details about the actual content, tools, and applications required to support
the implementation of workflow activities.
Workflow processes are specified in any of the workflow specification lan-
guages. In the following paragraphs we describe the primitives in the Work-
flow Process Definition Language (WPDL) defined by the WfMC [WfM97].
Although WPDL is currently incomplete, it is an attempt for defining in-
dustry standard scripting language for representing workflow processes. We
discuss WPDL because it supports fairly common set of primitives:
sources, and role information identifying the function that resources can
perform .
• Workflow Control Data: internal control data maintained by the work-
flow engine. They are used to identify the state of individual process or
activity instances. These data may not be accessible or interchangeable
outside of the workflow engine but some of the information content may
be provided in response to specific commands (e.g., process status, per-
formance metrics, etc.).
4.4 Roles
Commercial WfMSs' roles are global (i.e., organizational) and static (i.e.,
they must be fully defined before the execution of a process begins). Just like
WfMSs, groupware tools provide static activity roles (e.g., "meeting mod-
erator" and "attendee"). Role assignment in WfMSs and process-oriented
systems determines who is doing what activity within a process. The term
role assignment is typically used in WfMSs because process participants are
usually addressed only via roles. Role assignment in existing WfMSs are lim-
ited to a one-out-of-n semantics. This means that an activity in a process
specification corresponds to exactly one activity instance at runtime, and
this activity instance is performed by exactly one participant out of n eligible
participants that play the role(s) assigned to this activity. This traditional
role assignment is well suited in applications where a task must be distributed
among a group of workers. However, in case where a number of people have
to execute the same task, such as participating in the same meeting, or per-
forming concurrent analysis of the same intelligence data, the traditional role
assignment is not sufficient.
State of the art commercial WfMS can currently support several hundred
workflow instances per day. However, older WfMS technology offers limited
(or lack of) engine scalability, distribution and component redundancy for
dealing with load balancing and engines failures.
Workflow vendors have recognized some of these limitations in earlier ver-
sions of their products, and they are currently introducing improvements to
address them. In particular, WfMSs from several vendors allow the use of mul-
tiple WfMS engines for supporting distributed workflow process execution. In
addition, vendors currently provide capacity planning tools that can estimate
the number of WfMS engines required to support the execution requirements
of a given process. However, in many of these WfMSs distributed workflow
process execution requires manual replication of the process definition in all
engines that may be involved in the process execution. This approach suf-
fers form potential configuration problems related to consistency of process
definition residing in different engines.
Another serious limitation in the current approaches for distributed work-
flow execution is the lack of automatic load balancing. Workflow engine scal-
ability and component redundancy issues can be addressed by choosing ap-
propriate system architecture [GE95J:
• a server process per client. Such an architecture does not scale well be-
cause of the large number of connections in the system and the large
number of server processes running on the server machine.
• a process per server. The functionality of the applications is provided by
one multi-threaded server process. In this case the server process becomes
a bottleneck, and the server program packed with several applications
become hard to maintain, as faults cannot be easily isolated.
• the server functionality and data are partitioned, and there is a server
process for each partition. As long as the partitioning of the functionality
balances the load on the server processes, this architecture adequately
addresses the scalability problem. However, each client has to be aware
of the application partition and any change in the partitioning requires
considerable reorganization.
380 D. Georgakopoulos, A. Cichocki, and M. Rusinkiewicz
data object updates are rare. Thus, consistency can be handled by humans
who review the data object versions and decide which version to keep.
To support forward recovery, contemporary WfMSs utilize transaction
mechanisms provided by the DBMS that maintain the process relevant data.
In particular, such WfMSs issue database transactions to record workflow
process state changes in the DBMS. In the event of a failure and restart,
the WfMS accesses the DBMS(s) to determine the state of each interrupted
workflow instance, and attempts to continue executing workflow processes.
However, such forward recovery is usually limited to the internal components
of the WfMS.
Very few WfMSs currently offer support for automatic undoing of incom-
plete workflow instances. In such systems, the workflow designers may specify
the withdrawal of a specific instance from the system while it is running, pos-
sibly at various locations.
The workflow vendors and the research community are debating whether
it is possible to use database management system technology and transaction
processing technology, or the extended/relaxed transaction models [GHM96]
that have been developed to deal with the limitations of database transactions
in the workflow applications.
5.2 Awareness
The term awareness has been used in many collaborative systems (not man-
aged by a process specification) primarily to cover information about one's
fellow collaborators and their actions [BGS+99,PS97,SC94]. However, usually
only raw information is provided. This limited form of awareness is sometimes
called telepresence [GGR96]. One motivation for telepresence is that it allows
users to readily determine who is available at remote locations, so that ad
hoc collaboration may be initiated [DB92].
Commercial WfMSs and WfMC's Reference Model [HoI94] currently pro-
vide standard monitoring APIs. However, unless WfMSs users are willing to
develop specialized awareness applications that analyze process monitoring
logs, their awareness choices are limited to a few built-in options and process-
relevant events, usually delivered via e-mail or simple build-in tools. Elvin
8. Workflow Technology: the Support for Collaboration 383
6 Summary
References
[Act97] Action Technologies, Action Workflow and Metro,
http://www.actiontech.com/. 1997.
[BGS+99] Baker, D., Georgakopoulos, D., Schuster, H., Cassandra, A., Cichocki,
A., Providing customized process and situation awareness in the col-
laboration management infrastructure, Proc. 4th IFCIS Conference
on Cooperative Information Systems (CoopIS'99) , Edinburgh, Scot-
land, 1999, 79-91.
[BK95] Bogia, D., Kaplan, S.M., Flexibility and control for dynamic work-
flows in the worlds environment, Proc. ACM Conf. on Organizational
Computing Systems, 1995, 148-159.
[Bro01] BroadVision: one-to-one publishing, http://www.broadvision.com/.
2001.
[CBR99] Cassandra, A.R., Baker, D., Rashid, M., CEDMOS Complex Event
Detection and Monitoring System, MCC Technical Report CEDMOS-
002-99, Microelectronics and Computer Technology Corporation,
1999.
[CCP+96] Casati, F., Ceri, S., Pernici, B., Pozzi, G., Workflow evolution, Proc.
15th Conf. on Conceptual Modeling (ER'96), 1996,438-455.
[DB92] Dourish, P., Bly S., Portholes: supporting awareness in a distributed
work group, Proc. Conference on Computer Human Interaction
(CHl'92), 1992, 541-547.
[DocOl] Documentum: Documentum CME, http://www.documentum.com/.
2001.
[EasOl] Eastman Software, http://www.eastmansoftware.com. 2001.
[EM97] Ellis, C., Maltzahn, C., The Chautauqua workflow system, Proc. 90th
Hawaii Int. Conf. on System Sciences, 1997, 427-437.
[FilOl] FileNet: Panagon and Workflow, http://www.filenet.com/. 2001.
[GE95] Gray, J., Edwards, J., Scale up with TP monitors, Byte, April, 1995,
123-128.
[GGR96] Gutwin, C., Greenberg, S., Roseman, M., Workspace awareness in
real-time distributed groupware: framework, widgets, and evaluation,
R. J. Sasse, A. Cunningham, R. Winder (eds.), People and Comput-
ers XI, Human Computer Interaction Conference (HCl'96), Springer-
Verlag, London, 281-298.
[GHM96] Georgakopoulos, D. , Hornick, M., Manola, F., Customizing transac-
tion models and mechanisms in a programmable environment sup-
porting reliable workflow automation, IEEE 1ransactions on Data
and Knowledge Engineering 8(4), August 1996, 630-649.
8. Workflow Technology: the Support for Collaboration 385
[GPS99] Godart, C., Perrin, 0., Skaf, H., COO: a workflow operator to improve
cooperation modeling in virtual processes, Proc. 9th Int. Workshop
on Research Issues on Data Engineering Information Technology for
Virtual Enterprises (RIDE- VE'99), 1999, 126-131.
[GroOl] Groove, http://www.groove.net/. 2001.
[GSC+OO] Georgakopoulos, D., Schuster, H., Cichocki, A., Baker, D., Manag-
ing escalation of collaboration processes in crisis response Situations,
Proc. 16th Int. Conference on Data Engineering (ICDE'2000), San
Diego, 2000, 45-56.
[Hew97] Hewlett Packard: AdminFlow, http://www.ice.hp.com. 1997.
[HHJ+99] Heinl, P., Horn, S., Jablonski, S., Neeb, J., Stein, K., Teschke, M., A
comprehensive approach to flexibility in workflow management sys-
tems, Proc. Int. Joint Conference on Work Activities Coordination
and Collaboration (WACC'99), San Francisco, 1999, 79-88.
[HoI94] Hollingsworth, D., Workflow reference model, Workflow Management
Coalition, Document Number TCOD-1003, 1994.
[HoI97] Holosofx: workflow analyzer, http://www.holosofx.com. 1997.
[HSB98] Han, Y., Sheth, A., BuBIer, C., A taxonomy of adaptive workflow
management, On-line Proc. Workshop of the 1998 ACM Confer-
ence on Computer Supported Cooperative Work (CSCW'98) "To-
wards Adaptive Workflow Systems", Seattle, 1998.
[Ids97] IDS-Scheer: Aris toolset, http://www.ids-scheer.de/. 1997.
[IF97] ICL/Fujitsu: ProcessWise, http://www.process.icl.net.co.uk/. 1997.
[Lot01] Lotus: Lotus Notes, http://www.lotus.com/home.nsf/welcome/notes.
2001. .
[Met97] MetaSoftware: http://www.metasoftware.com/. 1997.
[MQSOl] IBM: MQSeries workflow,
http://www.ibm.com/software/ts/mqseries/workflow/, 2001.
[NetOl] Microsoft: NetMeeting,
http://www.microsoft.com/windows/NetMeeting/. 2001.
[OMG97] Object Management Group, http://www.omg.org/, 1997.
[Ope01] OpenMarket: Content Server, http://www.openmarket.com/. 2001.
[PS97] Pedersen, E.R., Sokoler, T., AROMA: Abstract Representation of
Presence Supporting Mutual Awareness, Proc. Conf. on Human Fac-
tors in Computing Systems (CHl'97), 1997, 51-58.
[QuiOl] Lotus: QuickPlace,
http://www.lotus.com/home.nsf/welcome/quickplace. 2001.
[RD98] Reichert, M., Dadam, P., ADEPTflex - supporting dynamic changes
of workflows without loosing control, Journal of Intelligent Informa-
tion Systems (JIIS), Special Issue on Workflow Management Systems
10(2), 1998, 93-129.
[SamOl] Lotus: Sametime,
http://www.lotus.com/home.nsf/welcome/sametime. 2001.
[SAP01] SAP, http://www.sap.com/
[SC94] Sohlenkamp, M., Chwelos, G., Integrating communication, coopera-
tion, and awareness the DIVA virtual office environment, Proc. Conf.
on Computer Supported Cooperative Work (CSCW'94), 1994, 331-
343.
[TibOl] InConcert, http://www.tibco.com/products/in_concert/index.html.
2001.
386 D. Georgakopoulos, A. Cichocki, and M. Rusinkiewicz
Abstract. Data warehouse systems have become a key component of the corpo-
rate information system architecture. Data warehouses are built in the interest of
business decision support and contain historical data obtained from a variety of
enterprise internal and external sources. By collecting and consolidating data that
was previously spread over several heterogeneous systems, data warehouses try to
provide a homogenous information basis for enterprise planning and decision mak-
ing.
After an intuitive introduction to the concept of a data warehouse, the initial
situation starting from operational systems or decision support systems is described
in Section 2. Section 3 discusses the most important aspects of the database of a data
warehouse, including a global view on data sources and the data transformation
388 U. Dorndorf and E. Pesch
process, data classification and the fundamental modelling and design concepts for
a warehouse database. Section 4 deals with the data warehouse architecture and
reviews design alternatives such as local databases, data marts, operational data
stores and virtual data warehouses. Section 5 is devoted to data evaluation tools
with a focus on data mining systems and online analytical processing, a real time
access and analysis tool that allows multiple views into the same detailed data. The
chapter concludes with a discussion of concepts and procedures for building a data
warehouse as well as an outlook on future research directions.
9. Data Warehouses 389
1 Introduction
Enterprises must react appropriately and in time to rapidly changing envi-
ronmental conditions, recognize trends early and implement their own ideas
as quickly as possible in order to survive and strengthen the own position
in an environment of increasing competition. Globalization, fusion, orienta-
tion towards clients' needs in a competitive market, mechanization, and the
growing worldwide importance of the Internet determine this entrepreneurial
environment. In order to plan, decide and act properly, information is of the
utmost importance for an enterprise. It is essential that the right informa-
tion is available in the appropriate form, at the right time and at the right
place. New procedures are necessary to obtain and evaluate this informa-
tion. PC-based databases and spreadsheets for business analysis have the
drawback of leaving the data fragmented and oriented towards very specific
needs, usually limited to one or a few users. Decision support systems and
executive information systems, which can both be considered as predecessors
of data warehouses, are usually also tailored to specific requirements rather
than the overall business structure. Enormous advances in the hardware and
software technologies have enabled the quick analysis of extensive business
information. Business globalization, explosion of Intranets and Internet based
applications, and business process re-engineering have increased the necessity
for a centralized management of data [Tan97,Hac99].
The much discussed and meanwhile well-known concept of the data ware-
house addresses the tasks mentioned above and can solve many of the prob-
lems that arise. A data warehouse is a database built to support information
access by business decision makers in functions such as pricing, purchasing,
human resources, manufacturing, etc. Data warehousing has quickly evolved
into the center of the corporate information system architecture. Typically,
a data warehouse is fed from one or more transaction databases. The data
needs to be extracted and restructured to support queries, summaries, and
analyses. Related technologies like Online Analytical Processing (OLAP) and
Data Mining supplement the concept. Integrating the data warehouse into
the corresponding application and management support system is indispens-
able in order to effectively and efficiently use the information, which is now
available at any time.
2 Basics
2.1 Initial Situation and Previous Development
The amount of internal as well as environmental enterprise related data is
continuously increasing. Despite this downright flood of data, there is a lack of
information relevant for decisions. The data are distributed and concealed in
various branches of the firm, where they are often related to special purposes,
and can also be found in countless sources outside the firm. In addition, the
390 U. Dorndorf and E. Pesch
data evaluations and reports are neither adequately topical nor sufficiently
edited.
In the past, manifold concepts have been developed for using data already
at hand in the firm in order to support planning and decision making. Es-
pecially the endeavours regarding Management Information Systems (MIS)
must be mentioned, by means of which it was tried already in the 1960s to
evaluate data effectively. However, most ideas and developments have failed so
far for various reasons. In particular, the requirements and expectations were
often too high and could not be satisfied with the existing technology. Conse-
quently, an upcoming early enthusiasm rapidly changed into disappointment
and started projects were swiftly declared failures and were terminated.
Several types of information systems that are related to the data ware-
house concept have been described in the literature. They have become known
under different names such as Decision Support System (DSS), Executive In-
formation System (EIS), Management Information System (MIS), or Man-
agement Support System (MSS) [GGC97,Sch96]. A data warehouse consti-
tutes not only a part but the basis of any of these information systems.
Sauter [Sau96], Marakas [Mar99], Mallach [MaI94] or Sprague and Watson
[SW96] present an overview of decision support systems. Turban [Tur98] gives
an overview of all types of decision support systems and shows how neural
networks, fuzzy logic, and expert systems can be used in a DSS. Humphreys
et al. [HBM+96] discuss a variety of issues in DSS implementation. Dhar and
Stein [DS97] describe various types of decision support tools.
9. Data Warehouses 391
Information has become one of the strategically most relevant success factors
of an enterprise because the quality of any strategic decision directly re-
flects the quality of its underlying information. Mucksch and Behme [MB97]
consider the factor information as the major enterprise bottleneck resource.
Management requires decision related and area specific information on mar-
kets, clients and competitors. The data must be relevant and of high qual-
ity with respect to precision, completeness, connectedness, access, flexibility,
time horizon, portability and reliability. As an immediate consequence, a
large amount of data does not necessarily imply a comprehensive set of infor-
mation [DycOO,IZG97]; Huang et al. [HLW98] discuss how to define, measure,
analyze, and improve information quality.
Heterogeneous data on the operational level are derived from a variety of
different external or internal sources, each of which is bound to its particular
purpose. In order to provide these data as a basis for the enterprise's manage-
ment decisions and for post-decision monitoring of the effects of decisions, an
appropriate adaptation is unavoidable. This, however, is the relevant concept
of a data warehouse: a flexible and quick access to relevant information and
knowledge from any database.
The early database discussion was dominated by the concept of a single uni-
fying, integrating system for all relevant enterprise decisions. The inappropri-
ateness of such a system results from different requirements on the operational
and strategic decision levels regarding relevant data, procedural support, user
interfaces, and maintenance.
Systems on the operational level mainly focus on processing the daily
events and are characterized by a huge amount of data that has to be pro-
cessed and updated timely. Hence the system's processing time becomes a
crucial factor. Utmost topicality should be assured whereas time aspects of
any decision relevant data are less important because data sources are up-
dated daily on the basis of short-term decisions. Since the environment is
stable in the short run, many operations become repetitious and can possi-
bly be automated.
On the strategic level, fast data processing and updating is less critical,
while different views corresponding to different levels of data aggregation in
various time horizons become more important. Time is a key element in a
data warehouse: it is important with respect to the data's history in order
to forecast future trends and developments. A continuous updating of the
data is not required, as a specific daily modification will not manipulate any
long-term tendency. Data access should be restricted to reading in order to
ensure that the database remains consistent.
392 U. Dorndorf and E. Pesch
Hence, powerful information retrieval systems are needed which are able to
retrieve all relevant, latest and appropriately worked up information at any
time in order to provide this information for the decision making process.
Data warehouses are an important step towards this goal.
data warehouse. Inmon et al. [IIS97] explain how data warehousing fits into
the corporate information system architecture. Kelly [KeI94] discusses how
data warehousing can enable better understanding of customer requirements
and permit organizations to respond to customers more quickly and flexibly.
Morse and Issac [MI97] address the use of parallel systems technology for
data warehouses.
Inmon's understanding of a data warehouse has been generally accepted
although sometimes other concepts like information warehouse have been
introduced in order to focus on specific commercial products. Hackathorn
[Hac95] uses the term data warehousing in order to focus on the dynamic
aspects of data warehouses and to emphasize that the important aspect of
a warehouse is not data collection but data processing. The common aim
behind all concepts is to considerably improve the quality of information.
In order to provide the right information for decision support, a database has
to be created. The database must be loaded with the relevant data, so that the
required information can be retrieved in the appropriate form. The process of
transforming data that has been collected from a variety of enterprise internal
and external sources and that is to be stored in the warehouse database is
outlined in the example in Figure 3.1.
Internal Sources
Marketing
Finance
Personnel
Data Warehouse
External Sources
Online DB
Media
Data sources. Data from several sources have to be transformed and inte-
grated into the data warehouse database.
According to Poe [Poe97], who gives an overview of data warehousing
that includes project management aspects, the largest amount of relevant
data is produced through enterprise internal operational systems. These data
are distributed over different areas of the enterprise. Acquisition, processing
and storage of these data are difficult because of frequent updates, which
occur not only once per year or month but daily or even several times a day
and usually affect a large amount of data. Enterprise areas with large data
amounts are controlling, distribution, marketing and personnel management.
Other data is collected from enterprise external sources like the media
or other kinds of publications, databases from other companies (possibly
competing ones) as far as available, and information from firms that collect
and sell data. Technological developments, new communication media and
in particular the Internet have led to a rapid increase of these external data
sources. These sources provide additional information for the evaluation of
the enterprise's own data on the competitive markets, for an early recognition
of the market evolution, and for the analysis of own weaknesses, strengths
and opportunities.
Another information source is meta-information, i.e., the information ob-
tained by processing information or data. It is the result of an examination
of data obtained from decision support based systems and takes the form of
tables, figures, reports, etc., which are of importance to different people in
different branches of the company. It can be very costly to extract this kind
of information whenever needed. Although the importance and relevance of
this information for future decisions is hard to predict, it may therefore be
preferable to store the meta-information instead of repeatedly generating it.
A data model for strategic decision support differs substantially from models
used on the operational decision level [AM97,Poe97]. The complex data sets
represent a multi-dimensional view of a multi-objective world. Thus the data
structures are multi-dimensional as well. Modelling of data in the data ware-
house means finding a mapping of concepts and terms arising in business
applications onto data structures used in the warehouse. Special attention
must be paid to those structures that are most relevant for the intended
analysis.
A dimensional business model [Poe97] splits the information into facts and
dimensions that can easily be described by key-codes and relations among the
objects. The goal of the model is to provide the information in a user-friendly
way and to simultaneously minimize the number of joins, i.e., the links among
tables in a relational model, required when retrieving and processing the
relevant information.
links to fact tables; the columns are used for hierarchies that provide a logical
grouping of dimensions, and for description or references in order to introduce
details.
Dimensional data frequently undergo changes. Some changes are the result
of the warehouse development, as at the very beginning not all kind of queries
can be predicted. Thus dimensional tables are to be established in order to
allow an easy later extension and refinement. Product types, sales regions or
time horizons may be considered as business dimensions.
Holthuis [HoI97] differentiates between several types or groups of dimen-
sions which, once again, may be divided into subtypes or subgroups, etc. Busi-
ness dimension types may be standardized with respect to time or some other
kind of measure. They also may be defined individually and task-specifically.
Structural dimension types are hierarchical due to their vertical relations.
Their data may be aggregated in a hierarchical structure or they may consist
of different internal structures. Moreover, there are also categorical aspects
relevant for dimensions. For example, categorical dimension types are marital
status, salary, etc. Categories result from specific attributes of the informa-
tion objects and can be partitioned into several sub-categories.
As about 70% of a database volume are occupied by measures of facts,
queries are often separated into steps. In the first step, access to the dimension
tables restricts the data volume that has to be searched. In order to select
the desired information, SQL queries are limited to a number of predefined
or user-defined links of fact and dimension tables.
A huge data set may be partitioned with respect to their fact or dimension
data tables into smaller tables. However, a large number of tables is not
desirable and it has been suggested that the number should not exceed 500
[AM97]. Horizontal partitioning of dimension tables should be considered if
their size reaches the size of fact tables in order to reduce the time of a query.
Partitioning is discussed in detail in Section 3.3 below.
Database schemes. Several database schemes have been used for data
warehouse databases. In analogy to the structure and links between elements
of the de-normalized fact and dimension tables, their names are star scheme,
starflake scheme or snowflake scheme. These schemes have their particular
functional properties and can be visualized in different ways. The schemes
consist of facts and dimensions documented by means of tables, Le., the
schemes basically consist of tables and differ in their structural design [GG97].
Kimball [Kim96] gives an excellent explanation of dimensional modelling (star
schemes) induding examples of dimensional models applicable to different
types of business problems.
Dimensions are represented in fact tables by means of foreign key entries.
A detailed description can be found in the inter-linked dimension tables.
According to their key, the columns of the tables are called the primary
or the foreign key columns. The primary key of a dimension table usually
9. Data Warehouses 399
dimension dimension
product
mount' -bike
trekking-bl
race-bike Berlin
sales
dimens'
revenue
color time-horizon
yellow July
red August
green September
blue October
Fig. 3.2. An example of a star scheme: the primary key of the fact table consists
of foreign keys into the dimension tables "product", "region", ''time-horizon'' and
"color" j "sales" and ''revenue'' are data columns of the fact table
dimension dimension
product region
mounta' -bike avaria
trekking-bl Rhein-Main
race-bike Berlin
dimension
time-horizon
yellow y
red August
green September
blue October
fact
time-horizon
valu
supplier
dimension costs dimension
supplier
value
facility 1
plan
facility 1
actual
facility 3
table that contains foreign keys as primary keys to other dimension tables.
The latter dimension tables, called outrigger tables or secondary dimension
tables, are used in order to specify a primary dimension through this sec-
ondary dimension. Usually this kind of foreign key only exists in fact tables
in which an appropriate combination of the keys defines a primary key.
In the multiple star scheme, fact tables may, besides their foreign keys to
dimension tables, contain primary keys without any link to a dimension table
but to columns of some fact tables. This happens if the keys linked to the
dimension tables do not sufficiently specify the fact table. The primary key
characterizing the fact table may be any combination of foreign or primary
keys.
The star scheme has a simple structure and well-defined tables and links.
Updates can easily be handled by a user familiar with any kind of database
design. The system response time for a query is short. One of the main
disadvantages is the simplicity of the links among the tables. Dependencies
9. Data Warehouses 401
input/output and CPU. Partitioning the database means splitting the data
into smaller, independent and non-redundant parts. Partitioning is always
closely connected to some partitioning criteria which can be extracted from
the data. For instance there might be enterprise related data, geographical
data, organizational units or time related criteria, or any combination of
these. A flexible access to decision-relevant information as one of the most
important goals of data-warehousing implies that partitioning is particularly
a tool to structure current detailed data into easily manageable pieces.
Anahory and Murray [AM97] differentiate between horizontal and verti-
cal partitioning. Horizontal partitioning splits data into parts covering equal
time horizon lengths. Non-equal time horizon lengths might be advantageous
whenever the frequency of data access is known in advance. More frequently
accessed data, e.g., the most recent information, should be contained in
smaller parts so that it can easily be kept online-accessible. Horizontal par-
titioning may also split data with respect to some other criteria, e.g., prod-
ucts, regions, or subsidiary enterprises, etc. This kind of partitioning should
be independent of time. Irrespective of the dimension Anahory and Mur-
ray recommend to use the round-robin method for horizontal partitioning,
Le., whenever a certain threshold is reached, the current data partitioning is
memorized in order to free the online memory for current new data partitions.
The vertical partitioning of data is closely related to the table representa-
tion of the data. Hence, columns or a set of columns may define a partition.
Moreover, enterprise functions may also be considered as a kind of vertical
partition. Vertical partitioning avoids an extensive memory usage because
less frequently used columns are separated from the partition.
Partitioning has several advantages; in particular, a smaller data volume
increases the flexibility of data management, as the administration of large
data tables is reduced to smaller and manageable ones. Data can more eas-
ily be restructured, indexed or reorganized; data monitoring and checking
are also easier. In addition, partitioning facilitates a regular data backup
and allows a faster data recovery. Finally, partitioning increases the system's
performance because a small data volume can be searched more quickly.
are linked in small tables in order to achieve an increased query efficiency. Any
kind of structured data access, e.g., a certain access probability, data access
sequences, etc., can be reflected by means of linked tables of data blocks in
order to minimize the number of required queries. Data redundancy might be
quite efficient for data whose use is widely spread and rather stable. This is
even more important if costly calculations of data are the only way to avoid
redundancy.
Updates. After loading the data warehouse with the decision relevant infor-
mation, the data have to be updated on a regular basis, Le., current external
or internal data have to be stored in the warehouse. This procedure, called
warehouse loading, is supposed to be executed either in well-defined time
steps or whenever there is a need for new information. The level of topicality
of the warehouse data depends on the enterprise-specific requirements. For
instance, financial data typically need a daily update. Data updates on a
regular basis within a certain time interval can be shifted to the night or to
the weekend in order to avoid unnecessary machine breakdowns or lengthy
query response times. Time marks are used to indicate the changes of data
over time. Monitoring mechanisms register changes of data.
online are put to a cheaper offline memory, such as optical disks or sometimes
magnetic tapes, while the data's aggregated information is still accessible
online. The archive keeps the size of necessary online memory limited. In order
to guarantee that simple standard or ad-hoc queries can be responded to in
a reasonable time, an archive memory also provides the necessary effective
access procedures.
Data marts. Local databases, the so-called data maris, are databases lim-
ited to some enterprise departments such as marketing, controlling, etc. In-
mon [Inm96] considers data marts as departmental level databases. They are
built and adjusted to the specific departmental requirements. Data marts
contain all components and functions of a data warehouse; however, they are
limited to a particular purpose or environment.
The data is usually extracted from the data warehouse and further de-
normalized and indexed to support intense usage by the targeted customers.
Data marts never provide insight into global enterprise information but only
consider the relevant aspects of their particular field of application. Data
marts serve specific user groups. As data marts consider only subsets of the
whole set of data and information, the amount of processed data and the
database are naturally smaller than the corresponding sets of the overall
data warehouse. This advantage is frequently used for local data redundancy,
where data on customers, products, regions or time intervals, etc., are inte-
grated as several copies. In order to provide a reasonable data marting, the
data should be kept separately as long as it reflects the functional or natural
separation of the organization.
Data marts can also be created by decomposing a complete data ware-
house. Inversely, an enterprise-wide database can also be created by com-
posing departmental level data marts. Data marts may be organized as in-
dependently functioning warehouses with data access to their own sources.
Alternatively, the data access may be realized through a central data ware-
house. For consistency purposes the latter is preferable. Semantically there
is no difference between the data model of a data mart or data warehouse.
The data mart design should be in analogy to the design of the database
and should always use the data-inherent structure and clustering if this does
not clash with the access tools. Anahory and Murray [AM97] recommend the
snowflake scheme, integrating possibly different data-types or meta-data on
certain aggregation levels. The data updating of the data marts can be sim-
plified if the technologies are identical and if a data mart only consists of a
subpart of the central data warehouse. Kirchner [Kir97] reports on updating
problems when different data marts are supposed to be updated simultane-
ously.
There are various reasons for using data marts. If there are particular
areas that have to provide frequent access to its data, a local copy of the
data may be useful. Data marts provide the possibility to accelerate the
queries because the amount of data that has to be searched is smaller than in
the global warehouse. The implementation of data marts provides the chance
to structure and partition data, e.g., in a way that the access tools require.
Simultaneously arriving queries in a data warehouse might create problems
which can be avoided through their de-coupling in order to query clusters
that only attack one data mart. Finally, data marts more easily guarantee
necessary data protection against uncontrolled access by a complete physical
410 U. Dorndorf and E. Pesch
Web warehousing. Data warehouse solutions which use the world wide
web are called web warehousing [Mat99,Sin98]. The world wide web provides
a huge source of information for the enterprise as well as an easy and fast
data distribution and communication medium. Information collection and
integration into the data warehouse is also called web farming. The internet
is used for data access to external data while enterprise-internal data and
information distribution and access is supported by intranets. Nets for data
and information exchange between cooperating enterprises, and from and
into their intranets are called extranets.
Database systems. There are different database technologies that are ap-
plicable to a data warehouse. They must have the ability to process huge
amounts of data arising from a large variety of different, detailed, aggregated
or historical enterprise information. Relational database systems have been
successfully applied in operational systems and provide a good solution to
data warehouse concepts as well. Relational databases have the advantage
of parallelism and familiarity. Alternatively, other technologies for decision
support have been applied. For instance there are multi-dimensional data-
base management systems that have been developed for the processing of
multi-dimensional data structures in online analytical processing (OLAP).
These database systems process data with respect to their dimensions. In
order to guarantee efficient OLAP queries, they use multi-dimensional in-
dices [GHR+97]. Moreover, there are hybrid database systems that combine
relational as well as multi-dimensional elements in order to process large
data volumes and to provide possibilities for multi-dimensional data analysis
[Sch96].
The data warehouse concept has proved useful for the support of enterprise
planning and decision making through the generation, evaluation and analysis
of relevant data and information. The variety of applications for analysing
and evaluating data stored in a warehouse is as large as the variety of different
environmental and internal problems and tasks.
Many software tools have been integrated into a data warehouse system.
Middleware and gateways allow to extract data from different systems. Trans-
formation tools are needed for the correction and modification of data. Other
tools have proved useful for the creation of meta-data. Finally, a large num-
ber of tools are available for retrieval, assessment and analysis purposes. The
following sections first discuss general data evaluation tools and then review
two important data analysis technologies: data mining and online analytical
processing.
412 U. Dorndorf and E. Pesch
A large number of evaluation tools enable the user to use the data warehouse
easily and effectively [Sch96,Poe97]. It is debatable whether the evaluation
tools of the front-end area do necessarily belong to the set of data warehouse
components. However, they are indispensable for a sensible use of the data
warehouse concept, and the effort for selecting and integrating them must
not be underestimated; the selection of the tools should be done in cooper-
ation with the user. Ryan [RyaOO] discusses the evaluation and selection of
evaluation tools.
The manifold user tools for information retrieval in a warehouse can be
classified according to different criteria. The spectrum of tools ranges from
those for simple queries and report functions to the complex tools necessary
for the multi-dimensional analysis of data. One can differentiate between
ad-hoc reporting tools, data analysis tools, EIS-tools and business process
engineering tools as well as navigation elements, which in particular are im-
plemented in all tools.
Query processing techniques are an essential element of many evaluation
tools. There may be ad-hoc as well as standard queries. The knowledge of
frequently required queries can help to prepare and provide a standardized
form in the warehouse in order to accelerate the response time and to increase
the user interface quality. Documents may be memorized by means of some
forms but, additionally, scheduling and retrieval procedures that are neces-
sary for frequent repetitions of assessments should be provided to the user. In
contrast to standard queries, the kind and frequencies of the ad-hoc queries
are difficult to predict and prepare in advance. Data warehouse queries are
sometimes split into three groups: those providing only information, those
that allow a subsequent analysis of information and data, and finally causal
queries. Warehouse query processing aspects have, e.g., been discussed by
Cui and Widom [CWOO], Cui et al. [CWW99], O'Neil and Quass [OQ97],
and Gupta et al. [GHQ95].
An important feature of a useful tool is that it allows a comprehensive
warehouse usage without a deeper knowledge of database systems. This is
achieved through a graphic interface which provides either a direct or an
indirect (via an additional level of abstraction) data access. The interme-
diate level of abstraction enables the user to assign his own specific names
to the data or tables. The graphic tool support allows a simple handling
of queries without a detailed knowledge of the SQL language. The results
are finally transformed into data masks or data tables, which are frequently
connected with report generators or various kinds of graphic presentation
systems [Sch96]. Hence, the system supports the user in generating any kind
of business-related reference numbers without requiring specific knowledge of
the underlying system.
Report generators allow an individual report design. Statistical methods
supplement the data warehouse and provide tools ranging from a simple prob-
9. Data Warehouses 413
presentation might be integrated into data mining tools or be left for addi-
tional presentation programs. The reliability of the derived results might be
questionable and must therefore be verified by means of statistical evalua-
tions. Data mining systems incorporate mathematical, statistical, empirical
and knowledge-based approaches.
Incomplete databases as well as databases containing a minimal amount
of relevant data limit a successful application of data mining tools. Moreover,
they can lead to false evaluations. To a certain degree, defective or false data
can be detected, filtered and continued to be processed by some data mining
tools. This kind of data cleaning, called scrubbing, is, of course, only possible
to a certain level of destruction and heavily depends on the data quality
and data redundancy. The importance of scrubbing is due to the fact that
data warehouse systems prove most successful when the user can focus on
using the data that are in the warehouse, without having to wonder about
its credibility or consistency.
Data mining has successfully been applied in various business areas, such
as banking, insurance, finance, telecommunication, medicine, or public health
services. As a typical example, the shopping behaviour of customers in a su-
permarket has been examined in order to draw conclusions for the market's
presentation of its products. Type and number of all products in the cus-
tomer's shopping basket have been recorded in order to draw conclusions
with respect to the customer's behaviour. For instance, it might be the case
that customers buying coffee frequently also buy milk, or customers buying
wine frequently also buy cheese. A typical correlation between diaper and
beer has been detected in US supermarkets: men buying diapers tend to buy
beer for themselves too. Conclusions of this kind could lead to an appropriate
location and presentation of the market's products and could even influence
the product mix. In addition, the information is important for estimating the
influence of withdrawing a product from the mix onto the sales figures of
other products.
Comprehensive overviews on data mining and related tools are provided
by Han and Kamber [HK02j, Groth [Gro97b,Gro99j, Fayyad et al. [FPS+95j,
Cabena [Cab97j, Berry and Linoff [BLOOj, Bigus [Big96j, Weiss and Indurkhya
[WI97j, Adriaans and D. Zantiage [AZ96j, Westphal and Blaxton [WB98j,
Anand [AnaOOj, Mena [Men99j, and Lusti [Lus02j. Data preparation for data
mining is discussed by Pyle [PyI98j.
The following subsections review some commonly used methods for data
mining.
Knowledge based methods. There are further methods which are applied
for pattern recognition, e.g., inductive learning, genetic algorithms or neural
networks. Additionally, "if-then" analysis has been found to be useful.
Cluster analysis. Cluster analysis groups data with respect to their at-
tributes so that data in a group are as homogenous as possible. Basically
there are two ways of clustering: hierarchical clustering and partitioning.
A way of hierarchical clustering is to start off with the two most homoge-
nous elements in order to create the first group of more than one element.
The process continues until a sufficiently small number of groups has been
reached. Other methods of clustering pursue the opposite direction. Groups
are continuously split until a certain level of homogeneity is reached. Hierar-
chical clustering always creates hierarchy trees.
Partitioning groups the data without going through a hierarchical cluster-
ing process. One can think of the objects represented by the data as vertices
of an edge-weighted graph; each positive or negative weight represents some
measure of similarity or dissimilarity, respectively, of the object pair defining
an edge. A clustering of the objects into groups is a partition of the graph's
vertex set into non-overlapping subsets. The set of edges connecting vertices
of different subsets is called a cut. In order to find groups as homogeneous as
possible, positive edges should appear within groups and negative edges in
the cut. Hence, a best clustering is one with a minimal cut weight. Cut min-
imization subject to some additional constraints arises in many applications,
and the literature covers a large number of disciplines, as demonstrated by
the remarkable variety in the reference section of [DJ80j.
In general there are two steps to be performed during the clustering pro-
cess. Firstly, some measure of similarity between distinct objects must be
derived and secondly the objects must be clustered into groups according to
these similarities (clique partitioning) [DP94,GW89j.
to consider the data from the opposite perspective. Other dimensions are cut
out after a pivoting operation.
As the first step of a data warehouse project a precise definition of the goals
is indispensable. In general a survey of the needs of the various user groups
is necessary in order to generate the knowledge about the required informa-
tion and data; one of the most difficult problems is to specify the manage-
ment's information needs for the future. When the warehouse is developed
this knowledge is very incomplete and undergoes continuous modifications.
The user of a data warehouse may be characterized with respect to the
management hierarchy within the enterprise. Another classification might be
the users' experience with the data warehouse concept. Poe [Poe97] differen-
tiates the novice or casual user without any or with a very limited computer
science experience. This kind of user needs frequent support and a simple
user interface. The business analyst is a regular user group having a basic
knowledge of the daily requests of information. They are able to use the
system on the basis of the predefined navigation and reporting tools with-
out further special support. The power users are able to specify their own
individual environment by parameters and macro definitions. They are suf-
ficiently qualified to generate individual reports and analysis independently
of the provided support tools. The application developer is the most skillful
9. Data Warehouses 419
user who is responsible for the warehouse environment and the availability
of tools.
Another differentiation of user groups can be achieved if the users' demand
on the warehouse is considered. A frequent and regular use of the warehouse
requires a completely different quality of decision relevant information from
the warehouse than an occasional usage. The design of the user interface has
to observe, however, the needs of the weakest group of occasional users, in
order to avoid their total exclusion from the possible use of the warehouse.
Quality, contents and current or future demands on a warehouse have to
reflect the aspect of usage frequency.
A further user group differentiation arises from the functional differentia-
tion of an enterprise into, e.g., product management, marketing, distribution,
accounting and controlling, finance, etc. For any of these business functions
a standard warehouse can be supplemented with additional, specific tools
and applications or a specific warehouse can be designed. Dyer and Forman
[DF95] discuss how to build systems for marketing analysis. Mentzl and Lud-
wig [ML97] report of the use of a warehouse as a marketing database in order
to improve the client care or to quickly recognize trends. The marketing de-
partment might also need an access to geographic information systems for
the generation of client relevant data.
Many users have on their own developed databases that meet their needs.
These users may be skeptical whether the new data warehouse can do as
good a job in supporting their reporting needs as their own solutions. The
users possibly feel threatened by the prospect of automation. Users may pre-
fer their own data marts for a variety of reasons. They may want to put
their data on different hardware platforms or they desire to not have to work
with other groups on resolving data definition issues. One functional area of
the enterprise may not want another functional area to see or to have access
to their data, e.g., because of concerns about misinterpretations or misun-
derstandings. Besides, disagreements about the correctness of data added or
processed might arise.
loading data takes the majority of the time in initial data warehouse devel-
opment; estimates of the average effort for these steps are as high as 80% of
the total time spent for building a warehouse. A very common problem is
that data must be stored which are not kept in any transaction processing
system, and that the data warehouse developer faces the problem to build a
system dedicated to generating the missing information.
Prototyping may help to keep the time and costs of a warehouse develop-
ment under control. The warehouse is first constructed for a small, limited
and well-defined business area and later extended to the whole enterprise.
A prototype development allows to present results and the quality of the
warehouse characteristics quickly, which is quite important in order to re-
ceive user acceptance as early as possible. Additionally, modifications and
corrections of the concepts and goals can be recognized early enough to allow
an appropriate restructuring. Prototyping is a central part of rapid applica-
tion development (RAD) and joint application design (JAD) methodologies.
Consultants are assigned to work directly with the clients and a continuous
collaboration, mentoring, and supervision ensures the desired outcome. The
traditional software development cycle follows a rigid sequence of steps with
a formal sign-off at the completion of each. A complete detailed requirements
analysis is done to capture the system requirements at the very beginning. A
specification step has to be signed-off before the development phase starts.
But the design steps frequently reveal technical infeasibilities or extremely ex-
pensive implementations unknown at the requirements' definition step. RAD
is a methodology for compressing the analysis, design, implementation and
test phases into a series of short, iterative development cycles. The advantage
gained is that iterations allow a self-correction of the complex efforts by small
refinements and improvements. Small teams working in short development it-
erations increase the speed, communication, management, etc. An important
aspect of the iterative improvement steps is that each iteration cycle delivers
a fully functional sub-version of the final system. JAD [WS95,Wet91] centers
around structured workshop sessions. JAD meetings bring together the users
and the builders of the system in order to avoid any delay between ques-
tions and answers. The key people involved are present, and the situation
does not arise that, when everyone is finally in agreement one discovers that
even more people should have been consulted because their needs require
something entirely different.
9. Data Warehouses 421
Besides the costs for the warehouse installation one should not underes-
timate maintenance and support costs as well as the personnel costs for the
system's useful application. Large and complex warehouses may take their
own life. Maintaining the warehouse can quickly become a very expensive
task. The more successful the warehouse is with the users, the more main-
tenance it may require. Possibly the enterprise has to introduce new tech-
nologies for the hard- or software. When a data warehouse has been built
questions arise such as: Who should administer the database? Who has re-
sponsibilities for data quality monitoring? Who makes the final decision over
the correctness of data? Who has access to what data? Inmon et al. [IWG97j,
Yang and Widom [YWOOj, Labio et al. [LYG99j, Huyn [Huy96,Huy97j, Quass
and Widom [QW97j, Quass et al. [QGM+96j, and Mumick et al. [MQM97j
discuss maintenance issues in data warehouses.
The Data Warehousing Institute estimates that over 3000 companies offer
data warehouse products and services.
8 Conclusions
References
[Ago99] Agosta, L., The essential guide to data warehousing: aligning technol-
ogy with business imperatives, Prentice-Hall, 1999.
[AM97] Anahory, S., Murray, D., Data warehousing in the real world,
Addison-Wesley, 1997.
[AnaOO] Anand, S., Foundations of data mining, Addison-Wesley, 2000.
[AV98] Adamson, C., Venerable, M., Data warehouse design solutions, John
Wiley & Sons, 1998.
[AZ96] Adriaans, P., Zantiage, D., Data mining, Addison-Wesley, 1996.
[BA97] Bischoff, J., Alexander, T. (eds.), Data warehouse: practical advice
from the experts, Prentice-Hall, 1997.
[BE96] Barquin, R., Edelstein, H. (eds.), Planning and designing the data
warehouse, Prentice-Hall, 1996.
[BE97] Barquin, R., Edelstein, H. (eds.) , Building, using and managing the
data warehouse, Prentice-Hall, 1997.
[Big96] Bigus, J.P., Data mining with neural networks, McGraw-Hill, 1996.
[Bis94] Bischoff, J., Achieving warehouse success, Database Programming &
Design 7, 1994, 27-33.
[BLOO] Berry, M., Linoff, G., Mastering data mining, John Wiley & Sons,
2000.
[Bra96] Brackett, M.H., The data warehouse challenge - taming data chaos,
John Wiley & Sons, 1996.
[Bro99] Brosius, G., Microsoft OLAP services, Addison-Wesley, 1999.
[BS96] Bontempo, C.J., Saracco, C., Database management: principles and
products, Prentice-Hall, 1996.
[BS97] Berson, A., Smith, S.J., Data warehousing, data mining, and OLAP,
McGraw-Hill, 1997.
[Bur97] Burleson, D., High performance Oracle data warehousing, Coriolis
Group, 1997.
[CA96] Corey, M., Abbey, M., Oracle data warehousing, McGraw-Hill, 1996.
[CAA+98] Corey, M., Abbey, M., Abramson, I., Taub, B., OracleB data ware-
housing, McGraw-Hill, 1998.
9. Data Warehouses 425
[CAA+99] Corey, M., Abbey, M., Abramson, I., Venkitachalam, R., Barnes, L.,
Taub, B., SQL Server 7 data warehousing, McGraw-Hill, 1999.
[Cab97] Cabena, P., Discovering datamining: from concept to implementation,
Prentice-Hall, 1997.
[CG98] Chamoni, P., Gluchowski, P. (OOs.), Analytische Informationssys-
teme, Springer, Berlin, 1998.
[CVB99] Craig, R.S., Vivona, J.A., Bercovitch, D., Microsoft data warehousing:
building distributed decision support systems, John Wiley & Sons,
1999.
[CWOO] Cui, Y., Widom, J., Lineage tracing in a data warehousing system,
Proc. 16th International Conference on Data Engineering, 2000, 683-
684.
[CWW99] Cui, Y., Widom, J., Wiener, J.L., Tracing the lineage of view data
in a data warehousing environment, Technical Report, Stanford Uni-
versity, 1999.
[Deb98] Debevoise, T., The data warehouse method, Prentice-Hall, 1998.
[Dev97] Devlin, B., Data warehouse: from architecture to implementation,
Addison-Wesley, 1997.
[DF95] Dyer, R., Forman, E., An analytic approach to marketing decisions,
Prentice-Hall, 1995.
[DGOO] Dodge, G., Gorman, T., Essential Omcle8i data warehousing, John
Wiley & Sons, 2000.
[DJ80] Dubes, R., Jain, A.K., Clustering methodologies in exploratory data
analysis, Advances in Computers 19, 1980, 113-228.
[DP94] Dorndorf, U., Pesch, E., Fast clustering algorithms, ORSA Journal
on Computing 6, 1994, 141-153.
[DS97] Dhar, V., Stein, R., Intelligent decision support methods: the science
of knowledge work, Prentice-Hall, 1997.
[DycOO] Dyche, J., e-Data: turning data into information with data warehous-
ing, Addison-Wesley, 2000.
[FPS+95] Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R.,
Advances in knowledge discovery and data mining, MIT Press, 1995.
[Fra98] Franco, J.M., Le datawarehouse, Eyrolles, 1998.
[GG97] Gabriel, R., Gluchowski, P., Semantische Modellierungstechniken fUr
multidimensionale Datenstrukturen, HMD, Theorie und Praxis der
Wirtschaftsinformatik 34, 1997, 18-37.
[GGC97] Gluchowski, P., Gabriel, R., Chamoni, P., Management Support
Systeme, ComputergestUtzte Informationssysteme fUr Fuhrungskriifte
und Entscheidungstriiger, Springer-Verlag, Berlin, 1997.
[GHQ95] Gupta, A., Harinarayan, V., Quass, D., Aggregate-query processing
in data warehousing environments, Proc. 21st Con/. on Very Large
Data Bases (VLDB), 1995, 358-369.
[GHR+97] Gupta, H., Harinarayan, V., Rajaraman, A., Ullman, J., Index selec-
tion for OLAP, Proc.International Conference on Data Engineering,
1997, 208-219.
[GioOO] Giovinazzo, W., Object-oriented data warehouse design, Prentice-
Hall, 2000.
[GLW+99] Garcia-Molina, H., Labio, W.J., Wiener, J.L., Zhuge, Y., Distributed
and parallel computing issues in data warehousing, Proc. ACM Prin-
ciples of Distributed Computing Conference, 1999, 7-10.
426 U. Dorndorf and E. Pesch
[InmOO] Inmon, W.H., Exploration warehousing, John Wiley & Sons, 2000.
[IRB+98] Inmon, W.H., Rudin, K., Buss, C.K., Sousa, R., Data warehouse per-
formance, John Wiley & Sons, 1998.
[IWG97] Inmon, W.H., Welch, J.D., Glassey, K., Managing the data warehouse,
John Wiley & Sons, 1997.
[IZG97] Inmon, W.H., Zachman, J., Geiger, J., Data stores, data warehousing,
and the Zachman framework, McGraw-Hill, 1997.
[JLV+OO] Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P., Fundamentals
of data warehouses, 2nd edition, Springer-Verlag, 2000.
[Kai98] Kaiser, B.-D., Corporate information with SAP-EIS, Morgan Kauf-
mann, 1998.
[Kel94] Kelly, S., Data warehousing: the route to mass customization, John
Wiley & Sons, 1994.
[Kel97a] Kelly, B.W., AS/400 data warehousing: the complete implementation
guide, Midrange Computing, 1997.
[Kel97b] Kelly, S., Data warehousing in action, John Wiley & Sons, 1997.
[Kim96] Kimball, R., The data warehouse toolkit, John Wiley & Sons, 1996.
[Kir97] Kirchner, J., Transformationsprogramme und Extraktionsprozesse
entscheidungsrelevanter Basisdaten, H. Mucksch, W. Behme (eds.),
Das Data Warehouse-Konzept, Gabler, 1997, 237-266.
[KLM+97] Kawaguchi, A., Lieuwen, D., Mumick, I., Quass, D., Ross, K., Con-
currency control theory for deferred materialized views, Proc. Inter-
national Conference on Database Theory, 1997, 306-320.
[KMOO] Kimball, R., Merz, R., The data webhouse toolkit: building the web-
enabled data warehouse, John Wiley & Sons, 2000.
[KRR+98] Kimball, R., Reeves, L., Ross, M., Thornwaite, W., The data ware-
house lifecycle toolkit: tools and techniques for designing, developing
and deploying data marts and data warehouses, John Wiley & Sons,
1998.
[LL96] Laudon, K.C., Laudon, J.P., Management information systems, orga-
nization and technology, 4th edition, Prentice-Hall, New Jersey 1996.
[Lus02] Lusti, M., Data warehousing und Data Mining, 2nd edition, Springer-
Verlag, 2002.
[LYG99] Labio, W.J., Yerneni, R., Garcia-Molina, H., Shrinking the warehouse
update window, Proc. ACM SIGMOD Conference, 1999, 383-394.
[LZW+97] Labio, W.J., Zhuge, Y., Wiener, J.L., Gupta, H., Garcia-Molina, H.,
Widom, J., The WHIPS prototype for data warehouse creation and
maintenance, Proc. ACM SIGMOD Conference, 1997, 557-559.
[MAOO] Moss, L., Adelman, S., Data warehouse project management,
Addison-Wesley, 2000.
[Mal94] Mallach, E., Understanding decision support systems and expert sys-
tems, McGraw-Hill, 1994.
[Mar99] Marakas, G., Decision support systems in the 21st century, Prentice-
Hall, 1999.
[Mat96] Mattison, R., Data warehousing: strategies, tools and techniques,
McGraw-Hill, 1996.
[Mat97] Mattison, R., Data warehousing and data mining for telecommunica-
tions, Artech House, 1997.
[Mat99] Mattison, R., Web warehousing and knowledge management,
McGraw-Hill, 1999.
428 U. Dorndorf and E. Pesch
[Sin98] Singh, H.S., Interactive data warehousing via the web, Prentice-Hall,
1998.
[Spe99] Sperley, E., The enterprise data warehouse, vol. 1, Planning, building
and implementation, Prentice-Hall, 1999.
[SW96] Sprague, RH., Watson, H., Decision support for management,
Prentice-Hall, 1996.
[Tan97] Tanler, R, The intranet data warehouse: tools and techniques for con-
necting data warehouses to intranets, John Wiley & Sons, 1997.
[Thi97] Thierauf, RJ., On-line analytical processing systems for business,
Quorum Books, 1997.
[Tho97] Thomsen, E., OLAP solutions: building multidimensional information
systems, John Wiley & Sons, 1997.
[TSC99] Thomsen, E., Spofford, G., Chase, D., Microsoft OLAP solutions,
John Wiley & Sons, 1999.
[Thr98] Thrban, E., Decision support systems and expert systems, Prentice-
Hall, 1998.
[VA98] Venerable, M., Adamson, C., Data warehouse design solutions, John
Wiley & Sons, 1998.
[WB98] Westphal, C., Blaxton, T., Data mining solutions: methods and tools
for solving real-world problems, John Wiley & Sons, 1998.
[Wel98] Welbrock, P.R., Strategic data warehousing principles using SAS soft-
ware, SAS Institute, 1998.
[Wet91] Wetherbe, J.C., Executive information requirements: getting it right,
MIS Quarterly, 1991.
[WG97] Watson, H., Gray, P., Decision support in the data warehouse,
Prentice-Hall, 1997.
[WHR97] Watson, H.J., Houdeshel, G., Rainer, RK., Building executive infor-
mation systems and other decision support applications, John Wiley
& Sons, 1997.
[WI97] Weiss, S.M., Indurkhya, N., Predictive data mining: a practical guide,
Morgan Kaufmann, 1997.
[WS95] Wood, J., Silver, D., Joint application development, 2nd edition, John
Wiley & Sons, 1995.
[WW99a] Whitehorn, M., Whitehorn, M., Business intelligence: the IBM solu-
tion, Springer, 1999.
[WW99b] Whitehorn, M., Whitehorn, M., SQL server: data warehousing and
OLAP, Springer-Verlag, 1999.
[YouOO] Youness, S., Professional data warehousing with SQL Server 7.0 and
OLAP services, Wrox, 2000.
[YW97] Yazdani, S., Wong, S., Data warehousing with Oracle: an administra-
tor's handbook, Prentice Hall, 1997.
[YWOO] Yang, J., Widom, J., Making temporal views self-maintainable for
data warehousing, Proc. 7th International Conference on Extending
Database Technology, 2000, 395-412.
[ZGH+95] Zhuge, Y., Garcia-Molina, H., Hammer, J., Widom, J., View mainte-
nance in a warehousing environment, Proc. ACM SIGMOD Confer-
ence, 1995, 316-327.
[ZGW96] Zhuge, Y., Garcia-Molina, H., Wiener, J.L., The strobe algorithms
for multi-source warehouse consistency, Proc. Conference on Parallel
and Distributed Information Systems, 1996, 146-157.
430 U. Dorndorf and E. Pesch
and pose many research challenges. This chapter addresses data management is-
sues in mobile computing environments. It analyzes the past and present of mobile
computing, wireless networks, mobile computing devices, architectures for mobile
computing, and advanced applications for mobile computing platforms. It covers
extensively weak connectivity and disconnections in distributed systems as well as
broadcast delivery. The chapter also lists available (at the time of writing) online
mobile computing resources.
10. Mobile Computing 433
1 Introduction
The technical challenges that mobile computing must resolve are hardly triv-
ial. Many challenges in developing software and hardware for mobile comput-
ing systems are quite different from those involved in the design of today's
stationary, or fixed network systems [FZ94j. Also the implications of host
mobility on distributed computations are quite significant. Mobility brings
about a new style of computing. It affects both fixed and wireless networks.
On the fixed network, mobile users can establish a connection from different
434 O. Bukhres, E. Pitoura, and A. Zaslavsky
r
Sys18ms support User applications
Teleconwnunlcatlons Networking
engl...ring
ducing battery weight and lengthening the life of a charge. Power can be
conserved not only by the design of energy-efficient software, but also by effi-
cient operation [DKL+94,ZZR+98j. Power management software can power
down individual components when they are idle, for example, spinning down
the internal disk or turning off screen lighting. Applications may have to con-
serve power by reducing the amount of computations, communication, and
memory, and by performing their periodic operations infrequently to mini-
mize the start-up overhead. Database applications may use energy efficient
query processing algorithms. Another characteristic of mobile computing is
that the cost of communication is asymmetric between the mobile host and
the stationary host. Since radio modem transmission normally requires about
10 times as much power as the reception operation, power can be saved by
substituting a transmission operation for a reception one. For example, a
mobile support station (MSS) might periodically broadcast information that
otherwise would have to be explicitly requested by the mobile host. This
way, mobile computers can obtain this information without wasting power to
transmit a request.
Mobile computing is also characterized by frequent disconnections and
the possible dozing of mobile computers. The main distinction between a
disconnection and a failure is its elective nature. In traditional distributed
systems, the loss of connectivity is considered to be a failure and leads to
network partitioning and other emergency procedures. Disconnections in mo-
bile computing, on the other hand, should be treated as planned activities,
which can be anticipated and prepared for. There may be various degrees
of disconnection ranging from a complete disconnection to a partial or weak
disconnection, e.g., a terminal is weakly connected to the rest of the network
via a low bandwidth radio channel. The reasons for disconnections may be
due to costs involved, as it is expensive to maintain an idle wireless commu-
nication link. Also, it could happen that there are no networking capabilities
at the current location. In addition, for some technologies, such as cellu-
lar modems, there is a high start-up charge for each communication session
[BBI+93,SKM+93j. Moreover, the increasing scale of distributed systems
will result in more frequent disconnections. Disconnections are undesirable
because they may impede computation.
Security and privacy is another major concern in mobile computing. Since
mobile computers appear and disappear on various networks, prevention of
impersonation of one machine by another is problematic. When a mobile com-
puter is taken away from its local environment, the data it sends and receives
are subject to possible theft and unauthorized copying. A network that al-
lows visiting mobile computers to connect cannot perform the type of packet
filtering now used as a security mechanism, since certain foreign packets will
be legitimate packets destined for the visiting mobile host. The administrator
of the foreign environment has security concerns as well. These concerns are
much greater than the current mode of mobile computing in which a user in a
436 O. Bukhres, E. Pitoura, and A. Zaslavsky
foreign environment is logged into a local guest account from which the user
may have a communication session (e.g., telnet protocol) to his/her home en-
vironment. In the nomadic computing paradigm, a guest machine may harm
its host/server - either accidentally or maliciously [As094]. The possibility of
such harm is much greater than that likely caused by the typical user of a
guest account on a fixed network.
Another major issue is establishing a connection when a mobile host has
no prior knowledge about the targeted network [NSZ97]. The point of entry
in a network is through the physical medium or interface to the access point.
The choices of physical medium include radio, infrared, wire/coaxial cable
and optical means. Furthermore, a mobile host needs to communicate using
one of the host network's protocols for meaningful exchange of information
to occur. In addition, networks may have established security schemes. In
order to join the targeted network, information about the "code of behavior"
is normally provided to the incoming member of the community. This ar-
rangement, characteristic of legacy computing systems, works well in a static
environment. This approach does not apply to mobile hosts, which migrate
within and across networks. It is important to note that the complexity of
connectivity depends on the variety of choices presented to the node. For
example at the signal level, there are several choices regarding the medium,
access method and encoding. Also, once a protocol is known, there are several
ways it can be used by the upper layers. A mobile host to start communi-
cating with a network needs to "speak the same language" as the targeted
network. The situation can be likened to visiting an unknown country where
one has no prior knowledge of the language, customs, or behavior but some-
how hopes to communicate and ask for directions, food or any other services.
Such a paradigm can be called "the ET (extraterrestrial) effect" [NSZ97]. A
mobile computer that intends to establish a connection in a foreign computer
network is viewed as an outsider and may have no prior knowledge of how
to instigate communications. This is a situation that will arise over and over
again as people demand computing anywhere without geographic barriers
such as those partially achieved in GSM technology.
Wireless data networks are a natural extension and enhancement to exist-
ing wireline computer networks and services. Wireless data networks support
mobile users who may require remote access to their base computer networks.
Wireless data services and systems represent a rapidly growing and increas-
ingly important segment of the telecommunications industry. It is easy to
notice that current computer applications follow the rapid advancements in
the telecommunications industry. Eventually, information systems will be in-
fluenced by the rapid evolution of the wireless segment of this industry. Since
mobility affects many assumptions upon which today's distributed systems
are based, such systems will have to move to where tomorrow's technology
can support them. Wireless data technology is foreseen to be a main infras-
tructure platform for future applications, which are naturally distributed,
10. Mobile Computing 437
dynamic and require much flexibility and mobility. In mobile computing sys-
tems, the underlying network infrastructure is somewhat different from tra-
ditional distributed systems. Designers of mobile information systems have
much less control over wireless networks since not only the communication
media is provided by telecommunications providers, but also base stations
and servers are part of a proprietary wireless network. For example, location
of base stations is considered commercial information and is unavailable to
application developers.
hosts and communication links between them constitute the static or fixed
network, and can be considered to be the reliable part of the infrastructure.
Thus, the general architecture for the network with mobile hosts is a two
tier structure consisting of a potentially high-performance and reliable fixed
network with mobile support stations and a large number of mobile hosts,
which are roaming within and across multiple heterogeneous networks and
are connected by slow and often unreliable wireless links.
Fixedlwireline network
Mobile GSM connection
U
G i HomeBase
NodeNI
Wireless LAN
Cellular Data (Aironet, Wavelan, Xircom)
DPD, DataTac, Motorola)
or enter areas of high interference. Unlike typical wired networks, the num-
ber of devices in a wireless cell varies dynamically and large concentrations
of mobile users, say, at conventions, hotels and public events, may overload
network capacity.
be used for transfer of large volumes of data (> 20 Kbytes) and for short
connect time batch transactions. A major disadvantage of circuit switched
networks is the high cost when connecting for a long time or on a regular
basis.
In terms of cost-efficiency, packet switched networks offer an alternative to
circuit switching for data transmission. The burst nature of data traffic leads
to an inefficient utilization of the pre-allocated bandwidth under the wireless
circuit switching technology. Wireless packet switching on the other hand,
allocates transmission bandwidth dynamically, hence allowing an efficient
sharing of the transmission bandwidth among many active users. Although
it is possible to send wireless data streams over dedicated channels by using
circuit-switched cellular networks, such methods are too expensive for most
types of data communications [Inc95]. Packet data networks are well suited
for short data transmissions where the overhead of setting up a circuit is not
warranted for the transmission of data bursts lasting only seconds or less.
In packet switching, packets are data sent using limited size blocks. The
information at the sending end is divided into a number of packets and trans-
mitted over the network to the destination where it is assembled to its in-
tended representation. The data is broken into packets of a certain size, for
example, 240 or 512 bytes. Each packet includes the origin and destination
address, allowing multiple users to share a single channel or transmission
path. Packet switches will use this information to send the packet to the next
appropriate transmission link. The actual route is not specified and does
not matter. It can change in the middle of the process to accommodate a
varying network load. Main advantages of packet-switched networks include
[Inc95,Inc96] :
Public wireless data networks are provided to the public by service providers
that offer telecommunications services in general. Private networks, used by
fleet operators and support services such as emergency services, also use these
types of networks. These networks use the existing infrastructure of base
stations, network control centers, and switches to transmit data. Enterprise
systems and third-party service providers can connect host data systems to
the wireless networks via wire line communications.
Public packet-switched wireless data networks are more economical to
operate than similar circuit-switched networks. They allow many devices to
share a small number of communication channels. Charges are based on the
amount of data transmitted, not on the connection time. Transmission speeds
vary from 4800 bps to 19.2 Kbps. However, the actual transmission time and
throughput is determined by the network load and overhead and cannot be
precisely specified. Two widely used packet data networks worldwide are the
Motorola's DataTac [Inc95] and Ericsson's Mobitex [Inc96].
Cellular digital packet data (CDPD) is another packet-based technology
that transmits data packets over existing analogue cellular networks. It is
ideally suited for established voice cellular analogue network operators who
wish to add wireless data to their existing services. CDPD has the same
in-building coverage as the current voice cellular analogue networks. CDPD
transmits over channels not in use for voice calls, making efficient use of
capacity that would otherwise be wasted. It always relinquishes a channel
when needed for voice. Packet-switched communication is optimized for the
burst like transmission of data. The fact that many CDPD users have the
same channel optimizes the use of scarce radio frequencies. Packet-switched
network resources are only used when data is actually being sent or received.
Depending on the application, CDPD allows for as many as up to 1,000 users
per channel with bit rate of 19,200 bps [Inc95].
Among circuit-switched networks there are two standards for digital net-
works: Code Division Multiple Access (CDMA) and Time Division Mul-
tiple Access (TDMA) which includes GSM [Br095]. CDMA, International
Standard-95 (IS-95), was adopted as a standard in 1992. In a CDMA sys-
tem, being a spread spectrum system, the total occupied RF bandwidth is
much larger than the information signal. All users share the same range
of radio spectrum and different digital code sequences are used to differenti-
ate between subscriber conversations. Ericsson, the leading TDMA producer,
claims that the CDMA technology is too complex and still years from being
ready for commercial use [Bro95]. PDC is the Japanese digital standard based
on TDMA. PDC is mainly used in Japan. In a TDMA system, a portion of
the frequency spectrum is designed as a carrier and then divided into time
slots. One conversation at a time is assigned a time slot (channel). The chan-
nel is occupied until the call is finished or handed by the system to another
channel (roaming).
442 O. Bukhres, E. Pitoura, and A. Zaslavsky
• LEO satellite system (Low Earth Orbit): orbits the earth at 500-
1000 kms. They require 66 satellites served by 200 ground stations. They
orbit the earth in 90-100 minutes. It has been proposed by Motorola's
Iridium project which was recently discontinued due to economic effi-
ciency considerations.
• GEO satellites (Geostationary): orbit the earth at 36,000 km. They take
23 hours and 56 minutes to complete an orbit. Three GEO are required for
global coverage. This technology uses car-mounted or pluggable handsets
rather than being genuine lightweight portables.
• MEO satellite system (Medium Earth Orbit): orbits the earth at 10400
kms and consists of 12 satellites. It has been proposed by ICO Global
Communications (formerly Inmarsat-P). MEO will be capable of 45000
simultaneous calls and in full operation by 2000. A 12 satellite system is
planned to be launched by TWR in the US.
There are three main standards in the cordless technology: Digital Euro-
pean Cordless Telephony (DECT), Telepoint (or CT-2), and Personal Handy
Phone System (PHS) [Rap96]. Cordless telecommunications systems are suit-
able mostly in high density business environments. Cordless telecommunica-
tions are of central importance to suppliers of the following two markets:
PABX suppliers will have their lucrative next generation market, and LAN
suppliers would open a new market for wireless LANs.
DECT is suitable for large installations and CT-2 for smaller operations.
Telepoint (or CT-2) was pioneered in the UK back in 1989-1990. However,
CT-2 was further developed by Hong Kong, France and Canada Telecoms
and produced a successful CT-2 services system.
444 O. Bukhres, E. Pitoura, and A. Zaslavsky
a medical application cannot tolerate losing the detail of an image used for
medical diagnosis.
An issue germane to system design is where should support for mobility
and adaptivity be placed. Should applications be aware of their environ-
ment? Strategies range between two extremes [Sat96a,NPS95]. At one ex-
treme, adaptivity is solely the responsibility of the underlying system and is
performed transparently from applications. In this case, existing applications
continue to work unchanged. However, since there is no single best way to
serve applications with diverse needs, this approach may be inadequate or
even make performance worse than providing no support for adaptivity at
all. For example, consider the following application-transparent way to pro-
vide operation during disconnection. Before disconnection, the last recently
used files are preloaded in the mobile host's cache. Upon re-connection, file
and directory updates are automatically integrated in the server and any
conflicting operations are aborted. This method performs poorly if the appli-
cation does not exploit any time locality in file accesses or if most conflicts
are semantically acceptable and can be effectively resolved, for example, in a
calendar application by reconciling conflicting entries. Often, completely hid-
ing mobility from applications is not attainable. For instance, during periods
of long disconnections, applications may be unable to access critical data.
At the other extreme, adaptation is left entirely to individual applica-
tions. No support is provided by the operating system. This approach lacks
a focal point to resolve the potentially incompatible resource demands of dif-
ferent applications or to enforce limits on the usage of resources. In addition,
applications must be written anew. Writing such applications becomes very
complicated.
Application-aware [SNK+95] support for mobility lies in between han-
dling adaptivity solely by applications and solely by the operating system. In
this approach, the operating system co-operates with the application in vari-
ous ways. Support for application awareness places additional requirements to
mobile systems [ARS97]. First, a mechanism is required to monitor the level
and quality of resources and inform applications about any relevant changes
in their environment. Then, applications must be agile [NSN+97,ARS97],
that is able to receive events in an asynchronous manner and react appro-
priately. Finally, there is a need for a central point for managing resources
and authorizing any application-initiated request for their use. Environmental
changes include changes of the location of the mobile unit and the availability
of resources such as bandwidth, memory or battery power.
Informing the application of a location update or a change in the availabil-
ity of a resource involves addressing a number of issues. To name just a few:
how does the system monitor the environment, which environmental changes
are detectable by the system and which by the application, how and when are
any changes detected by the system conveyed to the application. In [WB97],
changes in the environment are modeled as asynchronous events which are
10. Mobile Computing 447
delivered to the application. Events may be detected either within the kernel
or at the user-level. The detection of an event is decoupled from its delivery so
that only relevant events are delivered. In Odyssey [NS95,SNK+95,NPS95],
the application negotiates and registers a window of tolerance with the system
for a particular resource. If availability of that resource rises above or falls
below the limits set in the tolerance window, Odyssey notifies the applica-
tion. Once notified, it is the application's responsibility to adapt its behavior
accordingly.
Nevertheless, handling mobility spans multiple levels. Take for example
a mobile application that is built on top of a database management sys-
tem that is in turn built on top of an operating system that uses a specific
communication mechanism. At what level should mobility be handled?
Wireless Link
1 ... - - - - Fixed Networ1< - - - •
1
1 Application Client~ - - -- -- ..I
I
-------11 Application Server
Mobile Host I
I
I (a)
I
1
Application Client --H
1
1- Agent Application Server
Mobile Host I
I
I (b)
taining the client's presence on the fixed network via the agent. Furthermore,
agents split the interaction between mobile clients and fixed servers in two
parts, one between the client and the agent, and one between the agent and
the server. Different protocols can be used for each part of the interaction
and each part of the interaction may be executed independently of the other.
Between its surrogate and service-specific roles, various functions may be
undertaken by an agent. Agent functionality includes support for messaging
and queuing for communication between the mobile client and the server. The
agent can use various optimizations for weak connectivity. It can manipulate
the data prior to their transmissions to the client [ZD97,FGB+96,TSS+96],
by changing their transmission order so that the most important information
is transferred first, by performing data specific lossy compression that tailors
content to the specific constraints of the client, or by batching together mul-
tiple replies. The agent can also assume a more active role [Ora97,TSS+96],
for instance, it can notify the client appropriately, when application-specific
predefined events occur. To reduce the computation burden on the mobile
client, the agent might be made responsible for starting/stopping specific
functions at the mobile unit or for executing client specific services. For ex-
ample, a complex client request can be managed by the agent with only the
final result transmitted to the client. Located on the fixed network, the agent
has access to high bandwidth links and large computational resources, that
it can use for its client's benefit.
To deal with disconnections, a mobile client can submit its requests to
the agent and wait to retrieve the results when connection is re-established.
In the meantime, any requests to the disconnected client can be queued at
the agent to be transferred upon re-connection. The agent can be used in a
similar way to preserve battery life.
The exact position of the agent at the fixed network depends on its role.
Placing the agent at the fringe of the fixed network, i.e., at the base station,
has some advantages especially when the agent acts as the surrogate of the
mobile hosts under its coverage [ZD97,BBI+93]: it is easier to gather infor-
mation for the wireless link characteristics; a special link level protocol can
be used between the mobile host and the agent; and personalized information
about the mobile hosts is available locally. On the other hand, the agent may
need to move along with its mobile host, or the current base station may not
be trustworthy. In the case of service-specific agents, it makes sense to place
them either closer to the majority of their clients or closer to the server.
To accommodate the change in the system configuration induced by client
mobility, there may be a need to move the agents at the fixed network. Again,
relocating the agent depends on the role of the agent. If the agent is service-
specific, a client's request for this service and the associated server's reply is
transmitted through the agent. Moving the agent closer to the client does not
necessarily reduce communication since it may increase the cost of the agent's
interaction with the server especially when the agent serves multiple clients.
450 O. Bukhres, E. Pitoura, and A. Zaslavsky
When the agent acts as a surrogate of a client, any message to and from the
client passes through the client's agent. In this case, moving the agent along
with the client seems justifiable. Additional support is now needed to manage
information regarding the location of the mobile surrogate. A mobile motion
prediction algorithm to predict the future location of a mobile user according
to the user's movement history is proposed in [LMJ96]. A new proxy is then
pre-assigned at the new location before the mobile user moves in.
While the client/agent/server model offers a number of advantages, it
fails to sustain the current computation at the mobile client during periods
of disconnection. Furthermore, although the server notices no changes, the
model still requires changes to the client code for the development of the
client/ agent interaction rendering the execution and maintenance of legacy
applications problematic. Finally, the agent can directly optimize only data
transmission over the wireless link from the fixed network to the mobile client
and not vice versa.
Wireless Link
... - - - - Fixed Network - - - ..
Application Server
Mobile Host
(a)
I
I
IApplication I --H
I
1-1Agent Application Server 1
Mobile Host I
I
I (b)
Fig. 3.2. Client-Server based models: (a) pair of agents model, (b) peer-to-peer
model
3.4 Taxonomy
The agents, placed between the mobile client and the fixed server, alleviate
both the constraints of the wireless link, by performing various communica-
tion optimizations, and of any resource constraints, by undertaking part of
454 O. Bukhres, E. Pitoura, and A. Zaslavsky
4 Disconnected Operation
4.1 Overview
Disconnections can be categorized in various ways. First, disconnections may
be voluntary, e.g., when the user deliberately avoids network access to re-
duce cost, power consumption, or bandwidth use, or forced e.g., when the
portable enters a region where there is no network coverage. Then, discon-
nections may be predictable or sudden. For example, voluntary disconnection
are predictable. Other predictable disconnections include those that can be
detected by changes in the signal strength, by predicting the battery lifetime,
or by utilizing knowledge of the bandwidth distribution. Finally, disconnec-
tions can be categorized based on their duration. Very short disconnections,
such as those resulting from handoffs, can be masked by the hardware or
low-level software. Other disconnections may either be handled at various
levels, e.g., by the file system or an application, or may be made visible to
the user. Since disconnections are very common, supporting disconnected op-
eration, that is allowing the mobile unit to operate even when disconnected,
is a central design goal in mobile computing.
The idea underlying the support for disconnected operation is simple.
When a network disconnection is anticipated, data items and computation
are moved to the mobile client to allow its autonomous operation during dis-
connection. Preloading data to survive a forthcoming disconnection is called
hoarding. Disconnected operation can be described as a transition between
three states [KS92] (Figure 4.1).
Data hoarding. Prior to disconnection, the mobile host is in the data hoard-
ing state. In this state, data items are preloaded into the mobile unit. The
items may be simply relocated from the fixed host to the mobile unit. How-
ever, by doing so, these data items become inaccessible to other sites. Alter-
natively, data items may be replicated or cached at the mobile unit. The type
of data objects transfered to a mobile host depends on the application and
the underlying data model. For instance, in cases of file systems, the data
456 O. Bukhres, E. Pitoura, and A. Zaslavsky
Issue Approach
Data Values
What to log Timestamps
Operations
Table 4.1 [PS98] summarizes some of the issues regarding each of the three
states. The complexity of operation in each state depends on the type of the
distributed system and the dependencies among the data operated on. In the
following, we will discuss disconnected operation in distributed file systems
and database management systems.
458 O. Bukhres, E. Pitoura, and A. Zaslavsky
Most proposals for file system support for disconnected operation are based
on extending cache management to take into account disconnections. Files
are preloaded at the mobile client's cache to be used during disconnection.
Caching to support disconnected operation is different from caching during
normal operation in many respects. First, cache misses cannot be served.
Then, updates at a disconnected client cannot be immediately propagated to
its server. Similarly, a server cannot notify a disconnected client for updates
at other clients. Thus, any updates must be integrated upon re-connection.
Data hoarding. Hoarding is the process of preloading data into the cache
in anticipation of a disconnection, so that the client can continue its op-
eration while disconnected. Hoarding is similar to prefetching used in file
and database systems to improve performance. However, there are impor-
tant differences between hoarding and prefetching. Prefetching is an ongoing
process that transfers to the cache soon-to-be-needed files during periods of
low network traffic. Since prefetching is continuously performed, in contrast
to hoarding, keeping its overhead low is important. Furthermore, hoarding
is more critical than prefetching, since during disconnections, a cache miss
cannot be serviced. Thus, hoarding tends to overestimate the client's need
for data. On the other hand, since the cache at the mobile client is a scarce
resource, excessive estimations cannot be satisfied. An important parameter
is the unit of hoarding, ranging from a disk block, to a file, to groups of files or
directories. Another issue is when to initiate hoarding. The Coda file system
[KS92] runs a process called hoard walk periodically to ensure that critical
files are in the mobile user's cache.
The decision on which files to cache can be either (a) assisted by instruc-
tions explicitly given by the user, or (b) taken automatically by the system by
utilizing implicit information, which is most often based on the past history
of file references. Coda [KS92] combines both approaches in deciding which
data to hoard. Data are prefetched using priorities based on a combination
of recent reference history and user-defined hoard files. A tree-based method
is suggested in [TLA+95] that processes the history of file references to build
an execution tree. The nodes of the tree represent the programs and data files
referenced. An edge exists from parent node A to child node B, when either
program A calls program B, or program A uses file B. A GUI is used to assist
the user in deploying this tracing facility to determine which files to hoard.
Besides clarity of presentation to users, the advantage of this approach is that
it helps differentiate between the files accessed during multiple executions of
the same program. Seer [Kue94] is a predictive caching scheme based on the
user's past behavior. Files are automatically prefetched based on a measure
called semantic distance that quantifies how closely related they are. The
measure chosen is the local reference distance from a file A to a file B. This
10. Mobile Computing 459
In the case of files systems, the only conflicts detected are write/write
conflicts because they produce divergent copies. Read/write conflicts are not
considered. Such conflicts occur, for instance, when the value of a file read
by a disconnected client is not the most recent one, because the file has been
updated at the server after the client's disconnection. Extensions to provide
such semantics are discussed in the following section.
Data hoarding. There are many problems that remain open regarding
hoarding in databases. First, what is the granularity of hoarding. The gran-
ularity of hoarding in relational database systems can range from tuples, to
set of tuples, to whole relations. Analogously, in object-oriented database
systems, the granularity can be at the object, set of objects or class (ex-
tension) level. A logical approach would be to hoard by issuing queries; i.e.,
by prefetching the data objects that constitute the answer to a given query.
This, in a sense, corresponds to loading on the mobile unit materialized views.
Then, operation during disconnection is supported by posing queries against
these views.
Another issue is how to decide which data to hoard. In terms of views,
this translates to: how to identify the views to materialize, or how to specify
the hoarding queries that define the views. Then, users may explicitly specify
their preferences by issuing hoarding queries. Alternatively, the users' past
behavior may be used by the system as an indication of the users' future
needs. In such a case, the system automatically hoards the set of most com-
monly used or last referenced items along with items related to the set. Using
the history of past references to deduce dependencies among database items is
harder than identifying dependencies among files. Furthermore, issues related
to integrity and completeness must also be taken into account.
To decide which data to hoard, [GKL+94] proposes (a) allowing users to
assist hoarding by specifying their preferences using an object-oriented query
to describe hoarding profiles, and (b) maintaining a history of references by
using a tracing tool that records queries as well as objects. To efficiently
handle hoarding queries from mobile clients, [BP97] proposes an extended
database organization. Under the proposed organization, the database de-
signer can specify a set of hoard keys along with the primary and secondary
key for each relation. Hoard keys are supposed to capture typical access pat-
terns of mobile clients. Each hoard key partitions the relation into a set of
disjoint logical horizontal fragments. Hoard fragments constitute the hoard
granularity, i.e., clients can hoard and reintegrate within the scope of these
fragments.
10. Mobile Computing 461
5 Weak Connectivity
Weak connectivity is the connectivity provided by slow or expensive networks.
In addition, in such networks connectivity is often lost for short periods of
time. Weak connectivity sets various limitations that are not present when
connectivity is normal and thus instigates revisions of various system proto-
cols. An additional characteristic of weak connectivity in mobile computing
is its variation in strength. Connectivity in mobile computing varies in cost,
provided bandwidth and reliability. Many proposals for handling weak con-
nectivity take this characteristic into consideration and provide support for
operation that adapts to the current degree of connectivity. In such systems,
disconnected operation is just the form of operation in the extreme case of
total lack of connectivity. The aim of most proposals for weak connectivity
is prudent use of bandwidth. Often, fidelity is traded off for a reduction in
communication cost.
nectivity is intermittent. In such cases, the client cannot rely on the server
sending such notifications. Thus, upon re-connection, the client must validate
its cache against the server's data. Cache invalidation may impose substan-
tial overheads on slow networks. To remedy this problem, [MS94] suggests
increasing the granularity at which cache coherence is maintained. In par-
ticular, each server maintains version stamps for volumes, i.e., sets of files,
in addition to stamps on individual objects. When an object is updated, the
server increments the version stamp of the object and that of its containing
volume. Upon reintegration, the client presents volume stamps for validation.
If a volume stamp is still valid, so is every object cached from that volume.
So, in this case there is no need to check the validity of each file individually.
Case studies. In the Coda [MES95] file system, cache misses are serviced
selectively. In particular, a file is fetched only if the service time for the
cache miss which depends among others on bandwidth is below the user's
patience threshold for this file, e.g., the time the user is willing to wait for
getting the file. Reintegration of updates to the servers is done through trickle
reintegration. Trickle reintegration is an ongoing background process that
propagates updates to servers asynchronously. To maintain the benefits of
log optimization while ensuring a reasonably prompt update propagation,
a technique called aging is used. A record is not eligible for reintegration
until it spends a minimal amount of time, called aging window, in the log.
Transferring the replay log in one chunk may saturate a slow network for an
extended period. To avoid this problem, the reintegration chunk size is made
adaptive, thus bounding the duration of communication degradation. If a file
is very large, it is transferred as a series of fragments, each smaller than the
currently acceptable chunk size.
In the Little Work project [HH95b], update propagation is performed in
the background. To avoid interference of the replay traffic with other network
traffic, the priority queuing in the network driver is augmented. Three levels of
queuing are used: interactive traffic, other network traffic, and replay traffic.
A number of tickets are assigned to each queue according to the level of
service deemed appropriate. When it is time to transmit a packet, a drawing
is held. The packet in the queue holding the winning ticket is transmitted.
File updates at the servers are propagated to the client immediately through
callbacks. Thus a client opening a file is guaranteed to see the most recently
stored data. Directory updates are tricky to handle, thus only the locally
updated directory is used by mobile clients. Cache misses are always serviced.
In the variable-consistency approach [TD91,TD92], a client/server archi-
tecture with replicated servers that follow a primary-secondary schema is
used mainly to avoid global communication, but also works well with weak
connectivity. The client communicates with the primary server only. The pri-
mary makes periodic pickups from the clients it is servicing and propagates
updates back to the secondaries asynchronously. Once some number N of
10. Mobile Computing 465
Overview. The mobile host can play many roles in a distributed database
setting. For example, it may simply submit operations to be executed on a
server or an agent at the fixed network [JBE95,YZ94,DHB97,M097,BMM98].
In this case, it may either submit to the fixed server operations of a transac-
tion One at a time sequentially or the whole transaction as one atomic unit
[JBE95]. In [YZ94], the second approach is taken. Each mobile client submits
a transaction to a coordinating agent. Once the transaction has been submit-
ted, the coordinating agent schedules and coordinates its execution on behalf
of the mobile client. A different approach to the role of the mobile host is to
allow local database processing at the mobile host. Such an approach is nec-
essary to allow autonomous operation during disconnection but complicates
data management and may cause unacceptable communication overheads.
Concurrency control in the case of distributed transactions that involve
both mobile and fixed hosts is complicated. For transactions that access data
466 O. Bukhres, E. Pitoura, and A. Zaslavsky
at both mobile and stationary hosts accessing the wireless link impose large
overheads. Take for instance, the case of a pessimistic concurrency control
protocol that requires transactions to acquire locks at multiple sites. In this
case, transactions may block if they request locks at sites that get discon-
nected or if they request locks held by transactions at disconnected sites. On
the other hand, techniques such as timestamps may lead to a large number
of transactions being aborted because operations may be overly delayed in
slow networks.
To avoid delays imposed by the deployment of slow wireless links open-
nested transaction models are more appropriate [Chr93]. According to these
models, a mobile transaction that involves both stationary and mobile hosts
is not treated as one atomic unit but rather as a set of relatively independent
component transactions some of which run solely at the mobile host. Compo-
nent transactions can commit without waiting for the commitment of other
component transactions. In particular, as in the disconnected case, transac-
tions that run solely at the mobile host are only tentatively committed at
the mobile host and their results are visible by subsequent local transactions.
These transactions are certified at the fixed hosts, Le., checked for correctness,
at a later time. Fixed hosts can broadcast to mobile hosts information about
other committed transactions prior to the certification event, as suggested
in [Bar97]. This information can be used to reduce the number of aborted
transactions.
Case studies. Transactions that run solely at the mobile host are called
weak in [PB95b,Pit96,PB99], while the rest are called strict. A distinction
is drawn between weak copies and strict copies. In contrast to strict copies,
weak copies are only tentatively committed and hold possibly obsolete values.
Weak transactions update weak copies, while strict transactions access strict
copies. Weak copies are integrated with strict copies either when connectiv-
ity improves or when an application-defined limit to the allowable deviation
among weak and strict copies is passed. Before reconciliation, the result of
a weak transaction is visible only to weak transactions at the same site.
Applications at weakly connected sites may chose to issue strict transactions
when they require strict consistency. Strict transactions are slower than weak
transactions since they involve the wireless link but guarantee permanence
of updates and currency of reads. During disconnection, applications can use
only weak transactions. In this case, weak transactions have similar semantics
with second-class lOTs [LS95j and tentative transactions [GHN+96]. Adapt-
ability is achieved by adjusting the number of strict transactions and the
degree of divergence among copies based on the current connectivity.
The approach taken in Bayou [TDP+94,TTP+95,DPS+94] does not sup-
port transactions. Bayou is built on a peer-to-peer architecture with a number
of replicated servers weakly connected to each other. In this schema, a user
application can read-any and write-any available copy. Writes are propagated
10. Mobile Computing 467
to the server. Clients can also use the backchannel to directly request time-
critical data. The backchannel is used in [AFZ97J along with caching at the
clients to allow clients to pull pages that are not available in their local cache
and are expected to appear in the broadcast after a threshold number of
items.
One approach in hybrid delivery is, instead of broadcasting all data items
in the database, to broadcast an appropriately selected subset of the items
and provide the rest on demand. Determining which subset of the database
to broadcast is a complicated task since the decision depends on many factors
including the clients' access patterns and the server's capacity to service re-
quests. Broadcasting the most popular data is the approach taken in [SRB97J,
where the broadcast medium is used as an air-cache for storing frequently
requested data. A technique is presented that continuously adjusts the broad-
cast content to match the hot-spot of the database. The hot-spot is calculated
by observing the broadcast misses indicated by explicit requests for data not
on the broadcast. These requests provide the server with tangible statistics on
the actual data demand. Partitioning the database into two groups: a "publi-
cation group" that is broadcast and an "on demand" group is also suggested
in [IV94J. The same medium is used for both the broadcast channel and the
backchannel. In this approach, the criterion for partitioning the database is
minimizing the backchannel requests while keeping the response time below
a predefined upper limit.
Another approach is to broadcast pages on demand. In this approach, the
server chooses the next item to be broadcast on every broadcast tick based
on the requests for data it has received. Various strategies have been stud-
ied [Won88J such as broadcasting the pages in the order they are requested
(FCFS), or broadcasting the page with the maximum number of pending
requests. A parameterized algorithm for large-scale data broadcast that is
based only on the current queue of pending requests is proposed in [AF98J.
Mobility of users is also critical in determining the set of broadcast items.
Cells may differ in their type of communication infrastructure and thus in
their capacity to service requests. Furthermore, as users move between cells,
the distribution of requests for specific data at each cell changes. Two vari-
ations of an adaptive algorithm that takes into account mobility of users
between cells of a cellular architecture are proposed in [DCK +97J. The algo-
rithms statistically selects data to be broadcast based both on user profiles
and registration in each cell.
to be in the active mode and consumes power. The broadcast data should be
organized so that the access and tuning time are minimized.
The simplest way to organize the transmission of broadcast data is a flat
organization. In a flat organization, given an indication of the data items
desired by each client listening to the broadcast, the server simply takes the
union of the required items and broadcasts the resulting set cyclically. More
sophisticated organizations include broadcast disks and indexing.
In many applications, the broadcast must accommodate changes. At least
three different types of changes are possible [AFZ95]. First, the content of the
broadcast can change in terms of including new items and removing existing
ones. Second, the organization of the broadcast data can be modified, for
instance by changing the order by which the items are broadcast or the
frequency of transmission of a specific item. Finally, if the broadcast data are
allowed to be updated, the values of data on the broadcast change.
(a)
(c)
is a linked list of pages that are most likely to be requested next by the client.
When a request for a page p is satisfied, the user enters a phase during which
it prefetches the D most likely referenced item associated with p, where D
is the cache size in pages. This phase terminates either when D pages are
prefetched or when the client submits a new request.
that the item has not changed. The set of bit sequences is organized in a
hierarchical structure.
A client may miss cache invalidation reports, because of disconnections
or doze mode operation. Synchronous methods surpass asynchronous in that
clients need only periodically tune in to read the invalidation report instead of
continuously listening to the channel. However, if the client remains inactive
longer than the period of the broadcast, the entire cache must be discarded,
unless special checking is deployed. In simple checking, the client sends the
identities of all cached objects along with their timestamps to the server for
validation. This requires a lot of uplink bandwidth as well as battery energy.
Alternatively, the client can send group identifiers and timestamps, and the
validity can be checked at the group level. This is similar to volume checking
in the Coda file system. Checking at the group level reduces the uplink re-
quirements. On the other hand, a single object update invalidates the whole
group. As a result the amount of cached items retained may significantly
reduce by discarding possibly valid items of the group. To remedy this situa-
tion, in GCORE [WYC96j, the server identifies for each group a hot update
set and excludes it from the group when checking the group's validity.
Books that focus on data management for mobile and wireless computing
include [PS98] and [IK95j, which is an edited collection of papers covering a
variety of aspects in mobile computing.
There are various extensive on-line bibliographies on mobile computing
that include links to numerous research projects, reports, commercial prod-
ucts and other mobile-related resources [Ali,Mob].
There is a major annual conference solely devoted to mobile computing,
the ACM/IEEE International Conference on Mobile Computing and Net-
working. Many database, operating systems, networking and theory confer-
ences have included mobile computing in their topics of interest, and several
10. Mobile Computing 477
Who What
DATAMAN Data management in mobile computing
http://www .cs.rutgers.edu/datarnan/ distributed algorithms and services
T.Imielinski, B.Badrinath data broadcasting, indirect protocols
Rutgers University, NJ, U.S.A. data replication, wireless networks
location management, software architectures
Who What
Dldrlbut..d Multhnedla a. •••rch Group multimedia .upport for mobile computing
http://www.comp.lanc8.ac.uk/computing/rsssarch middleware for mobile computing
Impgl Lancaster Univ., U.K., G.Blair, N.Davies mobility-aware applications
Actlv. Bad .._ location management, mobUity-aware
http://www.cam.orl.co.uk/ab.html applications
Olivetti U.K.
DPMC (DI.trlb., ParaUel &£: Mobil. Cornputlns) data management for mobile computing
http://www.ct.monash.edu.au/DPMC/ wireless networks, interoperability,
School of Camp. Sci. and Software Eng. mobile agents'" objects,
Monash Unlv., Australia caching, adaptive protocols
Who What
Wireless LAN Alliance (WLANA) wireless LANe, protocols, handoff
http://www.wlana.comIEEE802.11
Major LAN vendors
8 Conclusions
References
[AAF+95] Acharya, S., Alonso, R., Franklin, M.J., Zdonik, S., Broadcast
disks: data management for asymmetric communications environ-
ments, Pmc. ACM SIGMOD Intl. Conference on Management of
Data (SIGMOD 95), 1995, 199-210. Reprinted in T. Imielinski, H.
Korth (eds.), Mobile Computing, Kluwer Academic Publishers, 1996,
331-36l.
[ABG90] Alonso, R., Barbara, D., Garcia-Molina, H., Data caching issues in
an information retrieval system, ACM Transactions on Database Sys-
tems 15(3), 1990, 359-384.
[AD93] Athan, A., Duchamp, D., Agent-mediated message passing for con-
strained environments, Pmc. USENIX Symposium on Mobile and
Location-Independent Computing, Cambridge, Massachusetts, 1993,
103-107.
[AF98] Aksoy, D., Franklin, M.J., Scheduling for large-scale on-demand
data broadcasting, Proc. Conference on Computer Communications
(IEEE INFO COM '98), 1998, 651--659.
[AFZ95] Acharya, S., Franklin, M.J., Zdonik, S., Dissemination-based data
delivery using broadcast disks, IEEE Personal Communications 2(6),
1995, 50--60.
[AFZ96a] Acharya, S., Franklin, M.J., Zdonik, S., Disseminating updates on
broadcast disks, Proc. 22nd International Conference on Very Large
Data Bases (VLDB 96), 1996,354-365.
480 O. Bukhres, E. Pitoura, and A. Zaslavsky
[AFZ96bJ Acharya, S., Franklin, M.J., Zdonik, S., Prefetching from a broad-
cast disk, Proc. 12th International Conference on Data Engineering
(ICDE 96), 1996, 276-285.
[AFZ97J Acharya, S., Franklin, M., Zdonik, S., Balancing push and pull for
data broadcast, Proc. ACM Sigmod Conference, 1997, 183-194.
[AirJ Air Media, AirMedia Live, www.airmedia.com.
[AK93J Alonso, R., Korth, H.F., Database system issues in nomadic com-
puting, Proc. 1999 SIGMOD Conference, Washington, D.C., 1993,
388-392.
[AliJ Aline Baggio's bookmarks on mobile computing,
http://www-sor.inria.frraline/mobile/mobile.html.
[Amm87J Ammar, M.H., Response time in a Teletext system: an individual
.user's perspective, IEEE 1Tansactions on Communications 35(11),
1987, 1159-1170.
[ARS97J Acharya, A., Ranganathan, M., Saltz, J., Sumatra: a language for
resource-aware mobile programs, J. Vitek, C. Tschudin (eds.), Mobile
Object Systems, Lecture Notes in Computer Science 1222, Springer-
Verlag, Berlin, 1997, 111-130.
[As094J Asokan, N., Anonymity in mobile computing environment, IEEE
Workshop on Mobile Computing Systems and Applications, 1994,
200-204,
http://snapple.cs.washington.edu:600/library/mcsa94/asokan. ps.
[AW85J Ammar, M.H., Wong, J.W., The design of Teletext broadcast cycles,
Performance Evaluation 5(4), 1985, 235-242.
[Bar97J Barbara, D., Certification reports: supporting transactions in wire-
less systems, Proc. IEEE International Conference on Distributed
Computing Systems, 1997,466-473.
[BB97J Bakre, A., Badrinath, B., Implementation and performance evalua-
tion of indirect TCP, IEEE 1Tansactions on Computers 46(3), 1997,
260-278.
[BBH93J Badrinath, B.R., Bakre, A., Imielinski, T., Marantz, R., Handling
mobile clients: a case for indirect interaction, Proc. 4th Workshop on
Workstation Operating Systems, Aigen, Austria, 1993, 91-97.
[BC96J Bestavros, A., Cunha, C., Server-initiated document dissemination
for the WWW, IEEE Data Engineering Bulletin 19(3), 1996, 3-11.
[BGH+92J Bowen, T., Gopal, G., Herman, G., Hickey, T., Lee, K., Mansfield, W.,
Raitz, J., Weinrib, A., The Datacycle architecture, Communications
of the ACM 35(12), 1992, 71-81.
[BGZ+96J Bukhres, 0., Goh, H., Zhang, P., Elkhammas, E., Mobile computing
architecture for heterogeneous medical databases, Proc. 9th Inter-
national Conference on Parallel and Distributed Computing Systems,
1996.
[B194J Barbara, D., Imielinski, T., Sleepers and workaholics: caching strate-
gies in mobile environments, Proc. ACM SIGMOD Intl. Conference
on Management of Data (SIGMOD 94), 1994, 1-12.
[BJ95J Bukhres, O.A., Jing, J., Performance analysis of adaptive caching
algorithms in mobile environments, Information Sciences, An Inter-
national Journal 95(2), 1995, 1-29.
10. Mobile Computing 481
[BMM98] Bukhres, 0., Mossman, M., Morton, S., Mobile medical database
approach for battlefield environments, The Australian Journal on
Computer Science 30(2), 1998, 87-95
[BP97] Badrinath, B.R., Phatak, S., Database server organization for han-
dling mobile clients, Technical Report DCS-342, Department of Com-
puter Science, Rutgers University, 1997.
[Bro95] Brodsky, I., The revolution in personal telecommunications, Artech
House Publishers, Boston, London, 1995.
[CGH+95] Chess, D., Grosof, B., Harrison, C., Levine, D., Parris, C., Tsudik,
G., Itinerant agents for mobile computing, IEEE Personal Commu-
nications 2(5), 1995, 34-49.
[Chr93] Chrysanthis, P.K., Transaction processing in mobile computing en-
vironment, Proc. IEEE Workshop on Advances in Parallel and Dis-
tributed Systems, Princeton, New Jersey, 1993, 77-83.
[DCK+97] Datta, A., Celik, A., Kim, J., VanderMeer, D., Kumar, V., Adaptive
broadcast protocols to support efficient and energy conserving re-
trieval from databases in mobile computing environments, Proc. 19th
IEEE International Conference on Data Engineering, 1997, 124-133.
[DGS85] Davidson, S.B., Garcia-Molina, H., Skeen, D., Consistency in parti-
tioned networks, ACM Computing Surveys 17(3), 1985, 341-370.
[DHB97] Dunham, M., Helal, A., Balakrishnan, S., A mobile transaction model
that captures both the data and movement behavior, ACM/Baltzer
Journal on Special Topics on Mobile Networks, 1997, 149-162.
[DKL+94] Douglis, F., Kaashoek, F., Li, K., Caceres, R., Marsh, B., Tauber,
J.A., Storage alternatives for mobile computers, Proc. 1st Symp. on
Operating Systems Design and Implementation, Monterey, California,
USA, 1994, 25-37.
[DPS+94] Demers, A., Petersen, K., Spreitzer, M., Terry, D., Theimer, M.,
Welch, B., The Bayou architecture: support for data sharing among
mobile users, Proc. IEEE Workshop on Mobile Computing Systems
and Applications, Santa Cruz, CA, 1994, 2-7.
[EZ97] Elwazer, M., Zaslavsky, A., Infrastructure support for mobile in-
formation systems in Australia, Proc. Pacific-Asia Conference on
Information Systems (PACIS'97), Brisbane, QLD, Australia, 1997,
33-43.
[FGB+96] Fox, A., Gribble, S.D., Brewer, E.A., Amir, E., Adapting to net-
work and client variability via on-demand dynamic distillation, Proc.
International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS- VII), Cambridge, MA,
1996, 160-170.
[FZ94] Forman, G.H., Zahorjan, J., The challenges of mobile computing,
IEEE Computer 27(6), 1994, 38-47.
[GHM+90] Guy, R.G., Heidemann, J.S., Mak, W., Page, T.W.J., Popek, G.J.,
Rothmeier, D., Implementation of the Ficus replicated file system,
Proc. Summer 1990 USENIX Conference, 1990,63-71.
[GHN+96] Gray, J., Helland, P., Neil, P.O., Shasha, D., The dangers of repli-
cation and a solution, Proc. ACM SIGMOD Conference, Montreal,
Canada, 1996, 173-182.
[Gif90] Gifford, D., Polychannel systems for mass digital communication,
Communications of the ACM 33(2), 1990, 141-151.
482 O. Bukhres, E. Pitoura, and A. Zaslavsky
[GKL+94] Gruber, R, Kaashoek, F., Liskov, N., Shrira, L., Disconnected opera-
tion in the Thor object-oriented database system, Pmc. IEEE Work-
shop on Mobile Computing Systems and Applications, Santa Cruz,
CA, 1994,51-56.
[HH93] Huston, L.B., Honeyman, P., Disconnected operation for AFS, Proc.
USENIX Symposium on Mobile and Location-Independent Comput-
ing, Cambridge, Massachusetts, 1993, 1-10.
[HH94] Huston, L., Honeyman, P, Peephole log optimization, Pmc. IEEE
Workshop on Mobile Computing Systems and Applications, Santa
Cruz, CA, 1994, http://citeseer.nj.nec.com/huston94peephole.html.
[HH95a] Honeyman, P., Huston, L.B., Communication and consistency in mo-
bile file systems, IEEE Personal Communications 2(6), 1995, 44-48.
[HH95b] Huston, L.B., Honeyman, P., Partially connected operation, Com-
puting Systems 4(8), 1995, 365-379.
[HPG+92] Heidemann, J., Page, T.W., Guy, RG., Popek, G.J., Primarily dis-
connected operation: experience with Ficus, Pmc. 2nd Workshop on
the Management of Replicated Data, 1992, 2-5.
[HSL98] Housel, B.C., Samaras, G., Lindquist, D.B., WebExpress: a
client/intercept based system for optimizing Web browsing in a wire-
less environment, ACM/Baltzer Mobile Networking and Applications
(MONET) 3(4), Special Issue on Mobile Networking on the Internet,
1998, 419-431. Also, University of Cyprus, CS-TR 96-18, 1996.
[IB94] Imielinski, T., Badrinath, B.R, Wireless mobile computing: chal-
lenges in data management, Communications of the ACM 37(10),
1994, 18-28.
[IK95] Imielinski, T., Korth, H. (eds.), Mobile computing, Kluwer Academic
Publishers, 1995.
[Inc95] Inc, M., Wireless data communications: an overview,
http://www.mot.com/wdg/.
[Inc96) Inc, E., GSM: the future,
http://www.ericsson.se/systems/gsm/future.htm.
[IV94] Imielinski, T., Viswanathan, S., Adaptive wireless information sys-
tems, Proc. SIG Data Base Systems Conference, Japan, 1994, 19-41.
[IVB94a] Imielinski, T., Viswanathan, S., Badrinanth, B.R, Energy efficient
indexing on air, Proc. ACM SIGMOD Intl. Conference on Manage-
ment of Data (SIGMOD 94), 1994, 25-36.
[IVB94b] Imielinski, T., Viswanathan, S., Badrinanth, B.R, Power efficient fil-
tering of data on air, Pmc. 4th International Conference on Extending
Database Technology, 1994, 245-258.
[JBE95] Jing, J., Bukhres, 0., Elmagarmid, A., Distributed lock management
for mobile transactions, Pmc. 15th IEEE International Conference
on Distributed Computing Systems, 1995, 118-125.
[JBE+95] Jing, J., Bukhres, 0., Elmargarmid, A.K., Alonso, R, Bit-sequences:
a new cache invalidation method in mobile environments, Technical
Report CSD-TR-94-074, Revised May 95, Department of Computer
Sciences, Purdue University, 1995.
[JK94] Jain, R., Krishnakumar, N., Network support for personal informa-
tion services for PCS users, Pmc. IEEE Conference on Networks for
Personal Communications, 1994, 1-7.
10. Mobile Computing 483
[JTK97] Joseph, A.D., Tauber, J.A., Kaashoek, M.F., Mobile computing with
the Rover toolkit, IEEE Transactions on Computers 46(3), 1997,
337-352.
[Kat94] Katz, R.H., Adaptation and mobility in wireless information systems,
IEEE Personal Communications 1, 1994, 6-17.
[KB92] Krishnakumar, N., Bernstein, A., High throughput escrow algorithms
for replicated databases, Proc. 18th VLDB Conference, 1992, 175-
186.
[KJ95] Krishnakumar, N., Jain, R., Mobility support for sales and inven-
tory applications, T. Imielinski, H. Korth (eds.), Mobile Computing,
Kluwer Academic Publishers, 1995, 571-594.
[KS92] Kistler, J.J., Satyanarayanan, M., Disconnected operation in the
Coda file system, ACM Transactions on Computer Systems 10(1),
1992, 213-225.
[KS93] Kumar, P., Satyanarayanan, M., Log-based directory resolution in
the coda file system, Proc. 2nd International Conference on Parallel
and Distributed Information Systems, San Diego, CA, 1993, 202-213.
[KS95] Kumar, P., Satyanarayanan, M., Flexible and safe resolution of file
conflicts, Proc. Winter 1995 USENIX Conference, New Orleans, LA,
1995, 95-106.
[Kue94] Kuenning, G.H., The design of the Seer predictive caching system,
Proc. IEEE Workshop on Mobile Computing Systems and Applica-
tions, Santa Cruz, CA, 1994, 37-43,
ftp://ftp.cs.ucla.edu/pub/ficus/mcsa94.ps.gz.
[LMJ96] Liu, G.Y., Marlevi, A., Maguire Jr., G.Q., A mobile virtual-
distributed system architecture for supporting wireless mobile com-
puting and communications, ACM Journal on Wireless Networks 2,
1996, 77-86.
[LS94] Lu, Q., Satyanarayanan, M., Isolation-only transactions for mobile
computing, Operating Systems Review, 1994, 81-87.
[LS95] Lu, Q., Satyanarayanan, M., Improving data consistency in mobile
computing using isolation-only transactions, Proc. 5th Workshop on
Hot Topics in Operating Systems, Orcas Island, Washington, 1995,
124-128, http://citeseer .nj .nec.com/lu95improving.html.
[MB96] Morton, S., Bukhres, 0., Mobile transaction recovery in distributed
medical databases, Proc. 8th International Conference on Parallel
and Distributed Computing and Systems, 1996.
[MB97] Morton, S., Bukhres, 0., Utilizing mobile computing in the Wishard
Memorial Hospital Ambulatory Service, Proc. 12th ACM Symposium
on Applied Computing (ACM SAC'97), 1997,287-294.
[MBM96] Morton, S., Bukhres, 0., Mossman, M., Mobile computing architec-
ture for a battlefield environment, Proc. International Symposium
on Cooperative Database Systems for Advanced Applications, 1996,
130-139.
[MBZ+97] Morton, S., Bukhres, 0., Zhang, P., Vanderdijs, E., Platt, J., Moss-
man, M., A proposed architecture for a mobile computing environ-
ment, Proc. 5th Euromicro Workshop on Parallel and Distributed
Processing, 1997.
484 O. Bukhres, E. Pitoura, and A. Zaslavsky
1 Introduction
Recent advances in data capture, data transmission and data storage tech-
nologies have resulted in a growing gap between more powerful database sys-
tems and users' ability to understand and effectively analyze the information
collected. Many companies and organizations gather gigabytes or terabytes
of business transactions, scientific data, web logs, satellite pictures, text re-
ports, which are simply too large and too complex to support a decision
making process. Traditional database and data warehouse querying models
are not sufficient to extract trends, similarities and correlations hidden in
very large databases.
The value of the existing databases and data warehouses can be signif-
icantly enhanced with help of data mining. Data mining is a new research
area which aims at nontrivial extraction of implicit, previously unknown and
potentially useful information from large databases and data warehouses.
Data mining, sometimes referred to as data dredging, knowledge extraction
or pattern discovery, can help answer business questions that were too time
consuming to resolve with traditional data processing techniques. The pro-
cess of mining the data can be perceived as a new way of querying - with
questions such as "which clients are likely to respond to our next promotional
mailing, and why?" .
Data mining aims at the discovery of knowledge that can be potentially
useful and unknown. It is subjective whether the discovered knowledge is
new, useful or interesting, since it depends on the application. Data mining
algorithms can discover large numbers of patterns and rules. To reduce the
number, users may have to put additional measurements and constraints on
patterns.
Two main types of data mining tasks are description and prediction. The
description consists in automated discovery of previously unknown patterns
which describe the general properties of the existing data. Example applica-
tions include the analysis of retail sales data to identify groups of products
that are often purchased together by customers, fraudulent credit card trans-
action detection, telecommunication network failure detection. The predic-
tion tasks typically attempt to do predictions of trends and behaviors based
on inference on available data. A typical application of a predictive problem
is targeted marketing, where the goal is to identify the targets most likely to
respond to the future mailings. Other predictive problems include customer
retention, promotion design, bankruptcy forecasting. Such applications may
help companies make proactive, knowledge-driven decisions.
Data mining is also popularly known as knowledge discovery in databases
(KDD), however, data mining is actually a part of the knowledge discovery
process. The knowledge discovery process is composed of seven steps that
lead from raw data collection to the new knowledge:
1. Data cleaning (data cleansing), which consists in removal of noise and
irrelevant data from the raw data collection.
11. Data Mining 489
Typically, some of the above steps are combined together, for example, data
cleaning and data integration represent a preprocessing phase of data ware-
house generation, data selection and data transformation can be expressed
by means of a database query.
Depending on the type of patterns extracted, data mining methods are
divided into many categories, where the most important ones are:
mining problems. The languages employ the concept of a data mining query,
which can be optimized and evaluated by a data mining-enabled database
management system (KDDMS - knowledge discovery management system).
2 Mining Associations
Let L = {h, l2, .. " lm} be a set of literals, called items. Let a non-empty set
of items T be called an itemset. Let D be a set of variable length itemsets,
where each itemset T ~ L. We say that an itemset T supports an item x E L
if x is in T. We say that an itemset T supports an itemset X ~ LifT supports
every item in the set X.
An association rule is an implication of the form X -> Y, where X c L,
Y c L, X n Y = 0. Each rule has associated measures of its statistical
significance and strength, called support and confidence. The support of the
rule X - t Y in the set Dis:
Y D) _ I {T E D I T supports Xu Y} I
support (X
-t
,- ID I
In other words, the rule X -> Y holds in the set D with support s if
s· 100% of itemsets in D support X U Y. Support is an important measure
since it is an indication of the number of itemsets covered by the rule. Rules
with very small support are often unreliable, since they do not represent a
significant portion of the database.
The confidence of the rule X -> Y in the set Dis:
f "d (X -t Y D) _ I {T E D I T supports Xu Y} I
con z ence ,- I {T ED I T supports X} I
In other words, the rule X - t Y has confidence c if c . 100% of itemsets in
D that support X also support Y. Confidence indicates the strength of the
rule. Unlike support, confidence is an asymmetric (confidence(X -> Y) i-
confidence(Y -> X)) and non-transitive (the presence of highly confident
rules X - t Y and Y -> Z does not mean that X - t Z will have the minimum
confidence) .
The goal of mining association rules is to discover all association rules
having support greater than or equal to some minimum support threshold,
minsup, and confidence greater than or equal to some minimum confidence
threshold, minconf"
The strongest association rules (minsup = 0.4, minconf = 0.5) that can be
found in the example database are listed below:
beer _10 ---> potato...chips_12 support = 0.60 confidence = 1.00
potato...chips_12 ---> beer_lO support = 0.60 confidence = 0.75
beer _10 /\ diapers..bOl ---> potato...chips_12 support = 0.40 confidence = 1.00
diapers..bOl /\ potato...chips_12 ---> beer_lO support = 0.40 confidence = 1.00
diaper s...bOl ---> beer _10 /\ potato...chips_12 support = 0.40 confidence = 1.00
diapers..bOl ---> beer _10 support = 0.40 confidence = 1.00
diapers...bOl ---> potato...chips_12 support = 0.40 confidence = 1.00
beer _10 /\ potato...chips_12 ---> diapers...bOl support = 0.40 confidence = 0.67
beer _10 ---> diapers..bOl /\ potato...chips_12 support = 0.40 confidence = 0.67
beer _10 ---> diapers..bOl support = 0.40 confidence = 0.67
soda_03 ---> potato...chips_12 support = 0.40 confidence = 0.67
potato...chips_12 ---> beer _10 /\ diapers...bOl support = 0.40 confidence = 0.50
potato...chips_12 ---> diapers...bOl support = 0.40 confidence = 0.50
potato...chips_12 ---> soda_03 support = 0.40 confidence = 0.50
For example, the association rule "beer _10 ---> potato...chips_12 (support =
0.60, confidence = 1.00)" states that every time the product beer _10 is pur-
chased, the product potato_chips_12 is purchased too and that this pattern
occurs in 60 percent of all transactions. Knowing that 60 percent of customers
who buy a certain brand of beer also buy Ii certain brand of potato chips can
help the retailer determine appropriate promotional displays, optimal use of
shelf space, and effective sales strategies. As a result of doing this type of
association rules discovery, the retailer might decide not to discount potato
chips whenever the beer is on sale, as doing so would needlessly reduce profits.
11. Data Mining 493
~
{A,B,C,D}
(AI
prefix-based
class
Horizontal mining algorithms assume that the database rows represent trans-
actions and each transaction consists of a set of items. Vertical mining algo-
rithms assume that the database rows represent items and with each item we
associate a set of transaction identifiers for the transactions that contain this
item. The two layouts of the database from the previous example are shown
in Figure 2.3.
L1 = frequent l-itemsets;
for (k = 2; Lk-1 =1= 0; k++) do begin
Ck = apriorLgen(Lk_1);
forall transactions tED do begin
Ct = subset(Ck, t);
forall candidates c E C t do
c.count ++;
end
Lk = {c E C k I c.count 2: minsup};
end
Answer = Uk Lk;
Fig. 2.4. Frequent itemset generation phase of the Apriori algorithm
insert into Ck
select p.iteml,p.item2, ... ,p.itemk_l, q.itemk-l
from L k - 1 p, Lk-l q
where p.iteml = q.iteml
and p.item2 = q.item2
11. Data Mining 497
Next, in the prune step, each itemset c E Ok such that some (k - I)-subset
of c is not in Lk-l is deleted:
forall itemsets c E Ck do
forall (k - I)-subsets 8 of c do
if (8 t/:- Lk-l) then delete c from Ok;
n,
trans-id products
I soda_03, potato_chips_12
2 beeLIO, potato_chips_12, diapers_bOI
3 soda_03
4 soda_03, beer_lO, potato_chips_12
5 beeLIO, potato_chips_12, diapers_bOI
The first pass of the algorithm Apriori counts product occurrences to deter-
mine the frequent l-itemsets L 1 • Each product that is contained in at least
2 purchase transactions (30% of all five transactions) becomes a frequent
l-itemset. All l-itemsets together with their support values are listed below:
itemset support
beeLlO 0.60
diapers_bOI 0.40
potato_chips_12 0.80
soda_03 0.60
itemset support
beer _10, diapers_bOI 0.40
beeLlO, potato_chips_12 0.60
beeLIO, soda_03 0.20
diapers_bOI, potato_chips_12 0.40
diapers_b01, soda_03 0.00
potato_chips_12, soda_03 0.40
itemset support
beer _10, diapers_bOI 0.40
beeLlO, potato_chips_12 0.60
diapers_bOI, potato_chips_12 0.40
potato_chips_12, soda_03 0.40
In the next step, apriori-gen function is used again - this time to generate
candidate 3-itemsets C3 • Each candidate 3-itemset is a superset of frequent
2-itemsets and its every subset is contained in L 2 . The set of candidate 3-
itemsets contains only one itemset and is presented below:
The set L3 of frequent 3-itemsets consists of this only itemset, because its
support is at least 0.30:
From the only frequent 3-itemset of L 3 , the following 3-item association rules
will be generated:
source 3-itemset supp generated rule conf
beer_1O, diapers_bOl, 0.40 beer _10 1\ diapers_bOl ---+ potato_chips_12 0.67
potato--<:hips_12
beer_lO, diapers_bOl, 0.40 beer_1O 1\ potato_chips_12 ---+ diapers_bOl 0.67
potato--<:hips_12
beer_1O, diapers_bOl, 0.40 diapers_bOl 1\ potato_chips_12 ---+ beef-IO 1.00
potato--<:hips_12
beef-lO, diapers_bOl, 0.40 diapers_bOl ---+ beer _10 1\ potato_chips_12 1.00
potato--<:hips_12
beer_1O, diapers_bOl, 0.40 potato--<:hips_12 ---+ beer_1O 1\ diapers_bOl 0.50
potato--<:hips_12
Notice, that an association rule "beer _10 ~ diapers_bOII\ potato_chips_12"
has not been even generated because both the rules: "beer _lOl\diapers..bOI ~
potato_chips_12" and "beer _10I\potato..chips_12 ~ diapers_bOI" did not have
the minimum confidence.
Finally, all the generated association rules are filtered and only the rules
with minimum confidence (~ 70%) form the result of the algorithm Apriori:
In the first step, the database is initially scanned to discover all frequent
items. Then, all non-frequent items are removed from each transaction and
the frequent items are sorted in frequency descending order. Next, all the
transactions are mapped to paths in an F P-tree.
F P- tree consists of one root node labeled as "null" and of regular nodes,
each containing a frequent item and an integer. Each path in the F P-tree
originating from the root represents a set of transactions containing identical
frequent items. The last node of the path contains the number of supporting
transactions. To facilitate tree traversal, an item header table is built, in
which each item points, via links, to its first occurrence in the tree. Nodes
with the same item name are linked in sequence via links.
In the second step, the F P-tree is mined in order to find all frequent
itemsets. The mining algorithm is based on the property that for any frequent
item a, all possible frequent itemsets that contain a can be obtained by
following a's node links, starting from a's head in the FP-tree header. The
algorithm uses the concepts of a transformed prefix path, conditional pattern
base and conditional FP-tree. A transformed prefix path is the prefix subpath
of node a, with frequency count of its nodes adjusted to the same value as
the count of node a. A conditional pattern base of a is a small database
of transformed prefix paths of a. A conditional F P-tree of a is the F P-tree
created over the conditional pattern base of a. The complete algorithm is
given in Figure 2.6. It starts with Tree = F P-tree and a = null.
item count
potato_chips_12 4
beeLlO 3
soda_03 3
diapers_bOl 2
Next, all non-frequent items are removed from the original database and the
frequent items inside each transaction are sorted in frequency descending
order. Thus, the database looks like the following:
trans~d products
1 potato_chips_12, soda_03
2 potato_chips_12, beeLlO, diapers_bOl
3 soda_03
4 potato_chips_12, beeLlO, soda_03
5 potato_chips_12, beeLlO, diapers_hOI
The root of the F P - tree is created and labeled with "null". We scan the
database again. The scan of the first transaction leads to the construction
of the first branch of the tree: ((potato_chips_12 : 1) (soda_03 : 1)). The
number after ":" is the support counter. For the second transaction, since
its frequent items list shares a common prefix (potato_chips_12) with the
existing path, the count of the first node is incremented by 1, and two new
nodes are created and linked as a child path of (potato_chips_12). For the
third transaction, a new branch of the tree is constructed: (( soda_03 : 1)).
For the fourth transaction, since its frequent items list shares a common
prefix (potato_chips_12, beer _10) with an existing path, the count of each
node along the prefix is incremented by 1, and one new node (soda_03) is
created and linked as a child of (beeLlO). For the last transaction, since it
can be completely mapped to an existing path ((potato_chips_12 : 3)(beer _10 :
2)(diapers_bOl : 1)), the count of each node along the path is incremented by
11. Data Mining 503
1. The resulting F P- tree with the associated item links is shown in Figure
2.7.
null
After having created the F P - tree, we mine the tree to discover all fre-
quent itemsets. First, we collect all the transactions that diapers_bOi partic-
ipates. This item derives a single path in the F P-tree. The path indicates
that the itemsets {potato_chips_I2, beer _10, diapers_bOI} , {potato_chips_I2,
diapers_bOl}, and {beer _10, diapers..bOI} appear twice in the database (sup-
port = 0.4). Next, we collect all the transactions that soda_03 participates.
This item derives three paths in the FP-tree: ((potato_chips_I2 : 4) (soda_03 :
1)), ((potato_chips_I2 : 4)(beer_1O : 3)(soda_03 : 1)), ((soda_03 : 1)). To find
which items appear together with soda_03, we build a conditional database:
{{potato_chips_I2} : 1, {potato_chips_I2, beer _IO} : I} and mine it recur-
sively. We find that potato_chips_I2 appears twice, so the discovered frequent
itemset is {potato_chips_I2, soda_03} (support = 0.4). Next, we collect all the
transactions that beer _10 participates. This item derives a single path in the
F P-tree. The path indicates that the itemsets {potato_chips_I2, beer _1O} ap-
pears three times in the database (support = 0.6). Finally, we skip the path
from the item potato_chips_I2 since it is a single-item path.
The result of the F P- Growth algorithm is the following:
itemset support
potato_chips_I2 0.80
beeLlO 0.60
soda_03 0.60
diapers_bOI 0.40
potato_chips_I2, beeLlO, diapers_bOI 0.40
potato_chips_I2, diapers_bOI 0.40
beeLIO, diapers_bOI 0.40
potato_chips_I2, soda_03 0.40
potato_chips_I2, beeLIO 0.60
504 T. Morzy and M. Zakrzewicz
Eclat (Sk-l):
forall itemsets Ia,h E Sk-l,a < b do begin
C = Ia.tidlist n h.tidlistj
if (I C 12 minsup)
add C to Lk
end
Partition Lk into prefix-based (k - I)-length prefix classes
foreach class Sk in Lk
Eclat(Sk);
end
Answer = Uk Lkj
item tidlist
beer-lO 2,4,5
diapers_bOl 2, 5
potato_chips_12 1,2,4,5
soda_03 1,2,3,4,5
itemset support
beer-lO 0.60
diapers_hOI 0.40
potato_chips_12 0.80
soda_03 1.00
11. Data Mining 505
Next, the frequent 2-itemsets L2 are divided into I-length prefix-based classes.
All the frequent 2-itemsets that begin with {beer_1O} form the first class, all
the frequent 2-itemsets that begin with {diapers_bOI} form the second class,
and all the frequent 2-itemsets that begin with {potato_chips_12} form the
third class.
Next, for each of the classes we recursively call the Eclat algorithm. The
algorithm merges pairs of frequent itemsets inside each class to generate
new potentially large itemsets and then evaluates their supports by counting
"tidlist" items. Here, we have found four frequent 3-itemsets.
products
I \ I \
Fig. 2.10. An example of conceptual hierarchy for finding generalized association
rules
Consider the database from Section 2.1 and the conceptual hierarchy in
Figure 2.10 for a simple example. Assume that minimum support for rules
to discover is 0.75 and minimum confidence is 0.90. First, every itemset in
508 T. Morzy and M. Zakrzewicz
the database is replaced with an extended itemset and any duplicates are
removed. It results in the following contents of the database:
trans-.id product
1 beverages
1 candy
1 potato_chips_12
1 soda_03
2 beef-I 0
2 beverages
2 candy
2 children
2 diapers_bOI
2 potato_chips_12
3 beverages
3 soda_03
4 beef-10
4 beverages
4 candy
4 potato_chips_12
4 soda_03
5 beef-lO
5 beverages
5 candy
5 children
5 diapers_bOl
5 potato_chips_12
The derivation of large itemsets is shown in Figure 2.11. The first pass of
the algorithm simply counts item occurrences to determine large 1-itemsets.
Notice that the itemsets can contain items from the leaves of the conceptual
hierarchy or from interior nodes. In a subsequent pass, candidate itemsets
C 2 are generated and the database is scanned to count their support. We
can also prune every itemset that consists of an item and its ancestor (not
to generate redundant rules). Then the algorithm ends, since no candidate
3-itemsets can be generated. Finally, the association rules are generated from
the large itemsets. The discovered generalized association rules are presented
in Figure 2.12.
itemset support
beverages, candy 0.80
beverages, potato_chips_12 0.80
The partition sizes are chosen such that each partition can be handled totally
in main memory so that the partitions are read from the database only once
in each phase.
DIC (Dynamic Itemset Counting) algorithm [BMU+97] tries to generate
and count the itemsets earlier, thus reducing the number of database scans.
The database is treated as a set of intervals of transactions, and the intervals
are scanned sequentially. During the first interval scan, l-itemsets are gener-
ated and counted. At the end of the first scan, potentially frequent 2-itemsets
are generated. During the second interval scan, all generated l-itemsets and
2-itemsets are counted. At the end of the second scan, potentially frequent
3-itemsets are generated, etc. When the end of the database is reached, the
database is rewound to the beginning and the itemsets that were not fully
counted are processed. The actual number of database scans depends on the
interval size, however, the minimal number of database scans is two.
Data mining research also focused on online algorithms for mining as-
sociation rules. G ARM A (Continuous Association Rule Mining Algorithm)
shows current association rules to the user and allows the user to change the
minsup and minconf parameters online, at any transaction during the first
scan of the database. G ARM A generates the itemsets in the first scan and
finishes counting all of them during the second scan, similarly to DIG. After
having read each transaction, G ARM A first increments the counts of the
itemsets that are subsets of the transaction. Then, if all immediate subsets
of the itemsets are currently potentially frequent with respect to the current
minsup and the part of the database read, it generates new itemsets from the
transaction. For more accurate prediction of whether an itemset is potentially
large, C ARM A calculates an upper bound for the count of the itemset, which
is the sum of its current count and an estimate number of occurrences before
the itemset is generated. The estimate (called maximum misses) is computer
when the itemset is first generated. G ARM A needs at most 2 database scans
to discover the requested association rules.
and to integrate the support information of the new large itemsets in order
to reduce the pool of candidate itemsets to be re-examined.
Let L = {h, l2, ... , lm} be a set of literals called items. Let D be a set
of variable length sequences, where each sequence S = (X 1 X 2 ... Xn) is an
ordered list of sets of items such that each set of items Xi ~ L.
We say that a sequence (X 1 X 2 .. . Xn) is contained in another sequence
(Y1 Y2 ..• Ym ) if there exist integers i1 < i2 < ... < in such that Xl ~
Yi 1 , X 2 ~ Yi 2 , ••• ,Xn ~ Yin. We say that in a set of sequences, a sequence is
maximal if it is not contained in any other sequence. We say that a sequence
S from D supports a sequence Q, if Q is contained in S.
A sequential pattern is a maximal sequence in a set of sequences. Each
sequential pattern has an associated measure of its statistical significance,
called support. The support for the sequential pattern (X 1 X 2 ... Xn) in the
set of sequences Dis:
While making a pass, one data-sequence at a time is read from the data-
base and the support count of candidates contained in the data-sequence is
incremented. Thus, given a set of candidate sequences C and a data-sequence
d, the goal is to find all sequences in C that are contained in d. The algorithm
for checking if the data-sequence d contains a candidate sequence 8 alternates
between forward and backward phases. The algorithm starts in the forward
phase from the first element. In the forward phase, the algorithm finds suc-
cessive elements of 8 in d as long as the difference between the end-time of
the element just found and the start-time of the previous element is less than
max-gap (for an element 8i, start-time(8i) and end-time(8i} correspond to the
first and last transaction-times of the set of transactions that contain 8i). If
the difference is more than max-gap, the algorithm switches to the backward
phase. If an element is not found, the data-sequence does not contain 8.
In the backward phase, the algorithm backtracks and "pulls up" previous
elements. If 8i is the current element and end-time(8i} = t, the algorithm
finds the first set of transactions containing 8i-1 whose transaction-times are
after t - max-gap. Pulling up 8i-1 may necessitate pulling up 8i-2 because
the max-gap constraint between 8i-1 and 8i-2 may no longer be satisfied.
The algorithm moves backwards until either the max-gap constraint between
the element just pulled up and the previous element is satisfied, or the first el-
ement has been pulled up. The algorithm then switches to the forward phase,
finding elements of 8 in d starting from the element after the last element
pulled up. If any element cannot be pulled up (that is, there is no subsequent
set of transactions which contain the element), the data-sequence does not
contain 8. This procedure is repeated, switching between the backward and
forward phases, until all the elements are found or it is discovered that the
data-sequence does not contain 8.
11. Data Mining 515
three scans of the database. It has been shown experimentally that SPADE
is at least twice as fast as GSP.
Classification and prediction are two forms of data analysis that are used
to extract models describing data classes or to predict future data trends.
Classification is used to predict categorical labels, while prediction is used to
predict numerical values or value ranges. For example, classification model
can be built to classify medical treatment as either safe or risky. Prediction
model can be built to asses the value of a given company stock, a blood
pressure of a given patient, or to asses the value of energy consumption by
a given company. In this section, we will briefly describe and discuss basic
techniques for data classification and data prediction.
Classification is a two-step process. In the first step, a concise model is
built describing a predetermined set of data classes. The model is constructed
by analyzing a dataset of training tuples, also called training database, de-
scribed by attributes. The data tuples of the training database are also called
samples, examples, or instances. Each tuple of the training database is a fea-
ture vector (Le. a set of <attribute-value> pairs) with its associated class. At-
tributes whose domain is numerical are called numerical attributes, whereas
attributes whose domains are not numerical are called categorical l . There
is one distinguished attribute called the dependent attribute. The remaining
attributes are called predictor attributes. Predictor attributes may be either
numerical or categorical. If the dependent attribute is categorical, then the
problem is referred to as a classification problem. If the dependent attribute
is numerical, the problem is called a prediction problem.
In the second step of classification, the resulting model is used to assign
values to tuples where the values of the predictor attributes are known but the
value of the dependent attribute is unknown. First, the predictive accuracy
of the model is estimated based on the test set of tuples. These tuples are
randomly selected and are independent of the training tuples. The accuracy
of the learned model on a given test set of tuples is defined as a percentage
of test set tuples that are correctly classified by the model. If the accuracy
of the model is acceptable, the model can be used to classify future data
tuples and predict values of new tuples, for which the value of the dependent
attribute is missing or unknown.
Prediction is very similar to classification. However, in prediction, the
model is constructed and used to predict not a discrete class label of the
3.1 Classification
The main goal of classification is to build a formal concise model called classi-
fier of the dependent attribute based upon values of the predictor attributes.
The input to the classification problem is a training set of tuples, each be-
longing to a predefined class as determined by the dependent attribute. In the
context of classification, the dependent attribute is usually called the class
label attribute. The elements of the domain of the class label attribute are
called class labels. The learned model can be used to predict classes of new
tuples, for which the class label is missing or unknown. Figure 3.1 shows a
sample training set of tuples where each tuple represents a credit applicant.
Here we are interested in building a model of what makes an applicant a
high or low credit risk. The class label attribute is Risk, the predictor at-
tributes are: Age, Marital Status, Income, and Children. Figure 3.2 shows
a sample classifier, in the form of a decision tree, that has been built based
on the training set from Figure 3.1. Once a model is built, it can be used to
determine a credit class of future unclassified applicants.
Many classification models have been proposed in the literature: deci-
sion trees [BFO+84,Mur98,Qui86,WK91j, decision tables [Koh95j, Bayes-
ian methods [CS96,Mit97,WK91j, neural networks [Bis95,Rip96j, genetic al-
gorithms [GoI89,Mit96j, k-nearest neighbor methods [Aha92,DH73,Jam85j,
rough set approach [CPS98,Paw91,SS96,Zia94j, and other statistical meth-
ods [Jam85,MST94,WK91j. All mentioned classification models can be com-
pared and evaluated according to [HKOOj: predictive accuracy, scalability
(efficiency in large databases), interpretability and understandability, speed,
and robustness with regard to noise and missing values. Among these models,
decision trees are particularly suited and attractive for data mining. First,
due to their intuitive representation the resulting classification model is easy
to understand by humans. Second, decision trees can be constructed rela-
tively fast compared to other classification methods. Third, decision trees
11. Data Mining 519
low high
scale well for large data sets and can handle high dimensional data. Last, the
accuracy of decision trees is comparable to other methods. Almost all major
commercially available data mining tools includes some form of decision tree
model. The main drawbacks of decision trees is that they cannot capture
correlation among attributes without additional computation.
520 T. Morzy and M. Zakrzewicz
The construction of the tree starts with a single root node N represent-
ing the training database D. If all tuples of D belong to the same class C,
then, the node N becomes a leaf node labeled C, and the algorithm stops
(steps 2 and 3). Otherwise, the set of predictor attributes A is examined
according to the split selection method 88 and a splitting attribute, called
"best..split" , is selected (step 6 and 7). The splitting attribute partitions the
training database D into a set of separate classes of samples 81, 8 2 , ••. ,8v ,
where 8 i , i = 1, ... , v, contains all samples from D with splitting attribute =
ai (step 9). A branch, labeled Vi, is created for each value ai of the splitting
attribute, and to each branch Vi a set of samples 8 i is assigned. The par-
titioning procedure is repeated recursively to each descendant node to form
the decision tree for each partition of samples (step 10). The set of attributes
A is examined and from it a splitting attribute is selected. Once an attribute
has been selected as a splitting attribute at a given node, it is not necessary
to consider the attribute in any of the node's descendants (step 10). The
procedure stops when one of the following conditions is satisfied: (1) a node
is "pure", i.e. all samples for the node belong to the same class, (2) there are
no remaining attributes on which the samples belonging to a given internal
node may be further partitioned - in this case the given node is converted
into a leaf node and labeled with the class label in majority among samples,
and (3) there is no samples for the branch splitting attribute = ai - in this
case a leaf node is created with the majority classes in samples. In the given
example all predictor attributes are categorical which is obviously not always
the case. Continuous attributes have to be discretized.
There are several variants of the basic decision tree construction algorithm
proposed by researchers from machine learning (ID3, C4.5) [Qui86,Qui93]'
statistics (CART) [BFO+84], pattern recognition (CHAID) [Mag94], or data
mining (SLIQ, SPRINT, SONAR, CLOUDS, PUBLIC, BOAT, Rainforest)
[ARS98,FMM +96,GRGOO,GGR+99,MAR96,SAM96,RS98].
The main difference between above mentioned algorithms consists in a
split selection method used during the building phase. The split selection
method should maximize the accuracy of the constructed decision tree or,
in other words, minimize the misclassification rate of the tree. Most of split
selection methods used in practice by commercial data mining tools belong
to the class called impurity-based split selection methods [Shi99]. Impurity-
based split selection methods find the splitting attribute of a node of the
522 T. Morzy and M. Zakrzewicz
decision tree by minimizing an impurity measure, such as: the entropy (ID3,
C4.5) [Qui86,Qui93], the gini-index (CART, SPRINT) [BFO+84,SAM96],
or the index of correlation X2 (CHAm) [Mag94]. In the following we will
briefly describe two popular split selection methods used in practice, based
on: information gain and gini-index measures, and give an intuition behind
impurity-based split selection methods.
Assume that the attribute A have v distinct values, {aI, a2, ... , a v }. If
we select the attribute A as the splitting attribute for the set of samples
D, then the attribute will partition D into subsets Sl, S2,"" Sv, where
Si, i = 1, ... , v, contains samples from D for which A = ai. Each subset
Si will correspond to a branch grown from the node containing D. Assume
Sij denotes the number of samples of the class Ci in a subset Sj. The in-
formation needed to classify samples in resulting partitions according to an
attribute A, called entropy, is given by the following formula
v
E(A) -- "~ Slj + 82j + ... + 8 m j 1(81),82),
. . .. ·, 8 m ).)
j=l n
where Pij = 8ij j I Sj I is the probability that a sample from the subset Sj
belongs to the class Ci .
The information gain of an attribute A is defined as follows:
and determines the amount of information that would be gained if the set of
samples will be split on the attribute A.
The classification algorithm based on the information gain measure works
as follows. First, the algorithm calculates the information gain of all predic-
tor attributes and select the attribute with the highest information gain as
the splitting attribute for the root node. The root node is labeled with the
attribute, for each value of the attribute a branch is created, and the set
of samples D is partitioned into a set of subsets, one for each branch. The
process is repeated recursively for each branch using samples assigned to the
branch. The algorithm we have described only works when all attributes are
categorical. If a set of samples contains numerical attributes they must be
discretized.
The information gain measure has the following properties. When the
splitting attribute partitions a set of samples into "pure" subsets, i.e. each
consisting of samples belonging to a single class only, the value of the entropy
is zero and the value of the information gain reaches a maximum. When the
splitting attribute partitions a set of samples into subsets, which consist of
524 T. Morzy and M. Zakrzewicz
samples uniformly distributed among different classes (i.e. each subset has
equal number of samples belonging to different classes), then the value of the
entropy reaches the maximum and the information gain reaches the minimum.
The information gain measure tends to prefer attributes with many values.
Notice that the entropy of an attribute that has a different value for each
sample in a training database, e.g. an attribute that is an identifier of the
training database (e.g. credit applicant identifier), is zero. The attribute will
have the highest information gain. This is obvious since such an attribute
uniquely determines the class of each sample without any ambiguity. The
problem is that splitting on this attribute is unreasonable since it is useless
for predicting the class of a new unknown sample and tells nothing about the
structure of the decision tree. Therefore, to compensate for this effect of the
information gain, a correction of the measure called gain ratio was proposed
[Qui93].
To illustrate the idea of the split selection method based on information
gain consider once more the example training database D from Figure 3.l.
The training database D has four predictor attributes: Age, Marital Status,
Income, and Children. Attributes Marital Status and Income are categorical,
while attributes Age and Children are numerical attributes that have to be
discretized. Therefore, let us assume that the range of values of the attribute
Age is divided into 3 intervals: "< 30", "30, ... ,40", and "> 40", and the
range of values of the attribute Children is reduced to 3 distinct values: "0",
"1 ... 2", and "> 2". The class label attribute Risk has two distinct values:
high and low. Therefore, we distinguish two classes (m=2): C 1 and C 2 • Let
C 1 denotes the class of high risk credit applicants, while C 2 denotes the class
of low risk credit applicants. The training database D consists of 14 samples
(n=14), where 6 samples belong to the class C 1 (81 = 6) and 8 samples belong
to the class C 2 (82 = 8). First, we calculate the expected information needed
to classify a given sample:
3
E(Age) = ""
L..- + S2j I(Slj, S2j)
Slj 14 = 0.8221
j=l
Similarly, we calculate the entropy and information gain for the rest of at-
tributes:
The highest information gain has the attribute Income = 0.4696, there-
fore, it is selected as the splitting attribute for the root node of the tree.
The root node is labeled with the attribute Income and for each value of the
attribute a corresponding branch is created. The set D is partitioned into
subsets Sl, S2, and S3, where each subset is assigned to the corresponding
descendant node. Notice that a set of samples assigned to the branch Income
= low is completely pure, i.e. all samples assigned to this branch belong to
the same class Risk = high. Therefore, a leaf node is created for this branch
and labeled with the class label high. For two other branches we repeat the
process of calculating the entropy and information gain using samples as-
signed to each branch. Consider the set of samples assigned to the branch
Income = medium. It is easy to notice that two attributes, namely, Marital
Status and Children, partition the set of samples into pure subsets. There-
fore, the entropy of both attributes is zero and the information gain reaches
a maximum. Assume the attribute Marital Status is selected as the splitting
attribute. A node at the end of the branch Income = medium is created and
labeled with Marital Status, and two branches labeled married and divorced
are grown. As we noticed before, the attribute Marital Status partitions the
set of samples into pure subsets. Therefore, for both branches leaf nodes are
created and labeled, respectively, low (for the branch Marital Status = mar-
ried) and high (for the branch Marital Status = divorced). Consider now the
set of samples assigned to the branch Income = high. The algorithm computes
the entropy of all predictor attributes, but Income, for the given set of sam-
ples: E(Age) = 0.46, E(MaritaLStatus) = 0.46, E(Children) = 0.4. The
attribute Children is selected as the splitting attribute since it has the high-
est information gain. A node is created and labeled with Children, and two
526 T. Morzy and M. Zakrzewicz
branches labeled, respectively, "0" and "1 ... 2", are grown. The samples be-
longing to the partition Children = "1 ... 2" all belong to the same class Risk
= low. Therefore, a leaf node is created for this branch labeled with low. The
algorithm continues with the set of samples assigned to the partition Children
= "0". Both remaining attributes, i.e. Marital Status and Age have the same
entropy value for this set of samples: E(Age) = 0, E(M aritaLStatus) = o.
The final decision tree induced by the algorithm is depicted in Figure 3.4.
low high
Gini index. Another popular split selection method is based on the gini
index measure. To illustrate the idea of the method and present the gini
index consider the schema, depicted in Figure 3.5, of a version of the well
known decision tree construction algorithm SPRINT [SAM96j.
Notice that each internal node of a decision tree is labeled with a splitting
attribute A, but moreover, it has a predicate Qa, called the splitting predicate,
11. Data Mining 527
The gini index of a binary split of the training dataset D into subsets D1
and D2 is defined as follows:
..
gznzsplit (D 1, D)
2
I D1 Igznz
= TDT . '(D) I D2 I . '(D)
1 + TDT gznz 2
where
m
gini(D) = 1 - 2: I ~ I
3=1
To illustrate the split selection method based on gini index consider once
more the example training database D from Figure 3.1. The method does not
require that all attributes are categorical, so it is not necessary to discretize
numerical attributes. However, the method requires that the training data-
base D is sorted for each numerical attribute at each node of the tree. Let's
start with the attribute Age. Table 3.1 shows the value of the gini index for
all possible split points in the domain of the numerical attribute Age.
25 28 29 31 35 38 39 41 42 45 48
::; > ::; > ::; > ::; > ::; > ::; > ::; > ::; > ::; > ::; > ::; >
C1 1 5 2 4 2 4 3 3 3 3 4 2 5 1 5 1 6 0 6 0 6 0
C2 0 8 0 8 1 7 1 7 2 6 3 5 3 5 4 4 4 4 5 3 6 2
gini 0.4396 0.3809 0.5065 0.4786 0.4571 0.4490 0.3870 0.4318 0.3429 0.3896 0.4286
Table 3.1. Gini index for Age
Similarly, we have to compute the gini index of all split points and/or
splitting subsets for other predictor attributes. Table 3.2 shows the value of
the gini index for all possible splitting subsets in the domain of the categorical
attribute Income. The lowest value of the gini index has the splitting criterion
Income E low. Therefore, it is selected as the splitting predicate for the root
node. The predicate partitions D into D1 and D 2, where D1 contains the set
of samples with Income E low and D2 contains all remaining samples. Note
that D1 is "pure", i.e. all samples belong to the class high. Therefore, we stop
developing that part of the tree and create the leaf node labeled with high.
We repeat the process of selecting splitting criterion for the subset D 2 • We
compute the gini index of all candidate split points or splitting subsets, and
from it a splitting criterion with lowest value is selected. The final decision
tree built by the algorithm is shown in Figure 3.6.
low high
future data. However, a decision tree build during the growth phase very
often is too complex if it overfits the training database. The pruning phase
of decision tree construction addresses the problem of overfitting the training
data and determines the size of the final tree by removing some branches and
nodes from the constructed tree. There are two basic approaches to avoid
overfitting: prepruning approach and postpruning approach.
In the prepruning approach, during the growth phase construction of a
tree is stopped earlier by deciding not to further split the training dataset at
a given node. Upon stopping, the node becomes a leaf node and is labeled
with the class to which most of samples at the node belong. The decision
530 T. Morzy and M. Zakrzewicz
the node N has two child nodes Nl and N2. Let minCN denotes the cost
of encoding the minimum cost subtree rooted at N. It is worth to prune
child nodes Nl, N2 and transform N into the leaf node if C'eaJ(N) is no
worse than Cspl it (N) + 1 + minCNl + minCN2 . In other words, ifthe cost of
encoding samples (their class labels) at N is lower than or equal to the cost
of encoding the subtree rooted at N, then it is worth to prune child nodes
Nl, N2 and transform N into the leaf node of the decision tree.
To illustrate the M DL pruning procedure let us consider a fragment of
a decision tree shown in Figure 3.7. The internal node N has the splitting
predicate "Age ~ 22", which splits a set of samples into two subsets assigned
to leaf nodes Nl and N2. The cost of the split at the node N is Csplit(N) =
log J + log(v -1) = 2.6, since J = 2 and v = 4. Since the splitting predicate
partitions a set of samples at the node N into pure subsets, then minCNl = 1
and minCN2 = 1. Therefore, Csplit(N) + 1 + minCNl + minCN2 = 2.6 +
1 + 1 + 1 = 5.6. On the other hand, the cost of encoding the leaf node N is
nE + 1 = 4(-! log! - ~ log~) + 1 = 4.245. Since nE + 1 ~ Csplit(N) + 1 +
minCNl + minCN2, then Nl and N2 are pruned and N is transformed into
the leaf node of the decision tree.
Nl N2
high low
The rules that are directly extracted from a decision tree are more complex
than necessary, and usually they are pruned by removing redundant tests.
Given a particular rule, each test in it is considered for deletion by tentatively
removing it, working out which of the training samples are covered by the
rule, calculating from this a pessimistic estimate of the accuracy of the new
rule, and comparing this with the pessimistic estimate of the accuracy of
the original rule. If the accuracy of the new rule is better than that of the
11. Data Mining 533
original rule, delete the test. Continue the procedure checking other tests
to delete. Left the rule if there are are no tests to delete. Once all rules
have been pruned, it is necessary to check if there are no any duplicates and
remove them from the set of rules. Usually, the set of rules is extended by an
additional "default" rule that covers cases not specified by other rules. The
most frequent class label among training samples is assigned to the rule as a
default.
If the attributes are numeric, most k-nearest neighbor classifiers use Euclidean
distance. Assume an n-dimensional Euclidean space, the distance between two
points X and Y, X = (Xl, X2, ... , xn) and Y = (Yl, Y2,.·., Yn), is defined as
n
d(X, Y) = ~)Xi - Yi)2.
i=l
Instead of the Euclidean distance, we may also apply other distance metrics
like Manhattan distance, maximum of dimensions, or Minkowski distance.
The choice of a given distance metric depends on an application. The second
issue is how to transform a sample to a point in the pattern space. Note that
different attributes may have different scales and units, and different variabil-
ity. Thus, if the distance metric is used directly, the effects of some attributes
might be dominated by other attributes that have larger scale or higher vari-
ability. A simple solution to this problem is to weight the various attributes.
One common approach is to normalize all attribute values into the range [0,
1]. This solution is sensitive to the outliers problem since a single outlier could
cause virtually all other values to be contained in a small subrange. Another
common approach is to apply a standardization transformation, such as sub-
tracting the mean from the value of each attribute and then dividing by its
standard deviation. Recently, another approach was proposed which consists
in applying the robust space transformation called Donoho-Stahel estimator
[KNZOl]. The estimator has some important and useful properties that make
the estimator very attractive for different data mining applications.
The description of other classification methods like case-based reasoning,
genetic algorithms, rough sets, and fuzzy sets can be found in [AP94,KoI93],
[CPS98,Mic92,Mit96,Paw91,SS96,Zad65,Zia94].
There are several methods for estimating classifier accuracy. The choice
of a method depends on the amount of sample data available for training and
testing. If there are a lot of sample data, then the following simple holdout
method is usually applied. The given set of samples is randomly partitioned
into two independent sets, a training set and a test set. Typically, 70% of the
data is used for training, and the remaining 30% is used for testing. Provided
that both sets of samples are representative, the accuracy of the classifier on
the test set will give a good indication of accuracy on new data. In general,
it is difficult to say whether a given set of samples is representative or not,
but at least we may ensure that the random sampling of the data set is done
in such a way that the class distribution of samples in both training and test
set is approximately the same as that in the initial data set. This procedure
is called stratification.
Note that a classifier accuracy computed on a test set is only an estimate of
the true value of the classifier accuracy on the target (new) data set. Assume
p denotes a true (unknown) value of the classifier accuracy and f denotes the
classifier accuracy measured on a test set. The question is how close f is to p?
The answer is usually expressed as a confidence interval, that is, p lies within
a specified interval [f - z, f + z] with a certain specified confidence, which
depends on the size of the test set and the data distribution. The following
formula taken from [WFOO] gives the values of upper and lower confidence
boundaries for p:
For example, if f = 75% on a test set of the size N = 1000, then with 95%
confidence p lies within the interval [73,3%,76,8%].
If the amount of data for training and testing is limited, the problem is
how to use this limited amount of data for training to get a good classi-
fier and for testing to obtain a correct estimation of the classifier accuracy?
The standard and very common technique of measuring the accuracy of a
classifier when the amount of data is limited is k-fold cross-validation. In k-
fold cross-validation, the initial set of samples is randomly partitioned into k
approximately equal mutually exclusive subsets, called folds, 8 1 ,82 , ... , 8k.
Training and testing is performed k times. At each iteration, one fold is used
for testing while remainder k - 1 folds are used for training. So, at the end,
each fold has been used exactly once for testing and k - 1 for training. The
accuracy estimate is the overall number of correct classifications from k itera-
tions divided by the total number of samples N in the initial dataset. Often,
the k-fold cross-validation technique is combined with stratification and is
called stratified k-fold cross-validation.
There are many other methods of estimating classifier accuracy on a
particular dataset. Two popular methods are leave-one-out cross-validation
538 T. Morzy and M. Zakrzewicz
3.6 Prediction
The main goal of prediction is to construct a formal concise model for pre-
dicting numeric values or value ranges. The constructed model can be used
to predict, for example, the sales of a product given its price. As in the clas-
sification, the input to the prediction problem is a training set of tuples. The
outcome of the prediction is a value or a range of values. The classification
methods we have discussed in the previous Section work well with numerical
as well as categorical predictor attributes. However, when the dependent at-
tribute is numeric, and all the predictor attributes are also numeric, then we
may apply well known statistical methods of regression.
where x is the value of the dependent attribute X, al, a2, ... , ak are the
predictor attributes values, and wo, WI, W2,"" Wk are regression coefficients.
This is a regression equation, and the process of determining the coefficients
is called regression.
The coefficients are calculated from the training dataset of tuples by
the method of least squares. Assume the first sample from the training
dataset has the dependent attribute value xl, and predictor attributes values
aL a~, ... ,al, where the superscript denotes that it is the first sample. For
11. Data Mining 539
Of interest, is the difference between the actual value of the sample xl and the
predicted value given by the above formula. The method of linear regression
is to choose the regression coefficients Wi, i = 0,1, ... , k, to minimizes the
sum of the squares of these differences over all training samples. Given n
samples, then the sum of the squares of the differences is defined as follows:
n k
2::(x i - 2:: wja;)2
i=l j=l
Regression trees. Decision trees are designed to predict class labels of new
unseen data. When it comes to predict numeric values, the same kind of tree
representation can be used. Trees used for numeric prediction are just like or-
dinary decision trees except that each leaf node of the tree contains numeric
value that represents the average value of all training samples that reach the
leaf node. This kind of the tree is called a regression tree [BFO+84,WFOOj.
Regression trees are constructed in the similar way as decision trees. First,
a decision tree induction algorithm is used to build the initial tree. Then,
when the initial tree is constructed, the tree is pruned by removing some
branches and nodes from the constructed tree. The main difference between
the construction of a decision tree and a regression tree consists in the split-
ting criterion. Decision tree induction algorithms find the splitting attribute
of a node of the decision tree by minimizing the an impurity measure (en-
tropy or gini-index). In regression tree construction the splitting criterion is
usually based on the standard deviation of the class values in training data-
base D as a measure of the error at the node. The attribute that maximizes
the expected error reduction is chosen for splitting at the node.
540 T. Morzy and M. Zakrzewicz
I8 I
8DR = sd(T) - ~
v
TDT
i
X sd(8i )
where 81. 8 2 , ••• are the sets of training data that result from splitting the
node according to the chosen attribute, and sd(D), sd(8i ) denote standard
deviation of D and 8 i , respectively.
Regression trees are usually larger, more complex and much more difficult
to interpret than corresponding regression equations.
4 Clustering
Clustering is a process of grouping a set of physical or abstract objects into a
set of classes, called clusters, according to some similarity function. Cluster is
a collection o( objects that are similar to one another within the cluster and
dissimilar to objects in other clusters. Objects belonging to one cluster can be
treated collectively as one group. Unlike classification, there is no predefined
classes or class-labeled training objects. A "good" clustering method produces
a number of clusters in which the intra-cluster similarity is high, and the
inter-cluster similarity is low.
Clustering has a wide range of applications, including marketing, pattern
recognition, data analysis, image processing, biology, banking, and informa-
tion retrieval. For example, in marketing, clustering help discover groups of
customers with similar behavior based on purchasing patterns, discover cus-
tomers with unusual behavior, discover companies with similar growth or
similar energy consumption, etc. Clustering can be used to classify similar
documents on the Web for information discovery or to discover groups of
Web users with similar access patterns. In biology, clustering can be used for
animal or plant classification. In general, by clustering we can discover overall
distribution patterns and interesting correlations among objects attributes.
For other examples of clustering applications see [FPS+96,HKOOl.
Clustering is a well known research problem intensively studied in many
areas including machine learning, statistics, biology, and data mining. In ma-
chine learning, clustering was analyzed as an example of unsupervised classi-
fication (or unsupervised learning). In statistics, cluster analysis was focused
mainly on distance-based cluster analysis, where each object is described as
a n-dimensional data feature vector. Recently, due to the huge amount of
data collected in databases and data warehouses, clustering has become a
highly active topic in data mining research. In data mining, current research
on clustering focuses on the scalability of clustering algorithms with respect
to the number of objects, the number of dimensions, and the noise level,
and effectiveness of clustering algorithms for new types of data (numerical,
categorical, sequences, unstructured documents, Web pages, etc.) [HKOOl.
11. Data Mining 541
ing this variant, it is possible to obtain the optimal partition starting from
any arbitrary initial partition provided proper threshold values are specified.
Another variation of the K -means algorithm involves selecting a different
objective function or strategies to calculate cluster centers.
An interesting generalization of the K-means algorithm is the EM (Ex-
pectation Maximization) algorithm [Lau95,Mit97]. The algorithm is the com-
bination of the probability-based clustering with the K-means paradigm.
From a statistical perspective, the goal of clustering is to find the most likely
set of clusters given a set of objects. The foundation of statistical clustering
is a statistical model called finite mixture. A mixture is a set of k probability
distributions representing k clusters. Each distribution gives the probability
that a particular object would have a certain set of attribute values if it were
known to be a member of a given cluster. We assume that the individual
components of the mixture model are Gausian but with different means and
variances. The clustering problem is to take a set of objects, a number of clus-
ters, and work out each cluster mean, standard deviation and the population
distribution between the clusters. If we knew which of distributions each ob-
ject came from, finding the parameters of the mixture model would be easy.
On the other hand, if we knew the parameters of the model, then finding
the probabilities that a given object comes from given distribution would be
easy too. The problem is that we do not know neither the distribution that
each object came from, nor the parameters of the mixture model. So, EM
adopts the K -means paradigm to estimate these distributions and param-
eters from the objects. The EM algorithm begins with an initial estimate
of parameters, then uses them to calculate the cluster probabilities for each
object. The calculated probabilities are then used to update the parameter
estimates, and the process repeats.
One of the main disadvantages of the K -means algorithm is its sensitivity
to outliers. In general, by outliers we mean a set of objects that are consid-
erably dissimilar from the remainder of objects. Outliers may substantially
distort the distribution of objects among clusters. To deal with the problem
of outliers and noisy data, the K-medoids clustering method has been pro-
posed [KR90,NH94]. The basic idea of the algorithm consists in replacing
the center of a cluster, as a reference point of the cluster, by the medoid,
which is the most centrally located object in the cluster. The basic strategy
of K -medoids clustering algorithms consists in finding k representative ob-
jects (medoids) representing k clusters. The strategy then iteratively replaces
one of the medoids by one of the non-medoid objects if it improves the qual-
ity of the resulting partition. The quality is estimated using a cost function,
called total swapping cost, that measures the average dissimilarity between
an object and the medoid of its cluster. One of the first K-medoid clustering
algorithm was PAM (Partitioning Around medoids) [KR90]. The algorithm
is less sensitive to outliers and noisy data than K-means algorithms, however,
it has higher processing cost than the K-means algorithms.
544 T. Morzy and M. Zakrzewicz
P AM works effectively for small data sets, but does not scale well for large
data sets. To copy with large data sets, a sampling-based K-medoid algo-
rithm, called CLARA (Clustering LARge Applications) can be used [KR90j.
The algorithm works as follows. It draws multiple samples of the data set, ap-
plies P AM on each sample, and returns the best clustering as the output. If a
sample is selected in a fairly random manner, it should represent correctly the
original data set. Multiple samples increase the chance of producing a "good"
clustering. The problem is that a good clustering based on samples will not
necessarily represent a good clustering of the whole data set. Therefore, the
effectiveness of CLARA depends on the size of the sample. The larger sample
size, the greater probability of finding the best medoids for clusters. Notice
that CLARA looks for best k medoids among the selected sample of the data
set. It may happen that sampled medoids are not among best medoids of the
data set. In this case CLARA will never find the best clustering. Another
interesting variant of the K-medoids algorithm is the CLARANS (Cluster-
ing Large Applications based on RANdom Search) algorithm [NH94]' which
combines randomized search with PAM and CLARA algorithms. The clus-
tering process is formalized as a searching a graph in which each node is
a K-partition represented by a set of K medoids. Two nodes of the graph
are neighbors if they only differ by one medoid. C LARAN S starts with a
randomly selected node. For current node, it checks randomly maxneighbor
number of neighbors, where maxneighbor is a user-specified parameter. If a
better solution (neighbor) is found, CLARANS moves to the neighbor and
continues. Otherwise, it records the current node as a local optimum and
starts with a new randomly selected node. The algorithm stops after some
local optima have been found and returns the best one.
step a graph in which pairs of clusters closer than dk are connected by a graph
edge. If all initial clusters are members of a connected graph, the algorithm
stops. The output of the algorithm is a nested hierarchy of graphs (tree of
clusters), called a dendrogram, representing the nested grouping of objects
and similarity levels at which groupings change. A clustering of objects is
obtained by cutting the dendrogram at the desired dissimilarity level. Then,
each connected component in the corresponding graph forms a cluster (see
Figure 4.3).
similarity level
3. modifying the path to the leaf node: after inserting a new entry into a leaf
node, update the CF entries for each non-leaf node on the path from the
root node to the leaf node. In the absence of a split, this simply involves
updating CF-vectors at non-leaf nodes, otherwise, split non-leaf nodes if
necessary,
4. merging refinement: if the CF-tree is too large, condense the tree by
merging the closest leaf nodes.
Phase 1 Scan all data and build an initial in-memory CF-tree using the
given amount of memory and recycling space on disk.
Phase 2 (optional) Scan the leaf entries in the initial CF-tree to rebuild a
smaller CF tree, while removing outliers and grouping crowded clusters
into larger ones.
Phase 3 Cluster all leaf entries by applying an existing (hierarchical ag-
glomerative) clustering algorithm directly to the subclusters represented
by their CF-vectors.
Phase 4 (optional) Pass over the data to correct inaccuracies and refine
clusters further.
.JJ,- Data
IDraw random sample Iq IPartition sample Iq IPartially cluster partitions I
.JJ,-
ILabel data in disk I ¢:J ICluster partial clusters I ¢:J IEliminate outliers
.JJ,-
Fig. 4.5. CURE: the algorithm
tial databases. The clustering process in these methods is based on the notion
of density. The density-based methods regard clusters as dense regions of ob-
jects in the data space that are separated by regions of low density. The
basic idea of these methods is to grow the given cluster as long a5 the density
in the "neighborhood" of the cluster exceeds some threshold value. Density-
based methods have several interesting properties: they are able to discover
clusters of arbitrary shape, they handle outliers, and usually need only one
scan over data set. A well-known example of a density-based method is the
DB SCAN algorithm [EKS+96]. DBSCAN defines clusters as maximal
density-connected sets of objects. The algorithm requires user to specify two
parameters to define minimum density: t: - maximum radius of the neighbor-
hood, and minpts - minimum number of objects in an €-neighborhood of that
object. If the t:-neighborhood of an object contains at least minpts objects,
then the object is called a core object. To determine clusters, DBSCAN uses
two concepts: density reach ability and density connectivity. An object OJ is
directly density reachable from an object 0i with respect to t: and minpts if:
(1) OJ belongs to the t:-neighborhood of Oi, and (2) the €-neighborhood of
0i contains more than minpts objects (Oi is the core object). Density reach-
ability is the transitive closure of direct density reach ability. An object OJ
is density connected to an object 0i with respect to € and minpts if there
is an object Ok such that both objects OJ and 0i are density reachable from
Ok with respect to t: and minpts. The following steps outline the algorithm:
(1) start from an arbitr~ry object 0, (2) if t:-neighborhood of 0 satisfies min-
imum density condition, a cluster is formed and the objects in belonging to
the t:-neighborhood of 0 are added to the cluster, otherwise, if 0 is not the
core object, DBSCAN selects the next object, (3) continue the process until
all objects have been processed. A density-based cluster is a set of density
connected objects that is minimal with respect to the density reachability
relationship. Every object not contained in any cluster is considered to be
outlier. To determine t:-neighborhood of a given object, DB SCAN uses index
structures, like R-tree or its variants, or nearest-neighbor search.
Other interesting examples of density-based algorithms are DBCLASD
[XEK98], OPTICS [ABK+99] (extensions of DBSCAN) , and DENCLUE
[HK98].
similar to numeric values, it is even difficult to say that one name of a car is
"like" or "unlike" another name.
Traditional clustering algorithms are, in general, not appropriate for
clustering data sets with categorical attributes [GRS99bj. Therefore, new
concepts and methods were developed for clustering categorical attributes
[Hua98,GGR99,GKR98,GRB99,GRS99b,HKK+98j. In the following subsec-
tion we briefly present one of the proposed method, called ROCK, to illus-
trate basic concepts and ideas developed for clustering categorical attributes.
where Ci denotes cluster i of size ni, k denotes the required number of clus-
ters, and f((}) denotes a function that is dependent on the data set as well
as the kind of clusters. The function has the following property: each object
belonging to cluster Ci has approximately n{(8) neighbors in Ci . The best
clusters are the ones that maximize the value of the objective function.
554 T. Morzy and M. Zakrzewicz
.....JJ.- Data
IDraw random sample Iq ICluster with links Iq I Label data In disk
users since they often contain useful information on abnormal behavior of the
system described by a set of objects. Indeed, for some applications, the rare
or abnormal events or objects are much more interesting than the common
ones, from a knowledge discovery standpoint. Sample applications include
the credit card fraud detection, network intrusion detection, monitoring of
criminal activities in electronic commerce, or monitoring tectonic activity of
the earth's crust [KNTOOj. Outlier detection and analysis is an interesting
and important data mining task, referred to as outlier mining.
The algorithms for outlier detection can be classified, in general, into the
following 2 approaches [HKOOj: (1) statistical approach and (2) distance-based
approach.
The concept of outliers has been studied quite extensively in computa-
tional statistics [BL94,Haw80j. The statistical approach to outlier detection
assumes that the objects in the data set are modeled using a stochastic distri-
bution, and objects are determined to be outliers with respect to the model
using a discordancy/outlier test. Over 100 discordancy tests have been de-
veloped for different circumstances, depending on: (1) the data distribution,
(2) whether or not the distribution parameters (e.g. mean and variance) are
known, (3) the number of expected outliers, and even (4) the types of ex-
pected outliers (e.g. upper of lower outliers in an ordered sample). How-
ever, almost all of the discordancy tests suffer from two serious problems.
First, most of the tests are univariate (i.e. single attribute). This restriction
makes them unsuitable for multidimensional data sets. Second, all of them
are distribution-based, i.e. they require parameters of the data set, such as
the data distribution. In many cases, we do not know the data distribution.
Therefore, we have to perform extensive testing to find a multidimensional
distribution that fits the data.
Distance-based approach defines outliers by using the distances of the
objects from one another. For example, the definition by Knorr and Ng
[KN98,KNTOOj defines an outlier in the following way: an object 0 in a data
set D is a distance-based (DB) outlier with respect to the parameters k and
d, that is, DB(k, d), if no more than k objects in the data set are at a dis-
tance d or less from o. According to the definition, DB outliers are those
objects who do not have "enough" neighbors, where neighbors are defined
in terms of the distance from the given object. As pointed out in [RRSOOj,
this measure is very sensitive to the use of the parameter d which is hard to
determine a-priori. Moreover, when the dimensionality increases, it becomes
increasingly difficult to define d, since most of the objects are likely to lie in
a thin shell about any other object. This means, that it is necessary to define
d very accurately in order to find a reasonable number of outliers. To over-
come the problem of the distance parameter d, Ramaswamy, Rastogi, and
Shim [RRSOOj introduces another definition of an outlier. Let Dk(o) denote
the distance of the k-th nearest neighbor of the object 0 in a data set D.
Then, according to [RRSOOj, outliers are defined as follows: given a k and
11. Data Mining 557
There are some studies in the literature that focus on identifying the
deviations in large multidimensional data sets [CSD98,JKM99,SAM98]. The
proposed techniques are significantly different from those of outlier detection,
but the idea behind the techniques is very similar: to identify objects (or data
values) that are "intuitively surprising" . Sarawagi, Agrawal, and Megiddo de-
veloped the deviation detection technique to find deviations in OLAP data
cubes [SAM98]. The authors define a deviation as a data value that is signif-
icantly different from the expected value computed from a statistical model.
The technique is a form of discovery-driven exploration where some precom-
puted measures indicating data deviations are used to guide the user in data
analysis. So, the user navigates through the data cube, visually identifying
interesting cells that are flagged as deviations. The user can drill down further
to analyze lower levels of the cube, thus, the user can detect deviations at var-
ious levels of data aggregation. The deviation detection process is overlapped
with cube computation to increase the efficiency. This interactive technique
involves the user in the discovery process, which may be difficult since the
search space is typically very large, particularly, when there are many dimen-
558 T. Morzy and M. Zakrzewicz
sions of analysis. The work of Chakrabarti, Sarawagi, and Dom deals with
the problem of finding surprising temporal patterns in market basket data
[CSD98], while Jagadish, Koudas , and Muthukrishnan propose the efficient
method for finding data deviations in time-series databases [JKM99].
5 Conclusions
In this chapter, we have described and discussed the fundamental data mining
methods. Since data mining is the area of very intensive research, there are
many related problems that still remain open. The most commonly discussed
data mining issues include interactive and iterative mining, data mining query
languages, pattern interestingness problem, and visualization of data mining
results. The reason for perceiving data mining as an interactive and iterative
process is that it is difficult for users to know exactly what they want to have
discovered. Typically, users experiment with different constraints imposed on
a data mining algorithm, ego different minimum support values, to narrow
the resulting patterns to those, which are interesting to the users. Such an
iterative process would normally require rerunning of the basic data mining
algorithm. However, if the user constraints change slightly between iterations,
then possibly previous results of the data mining algorithm can be used in
order to answer the new request. Similarly, the concept of a materialized
view should be considered here to provide for optimizations of frequent data
mining tasks. Another method to provide for efficient iterative data mining of
very large databases is to use appropriate sampling techniques to be applied
for fast discovery of initial set of patterns. After a user is satisfied with the
rough result based on the sample, the complete algorithm can be executed
to deliver final and precise set of resulting patterns.
Data mining can be seen as advanced database querying, in which a user
describes a data mining problem by means of a declarative query language
and then the data mining system executes the query and delivers the results
back to the user. The declarative data mining query language should be
based on a relational query language (such as SQL), since it would be useful
to mine relational query results. The language should allow users to define
data mining tasks by facilitating the specification of the data sources, the
domain knowledge, the kinds of patterns to be mined and the constraints to
be imposed on the discovered patterns. Such a language should be integrated
with a database query language and optimized for efficient and flexible data
mining.
The fundamental goal of data mining algorithms is to discover interesting
patterns. Patterns which are interesting to one user, need not be interesting to
another. Users should provide the data mining algorithms with the specific
interestingness measures, and the algorithms should employ the measures
to optimize the mining process. Such interestingness measures include sta-
11. Data Mining 559
References
[ABK+99] Ankerst, M., Breunig, M., Kriegel, H-P., Sander, J., Optics: order-
ing points to identify the clustering structure, Proc. ACM SIGMOD
Conference on Management of Data, 1999, 49-60.
[AGG+98] Agrawal, R, Gehrke, J., Gunopulos, D., Raghavan, P., Automatic
subspace clustering of high dimensional data for data mining appli-
cations, Proc. ACM SIGMOD Conference on Management of Data,
1998,94-105.
[Aha92] Aha, D., Tolerating noisy, irrelevant, and novel attributes in instance-
based learning algorithms, International Journal of Man-Machine
Studies 36(2), 1992, 267-287.
[AIS93] Agrawal, R, Imielinski, T., Swami, A., Mining association rules be-
tween sets of items in large databases, Proc. ACM SIGMOD Confer-
ence on Management of Data, 1993, 207-216.
[And73] Anderberg, M.R, Cluster analysis for applications, Academic Press,
New York, 1973.
[AP94] Aamodt, A., Plazas, E., Case-based reasoning: foundational issues,
methodological variations, and system approaches, AI Communica-
tions 7, 1994, 39-52.
[ARS98] Alsabati, K., Ranka, S., Singh, V., Clouds: a decision tree classifier
for large datasets, Proc. 4th International Conference on Knowledge
Discovery and Data Mining (KDD'1998), 1998, 2-8.
[AS94] Agrawal, R, Srikant, R, Fast algorithms for mining association
rules, Proc. 20th International Conference on Very Large Data Bases
(VLDB'94), 1994, 478-499.
[AS95] Agrawal, R., Srikant, R., Mining sequential patterns, Proc. 11th In-
ternational Conference on Data Engineering, 1995, 3-14.
[AS96] Agrawal, R., Shafer, J.C., Parallel mining of association rules, IEEE
Transactions on Knowledge and Data Engineering, vol. 8, No.6, 1996,
962-969.
560 T. Morzy and M. Zakrzewicz
[AYOl] Aggarwal, C.C., Yu, P.S., Outlier detection in high dimensional data,
Proc. ACM SIGMOD Conference on Management of Data, 2001, 37-
46.
[BFO+84] Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., Classification
and regression trees, Wadsworth, Belmont, 1984.
[Bis95] Bishop, C., Neural networks for pattern recognition, Oxford Univer-
sity Press, New York, NY, 1995.
[BKN+OO] Breunig, M.M, Kriegel, H-P., Ng, RT., Sander, J., LOF: identify-
ing density-based local outliers, Proc. ACM SIGMOD Conference on
Management of Data, 2000, 93-104.
[BKS+90] Beckmann, N., Kriegel, H-P., Schneider, R, Seeger, B., The R*-tree:
an efficient and robust access method for points and rectangles, Proc.
ACM SIGMOD Conference on Management of Data, 1990, 322-331.
[BL94] Barnett, V., Lewis, T., Outliers in statistical data, John Wiley, 1994.
[BMU+97] Brin, S., Motwani, R, Ullman, J.D., Tsur, S., Dynamic itemset count-
ing and implication rules for market basket data, Proc. ACM SIG-
MOD Conference on Management of Data, 1997, 255-264.
[BWJ+98] Bettini, C., Wang, X.s., Jajodia, S., Lin, J., Discovering frequent
event patterns with multiple granularities in time sequences, IEEE
Transactions on Knowledge and Data Engineering, vol. 10, No.2,
1998, 222-237.
[CHN+96] Cheung, D.W., Han, J., Ng, V., Wong, C.Y., Maintenance of discov-
ered association rules in large databases: an incremental updating
technique, Proc. 12th International Conference on Data Engineering,
1996, 106--114.
[CHY96] Chen, M.S., Han, J., Yu, P.S., Data mining: an overview from a data-
base perspective, IEEE Trans. Knowledge and Data Engineering 8,
1996, 866-883.
[CPS98] Cois, K., Pedrycz, W., Swiniarski, R, Data mining methods for knowl-
edge discovery, Kluwer Acadamic Publishers, 1998.
[CS96] Cheeseman, P., Stutz, J., Bayesian classification (autoclass): theory
and results, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R Uthu-
rusamy (eds.), Advances in Knowledge Discovery and Data Mining,
MIT Press, 1996, 153-180.
[CSD98] Chakrabarti, S., Sasrawagi, S., Dom, B., Mining surprising patterns
using temporal description length, Proc. 24nd Conference on Very
Large Data Bases (VLDB'98), 1998,606-617.
[DH73] Duda, RO., Hurt, P.E., Pattern classification and scene analysis,
John Wiley, New York, 1973.
[EKS+96] Ester, M., Kriegel, H-P., Sander, J., Xu, X., A density-based algo-
rithm for discovering clusters in large spatial database with noise,
Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining
(KDD'96), 1996, 226-231.
[FMM+96] Fukuda, T., Marimoto, Y., Morishita, S., Tokuyama, T., Construct-
ing efficient decision trees by using optimized association rules, Proc.
22nd Conference on Very Large Data Bases (VLDB'96), 1996, 146-
155.
[FPM91] Frawley, W.J., Piatetsky-Shapiro, G., Matheus, C.J., Knowledge dis-
covery in databases: an overview, G. Piatetsky-Shapiro, W. Frawley
11. Data Mining 561
[HKK+98] Han, E., Karypis, G., Kumar, V., Mobasher, B., Hypergraph based
clustering in high-dimensional data sets: a summary of results, Bul-
letin of the Technical Committee on Data Engineering, 21(1), 1998,
15-22.
[Haw80] Hawkins, D., Identification of outliers, Chapman and Hall, 1980.
[HPM+OO] Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., Hsu, M-
C., FreeSpan: frequent pattern-projected sequential pattern mining,
Proc. 6th International Conference on Knowledge Discovery and Data
Mining (KDD'OO), 2000, 355-359.
[HPYOO] Han, J., Pei, J., Yin, Y., Mining frequent patterns without candi-
date generation, Proc. ACM SIGMOD Conference on Management
of Data, 2000, 1-12.
[HS93] Houtsma, M., Swami, A., Set-oriented mining of association rules,
Research Report RJ 9567, IBM Almaden Research Center, San Jose,
California, USA, 1993.
[Hua98] Huang, Z., Extensions to the K -means algorithm for clustering large
data sets with categorical values, Data Mining and Knowledge Dis-
covery 2, 1998, 283-304.
[IM96] Imielinski, T., Mannila, H, A database perspective on knowledge dis-
covery, Communications of ACM 39, 1996, 58--64.
[Jam85] James, M., Classification algorithms, John Wiley, New York, 1985.
[JD88] Jain, A.K., Dubes, RC., Algorithms for Clustering Data, Prentice
Hall, Englewood Cliffs, NJ, 1988.
[JKK99] Joshi, M., Karypis, G., Kumar, V., A universal formulation of se-
quential patterns, Technical Report 99-21, Department of Computer
Science, University of Minnesota, Minneapolis, 1999.
[JKM99] Jagadish, H.V., Koudas, N., Muthukrishnan, S., Mining deviants in
a time series database, Proc. 25th International Conference on Very
Large Data Bases (VLDB'99), 1999, 102-113.
[JMF99] Jain, A.K., Murty, M.N., Flynn, P.J., Data clustering: a survey, ACM
Computing Surveys 31, 1999, 264-323.
[KAK+97] Karypis, G., Aggarwal, R, Kumar, V., Shekhar, S., Multilevel hyper-
graph partitioning: application in VLSI domain, Proc. ACM/IEEE
Design Automation Conference, 1997, 526-529.
[KN98] Knorr, E.M., Ng, RT., Algorithms for mining distance-based outliers
in large datasets, Proc. 24th International Conference on Very Large
Data Bases (VLDB'98), 1998, 392-403.
[KNTOO] Knorr, E.M., Ng, RT., Tucakov, V., Distance-based outliers: algo-
rithms and applications, VLDB Journal 8(3-4), 2000, 237-253.
[KNZ01] Knorr, E.M., Ng, R.T., Zamar, R.H., Robust space transformation
for distance-based operations, Proc. 8th International Conference on
Knowledge Discovery and Data Mining (KDD'2001), 2001, 126-135.
[Koh95] Kohavi, R., The power of decision tables, N. Lavrac, S. Wrobel (eds.),
Lecture Notes in Computer Science 912, Machine Learning: ECML-
95, 8th European Conference on Machine Learning, Springer Verlag,
Berlin, 1995, 174-189.
[KoI93] Kolodner, J.L., Case-based reasoning, Morgan Kaufmann, 1993.
[KR90] Kaufman, L., Rousseeuw, P.J., Finding groups in data: an introduc-
tion to cluster analysis, John Wiley & Sons, 1990.
11. Data Mining 563
[PHM+OO] Pei, J., Han J., Mortazavi-Asl, B., Zhu, H., Mining access patterns ef-
ficiently from Web logs, Proc. 4th Pacific-Asia Conference on Knowl-
edge Discovery and Data Mining (PAKDD'OO), 2000, 396-407.
[PHM+Ol] Pei, J., Han J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu
M-C., PrefixSpan: mining sequential patterns efficiently by prefix-
projected pattern growth, Proc. 17th International Conference on
Data Engineering (ICDE'Ol), 2001, 215-224.
[PZO+99] Parthasarathy, S., Zaki, M.J., Ogihara, M., Dwarkadas, S., Incremen-
tal and interactive sequence mining, Proc. 8th International Confer-
ence on Information and Knowledge Management, 1999, 251-258.
[QR89] Quinlan, J.R., Rivest, R.L., Inferring decision trees using the mini-
mum description length principle, Information and Computation 80,
1989, 227-248.
[Qui86] Quinlan, J.R., Induction of decision trees, Machine Learning, vol. 1,
No.1, 1986, 81-106.
[Qui93] Quinlan, J. R., C4.5: programs for machine learning, Morgan Kauf-
mann, 1993.
[RHW86] Rumelhart, D.E., Hinton, G.E., Williams, R.J., Learning internal rep-
resentation by error propagation, D.E. Rumelhart, J.L. McClelland
(eds.), Parallel Distributed Processing, MIT Press, 1996, 318-362.
[Rip96] Ripley, B., Pattern recognition and neural networks, Cambridge Uni-
versity Press, Cambridge, 1996.
[RRSOO] Ramaswamy, S., Rastogi, R., Shim, K., Efficient algorithms for mining
ouliers from large data sets, Proc. ACM SIGMOD Conference on
Management of Data, 2000, 427-438.
[RS98] Rastogi, R., Shim, K., PUBLIC: a decision tree classifier that inte-
grates building and pruning, Proc. 24th International Conference on
Very Large Data Bases (VLDB'98), 1998,404-415.
[SA95] Srikant, R., Agrawal, R., Mining generalized association rules, Proc.
21th International Conference on Very Large Data Bases (VLDB'95),
1995, 407-419.
[SA96a] Srikant, R., Agrawal, R., Mining quantitative association rules in
large relational tables, Proc. ACM SIGMOD Conference on Man-
agement of Data, 1996, 1-12.
[SA96b] Srikant, R., Agrawal, R., Mining sequential patterns: generalizations
and performance improvements, P.M.G. Apers, M. Bouzeghoub, G.
Gardarin (eds.) Lecture Notes in Computer Science 1057, Advances
in Database Technology - EDBT'96, 5th International Conference on
Extending Database Technology, 1996,3-17.
[SAM96] Shafer, J., Agrawal, R., Mehta, M., SPRINT: a scalable parallel clas-
sifier for data mining, Proc. International Conference on Very Large
Data Bases (VLDB'96), 1996, 544-555.
[SAM98] Sarawagi, S., Agrawal, R., Megiddo, N., Discovery-driven exploration
of OLAP data cubes, Proc. International Conference on Extending
Database Technology (EDBT'98), 1998, 168-182.
[Sch96] Schikuta, E., Grid clustering: an efficient hierarchical clustering
method for very large data sets, Proc. International Conference on
Pattern Recognition, 1996, 101-105.
11. Data Mining 565