Fuxman Thesis

Efficient Query Processing Over Inconsistent Databases
by
Ariel Damian Fuxman
A thesis submitted in conformity with the requirements
for the degree of Ph.D. in Computer Science
Graduate Department of Computer Science
University of Toronto
Copyright c _ 2007 by Ariel Damian Fuxman
Abstract
Ecient Query Processing Over Inconsistent Databases
Ariel Damian Fuxman
Ph.D. in Computer Science
Graduate Department of Computer Science
University of Toronto
2007
Although integrity constraints have long been used to maintain data consistency, there
are situations in which they may not be enforced or satised. In this thesis, we present
ConQuer, a system for ecient and scalable answering of SQL queries on databases
that may violate a set of constraints. ConQuer permits users to postulate a set of key
constraints together with their queries. The system rewrites the queries to retrieve all
(and only) data that is consistent with respect to the constraints. The rewriting is into
SQL, so the rewritten queries can be eciently optimized and executed by commercial
database systems.
The problem of obtaining consistent answers for primary key constraints and Select-
Project-Join (SPJ) queries is known to be intractable in general. However, we identify
a large and practical class of SPJ queries for which the problem is tractable. For this
class of queries, we provide a query rewriting algorithm that can be executed in linear
time in the size of the query. We consider SPJ queries that may have either set or bag
semantics. For the latter case, the queries may also have grouping and aggregation. We
show the maximality of the class of queries, in the sense that minimal relaxations of its
conditions may lead to intractability. Finally, we study the eciency and scalability of the
query rewritings on a commercial database system. The study shows that the overhead
of the rewritings is reasonable, when we consider the original (non-rewritten) queries
as a baseline. The experiments use representative queries from TPC-H (the standard
benchmark for decision support systems) and databases of up to 20 GB.
ii
A mis padres Silvia y Miguel
iii
Acknowledgements
First and foremost, I would like to thank my supervisor, Renee J. Miller, for her constant
encouragement and support. During these years, I have beneted tremendously from her
remarkable vision and experience. She has been the greatest mentor, always available for
discussion and guidance. I will always be grateful for the endless hours she devoted to
reading and correcting my drafts, and for the numerous times she stayed at the university
until very late to help me out before conference deadlines.
I am grateful to the members of my committee (John Mylopoulos, Mariano Consens,
and Thodoros Topaloglou) for thoroughly reading my thesis and for their valuable feed-
back. I also thank Leopoldo Bertossi for serving as the external reviewer of the thesis, and
for coming to Canada during his sabbatical in Chile with the sole purpose of attending
my thesis defense.
I am indebted to Alberto Mendelzon, who sadly passed away the year before I com-
pleted my thesis. Alberto was not only an outstanding researcher, but also the warmest
and most generous person. At the beginning of my stay in Canada, I was needing a job
oer in order to obtain permanent resident status. Alberto hardly knew me at that time
(I was then not even a member of the Database Group), but as soon as he heard about
my situation, he oered me a position as Research Associate in his group.
In 2004, I had the opportunity of visiting Phokion Kolaitis and Wang-Chiew Tan at
University of California at Santa Cruz. It was a joy to work with both of them. They
were also wonderful hosts, and I thank them for their hospitality. During the summer of
2005, I did an internship with the Clio group at IBM Almaden, working with Mauricio
Hernandez, Lucian Popa, and Howard Ho. I very much enjoyed my time at Almaden,
where I had an opportunity to learn how research is done at an industrial lab. Special
thanks go to Mauricio for his unwavering support during the internship.
For the implementation of the ConQuer system, I received invaluable help from my
brother Diego. I convinced him to do his nal undergraduate project on the topic of
consistent query answering, and his contribution was fundamental for the demo that we
gave at VLDB in Trondheim. Diego, I am proud of your work! I also thank Jiang Du for
his help in building up the experimental framework used in Chapter 7.
Many people helped to make these years in Toronto a very enjoyable experience. I
especially thank the Latin American gang (Sebastian Sardi na, Andres Lagar-Cavilla,
Carlos Hurtado, Blas Melissari, Flavio Rizzolo, Pablo Sala, and many others) for their
iv
friendship. I will always remember our long, heated debates at the Graduate Lounge,
which gained us the reputation of being the loudest group of people in the Department.
I am also grateful to Patricia Rodriguez Gianolli for her support during the last year of
my Ph.D.
And last, but denitely not least, I would like to thank my parents, Silvia and Miguel,
and my brothers, Adrian and Diego, for always being there, despite the distance: without
their love and support none of this would have ever been possible.
v
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Consistent Query Answering . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Organization of the Document . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Formal Framework 10
2.1 Repairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Query Answering Semantics . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Query Rewritings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Rewritings for Conjunctive Queries 22
3.1 A Broad Class of First-Order Rewritable Queries . . . . . . . . . . . . . 22
3.1.1 Notation for Conjunctive Queries . . . . . . . . . . . . . . . . . . 22
3.1.2 Join Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.3 The Class c
forest
of First-Order Rewritable Queries . . . . . . . . 25
3.2 Query Rewriting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Correctness of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Properties of Repairs . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 A Structural Property of c
forest
. . . . . . . . . . . . . . . . . . . 35
3.3.3 A Pessimistic Repair . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.4 Correctness of RewriteLocal . . . . . . . . . . . . . . . . . . . . 39
3.3.5 Correctness of RewriteTree . . . . . . . . . . . . . . . . . . . . . 42
3.3.6 Correctness of RewriteForest . . . . . . . . . . . . . . . . . . . . 44
vi
3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Rewritings for Queries with Grouping and Aggregation 48
4.1 Formal Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Queries with Bag Semantics . . . . . . . . . . . . . . . . . . . . . 51
4.2.2 Queries with the sum, min, and max Functions . . . . . . . . . . . 56
4.3 Correctness of the Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Building Upon First-Order Rewritings . . . . . . . . . . . . . . . 61
4.3.2 An Optimistic Repair . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.3 Sound Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.4 Tight Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.5 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5 Complexity-Theoretic Analysis 83
5.1 Minimal Relaxations of c
forest
. . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 A Dichotomy Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.1 The Class c
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.2 Basic Intractable Cases . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.3 Generalizing the Basic Cases . . . . . . . . . . . . . . . . . . . . . 95
5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6 ConQuer: System Implementation and SQL Rewritings 101
6.1 System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2 ConQuer Rewritings for Queries without Aggregation . . . . . . . . . . . 103
6.2.1 Rewriting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3 ConQuer Rewritings for SPJ Queries with Aggregation . . . . . . . . . . 121
6.3.1 Rewriting algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4 Exploiting Precomputed Annotations . . . . . . . . . . . . . . . . . . . . 134
6.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
vii
7 Experimental Analysis 139
7.1 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.1.1 System and Database Manager Conguration . . . . . . . . . . . 139
7.1.2 Inconsistent Database Instances . . . . . . . . . . . . . . . . . . . 140
7.1.3 Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2.2 Eect of Degree of Inconsistency . . . . . . . . . . . . . . . . . . 153
8 Conclusions and Future Work 157
Bibliography 159
A TPC-H Queries and their Rewritings 168
B Design Advisor Indices 208
viii
Chapter 1
Introduction
1.1 Motivation
The presence of inconsistent data is known to be a major problem in enterprises. How-
ever, data analysts often make business decisions based on inconsistent data; and their
database systems rarely give any warning or indication about this situation. In fact,
current database management systems are largely unable to give such a warning because
they rely upon the fundamental assumption that the underlying data is consistent. In
this thesis, we tackle this problem by providing a set of tools that enable users to obtain
meaningful answers from databases even if they are partially inconsistent.
Integrity constraints have long been used by database management systems in order
to maintain data consistency. The typical data design process focuses on developing a set
of constraints that ensure that every possible database reects a valid, consistent state
of the world. However, integrity constraints may not always be enforced or satised for
a number of reasons. For example, when data is integrated from multiple sources, each
source may satisfy a constraint (for example, a key constraint), but the merged data may
not (for example, if the same key value exists in multiple sources). More generally, when
data is exchanged between independently designed sources with dierent constraints, the
exchanged data may not satisfy the constraints of the destination schema. As another
example, in some environments, checking the consistency of constraints may be too ex-
pensive, particularly for workloads with high update rates. Hence, the database may
become inconsistent with respect to the (unenforced) integrity constraints. In addition
to these long-standing problems, the trend toward autonomous computing is making the
need to manage inconsistent data more acute. In autonomous environments, we can no
1
Chapter 1. Introduction 2
longer assume that data are married with a single set of constraints that dene their
semantics. As constraints are used in an increasing number of roles (from modelling
the query capabilities of a system, to dening mappings between independent sources),
there is an increasing number of applications in which data must be used with a set
of independently designed constraints. In such applications, a static approach where
consistency (with respect to a xed set of constraints) is enforced on the database may
not be appropriate. Rather, a dynamic approach in which inconsistent data is tolerated,
but consistency is taken into account at query time, permits the constraints to evolve
independently from the data.
One strategy for managing inconsistent databases is data cleaning [DJ03]. Data
cleaning techniques seek to identify and correct errors in the data, and can be used to
restore an inconsistent database to a consistent state. Data cleaning, when applicable,
can be very successful. However, it is necessarily a semiautomatic process, which makes
it infeasible or unaordable for some applications. Furthermore, committing to a single
cleaning strategy may not always be appropriate. A user may wish to experiment with
dierent cleaning strategies, or may desire to retain all data, even inconsistent data,
for tasks such as lineage tracing. Finally, data cleaning is only applicable to data that
contains errors. However, the violation of a constraint may also indicate that the data
contains exceptions, that is, clean data which simply does not satisfy a constraint.
In this thesis, we consider inconsistent databases that may violate a set of primary
key constraints. This type of constraint (together with foreign key constraints) are the
most commonly used in commercial databases systems. Furthermore, databases that
violate primary key constraints are ubiquitous in enterprises. For example, in the domain
of Customer Relationship Management (CRM), data sources often contain conicting
information about the same customer. Notably, commercial CRM tools provide limited
support for merging tuples corresponding to the same customer into one tuple in the
integrated database. Although they typically support some form of conict resolution
rules (e.g., rules that take the average between two conicting incomes of the same
customer), these rules may be dicult to design. In the absence of conict resolution
rules, some CRM tools transfer all conicting tuples to the integrated database. Thus,
even if the sources satisfy the key constraints, the integrated database may not.
1.2 Consistent Query Answering
While it is well known how to answer queries over consistent databases, we must give
a clear and precise semantics to the notion of a meaningful answer obtained from an
inconsistent database. In this thesis, we make use of a semantics based upon the notions
of possible worlds and certain answers, concepts that are widely used not only in the
context of database theory and data integration [Lip79, Lip81, AKG87, AD98], but also
in the eld of knowledge representation [Lev81, Moo85]. These notions were rst adapted
to the context of inconsistent databases by Arenas, Bertossi and Chomicki [ABC99], who
dened the semantics of consistent query answers.
The semantics of consistent query answers relies on the intuition that an inconsistent
database can be cleaned (or repaired) by adding or deleting tuples in such a way that
the resulting database satises some given integrity constraints. The semantics is agnostic
about which tuples should be added or removed. Therefore, each inconsistent database
may be associated to more than one clean, consistent database. A consistent answer is
then an answer that is obtained from every possible consistent database. Intuitively, this
means that the consistent answers are obtained no matter how the database is cleaned.
The semantics of consistent query answers provides a sound and elegant basis for the
study of the problem of query answering over inconsistent databases. However, despite
considerable work on its theoretical underpinnings [ABC99, CB00, ABC
+
03b, CLR03a,
CLR03b, BB03a, BB03b, CM05], to the best of our knowledge, little work has been
done on its practical applications. A key contribution of this thesis is to bridge the
gap between theory and practice by providing an ecient and scalable system to obtain
consistent query answers from inconsistent databases. In particular, we report the design
and evaluation of ConQuer, a system for managing inconsistent data.
1
In ConQuer, a
user may postulate a set of integrity constraints, possibly at query time, and the system
automatically retrieves all (and only) the query answers that are consistent with respect
to the constraints. ConQuer also helps users take advantage of the query results in order
to interactively clean the inconsistent database.
The major challenge in consistent query answering is the potentially huge number
of consistent databases that can be associated with a given inconsistent database. In
the case of primary key constraints, that is the focus of this thesis, the number of con-
1
ConQuer stands for Consistent Querying. ConQuers web page can be found at
www.cs.toronto.edu/db/conquer.
emplKey salary
t1 John 1000
t2 John 2000
t3 Mary 1000
Figure 1.1: An inconsistent database
sistent databases is exponential in the size of the inconsistent database. This problem
is tackled in ConQuer by implementing a query rewriting approach. Given a query q,
ConQuer rewrites q into another query Q that has the following property: for every incon-
sistent database, the rewritten query Q retrieves the consistent answers for the original
query q. The rewriting is done independently of the data, and works on every inconsistent
database. This approach has two fundamental advantages. First, it avoids constructing
the (potentially huge number of) consistent databases associated with the inconsistent
database. Second, the rewritten query is a SQL query that can be executed using any
commercial relational database management system (in ConQuer, we use IBMs DB2).
In an extensive set of experiments, reported in Chapter 7, we show that the overhead
in the execution of the rewritten queries is reasonable, when compared to the original
(non-rewritten) ones.
In the next example, we illustrate the semantics of consistent answers and the query
rewriting approach.
Example 1.1. Consider the database of Figure 1.1, which contains information about
employees and their salaries. In particular, the schema of the database has one relation
called employee, with two attributes: emplKey (the name of the employee) and salary.
Assume that a user species that the key of the relation should be the attribute
emplKey. Note that the database violates this key constraint, perhaps because its data
has been integrated from many operational sources. In particular, there are two tuples
for employee John, one stating that he makes a salary of 1000, and the other stating that
he makes a salary of 2000. Suppose that we do not know which one of this alternatives is
correct, but we still want to be able to draw meaningful answers from the database. Let
us consider the consistent databases (i.e., databases that satisfy the key constraint) that
can be built from the inconsistent database. We would like these databases to be not
only consistent, but also as close as possible to the inconsistent database. This leaves
emplKey salary emplKey salary
t1 John 1000 t2 John 2000
t3 Mary 1000 t3 Mary 1000
Consistent database 1 Consistent database 2
Figure 1.2: Consistent databases for the inconsistent database of Figure 1.1
us with two possible consistent databases (shown in Figure 1.2), obtained by deleting
exactly one tuple for John in each of them.
Consider a query q
1
that retrieves information about customers whose salary is less
or equal than 1000.
q
1
: select distinct emplKey
from employee
where salary <= 1000
If we execute this query directly over the inconsistent database, we obtain John, Mary.
Intuitively, this is not a consistent answer because it may be the case that John has a
salary over 1000. In fact, if the consistent database turns out to be the database on the
right hand side of Figure 1.2, then John would not appear in the answer.
One strategy to obtain the consistent answer would be to apply query q
1
to each
of the consistent databases of Figure 1.2. While this may be feasible in this simple
example, it is clearly impractical when the number of tuples violating the constraint
grows. In particular, even for the schema and single constraint of this example, the
number of consistent databases is exponential in the size of the inconsistent database.
For this reason, in ConQuer, we never build the consistent databases explicitly. Instead,
we follow a query rewriting approach, where we rewrite the original query (q
1
in this
case) into another query that can be executed directly on the inconsistent database and
is guaranteed to always return the consistent answers for the original query.
In this case, it is quite simple to obtain a rewriting of q
1
. Notice that John appears
associated with two dierent salaries in the inconsistent database: one satisfying the
query, the other not. This suggests that in the rewriting we should return the employees
that satisfy q
1
(i.e., have a salary of less or equal than 1000) in every tuple of the
inconsistent database where they appear. This can be obtained using the following
query:
Q
1
from employee e
and not exists (select *
from employee e
where e.emplKey=e.emplKey
and c.salary > 1000)
Notice the use of a nested subquery related by not exists. The purpose of this
subquery is to lter out those key values that satisfy q
1
in some tuples, but violate it in
others. In our example, this subquery lters John out of the answer because he appears
in tuple t2 with an account balance above 1000.
Despite the simplicity of the previous example, it has been shown in the literature
[CLR03a, CM05] that there are Select-Project-Join queries for which there is no rewriting
into SQL (under a very likely complexity-theoretic assumption). However, we observe
that the presence of these negative results does not necessarily preclude the existence of
classes of queries for which there is a SQL rewriting. In fact, in Chapter 3, we show a
large and practical class of Select-Project-Join queries for which there is a SQL rewriting.
In Chapter 5, we show that this is a maximal class of queries, in the sense that minimal
relaxations of its conditions lead to queries for which there is no SQL rewriting.
Most of the previous work on consistent query answering (except [ABC
+
03b]) focuses
on queries with set semantics and no aggregation. However, practical query languages
like SQL have bag semantics (duplicates are not eliminated unless explicitly requested),
and support aggregation functions and grouping of results. In Chapter 2, we present
a generalization of the semantics of consistent answers for queries with bag semantics,
grouping and aggregation. In Chapter 4, we provide query rewritings that work under
this semantics.
In the thesis, we are concerned not only with the correctness of the rewritings (i.e.,
ensuring that they retrieve all and only the consistent answers), but also with their
eciency when executed using existing database technology. We address eciency issues
and their empirical validation in Chapters 6 and 7.
1.3 Contributions
The main contributions of this thesis are the following:
We identify a large and practical class of Select-Project-Join queries for which the
problem of computing consistent answers is tractable. The class consists of queries
that can have two kinds of joins. First, they can have joins between key attributes.
Second, they can have joins from non-key attributes of a relation (possibly a foreign
key) to the primary key of another relation. Arguably, these two types of joins are
the most commonly used in practice (and certainly the most common in industry
standard benchmarks like TPC-H). (Chapter 3)
For the class of tractable queries that we identify, we provide a query rewriting algo-
rithm that produces a query in rst-order logic that returns the consistent answers.
The algorithm runs in polynomial time in the size of the query. The rewritings
are sound and complete, in the sense that they return all (and only) the consistent
answers. Since rst-order queries can be written in SQL, the rewritings in rst-
order logic are a rst step towards reusing existing commercial database technology.
This work was rst published at the International Conference on Database Theory
(ICDT) [FM05], and an extended journal version has been invited to the Journal
of Computer and Systems Sciences (JCSS) [FM06]. (Chapter 3)
We consider not only Select-Project-Join queries with set semantics, but also queries
with bag semantics, grouping and aggregation. These extensions are needed to en-
able practical use in decision support applications. For this purpose, we extend
the semantics of consistent answers originally proposed by Arenas, Bertossi and
Chomicki [ABC99, ABC
+
03b] . We provide sound and complete algorithms un-
der this semantics for the most common SQL aggregation functions (count, min,
max, sum). This work has been published at the ACM International Conference
on the Management of Data (SIGMOD) [FFM05a]. (Chapters 2 and 4)
We show a large class of Select-Project-Join queries for which the conditions of
applicability of our rewriting algorithm are not only sucient but also necessary.
In particular, we show a class in which the problem of computing the consistent
answers is coNP-complete (and, assuming P ,= NP, inexpressible in rst-order logic)
for every query of the class that violates the conditions of the class of queries for
which we give a rewriting algorithm. This type of result is stronger than the com-
plexity results given in the consistent query answering literature [CLR03a, CM05],
which consist of showing intractability of a class by exhibiting at least one query for
which the problem is intractable. As a corollary of our result, we get a dichotomy
for this class of queries: given a query q in our class, either the problem of comput-
ing the consistent answers for q is rst-order rewritable (and thus it is in PTIME),
or it is a coNP-complete problem. (Chapter 5)
We present the implementation of ConQuer, a system for querying inconsistent
databases. We also explain in detail the SQL rewritings produced by the system.
ConQuer has been demonstrated at the International Conference on Very Large
Databases (VLDB) [FFM05b]. (Chapter 6)
We study the running time of ConQuers SQL rewritings on a commercial database
system, in particular IBM DB2. To this end, we present a detailed performance
study using the data and queries of the TPC-H decision support benchmark. The
study focuses on the overhead of the rewritings, using the original (non-rewritten
queries) as a baseline. We study the scalability of the approach (with databases of
up to 172 million tuples), and the eect of the degree of inconsistency (in terms
of the percentage of tuples that are inconsistent and the number of conicting
tuples per key value). The experiments show that our approach can be applied to
large databases, several orders of magnitude larger than those considered in other
approaches for querying inconsistent databases. (Chapter 7)
1.4 Organization of the Document
The rest of this document is organized as follows. In Chapter 2, we present the formal
framework for querying inconsistent databases that will be used throughout the thesis.
In Chapters 3 and 4, we present query rewritings and focus on proving their correctness.
In Chapter 3, we consider a large and practical class of conjunctive queries (that is,
Select-Project-Join queries) and present rewritings in rst-order logic. In Chapter 4, we
consider queries with bag semantics, grouping and aggregation, and present rewritings
in an extension of rst-order logic with grouping and aggregation functions. In Chapter
5, we show the maximality of the class of queries that is the input to the rewriting
algorithms.
In Chapter 6, we present ConQuer, a system for eciently querying inconsistent
databases. We present in detail the SQL query rewritings produced by ConQuer for
queries with and without aggregation. The eciency of these rewritings is empirically
validated in Chapter 7 with an extensive set of experiments. We present related work in
separate sections at the end of each of the chapters. In Chapter 8, we nish the document
with conclusions and directions for future work.
Chapter 2
Formal Framework
In this chapter, we present the formal framework that will be used throughout the thesis.
In this framework, an inconsistent database is associated with a space of consistent
databases called repairs. In Section 2.1, we formally dene the notion of repair. Then, in
Section 2.2, we introduce the semantics for query answering over inconsistent databases.
This semantics involves the exploration of all repairs of an inconsistent database. Since
the number of repairs can be very large, in this thesis we advocate a query rewriting
approach, where queries are rewritten in such a way that their consistent answer can be
obtained by posing another query directly on the inconsistent database, without explicitly
building any repair. In Section 2.3, we formally dene the notion of a query rewriting.
Finally, in Section 2.4, we introduce the integrity constraints that are the focus of this
thesis.
2.1 Repairs
A schema R is a nite collection of relation symbols, each of which has an associated
arity. A database instance (or database) I over R is a function that associates each
relation symbol r of R to a relation I(r). A relation I(r) of arity k is a set of k-tuples
whose elements belong to some underlying xed domain.
1
Whenever it is clear from
context, we will abuse notation and use the same symbol r to denote both a relation
symbol and a relation. Given a tuple

t occurring in relation I(r), we denote by r(
t) the
association between

t and r.
1
Although we will consider both set and bag semantics for queries, we always assume the relations of
a database instance (including inconsistent databases) to be sets.
10
Chapter 2. Formal Framework 11
A database instance I is consistent with respect to a set of integrity constraints if
I satises in the standard model-theoretic sense, that is I [= . (As customary, an
integrity constraint may be any rst-order formula [AHV95]). Throughout this thesis,
we will consider databases that may violate a given set of integrity constraints. That is,
given R and set of integrity constraints over R, a database I may be inconsistent with
respect to , that is I ,[= .
Intuitively, we will assume that an inconsistent database can be cleaned (or re-
paired) by adding or deleting tuples in such a way that the resulting database satises
the given integrity constraints. We will be agnostic about which tuples should be added
or removed. Therefore, each inconsistent database may be associated to more than one
possible clean, consistent database. Furthermore, no matter how the clean databases are
obtained, we would like them to be as close as possible to the original, inconsistent
database (that is, to minimize the number of tuples that are added or removed). We will
call each consistent database a repair.
The notion of repair was originally introduced by Arenas, Bertossi and Chomicki
[ABC99]. A repair is a database instance that satises the given integrity constraints,
and which has a minimal distance to the inconsistent database. The distance between
two database instances I and I
is dened as their symmetric dierence, i.e., (I, I
) =
(I I
) (I
I). The formal denition of repair is the following.

Denition 2.1 (Repair [ABC99]). Let I be a database instance, and be a set of
integrity constraints. We say that an instance 1 is a repair of I with respect to if:
2
1 [= , and
there is no instance I
such that I
[= and (I, I
) (I, 1) (i.e., (I, 1) is

minimal under set inclusion in the class of instances that satisfy ).
Example 2.1. Let R be a schema with one relation symbol employee. Assume that
employee has two attributes: emplKey (the name of the employee) and salary, and
that the only constraint in is that attribute emplKey is the key of relation employee.
Let I = employee(John, 1000), employee(John, 2000), employee(Mary, 1000). The
database I is inconsistent with respect to because it violates the key constraint stating
that every employee has exactly one salary.
2
Whenever is clear from the context, we will just say that 1 is a repair of I.
There are two repairs of I wrt : 1
1
= employee(John, 1000), employee(Mary, 1000)
and 1
2
= employee(John, 2000), employee(Mary, 1000). Notice that, according to
Denition 2.1, the databases employee(John, 2000) and employee(Mary, 1000) are
not repairs because their distance with respect to I is not minimal under set inclusion.
The minimality condition for repairs is crucial in the denition. Otherwise, the empty
set would trivially be a repair of every database that violates a set of key constraints.
Notice that repairs do not need to be unique. For example, if the given set of con-
straints consists of key dependencies, the number of repairs can be exponential in the
size of the inconsistent database.
2.2 Query Answering Semantics
The notion of repair can be used to give a precise meaning to query answering over
inconsistent databases. Intuitively, each repair corresponds to one particular way of
cleaning the database. Since we are agnostic about how the database should be cleaned,
it makes sense to consider the answers that would be obtained from every repair. This
notion is formalized with the concept of consistent answers, which we dene next.
Denition 2.2 (Consistent Answer [ABC99]). Let R be a schema. Let be a set
of integrity constraints. Let I be an instance over R (possibly inconsistent with respect
to ). Let q be a query over R. We say that a tuple

t is a consistent answer for q with
respect to if

t q(1), for every repair 1 of I with respect to . We denote this as
t consistent
(q, I).
This denition was originally given by Arenas, Bertossi and Chomicki [ABC99]. It is
based on the semantics of certain answers [Lip79, Lip81, AKG87] that has been used in
database theory, and possible worlds, which is well-known in knowledge representation
[Lev81]. In the case of consistent answers, the space of possible worlds corresponds to
the repairs of the inconsistent database.
Example 2.1. (continued) Consider a query that retrieves all the employees from
the database, expressed as q
1
(e) = s.employee(e, s). Recall that there are two re-
pairs of I wrt : 1
1
= employee(John, 1000), employee(Mary, 1000) and 1
2
=
employee(John, 2000), employee(Mary, 1000). The result of applying q
1
on both 1
1
and 1
2
is (John), (Mary). Thus, the consistent answers for q
1
on I are the tuples
(John) and (Mary).
Now, consider a query that retrieves employees together with their salaries, expressed
as q
2
(e, s) = employee(e, s). Notice that q
2
is the identity on the repairs. Thus, the con-
sistent answer can be obtained as the intersection of 1
1
and 1
2
. In consequence, the only
consistent answer for q
2
on I is (Mary, 1000). Notice that the tuples (John, 1000) and
(John, 2000) are not consistent answers. The reason is that neither of them are present
in both repairs. Intuitively, this reects the fact that Johns salaries are inconsistent data,
and we do not want to retrieve possibly erroneous results.
For convenience, we will use the following notation for the consistent answers of
Boolean queries.
Denition 2.3. Let R be a schema. Let be a set of integrity constraints. Let
I be a database instance over R. Let q be a Boolean query over R. We say that
consistent
(q, I) = true if for every repair 1 of I with respect to , 1 [= q. We

say that consistent
(q, I) = false if there exists at least one repair 1 of I with respect

to such that 1 ,[= q.
Notice the asymmetry between the case for consistent
(q, I) = true and

consistent
(q, I) = false. While for the former, every repair must satisfy the query,
for the latter it suces to have just one non-satisfying repair. This is not intrinsic to
Boolean queries: by Denition 2.2, it is also the case that

t , consistent
(q, I) if there
exists at least one repair 1 such that

t , q(1).
The denition of consistent answers is independent of the language used to express
the input query q, and it makes perfect sense for queries that, for example, return tuples
from the active domain of the database. However, for queries that compute aggregates
over groups of tuples, it may be useful to relax this denition, as we motivate next.
Example 2.1. (continued) Let q
3
(s, v) be a SQL query that counts the number of
occurrences of each salary in the database:
select salary as s, count(*) as v
from employee
group by salary
Recall that there are two repairs of I with respect to : 1
1
= employee(John, 1000),
employee(Mary, 1000) and 1
2
= employee(John, 2000), employee(Mary, 1000). The
result of applying query q
3
to the repairs is the following: q
3
(1
1
) = (1000, 2), and
q
3
(1
2
) = (1000, 1), (2000, 1). Since the intersection of these results is empty, according
to Denition 2.2, the set of consistent answers for q
3
is empty. However, notice that the
salary 1000 appears in every query result (but together with a dierent number for the
count of occurrences). Intuitively, it would be desirable to report this salary in the result.
In the previous example, the value 1000 appears in every query result. However, it
appears a dierent number of times on each of them. How do we report the number of
times that it appears? In the semantics that we dene next, we employ tight bounds
for this purpose. In this particular example, we will say that the minimum (greatest
lower bound) is one, since the salary 1000 appears exactly once in q
3
(1
1
); and that the
maximum (lowest upper bound) is two, since salary 1000 appears exactly twice in q
3
(1
2
).
In the following denition, we formalize this notion. The denition applies to any query
that computes an aggregate over a group (in our example, the aggregate is the count
of occurrences of each salary). We will denote with aggconsistent
(q, I) the modied

semantics for consistent answers for a query q on an instance I with respect to a set of
constraints .
Denition 2.4 (Consistent Answer for Queries with Aggregation). Let R be
a schema. Let be a set of integrity constraints. Let I be a database instance over
R. Let q be a query over R with free variables z and v, where v is a variable over a
numeric domain (possibly computed by an aggregate function). We say that (
t, glb, lub)
aggconsistent
(q, I) if all the following conditions hold:

for every repair 1 of I wrt , there is some d such that (
t, d) q(1) and glb d

lub; and
there is some repair 1 of I wrt such that (
t, glb) q(1); and

there is some repair 1 of I wrt such that (
t, lub) q(1).
We also say that glb is the greatest lower bound of

t in q, and that lub is the lowest
upper bound of

t in q.
This denition is particularly well suited to the case of queries with bag semantics,
grouping and aggregation, which are prevalent in practice. For instance, consider the
query q
3
(s, v) of Example 2.1:
select salary as s, count(*) as v
from employee
group by salary
In this case, q
3
has free variables s and v. The variable s corresponds to the attribute
salary, on which there is a grouping condition; the numerical argument v, for which we
give tight ranges, corresponds to the result of count(*). Essentially, for a query q(z, v),
aggconsistent
(q, I) gives the consistent answers on I with respect to for each value
of z (the salary in our example), together with a tight range for the possible associated
numerical values.
Example 2.1. (continued) Let us obtain the aggconsistent
answers for q
3
on I. Re-
call that the result of applying q
3
to the repairs of the inconsistent database is: q
3
(1
1
) =
(1000, 2), and q
3
(1
2
) = (1000, 1), (2000, 1). Then, we have that aggconsistent
(q
3
, I) =
(1000, 1, 2). This means that the salary 1000 appears in every query result, and the
value of count(*) for 1000 has a greatest lower bound of one and a lowest upper bound
of two. Notice that the salary 2000 does not appear in aggconsistent
(q
3
, I). The intu-
itive reason is that 2000 is not a consistent answer, since it does not occur in repair 1
1
.
According to the denition of aggconsistent
above, 2000 is not in the answer because

it fails to satisfy the rst condition of Denition 2.4. This condition is violated because
1
1
is a repair such that (2000, d) , q(1
1
), for every d.
To the best of our knowledge, the problem of computing consistent answers for queries
with aggregation has only been studied before by Arenas et al. [ABC
+
03b]. In particular,
they were the rst to propose a generalization of the semantics of consistent answers,
where ranges rather than exact values are returned. In their work, they consider a class
of SQL queries with no grouping, no selection conditions (i.e., no conditions in the where
clause) and on exactly one relation. In Chapter 4, we will present results for a much
larger class of queries. For the class of queries considered by Arenas et al., our and their
semantics coincide. However, we need to extend their semantics in order to be able to
deal with grouping.
2.3 Query Rewritings
The denition of consistent answers introduced in the previous section involves the explo-
ration of a potentially huge number of repairs (in the case of keys, it can be exponential in
the size of the inconsistent database). In this thesis, we approach this problem by design-
ing algorithms that compute consistent answers directly from the inconsistent database,
without explicitly building the repairs. Given a query q, our algorithms will return an-
other query Q such that, for every instance I, the consistent answers for the original
query q can be obtained by just evaluating Q on I. We call Q a query rewriting for the
problem of computing the consistent answers of q.
In order to give a formal denition of query rewriting, we rst dene the computa-
tional problems associated to computing consistent answers using the consistent
and
aggconsistent
operators (the latter for the case in which the query computes numerical
values over a group of tuples).
Denition 2.5. Let R be a schema. Let q be a query over R. Let be a set of integrity
constraints.
The problem CONSISTENT(q, ) is the following: given an instance I over R, and
tuple

t, is it the case that

t consistent
(q, I)?
The problem AGGCONSISTENT(q, ) is the following: given an instance I over R, tuple
t and real numbers glb and lub, is it the case that (
t, glb, lub) aggconsistent
(q, I)?
We can now dene the notion of query rewriting for the problems CONSISTENT(q, )
and AGGCONSISTENT(q, ). The denition is given for a xed (but undened) query
language.
Denition 2.6 (/-query rewriting). Let R be a schema. Let be a set of integrity
constraints. Let q be a query over R. Let Q be a query expressed in a query language /
(possibly dierent from the language used to express q).
We say that Q is an /-rewriting of CONSISTENT(q, ) if for every instance I over R
and tuple

t,

t Q(I) i

t consistent
(q, I).
We say that Q is an /-rewriting of AGGCONSISTENT(q, ) if for every instance I
over R, tuple

t and real numbers glb and lub, (
t, glb, lub) Q(I) i (
t, glb, lub)
aggconsistent
(q, I).
We also dene the rewritability of a problem in a language / as follows. We say that
CONSISTENT(q, ) is /-rewritable if there exists a query Q expressed in language / such
that Q is a query rewriting for CONSISTENT(q, ). A similar denition can be given for
AGGCONSISTENT(q, ).
In Chapter 3, we will consider classes of conjunctive queries, and present query rewrit-
ings in rst-order logic. Notice that if CONSISTENT(q, ) is rst-order rewritable, then
it is tractable. This is because the data complexity of rst-order logic is in PTIME (in
fact, in AC
0
, which is a subset of PTIME). Thus, the query rewriting Q can be executed
on the inconsistent database in polyomial time. Besides this, an approach based on rst-
order query rewriting is attractive because rst-order queries can be written in SQL. In
Chapter 4, we will focus on classes of conjunctive queries with bag semantics, grouping,
and aggregation. We will give query rewritings for the problem AGGCONSISTENT(q, ) in
a language that extends rst-order logic with operators for grouping and aggregation. In
Chapter 5, we will study the computational complexity of the problem CONSISTENT(q, ).
Finally, in Chapters 6 and 7, we will present SQL query rewritings and show experimen-
tally that they can be run eciently and scalably on a commercial relational database
system.
2.4 Constraints
The most commonly used type of constraints in database systems are keys and foreign
keys. Of these, keys pose a particular challenge since databases that are inconsistent
with respect to a set of key dependencies admit an exponential number of repairs in the
worst case. This potentially large number of repairs leads to the question of whether it is
possible to compute consistent answers eciently. The answer to this question is known
to be negative in general [CLR03a, CM05]. However, this does not necessarily preclude
the existence of classes of queries for which the problem is easier to compute. Hence, we
consider the following question: for what queries is the problem of computing consistent
answers under key constraints in polynomial time (in data complexity)? And, can these
rewritings be executed eciently in practice? We address the rst question in Chapters
3 and 4, and the second question in Chapter 6.
A key constraint is an integrity constraint of the form
x, y, z.(r(x, y) r(x, z)) y = z
In the above constraint, we say that x is a key of relation r. Notice that a key may
consist of many attributes. Throughout the thesis, we will assume that is a set of key
constraints that includes one key constraint per relation of the schema. This corresponds
to the notion of primary keys in database systems.
To facilitate specifying the key constraints each time that we give a query, we will un-
derline the positions in each literal that correspond to key attributes. Furthermore,
by convention, the key attributes will be given rst. For example, the query q =
x, y, z.r
1
(x, y) r
2
(y, z) indicates that the rst and second literals correspond to bi-
nary relations whose rst attribute is the key. We will use vector notation (e.g., x, y) to
denote vectors of variables or constants from a query or tuple. In addition, when we give
a tuple, we will underline the values that appear at the position of key attributes. For
instance, for a tuple r(c,

d), we will say that c is a key value, and

d is a nonkey value.
Using this notation, the key constraints of that are relevant to the query are denoted
directly in the query expression.
2.5 Related Work
In this section, we survey work on related formal frameworks for managing inconsistent
data. For two excellent surveys of the area of consistent query answering, we refer the
reader to Bertossi and Chomicki [BC03] and Bertossi [Ber06].
Intuitively, a repair is a consistent database that is as close as possible to the given
inconsistent database. To formalize this intuition, it is necessary to dene a notion of
distance between databases. The notion of distance that we employ in this thesis (and
which was initially proposed by Arenas, Bertossi, and Chomicki [ABC99]) is dened in
terms of the symmetric dierence between sets. Other notions of distance have been
explored in the literature, which we review next.
Some proposals adopt a cardinality-based notion of distance between database in-
stances, instead of set-theoretic. For example, Lin and Mendelzon [LM96] propose a
semantics where conicts are resolved according to a majority criterion. Their frame-
work is presented in the context of belief revision for rst-order theories, and is therefore
broader in scope than consistent query answering. However, the complexity of query an-
swering under this semantics has not been studied. Other approaches [FPL
+
01, BBFL05,
FFP05, BMFR05] consider cost-based notions of distance, where each operation that can
be used to restore consistency is given a cost. Then, repairs are dened as the consistent
databases that can be obtained from the inconsistent database with a minimum cost.
These operations include not only insertion and deletion of tuples, but also modication
of values. While a cost-based notion of distance is attractive from a semantic point of
view, it can be computationally more expensive than the set-theoretic semantics. For
example, in the case of inconsistencies with respect to primary key dependencies, the
problem of obtaining a repair of an inconsistent database is NP-complete [BMFR05],
whereas it can be obtained in linear time under the set-theoretic semantics.
In some of the cost-based approaches mentioned above [FPL
+
01, BBFL05, FFP05],
tuples can be modied to contain values that are not in the active domain of the incon-
sistent database. Thus, the domain of the attributes that can be modied must have
an intrinsic distance metric. In particular, these approaches consider only numerical at-
tributes (it is not clear how their techniques could be extended to categorical values).
An approach based on tuple modication which allows arbitrary attribute domains is
given by Wijsen [Wij05]. In his work, the repaired databases may contain variables, and
the semantics is given in terms of homomorphisms to the inconsistent database. Instead
of answering queries directly on the inconsistent database (as we do in ConQuer), his
approach requires the oine processing of the inconsistent databases to construct con-
densed representations. The consistent answers to certain classes of queries can then be
obtained by directly executing the original query on the condensed representation.
In contrast to consistent answers, we could also consider possible answers, where
we retrieve answers that appear in at at least one repair. This notion has received
less attention than consistent answers, perhaps because it is less challenging from a
computational point of view. In fact, for broad classes of queries and constraints for which
obtaining consistent answers is intractable, the problem of obtaining possible answers
is tractable (and it usually suces to compute the original query on the inconsistent
database). Although they are easier to obtain, possible answers are as important as
consistent answers in the context of inconsistent databases. While consistent answers are
best suited for decision making, possible answers can be used to understand the reasons
why a database is inconsistent. For example, in ConQuer, we give the option of retrieving
not only the consistent answers but also the possible answers (see Chapter 6). If the user
decides that a possible answer should have been a consistent answer, he or she can request
an explanation from the system in terms of the underlying database. This explanation
often helps the user to detect incorrect data and to (interactively) correct it.
The notions of possible and consistent answers are two opposite ends of a spectrum:
the former being the most aggressive, and the latter the most cautious. In some sce-
narios, it is desirable to give preference (or rank) tuples in the answer according to the
number of repairs where they appear. Furthermore, some repairs may be more preferable
than others. To formalize this intuition, it is natural to appeal to a semantics based on
probabilities, where each repair is assigned a probability of being the consistent database
that the user has in mind. There has been considerable research on the topic of prob-
abilistic databases [CP87, BMP92, LLRS97, FR97, DS04]. Recently, Dalvi and Suciu
[DS04] presented a framework for query rewriting over probabilistic databases. Their
rewriting algorithms rely on the fundamental assumption that each tuple has an inde-
pendent probability of being in the (in our terms) consistent database. In the context
of databases that violate primary key constraints, which is the focus of this thesis, we
cannot assume that all tuples are independent. In fact, tuples that share the same key
value are mutually exclusive. In recent work (which is not covered in this thesis), we
and other authors [AFM06] presented query rewriting algorithms that work under the
probabilistic semantics for databases that may violate primary key constraints. In that
paper, we also considered the important problem of obtaining the probabilities. In par-
ticular, we explored the use of a clustering-based technique that works particularly well
on categorical values [ATMS04]. The non-probabilistic semantics that we consider in this
thesis is a special case of the probabilistic semantics. However, the class of rewritable
queries that we can handled under the probabilistic semantics [AFM06] is considerably
more restricted than the classes considered in Chapters 3 and 4 of this thesis for the
non-probabilistic case.
Databases that are inconsistent with respect to primary key constrains can be mod-
elled as disjunctive databases [vdM98]. In particular, if is a set of key dependencies, the
set of all repairs of an inconsistent database can be represented as a disjunctive database
D in such a way that each repair corresponds to a minimal model of D. However, to
the best of our knowledge, there are no results in the literature for query rewritings over
disjunctive databases. A relevant special case of disjunctive databases are databases with
OR-objects [IvdMV95]. If an inconsistent relation has two attributes (a key and a nonkey
attribute), then it can be modelled with OR-objects. However, this is no longer the case
for relations whose arity is greater than two.
To the best of our knowledge, DeMichiel [DeM89] and Agarwal et al. [AKWS95] are
the rst authors to recognize the need to manage inconsistent databases. They propose
semantics analogous to the one for OR-objects. DeMichiel proposes algorithms that are
sound but not necessarily complete with respect to the semantics. Agarwal et al. do not
discuss the implementation of the projection and join operations which, as we will see in
Chapter 3, are particularly challenging under the consistent query answering semantics,
and an important contribution of this thesis.
We conclude this section by pointing out that the problem of dealing with inconsis-
tency arises (and has been studied) in other elds of computer science. For example, our
approach to handling inconsistency is related to the approaches followed by the belief
revision community [GR95] in the eld of articial intelligence. The scenario typically
adopted in belief revision is more general in scope than ours, since (in our terms) they
allow the modication of not only the data but also the integrity constraints. As another
example, the problem of handling inconsistency has been studied in software engineer-
ing [Bal91, NER00]. The focus of this body of work is not centered on data or query
answering, but on the reconciliation of inconsistent views of software requirements and
specications.
Chapter 3
Rewritings for Conjunctive Queries
The problem of computing consistent answers for conjunctive queries over databases that
might violate a set of key constraints is known to be coNP-complete in general [CLR03a,
CM05]. This is the case even for queries with no repeated relation symbols, which is
the focus of this chapter. However, this does not necessarily preclude the existence of
classes of queries for which the problem is easier to compute. In fact, in this section we
characterize a large and practical class of conjunctive queries for which the problem of
computing consistent answers under key constraints is indeed tractable. Even more so,
we show that all queries in this class are rst-order rewritable, and we give a linear-time
algorithm that computes the rst-order rewriting. We introduce the class of queries in
Section 3.1, and we present the query rewriting algorithm in Section 3.2. The proof of
correctness of the algorithm is given in Section 3.3.
3.1 A Broad Class of First-Order Rewritable Queries
3.1.1 Notation for Conjunctive Queries
The results in this chapter concern a class of conjunctive queries. Conjunctive queries
[CM77, AHV95] are rst-order formulas that may only have conjunctions of positive
literals and existential quantication. That is, they are formulas of the following form:
q(z) = w.R
1
(x
1
, y
1
) R
n
(x
n
, y
n
)
where the variables of x
1
, y
1
, . . . , x
n
, y
n
appear in exactly one of z and w. We will
say that the variables in z are the free variables of q, and that the variables in w are the
22
Chapter 3. Rewritings for Conjunctive Queries 23
existentially-quantied variables of q. Even though there are no equality symbols in our
notation for conjunctive queries, their eect can be achieved by having variables appear
more than once in the queries.
Notice that in the formula above, we denote the literals as R
i
(x
i
, y
i
). Throughout
the thesis, we will use the convention of using capital letters (usually R, S and T) to
denote literals of a query. Notice that two distinct literals R
i
and R
j
may be on the same
relation symbol r (although most results in this thesis are for queries without repeated
relation symbols in which each literal corresponds to a distinct relation).
We will adopt the convention of using x to denote variables and constants of a literal
that appear at a position corresponding to key attributes of the relation symbol of the
literal, and y for variables and constants that appear at the position of nonkey attributes
of the relation symbol of the literal.
We will say that there is a join on a variable w if w appears in two literals R
i
(x
i
, y
i
)
and R
j
(x
j
, y
j
) such that i ,= j. If w occurs in y
i
and y
j
, we say that there is a nonkey-
to-nonkey join on w; if w occurs in y
i
and x
j
, we say that there is a nonkey-to-key join;
and if w occurs in x
i
and x
j
, we say that there is a key-to-key join.
3.1.2 Join Graph
Before introducing the class of queries handled by our algorithm, let us get some insight
from queries that are not considered by our algorithm because (unless P=NP) there is
no rst-order rewriting that computes the consistent answer (no matter what rewriting
algorithm is used). In particular, let us consider the following queries:
q
1
= x, x
, y.R
1
(x, y) R
2
(x
, y)
q
2
= x, y.R
1
(x, y) R
2
(y, x)
q
3
= x, x
, w, w
, z, z
, m.R
1
(x, w) R
2
(m, w, z) R
3
(x
, w
) R
4
(m, w
, z
)
We will show in Chapter 5 that the problem of computing consistent answers for the
above queries is intractable. The rst query consists of a join between nonkey attributes;
the second one involves a cycle of nonkey-to-key joins; and in the third, there are two
joins from nonkey variables to part, but not the entire key, of the corresponding relations.
In order to be more precise in specifying such conditions, we need the notion of the join
graph of a query, which has a node for each literal of a query. Notice that the conditions
that we just gave are concerned with joins where at least one nonkey variable is involved.
Therefore, the join graph will be a directed graph, where directionality is determined by
the nonkey variables involved in the join.
Denition 3.1 (join graph). Let q be a conjunctive query. The join graph G of q is a
directed graph such that:
the vertices of G are the literals of q;
there is an arc from R
i
to R
j
if i ,= j, and there is some variable w such that w is
existentially-quantied in q, w occurs at the position of a nonkey attribute in R
i
,
and w occurs in R
j
.
Notice that key-to-key joins do not introduce any arcs to the join graph. Since the
class of rst-order rewritable queries that we will present shortly is dened in terms of
the join graph, its queries can have arbitrary key-to-key joins. Further, the free variables
of a query do not introduce arcs to the join graph. As a special case, if all the variables
of a query are free, then its join graph has no arcs. Such queries correspond to the
class of quantier-free queries, and have already been shown to be rst-order rewritable
[ABC99]. If we think in terms of equivalent SQL queries, the fact that all variables are
free means that every attribute of every relation in the from clause must appear in the
select clause.
1
This a strong condition which restricts the practical applicability of
the class. As an empirical observation, none of the queries in the TPC-H specication
[TPC03], the industry standard for decision support systems, satisfy this restriction. For
this reason, we will focus on a class of conjunctive queries that may have existential
quantication (in relational algebra terms, arbitrary projections). Handling queries with
existentially-quantied variables is a major challenge, which we address in this chapter.
In Figure 3.1, we show the join graphs for q
1
and q
2
(we label the arcs with the variable
involved in the joins for illustration purposes). Observe in the gure that both join graphs
have a cycle. For our rewriting algorithm, we will focus on queries that have an acyclic
join graph. Additionally, when we consider how two literals R
i
and R
j
are joined, we will
require that if any of the key attributes of R
i
are joined with a nonkey attribute of R
j
,
then all of the key attributes of R
i
join with nonkey attributes of R
j
. We will then say
that the query has only full nonkey-to-key joins. For example, in the query q
3
above, of
1
The only exception are the attributes that are equated in the where clause. In that case, only one
of the equated attributes needs to appear in the select clause.
the form x, x
, w, w
, z, z
, m.R
1
(x, w) R
2
(m, w, z) R
3
(x
, w
) R
4
(m, w
, z
), the joins
between R
1
and R
2
, and between R
3
and R
4
, are not full since they do not involve the
entire key of R
2
and R
4
, respectively.
Denition 3.2. Let q be a conjunctive query. Let R
i
(x
i
, y
i
) and R
j
(x
j
, y
j
) be a pair of
literals of q. We say that there is a full nonkey-to-key join from R
i
to R
j
if every variable
of x
j
appears in y
i
.
We observe that if G is an acyclic join graph for a query all of whose nonkey-to-key
joins are full, then G must be a forest. We show this with the following proposition.
Proposition 3.3. Let q be a query all of whose nonkey-to-key joins are full. Let G be
the join graph of q. If G is acyclic, then G is a forest.
Proof. Assume towards a contradiction that G is a directed acyclic graph that is not a
tree. Then, there is a node v in G that receives arcs from two dierent nodes v
i
and v
j
of G. Let R(x, y), R
i
(x
i
, y
i
), and R
j
(x
j
, y
j
) be the literals at the nodes of v, v
i
, and v
j
,
respectively. Since there are arcs from v
i
and v
j
to v, there are variables w
i
and w
j
in
y
i
and y
j
, respectively, that appear in R. Since G is acyclic, w
i
and w
j
must appear in
x. Also, w
j
cannot appear in a nonkey position of R
i
(or, otherwise, there would be a
cycle between the nodes v
i
and v
j
). Since there is a nonkey-to-key join from R
i
to R on
variable w
i
, and variable w
j
does not occur at a nonkey position of R
i
, the join is not
full; contradiction.
3.1.3 The Class c
forest
of First-Order Rewritable Queries
We will now characterize a broad class of conjunctive queries for which the problem of
computing consistent answers under key constraints is tractable and rst-order rewritable.
The characterization is given in terms of the join graph of the queries. In particular, we
will require three conditions. First, all the nonkey-to-key joins of the query must be full.
Second, the join graph must be a forest. As we showed in Proposition 3.3, this includes
all queries with full nonkey-to-key joins with acyclic join graph. Finally, the query should
have no repeated relation symbols. We call this class c
forest
since we require the join
graph of its queries to be a forest, and we give the formal denition next.
Denition 3.4. Let q be conjunctive query without repeated relation symbols and all of
whose nonkey-to-key joins are full. Let G be the join graph of q. We say that q c
forest
if G is a forest (i.e., every connected component of G is a tree).
Figure 3.1: Cyclic join graphs of intractable queries
A fundamental observation about c
forest
is that it is a very common, practical class
of queries. Arguably, the most used form of joins are from a set of nonkey attributes of
one relation (which may be a foreign key)
2
to the key of another relation (which may be
a primary key). Furthermore, such joins typically involve the entire primary key of the
relation (and, hence, they are full joins in our terms). Finally, cycles are rarely present
in the queries used in practice. Admittedly, the restriction not to have repeated relation
symbols does rule out some common queries (those in which the same relation appears
twice in the from clause of an SQL query). Still, many queries used in practice do not
have repeated relation symbols.
As an empirical observation, only one out of 22 queries in the TPC-H specication
[TPC03], the industry standard for decision support queries, has a nonkey-to-nonkey
join. All the queries in the standard are acyclic, and all the nonkey-to-key joins of the
queries are full.
3.2 Query Rewriting Algorithm
In this section, we present the query rewriting algorithm RewriteForest that works for
the class of conjunctive queries c
forest
introduced in the previous section. We start the
presentation with a number of examples that highlight some of the intuition underlying
the algorithm.
In the next example, we illustrate the rewriting for a query consisting of only one
2
Notice that we are not dealing with the problem of inconsistency with respect to foreign keys, but
only with respect to key dependencies.
literal. We also show that even for such a simple query, the query itself is not a rewriting
for the problem of computing its own consistent answers.
Example 3.1. As in Example 2.1, consider a schema R with one relation symbol
employee, which has two attributes: emplKey (the name of the employee) and salary.
Furthermore, consider a set consisting of only one constraint stating that the attribute
emplKey is the key of relation employee.
Let q
1
be a query that retrieves all the employees from the database that make
a salary of 1000, expressed as q
1
(e) = employee(e, 1000). First of all, notice that q
1
itself is not a query rewriting of CONSISTENT(q
1
, ). Consider a database instance I
1
=
employee(John, 1000), employee(John, 2000). It is easy to see that (John) q
1
(I
1
).
However, (John) , consistent
(q
1
, I
1
) because the repair 1 = employee(John, 2000)
is such that (John) , q
1
(1).
Now, consider a database instance I
2
= employee(John, 1000), employee(John, 2000),
employee(Mary, 1000). It is easy to see that (Mary) consistent
(q, I
2
). This is be-
cause employee Mary appears with a salary of 1000 as its nonkey value, and does not
appear with any other s
such that s
,= 1000. This can be checked with a formula

Q
consist
(e) = s
.employee(e, s
) s
= 1000. In fact, we will show that a query rewrit-

ing Q
1
for q
1
can be obtained as the conjunction of q
1
and Q
consist
:
Q
1
(e) = e.employee(e, 1000) s
.employee(e, s
) s
= 1000
In the next example, we illustrate the rewriting for a conjunctive query that has a
nonkey-to-key join.
Example 3.2. Let R be a schema with two relation symbols: employee and dept. As-
sume that employee has two attributes: emplKey (employee name), and deptFKey (de-
partment name); and dept has two attributes deptKey (department name) and mgrName
(manager name). Assume that there are two key constraints in , stating that emplKey is
the key of the relation employee, and deptKey is the key of relation dept.
Consider the query q
2
that retrieves the names of all employees whose department
appears in the dept relation:
q
2
(e) = d, m.employee(e, d) dept(d, m)
As in the previous example, q
2
itself is not a query rewriting of CONSISTENT(q
2
, ).
Consider the database instance I
1
= employee(John, Sales), employee(John, Engineering),
dept(Sales, Peter). It is easy to see that (John) q
2
(I
1
). However, we have that
(John) , consistent
(q
2
, I
1
) because the repair 1 = employee(John, Engineering),
dept(Sales, Peter) is such that (John) , q
2
(1).
Now, consider the following database instance I
2
= employee(John, Sales),
employee(John, Engineering), dept(Sales, Peter), dept(Engineering, Tom). It is easy
to see that (John) consistent
(q
2
, I
2
). This is because every nonkey value (de-
partment name) that appears together with John in some tuple (in this case, Sales
and Engineering) joins with a tuple of dept. This can be checked with a formula
Q
consist
(e) = d.employee(e, d) m.dept(d, m). We will soon show that a query rewrit-
ing Q
2
for q
2
can be obtained as the conjunction of q
2
and Q
consist
, as follows:
Q
2
(e) = d, m.employee(e, d) dept(d, m) d.(employee(e, d) m.dept(d, m))
We now proceed to present RewriteForest, the query rewriting algorithm for queries
in c
forest
(shown in Figures 3.2, 3.3, and 3.4). Given a query q such that q c
forest
and a set of key constraints (containing one key per relation), RewriteForest(q, )
returns a rst-order rewriting Q for the problem of obtaining the consistent answers
for q with respect to . The main procedure of the algorithm is shown in Figure 3.2.
The rst-order rewriting Q that it returns is obtained as the conjunction of the input
query q, and a new query called Q
consist
. The query Q
consist
is used to ensure that q is
satised in every repair. It is important to notice that Q
consist
will be applied directly to
the inconsistent database (i.e., we will never explicitly generate the repairs). The query
Q
consist
is obtained by recursion on the tree structure of each of the components of the
join graph of q (recall that since q is in c
forest
, the join graph is a forest). The recursive
procedure is called RewriteTree, and is shown in Figure 3.3.
The rst part of RewriteTree produces a rewriting Q
local
for the literal R(x, y) at the
root of the input tree. This rewriting is done independently of the rest of the query, and
it is produced by the procedure RewriteLocal (shown in Figure 3.4). The query Q
local
deals with the constants that appear in y in the same way as we illustrated in Example
3.1. It also deals with the free variables that appear at nonkey positions of the query in
the way that we illustrate in the next example.
Example 3.3. Consider the query q
3
that retrieves all employees and their salaries from
the database, expressed as q
3
(e, s) = employee(e, s). Notice that the only dierence with
the query q
1
of Example 3.1 is that the constant 1000 is replaced by the free variable
Algorithm RewriteForest(q, )
Input: q(z), a query of the form w.( w, z)
, a set of key constraints, one per relation used in q
Output: Q, a rst-order query that computes consistent
(q, I) for every database I

Let G be the join graph of q
Let T
1
, . . . , T
m
be the connected components of G
for i := 1 to m do
Let R
i
(x
i
, y
i
) be the literal at the root of T
i
Let
i
be the conjunction of literals of T
i
Let w
i
= w : w is a variable that occurs in
i
and w, and w , x
i
Let z
i
= z : z is a variable that occurs in
i
and z, and z , x
i
Let q
i
(x
i
, z
i
) = w
i
.
i
(x
i
, w
i
, z
i
)
Let Q
i
(x
i
, z
i
) = RewriteTree(q
i
, )
end for
Let Q
consist
( w, z) =
i=1...m
Q
i
(x
i
, z
i
)
Let Q(z) = w.(( w, z) Q
consist
( w, z))
return Q
Figure 3.2: Query rewriting algorithm for conjunctive queries in c
forest
s. The algorithm RewriteLocal creates a new, universally-quantied variable s
for the
free variable s, and equates s
to s. The resulting query rewriting for q

3
is the following:
Q
3
(e, s) = employee(e, s) s
.employee(e, s
) s
= s
The second part of RewriteTree recursively creates a query Q
i
for each subtree T
i
of T rooted at R. Let y
0
be the variables at nonkey positions of R (excluding those
that also appear in x). Then, one of the conjuncts of the rewritten query returned by
RewriteTree is of the form y
0
.R(x, y)
i=1...m
Q
i
(x
i
, z
i
). Notice that the variables of
y
0
(i.e., the variables at nonkey positions of the root literal R) are universally quantied.
The intuition behind this is that, as we illustrated in Example 3.2, the query must
be satised by all the nonkey values of a given key (in that example, all the possible
departments for the given employee).
Algorithm RewriteTree(q, )
Input: q(x, z), a query in c
forest
of the form w.(x, w, z),
whose join graph T is a tree with root literal R(x, y)
, a set of key constraints, one per relation
Output: Q, a rst-order query that computes consistent

Let T be the join graph of q
Let R(x, y) be the literal at the root node of T
Let q
local
(x, z) = w.R(x, y)
Let Q
local
(x, z) = RewriteLocal(q
local
, )
if has exactly one literal then
Q = Q
local
else
Let R
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
) be the children of R in T
for i := 1 to m do
Let T
i
be the subtree of T rooted at R
i
Let
i
i
Let w
i
i
and w,
and w , x
i
Let z
i
i
and z, and z , x
i
Let q
i
(x
i
, z
i
) = w
i
.
i
(x
i
, w
i
, z
i
)
Let Q
i
(x
i
, z
i
) = RewriteTree(q
i
, )
end for
Let y
0
= y : y is a variable that occurs in y and w, and y , x
Let Q(x, z) = Q
local
(x, z) y
0
.R(x, y)
i=1...m
Q
i
(x
i
, z
i
)
end if
return Q
Figure 3.3: Recursive algorithm on the tree structure of the join graph
The next example illustrates an application of the algorithm.
Example 3.4. Let R be a schema with four relation symbols: employee, dept, city,
and prov. Assume that employee has three attributes: emplKey (employee name),
cityFKey (city name), and deptFKey (department name); dept has two attributes:
deptKey (department name) and mgrName (managers name); city has two attributes:
cityKey and provFKey; and prov has two attributes: provKey (province name) and
countryName (country name). Assume that there are four key constraints in , stating
that emplKey is the key of the relation employee; cityKey is the key of relation city;
deptKey is the key of the relation dept; and provKey is the key of the relation prov.
Consider a query q
4
that retrieves the names of all employees that are located in
Algorithm RewriteLocal(q, )
Input: q(x, z), a query of the form w.R(x, y), where
none of the variables of w appear in x
, a set of key constraints
Let be an injective function mapping natural numbers to variables not present in R
Initialize Eq as an empty set
for each position p of y do
Let w be the variable that appears at position p of y
Let z = (p)
if there is a constant d at position p of y then
Add the equality z = d to Eq
end if
if w appears in x or w appears in z then
Add the equality z = w to Eq
end if
for every position p
of y such that p ,= p
and w occurs in y at position p
do
Let z
= (p
)
Add the equality z = z
to Eq
end for
end for
if Eq ,= then
Let y
be a vector of variables of the same arity as y, and

such that if z is at position p of y
, then (p) = z
Let Q
eq
be the conjunction of the equalities of Eq
Let Q
local
(x, z) = w.R(x, y) y
.R(x, y
) Q
eq
else
Let Q
local
(x, z) = w.R(x, w)
end if
return Q
local
Figure 3.4: Query rewriting for a given literal
Figure 3.5: Join graph of query q
4
.
Canada and whose manager is Peter:
q
4
(e) = d, c, m, p. employee(e, d, c) city(c, p) prov(p, Canada) dept(d, Peter)
The join graph of q
4
is given in Figure 3.5. Notice that the join graph of q
4
is a tree.
Furthermore q
4
has full nonkey-to-key joins and no repeated relation symbols. Thus, q
4
is in c
forest
.
Let q
be the query q
(c) = p.city(c, p) prov(p, Canada); let q
be the query
q
(p) = prov(p, Canada); and let q

IV
(d) = dept(d, Peter). The rst-order query rewrit-
ing Q
4
of q
4
is obtained by applying the algorithm RewriteForest(q
4
, ) as follows.
Q
4
(e) = d, c, m, p.employee(e, d, c) dept(d, m) city(c, p) prov(p, Canada) Q
consist
(e)
where :
Q
consist
(e) = RewriteTree(q, ) =
d, c.employee(e, d, c) d, c.employee(e, d, c) (Q
(c) Q
IV
(d))
Q
(c) = RewriteTree(q
, ) =
p.city(c, p) p.city(c, p) Q
(p)
Q
(p) = RewriteTree(q
, ) =
prov(p, Canada) w
.(prov(p, w
) w
= Canada)
Q
IV
(d) = RewriteTree(q
IV
, ) =
dept(d, Peter) u
.(dept(d, u
) u
= Peter)
Notice the reuse of variables in the rewritten queries. In particular, each existentially-
quantied variable of q
4
that appears at a nonkey position in a literal of q
4
is rst
existentially quantied, and then universally quantied in the rewriting Q
4
.
Recall that queries with repeated relation symbols are not allowed in the class c
forest
.
We now give an example of a query with repeated relation symbols for which our al-
gorithm fails to give the consistent answers. Although not addressed in this work, it
would be interesting to characterize the class of queries with repeated relation symbols
for which our algorithm is indeed correct.
Example 3.5. Let R be a schema with one relation symbol r, which has three attributes:
A, B, C. Assume that A is the key of the relation r. Let q be the Boolean query
q = x, y, z.r(x, y, a) r(y, z, b), where a and b are constants. If we apply our query
rewriting algorithm, we obtain the following:
Q = x, y, z.r(x, y, a) r(y, z, b) y
, z
.(r(x, y
, z
) z
= a)
y.(r(x, y, a) z.r(y, z, b) z
, w
.(r(y, z
, w
) z
= b))
Let I be the database instance I = r(c, d, a), r(d, e, b), r(d, f, a), r(f, g, b). In this
case, there are two repairs of I with respect to : 1
1
= r(c, d, a), r(d, e, b), r(f, g, b)
and 1
2
= r(c, d, a), r(d, f, a), r(f, g, b). Clearly, 1
1
[= q and 1
2
[= q. However, I ,[= Q.
We nish this section by pointing out that the complexity of the query rewriting
algorithm is linear in the number of literals of the input query. To see this, notice that
the algorithm visits each node of the join graph exactly once.
3.3 Correctness of the Algorithm
In this section, we show that the algorithm RewriteForest presented in the previous
section is correct for all queries in the class c
forest
. In particular, we prove the following
theorem.
Theorem 3.5. Let R be a schema. Let be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(z) be a conjunctive query over R such that
q c
forest
. Let Q(z) be the rst-order query returned by RewriteForest(q, ). Let I be
an instance over R.
Then,

t Q(I) i

t consistent
(q, I).
Our proof relies on a few simple properties of repairs of inconsistent databases where
the set of integrity constraints contains a single key dependency per relation. We establish
these properties in Section 3.3.1. In Section 3.3.2, we show a structural property of the
queries in c
forest
that is important in order to guarantee the correctness of the algorithms
RewriteTree and RewriteForest: the literals from distinct trees of the join graph may
only share variables that appear as key attributes at the root of their trees.
In Section 3.3.3, we introduce the notion of a pessimistic repair. The name comes
from the fact that, for a given query q and database I, if a tuple fails to satisfy the query
on some repair, then it also fails to satisfy the query on the pessimistic repair. More
precisely, for any inconsistent database I, there is a repair / such that if / [= q(c),
then consistent
(q(c), I) = true. This enables the algorithm to independently consider

each instantiation of the variables for the key of the root literal.
We then proceed to prove the correctness of the building blocks of the rewriting
algorithm. First, in Section 3.3.4, we prove the correctness of the module RewriteLocal,
for atomic queries, that is queries with a single literal (and hence no joins). In Section
3.3.5, we prove the correctness of the recursive algorithm RewriteTree that works on
queries whose join graph is a tree. Finally, in Section 3.3.6, this is generalized to the case
of queries whose join graph is a forest, which gives the correctness proof for the rewriting
algorithm RewriteForest for conjunctive queries in class c
forest
.
3.3.1 Properties of Repairs
We rst show a few important properties of repairs when the set of integrity constraints
consists of one key dependency per relation. These properties will be used throughout
the proofs of this and the next chapter.
Proposition 3.6. Let I be a database instance. Let 1 be a repair of I wrt . Then
1 I.
Proof. Let I
be an instance such that I
[= . Assume that there is a tuple

t such that
t I
and

t , I. Let I
= I
t. It is easy to see that by removing tuples from

an instance, we do not introduce violations with respect to a set of key dependencies.
Hence, I
[= . Clearly, (I, I
) (I, I
). Therefore, I
is not a repair of I wrt .

Proposition 3.7. Let I be an instance. Let 1 be a repair of I wrt . Let R(c,

d) be a
tuple of I. Then, there exists some

d
such that R(c,

d
) is a tuple of 1.
Proof. Let I
be an instance such that I
[= and R(c,

d
) , I
, for every

d
. Let
I
= I
R(c,

d). Since R(c,

d
) , I
for every

d
, I
[= . Clearly, (I, I
) =
(I, I
) R(c,

d). Since (I, I
) (I, I
), I
is not a repair of I wrt .

Proposition 3.8. Let I be an instance. Let R(c,

d) be a tuple of I. Then, there exists
some repair 1 of I such that R(c,

d) 1.
Proof. Let 1
be a repair of I wrt . By Proposition 3.7, there exists

d
such that
R(c,

d
) 1
. Let I
= 1
R(c,

d
) R(c,

d). Since 1
is a repair, 1
[= . Since I
does not introduce any violation to the key dependencies of , I
[= . Assume that I
is not a repair of I. Then, there exists a repair 1
of I such that (I, 1
) (I, I
).
By Proposition 3.6, 1
I, and thus I
I. Furthermore, by Proposition 3.6, 1
I.
Thus, I 1
I I
. Therefore, I
. Let I
= 1
R(c,

d) R(c,

d
).
Clearly, 1
. Thus, 1
is not a repair; contradiction.

3.3.2 A Structural Property of c
forest
In the next lemma, we show a structural property of the queries in c
forest
that is important
in order to guarantee the correctness of the algorithm. In particular, we show that distinct
trees of the join graph may only share free variables (which do not contribute arcs to the
join graph) or variables that appear as key attributes at the root of their trees.
Lemma 3.9. Let q(z) be a query such that q c
forest
. Let G be the join graph of q.
Let T
i
and T
j
be distinct connected components of G. Let R
i
(x
i
, y
i
) and R
j
(x
j
, y
j
) be the
literals at the roots of T
i
and T
j
, respectively. Let w be a variable that occurs in a literal
of both T
i
and T
j
. Then, either w is free (w z) or w is in the key of the roots of both
trees (w x
i
x
j
).
Proof. Let w
i
= w : w is a variable that occurs in some literal of T
i
, w , x
i
and w , z.
Let w
j
= w : w is a variable that occurs in some literal of T
j
, w , x
j
and w , z.
Assume that there is some variable w such that w appears in w
i
and w
j
. Let S
1
(u
1
, v
1
)
and S
2
(u
2
, v
2
) be literals of T
i
and T
j
, respectively such that w appears in S
1
and S
2
.
We must now consider the next two cases. First, suppose that w occurs in v
1
. Then,
by denition of join graph, there is an arc from S
1
to S
2
in G. But S
1
and S
2
are in
distinct connected components of G; contradiction. Second, suppose that w occurs in
u
1
. By denition of w
i
, S
1
is not at the root of T
i
(i.e., S
1
,= R
i
). Hence, there must
be a nonkey-to-key join from another literal, S
3
(u
3
, v
3
), in T
i
to S
1
. Since q is in c
forest
,
all the nonkey-to-key joins of q are full. Thus, the variable w also appears in a nonkey
position in v
3
. Hence, there must be an arc in the join graph from S
3
to S
2
. But S
2
and
S
3
are in distinct connected components of G; contradiction.
3.3.3 A Pessimistic Repair
In this subsection, we introduce the notion of a pessimistic repair. The name comes
from the fact that, for a given query q (in a class that we will dene shortly) and database
I, if a tuple fails to satisfy the query on some repair, then it also fails to satisfy the query
on the pessimistic repair. More precisely, for every inconsistent database I, there is a
repair / such that if c q(/), then c consistent
(q, I). This is a fundamental

property for the following reason. Consider a Boolean query q = x, w.(x, w) and a
query q
(x) = w.(x, w). That is, q and q
have the same literals, but some of the

(existentially-quantied) variables of q are free in q
. Suppose that we would like to

check whether consistent
(q, I) = true. This holds if, for every repair 1 of I, 1 [= q. In

particular, since / is a repair of I, / [= q. Thus, there is some c such that c q
(/).
By Lemma 3.10 below, it follows that c consistent
(q
, I). This property will be

exploited in the design of our algorithms in order to check the consistency of each tuple
of x independently. Notice that the property does not hold in general for conjunctive
queries, as we show in the next example. However, it does hold for the queries that
satisfy the conditions of Lemma 3.10.
Example 3.6. Consider a schema Rwith two binary relations r
1
and r
2
. Consider a set
that consists of a key dependency for r
1
and a key dependency for r
2
(the key dependencies
will be obvious from the queries). Let q
nk
be the Boolean query x, x
, y.r
1
(x, y)r
2
(x
, y).
Notice that q
nk
is not in c
forest
because it contains a nonkey-to-nonkey join. Let I be an
instance such that I = r
1
(a
1
, b
1
), r
1
(a
1
, b
2
), r
1
(a
2
, b
3
), r
1
(a
2
, b
4
), r
1
(a
3
, b
5
),
r
1
(a
3
, b
3
), r
2
(c
1
, b
1
), r
2
(c
1
, b
3
), r
2
(c
2
, b
4
), r
2
(c
2
, b
5
), r
2
(c
3
, b
2
), r
2
(c
3
, b
3
). It can be checked
that for every repair 1 of I, 1 [= q
nk
.
Now, consider the query q
nk
(x) = x
, y.r
1
(x, y) r
2
(x
, y). That is, q

nk
and q
nk
dier
only in the fact that x is existentially-quantied in the former, and free in the latter. Let
1
1
be repair of I such that 1
1
= r
1
(a
1
, b
1
), r
1
(a
2
, b
3
), r
1
(a
3
, b
5
), r
2
(c
1
, b
3
), r
2
(c
2
, b
4
), r
2
(c
3
, b
3
).
Let 1
2
be a repair of I such that 1
2
= r
1
(a
1
, b
1
), r
1
(a
2
, b
3
), r
1
(a
3
, b
5
), r
2
(c
1
, b
1
), r
2
(c
2
, b
4
),
r
2
(c
3
, b
2
). Notice that (a
1
) , q
nk
(1
1
), (a
2
) , q
nk
(1
2
), and (a
3
) , q
nk
(1
1
). Thus, even
though consistent
(q
nk
, I) = true, we have that (a) , consistent
(q
nk
, I) = false,
for every a. Therefore, it is not possible to check whether consistent
(q
nk
, I) = true
by independently checking each instantiation of the free variables of q
nk
.
The result that we give below assumes an input query q(x) that is in c
forest
, whose
join graph T is a tree, and whose free variables x are exactly the variables of the key of Ts
root. In the algorithm RewriteForest, the input query will be broken into subqueries
that satisfy this condition.
Lemma 3.10. Let q(x) be a query in c
forest
, whose join graph T is a tree and where
R(x, y) is the literal at the root of T. Let I be an instance. Then, there is a repair /
such that for all c if c q(/), then c consistent
(q, I).
Proof. Let / be the instance instance built by invoking the procedure
BuildPessimisticRepair(q, I) given in Figure 3.3.3. Assume that q is of the form
q(x) = w.( w, x). We will prove the claim by induction on the number of literals of .
Base case. Assume that consists of exactly one literal R(x, y). Let

t be the tuple
selected by the algorithm in the iteration for literal R and the vector of values c. Assume
towards a contradiction that consistent
( w.R(x, w)[x/c], I) = false. Then, there is

some repair 1 of I such that 1 ,[= w.R(x, y)[x/c]. Since

t I and 1 is a repair of I,
by Proposition 3.7, there is some tuple

t
in 1 and some

d
such that

t
= R(c,

d
). Since
1 ,[= w.R(x, y)[x/c], we have that
,[= w.R(x, y)[x/c].

Notice that

t and

t
can be added to / only during the iteration for the vector of

values c. Since
t [= w.R(x, y)[x/c] and
,[= w.R(x, y)[x/c], the algorithm never

selects tuple

t. But

t /; contradiction.
Inductive step. Assume that has more than one literal. Let T
1
, . . . , T
m
be the
subtrees of T such that the root of T
j
is a child of the root of T, for 1 j m. For each
1 j m, let S
j
(x
j
, y
j
) be the literal at the root of T
j
. Let
j
be the conjunction of
the literals of T
j
. Let w
j
= w : w is a variable of
j
, and w , x
j
. Let q
j
=
j
(x
j
, w
j
).
Let /
j
=BuildPessimisticRepair(
j
, I).
Assume that / [= q(x)[x/c]. Let

t be the tuple of I selected by the algorithm in
the iteration for literal R and the vector of values c. Then,

t /, and there is some
d such that

t = R(c,

d). Since / [= q(x)[x/c], we have that for every j such that
1 j m, there is some valuation for the variables of y, and some c
j
such that
(y) =

d, ( x
j
) =c
j
, and /
j
[= q
j
(x
j
)[x
j
/c
j
].
Algorithm BuildPessimisticRepair
Input: q(x), a query in c
forest
of the form w.( w, x),
I, an instance
Output: /, a repair of I
Initialize / as an empty instance
for each c such that there is some R(c,

d) in I do
if there is some

d such that R(c,

d) I,
and R(c,

d) ,[= w.R(x, y)[x/c] then
Let

t = R(c,

d)
else
Let

t be any tuple of I such that

t = R(c,

d), for some

d
end if
Add

t to /
end for
else
/* has more than one literal*/
Let S
1
, . . . , S
m
be the children of R in T
for j := 1 to m do
Let T
j
be the subtree of T whose root is S
j
Let
j
j
Let w
j
j
and w, and w , x
j
Let q
j
(x
j
) = w
j
.
j
(x
j
, w
j
)
Let /
j
= BuildPessimisticRepair(q
j
, I)
Add /
j
to /
end for

d) in I do
if there is some

d, some j, some valuation for the variables of y,
and some c
j
such that R(c,

d) I, (y) =

d, ( x
j
) =c
j
, and
/
j
,[= q
j
(x
j
)[x
j
/c
j
] then
Let

t = R(c,

d)
else
Let


t = R(c,

d), for some

d
end if
Add

t to /
end for
end if
Figure 3.6: Algorithm to construct a pessimistic repair
Assume towards a contradiction that consistent
(q(x)[x/c], I) = false. Then, there

is some repair 1 of I such that 1 ,[= q(x)[x/c]. Since

t I and 1 is a repair of I, by
Proposition 3.7, there is some tuple
in 1 and some

d
such that
= R(c,

d
). By Lemma
3.9, none of the variables of w
i
appear in w
j
, for every i and j such that i ,= j, 1 i m,
1 j m. Thus, there is some j, some valuation for the variables of y, and some tuple
of values c
j
such that 1 j m, 1 ,[= q
j
(x
j
)[x
j
/c
j
], (y) =

d
, and (x
j
) = c
j
. Thus,
consistent
(q
j
(x
j
)[x
j
/c
j
], I) = false. By inductive hypothesis /
j
,[= q
j
(x
j
)[x
j
/c
j
].
Since /
j
[= q
j
(x
j
)[x
j
/c
j
], the algorithm never selects

t in the construction of /. But
t /; contradiction.
3.3.4 Correctness of RewriteLocal
We now give a correctness proof of RewriteLocal, the module of the algorithm that
handles atomic queries, that is queries with a single literal (and hence no joins). These
atomic queries may have arbitrary selections and projections on any subset of the nonkey
attributes (more precisely, any of the nonkey attributes may be projected out of the
query result). We consider here only equality selections, but it is quite easy to see how to
extend the algorithm and the proof to more general selection conditions (including not
only inequalities, but also arbitrary rst-order expressions relating the variables of the
literal).
Lemma 3.11. Let q(x, z) be a query of the form w.R(x, y). Let I be a database instance.
Let Q
local
(x, z) be the rst-order query returned by RewriteLocal(q, ).
Then, (c,
t) Q
local
(I) i (c,
t) consistent
(q, I).
Proof. () Assume that I [= Q
local
(x, z)[x/c][z/
t]. Then, there is a tuple R(c,

d) such
that R(c,

d) [= w.R(x, y)[x/c][z/
t]. Assume towards a contradiction that

consistent
( w.R(x, y)[x/c][z/
t], I) = false. Then, there is some repair 1 such that

1 ,[= w.R(x, y)[x/c][z/
t]. By Proposition 3.7, there is a tuple R(c,
) in 1.
Following the construction of Q
local
in RewriteLocal, let be an injective function
that maps natural numbers to variables not present in R. Let y
be a vector of variables
of the same arity as y and such that if z is at position p of y
, then (p) = z. Let and
be valuations for the variables of x and

y
such that (x) = c, (
) =

d,
(x) = c,
and
) =

d
.
Since R(c,

d) [= w.R(x, y)[x/c][z/
t] and R(c,
) ,[= w.R(x, y)[x/c][z/
t], there
is some variable z at some position p of

y
such that
1. (z) ,=
(z), and there is a constant at position p in y; or

2. (z) ,=
(z), and there is some variable w such that w occurs at position p of y,

and w occurs in either x or z; or
3. there are variables w and z
, and a position p
such that w occurs at position p of

y, w occurs at position p
of y, p ,= p
, z
= (p
), and
(z) ,=
(z
).
Assume (1) that there is a constant d at position p in y. Since
R(c,

d) [= w.R(x, y)[x/c][z/
t], (z) = d. Since (z) ,=
(z), there is a constant d
such that d ,= d
and
(z) = d
. Notice in the algorithm RewriteLocal that since I [=

Q
local
(x, z)[x/c][z/
t], we have that I [= y
.R(x, y
) z = d. Since 1 I, R(c,
) I.
Thus, R(c,
) [= y
.R(x, y
) z = d. Therefore,
(z) = d; contradiction.
Assume (2) that there is some variable w such that w occurs at position p of y,
and w occurs in either x or in z. Let c = (w). Since R(c,

d) [= w.R(x,

y
)[x/c][z/
t],
(z) = c. Since (z) ,=
(z),
(z) ,= c. Notice in the algorithm RewriteLocal that since

I [= Q
local
(x, z)[x/c][z/
t], we have that I [= y
.R(x, y
) z = w[w/c]. Since 1 I,
R(c,
) I. Thus, R(c,
) [= y
.R(x, y
) z = w[w/c]. Therefore,
(z) = c;
contradiction.
Assume (3) that there are variables w and z
, and a position p
such that w occurs

at position p of y, w occurs at position p
of y, p ,= p
, z
= (p
), and
(z) ,=
(z
).
Notice in the algorithm RewriteLocal that since I [= Q
local
(x, z)[x/c][z/
t], we have that

I [= y
.R(x, y
) z = z
. Since 1 I, R(c,
) I. Thus, R(c,
) [= y
.R(x, y
)
z = z
. Therefore,
(z) =
(z
); contradiction.
() Assume that consistent
(q(x, z)[x/c][z/
t], I) = true. Assume towards a con-

tradiction that I ,[= Q
local
(x, z)[x/c][z/
t]. Then, at least one of the following conditions

hold:
1. I ,[= w.R(x, y)[x/c][z/
t]; or
2. there is a constant d at position p in y and a variable z such that z = (p) and
I ,[= y
.R(x, y
) z = d[x/c][z/
t]; or
3. there is some variable w such that w occurs at position p of y, w occurs in either
x or z, and I ,[= y
.R(x, y
) z = w[x/c][z/
t]; or
4. there is some variable w that occurs at position p of y, and at a position p
of y
such that p ,= p
, (p) = z, (p
) = z
and I ,[= y
.R(x, y
) z = z
[x/c][z/
t].
Assume that I ,[= w.R(x, y)[x/c][z/
t]. Let 1 be an arbitrary repair of I. Since 1 I,

1 ,[= w.R(x, y)[x/c][z/
t]; contradiction.
Suppose that I [= w.R(x, y)[x/c][z/
t]. Furthermore, assume that there is a constant

d at position p in y and a variable z such that z = (p) and I ,[= y
.R(x, y
) z =
d[x/c][z/

d) in I such that R(c,

d) ,[= y
.R(x, y
)
z = d[x/c][z/
t]. This means that there is some constant e at position p of

d such that
d ,= e. Thus, R(c,

d) ,[= w.R(x, y)[x/c][z/
t]. By Proposition 3.8, there is a repair 1

of I such that R(c,

d) I. Assume that 1 [= w.R(x, y)[x/c][z/
t]. Let R(c,
) be a
tuple of 1 such that R(c,
) [= w.R(x, y)[x/c][z/
t]. Since 1 is a repair of I, 1 satises

the key constraints of . Thus,

d =

d
. Therefore, R(c,

d) [= w.R(x, y)[x/c][z/
t];
contradiction.
t]. Furthermore, assume that there is some

variable w such that w occurs at position p of y, w occurs in either x or z, and I ,[=
y
.R(x, y
) z = w[x/c][z/


d) ,[=
y
.R(x, y
) z = w[x/c][z/
t]. Let be a valuation for the variables of x and z such

that (x) =c and (z) =
t. Let c = (w). Then, there is some constant e at position p of
d such that c ,= e. Thus, R(c,

d) ,[= w.R(x, y)[x/c][z/
t]. By Proposition 3.8, there is a

repair 1 of I such that R(c,

t]. Let R(c,
) be
a tuple of 1 such that R(c,
) [= w.R(x, y)[x/c][z/
t]. Since 1 is a repair of I, 1 satises

the key constraints of . Thus,

d =

d
. Therefore, R(c,

d) [= w.R(x, y)[x/c][z/
t];
contradiction.
t]. Furthermore, assume that there is some

variable w that occurs at position p of y, and at a position p
of y such that p ,= p
,
(p) = z, (p
) = z
and I ,[= y
.R(x, y
) z = z
[x/c][z/
t]. Then, there is a

tuple R(c,


d) ,[= y
.R(x, y
) z = z
[x/c][z/
t]. Let
be a valuation for the variables of

y
such that (
) =

d. Then, there are con-
stants d and e at the respective positions p and p
of

d such that d ,= e. Thus,
R(c,

d) ,[= w.R(x, y)[x/c][z/
t]. By Proposition 3.8, there is a repair 1 of I such that

R(c,

t]. Let R(c,
) be a tuple of 1 such that

R(c,
) [= w.R(x, y)[x/c][z/
t]. Since 1 is a repair of I, 1 satises the key constraints

of . Thus,

d =

d
. Therefore, R(c,

d) [= w.R(x, y)[x/c][z/
t]; contradiction.
3.3.5 Correctness of RewriteTree
Consider a Boolean query q = x, w.(x, w) and a query q
(x) = w.(x, w). That is, q

and q
have the same literals, but some of the (existentially-quantied) variables of q are
free in q
. In Lemma 3.10 above, we showed that if q
is in a certain class of conjunctive

queries, then there is a pessimistic repair / such that for all c, if c q
(/), then
(c) consistent
(q
, I). We also argued that this fact implies that, in order to check
whether consistent
(q, I) = true, it suces to nd some instantiation c for the free

variables of q
such that c consistent
(q
, I). The latter condition is fundamental in

the design of our algorithm since it can be checked with a rst-order query directly on the
inconsistent database I. In the next lemma, we show that the algorithm RewriteTree,
the main building block of RewriteForest, produces a rst-order query that checks
precisely this condition.
Lemma 3.12. Let q(x, z) be a query in c
forest
whose join graph T is a tree with root
literal R(x, y). Let I be an instance. Let Q(x, z) be the rst-order query returned by
RewriteTree(q, ).
Then, (c,
t) Q(I) i (c,
t) consistent
(q, I).
Proof. The proof is by induction on the number of literals of q.
Base case Assume that q has exactly one literal. Then, q(x, z) = w.R(x, y),
and Q = RewriteLocal(q, ). By Lemma 3.11, we have that I [= Q(x, z)[x/c][z/
t]
i consistent
(q(x, z)[x/c][z/
t], I) = true.
() Notice in the algorithm RewriteLocal that, since I [= Q
local
[],
I [= y
1
, . . . , y
m
.R(x, y)[]. Let c = (x). Then, there exists some

d such that R(c,

d) I.
Let 1 be a repair of I. By Proposition 3.7, there is some

d
such that R(c,

d
) 1.
Assume that there are no constants in y. Since all the variables of y are existentially
quantied in q
T
, R(c,

d
) [= q
T
[], and we are done.
Assume that there is some constant in y. Since all the variables of y are existentially
quantied in q
T
, in order to show that R(c,

d
) [= q
T
[], it suces to show that

d
and
y coincide in their constants. By Proposition 3.6, 1 I. Thus, R(c,

d
) I. Since
I [= Q
local
[] and R(c,

d
) I, we have that [= Q
const
[y
]. Therefore, it holds that if

there is a constant e at position i of

d
, then [= E
i
[w
i
/e], where w
i
is the variable created
in RewriteLocal for the i-th position of y. By construction of E
i
, this means that there
is a constant e at position i of y.
() Let 1 be a repair of I. Let c = (x). Since 1 [= q
T
[], there exists

d such that
R(c,

d) 1. By Proposition 3.6, 1 I. Therefore, there exists

d
such that R(c,

d
) I.
Thus, I [= y
1
, . . . , y
m
.R(x, y)[].
Assume that there is some constant in y. Let
y
be a valuation for the variables
of y
, where y
is the vector of variables created in RewriteLocal. Let

d be such that
d =
y
(y
). If R(c,

d) , I, then I [= R(x, y
) Q
const
[][
y
] because the left-hand side
of the implication is not satised. Assume R(c,

d) I. By Proposition 3.8, there exists
a repair 1 of I such that R(c,

d) 1. Since 1 [= , if R(c,

d
) 1, then

d
=

d. Since
1 [= q
T
[], R(c,

d) [= q
T
[]. Therefore, if d is a constant that appears at position i in
y, then d occurs at position i in

d. Thus, I [= Q
const
[][
y
].
Inductive step Let R
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
) be the children of R in T. Assume
that q is of the form w.( w, z), where is a conjunction of literals. For each 1 i m,
let T
i
be the tree whose root is R
i
. Let
i
be the conjunction of the literals of T
i
. Let
w
i
i
and w, and w , x
i
. Let z
i
= z : z
is a variable that occurs in
i
and z, and z , x
i
. Let q
i
(x
i
, z
i
) = w
i
.
i
(x
i
, w
i
, z
i
).
Let Q
i
(x
i
, z
i
) = RewriteTree(q
i
, ). Let q
local
(x, z) = w.R(x, y). Let Q
local
(x, z) =
RewriteLocal(q
local
, ).
() Assume that I [= Q(x, z)[x/c][z/
t]. Then, there is a valuation for the variables

of such that:
1. (x) =c, and
2. (z) =
t, and
3. I [= Q
local
(x, z)[], and
4. for every i such that 1 i m, there are c
i
and

t
i
such that (x
i
) =c
i
, (z
i
) =
t
i
,
and I [= Q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
]
Let 1 be a repair of I. Assume towards a contradiction that 1 ,[= w.R(x, y)[x/c][z/
t].
Then, consistent
( w.R(x, y)[x/c][z/
t], I) = false. By Lemma 3.11, we have that

I ,[= Q
local
(x, z)[]; contradiction.
Assume that 1 [= w.R(x, y)[x/c][z/
t]. By Lemma 3.9, none of the variables of w

i
appear in w
j
, for every i and j such that i ,= j, 1 i m, 1 j m. Then, 1 ,[= q
i
(c
i
,
t
i
)
for some i such that 1 i m. Thus, consistent
(q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
], I) = false.
By inductive hypothesis, I ,[= Q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
]; contradiction.
() Assume that consistent
(q(x, z)[x/c][z/
t], I) = true. Assume towards a con-

tradiction that I ,[= Q(x, z)[x/c][z/
t]. Let be a valuation for the variables of such

that (x) =c and (z) =
t. By Lemma 3.9, none of the variables of w

i
appear in w
j
, for
every i and j such that i ,= j, 1 i m, 1 j m. Then, either (1) I ,[= Q
local
(x, z)[];
or (2) there is some i such that I ,[= Q
i
(x
i
, z
i
)[].
Assume that I ,[= Q
local
(z)[]. By Lemma 3.11, consistent
( w.R(x, y)[x/c][z/
t], I) =
false. Thus, it is the case that consistent
(q(x, z)[x/c][z/
t], I) = false; contradic-

tion. Assume that there is some i such that I ,[= Q
i
(x
i
, z
i
)[]. By inductive hypothesis,
consistent
(q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
], I) = false. Thus, it is the case that
consistent
(q(x, z)[x/c][z/
t], I) = false; contradiction.

3.3.6 Correctness of RewriteForest
We are now ready to give the correctness proof of our rewriting algorithm, for all queries
in class c
forest
. The intuition of the proof is the following. Assume that we are given
a query q in c
forest
. Then, each of the connected components of the join graph of q
is a tree. Recall that RewriteTree, the algorithm for which we proved correctness in
the above lemma, requires that the input query satises the following conditions. First,
the join graph of the query must be a tree. Second, the free variables of the query
must include all the variables at key positions of the literal at the root of this tree.
In order to be able to use RewriteTree, RewriteForest produces a subquery for each
tree of the join graph such that the variables at the key of the corresponding trees
root are free. In this way, a rst-order rewriting can be produced for each subquery by
invoking the algorithm RewriteTree. For each i, let Q
i
(x
i
, z
i
) be the rewriting obtained
by invoking RewriteTree(q
i
, ). The query returned by RewriteForest has the form
Q(z) = w.(( w, z)
i=1...m
Q
i
(x
i
, z
i
)), where ( w, z) is the conjunction of literals of
the original query q, and the variables of each x
i
are in w. The correctness of this formula
relies on the structural property of Section 3.3.2 and the notion of a pessimistic repair of
Section 3.3.3. First, by Lemma 3.10, it suces to nd one instantiation for the variables
of each x
i
. Thus, the variables of x
i
can be free in Q
i
. Second, the subqueries do not
share existentially-quantied variables. This is ensured by the structural property proved
in Lemma 3.9.
Theorem 3.5. Let R be a schema. Let be a set of integrity constraints, consisting
of one key dependency per relation of R. Let q(z) be a conjunctive query over R such
that q c
forest
. Let Q(z) be the rst-order query returned by RewriteForest(q, ). Let
I be an instance over R.
Then,

t Q(I) i

t consistent
(q, I).
Proof. Let G be the join graph of q. Since q c
forest
, G is a forest. Let T
1
, . . . , T
m
be
the connected components (trees) of G. Assume that q is of the form w.( w, z), where
is a conjunction of literals. For each 1 i m, let R
i
(x
i
, y
i
) be the literal at the root
of T
i
. Let
i
i
. Let w
i
= w : w is a variable that
occurs in
i
and w, and w , x
i
. Let z
i
i
and z,
and z , x
i
. Let q
i
(x
i
, z
i
) = w
i
.
i
(x
i
, w
i
, z
i
). Let Q
i
(x
i
, z
i
) = RewriteTree(q
i
, ).
() Assume that I [= Q(z)[z/
t]. Then, there is a valuation for the variables of

such that:
1. (z) =
t, and
2. I [= ( w, z)[], and
3. for every i such that 1 i m, there are c
i
and

t
i
such that (x
i
) =c
i
, (z
i
) =
t
i
,
and I [= Q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
]
Let 1 be a repair of I. Assume towards a contradiction that 1 ,[= q[z/
t]. Thus,
1 ,[= q[]. By Lemma 3.9, none of the variables of w
i
appear in w
j
, for every i and j
such that i ,= j, 1 i m, 1 j m. Then, 1 ,[= q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
] for some i such
that 1 i m. Thus, consistent
(q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
], I) = false. By Lemma 3.12,
I ,[= Q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
]; contradiction.
() Assume that

t consistent
(q, I). Assume towards a contradiction that I ,[=

Q(z)[z/
t]. Let be a valuation for the variables of such that (z) =

t. Then, either
(1) I ,[= q(z)[]; or (2) there is some i such that I ,[= Q
i
(x
i
, z
i
)[].
We will build a repair / of I as follows. For each i, let I
i
be the projection of
I on the relation symbols of
i
. By Lemma 3.10, there is a repair /
i
such that if
/
i
[= q
i
(x
i
)[x
i
/c
i
], then consistent
(q
i
(x
i
)[x
i
/c
i
], I
i
) = true. We add all the tuples of
/
i
to /.
We now show that / ,[= q(z)[]. Assume that I ,[= q(z)[]. Since / I, / ,[=
q(z)[]. Now, assume that there is some i such that 1 i m and I ,[= Q
i
(x
i
, z
i
)[]. By
Lemma 3.12, consistent
(q
i
(x
i
, z
i
)[], I) = false. By Lemma 3.10, /
i
,[= q
i
(x
i
, z
i
)[].
Thus, / ,[= q(z)[].
So, for every valuation such that (z) =

t, we have that / ,[= q(z)[]. Thus,
t , consistent
(q, I); contradiction.

3.4 Related Work
In their seminal paper on consistent query answering, Arenas, Bertossi and Chomicki
[ABC99] propose a rst-order rewriting algorithm. The algorithm applies to a broad
class of constraints but a restricted class of queries, called quantier-free conjunctive
queries. In these queries, all variables are free (i.e., there is no existential quantication).
If we think in terms of equivalent SQL queries, the fact that all variables are free means
that every attribute of every relation in the from clause must appear in the select
clause. This a strong restriction that rules out many practical queries. As an empirical
observation, none of the queries in the TPC-H specication [TPC03], the industry stan-
dard for decision support systems, satisfy this restriction. Chomicki and Marcinkowski
[CM05] propose a rewriting for another restricted class, where no variables are shared
between literals (and therefore, there are no joins). In this chapter, we focused on a class
of conjunctive queries that may have existential quantication, and we argued that the
class captures many queries that arise in practice.
Except for the aforementioned work [ABC99, CM05], to the best of our knowledge,
none of the work in the consistent query answering literature has focused on rst-order
rewritings. Instead, they typically produce rewritings into disjunctive logic programs
[ABC00, CB00, GZ00, FPL
+
01, GGZ01, LLR02, ABC03a, BB03a, BB03b, EFGL03,
CB05]. Their focus is on obtaining correct disjunctive logic programs for (usually large)
classes of queries and constraints. However, given the high complexity of disjunctive
logic programming, none of these approaches focus on tractability issues. Tractability
results have been given in the context of databases with OR-objects [IvdMV95]. As
we mentioned in Section 2.5, OR-objects can be used in some (though not all) cases
to represent databases inconsistent with respect to key constraints. To the best of our
knowledge, query rewriting has not been studied in the context of OR-objects.
Our work on rst-order query rewriting has been subsequently extended by other
authors. Grieco, Lembo, Rosati and Ruzzi [GLRR05] show a query rewriting algorithm
for our class c
forest
under exclusion constraints (that is constraints which restrict values
to appear in exactly one of two relations). In a recent paper [LRR06], Lembo, Rosati,
and Ruzzi extend the class c
forest
to consider queries that may have the union operation.
Chapter 4
Rewritings for Queries with
Grouping and Aggregation
In the previous chapter, we presented query rewritings for queries with set semantics and
no aggregation. However, practical query languages like SQL have bag semantics (dupli-
cates are not eliminated unless explicitly requested), and support aggregation functions
and grouping of results. For this reason, in this chapter we present rewritings for queries
with bag semantics, grouping, and aggregation.
4.1 Formal Language
Despite extensive research on queries with bag semantics and aggregation [CV93, IR95,
LW97, GM96, GRT99, CNS99, HLNW01, CNS03], there is no commonly agreed formal
language for this kind of queries, with dierent researchers proposing dierent (but of-
ten equivalent) languages. For this reason, in this section, we introduce languages for
rst-order aggregate queries and conjunctive aggregate queries that are inuenced by the
previous proposals. The former language will be used to express our query rewritings,
whereas the latter will be used for the input queries (i.e., the queries for which we compute
consistent answers). The language of rst-order aggregate queries extends the language
of rst-order logic with operators for grouping and aggregation. Aggregate conjunctive
queries are a subset of rst-order aggregate queries.
Our language for rst-order aggregate queries is based on the one given by Cohen,
Nutt and Sagiv [CNS03], except for the fact that we use a SQL-like syntax to specify
grouping and aggregation. The language can be shown to be a subset of the aggregate
48
Chapter 4. Rewritings for Queries with Grouping and Aggregation 49
logic /
aggr
introduced by Hella, Libkin, Nurmonen, and Wong [HLNW01]. We do not
explicitly provide the bag manipulation operators (such as additive union, maximum
union, etc.) that are given in bag algebras [GM96, LW97].
Bags and aggregation functions. A bag or (multiset) is a collection of elements,
each of which occurs one or more times in the collection. We will denote the multiplicity
(number of occurrences) of each element x of a bag B as [x[
B
. If S is a domain, we
denote by B(S) the set of nite bags over S. A k-ary aggregation function is a function
F : B(C
k
) R that maps bags of k-tuples of constants from some underlying domain C
to real numbers. In particular, we will consider the functions sum, min, and max, which
return the sum, minimum, and maximum of a bag of tuples, and the function count(*),
which returns the cardinality of a bag of tuples.
We will consider a bag-set query semantics [CV93], where relations (and their re-
pairs) are assumed to be sets, but the aggregate queries manipulate bags. For example,
consider a database I = employee(John, 1000), employee(Mary, 1000) and a query
q that retrieves the salaries (the second attribute of relation employee), expressed as
q(s) = e. employee(e, s). Under bag-set semantics, the result of q(I) is 1000, 1000
(that is, 1000 has multiplicity two in the result).
Language syntax.A rst-order aggregate query q may be either:
1. a rst-order formula; or
2. a formula of the form
select z, F
1
(v
1
), . . . , F
m
(v
m
)
from q
( w, z)
group by z
where q
is a rst-order aggregate query, w and z do not share variables, v

1
, . . . , v
m
are vectors of variables from w, and F
1
, . . . , F
m
are aggregation functions with
arities [v
1
[, . . . , [v
m
[. We will say that z are the grouping variables of the query, and
v
1
, . . . , v
m
are the aggregation variables.
Language semantics. We now dene how to obtain a set of tuples by applying a
rst-order aggregate query q to a database I. (Even though aggregate functions take
bags as input, the nal result of a query is always a set because it has one tuple for each
group).
If the aggregate query is just a rst-order formula (Case 1 above), its semantics
corresponds to the semantics of rst-order queries. If the query is of the form of Case
2 above, the aggregate query q is evaluated as follows. First, we retrieve groups that
satisfy q
(i.e., all the satisfying assignments for the grouping variables z). Second, for
each group a (i.e., for each instantiation of the grouping variables z), we obtain the bag
of tuples
a
that satisfy q
and whose projection on z is a (the tuples of

a
are on both
the grouping variables z and the other free variables w of q
). Third, for each group a

and aggregation function F
i
, we create a bag B
i,a
by taking each tuple (c, a) of
a
and
projecting on the aggregation variables v
i
. Finally, we apply every aggregation function
F
i
to the corresponding bag B
i,a
.
More formally, for every database instance I, tuple a and real numbers b
1
, . . . , b
m
, we
say that (a, b
1
, . . . , b
m
) q(I) if there is a set
a
such that:
I [= w.q
( w, a), and

a
= (c, a) : (c, a) q
(I), and
for every i such that 1 i m, b
i
= F
i
(B
i,a
), where B
i,a
is the bag obtained by
taking each tuple (c, a) of
a
and projecting on the aggregation variables v
i
.
We now dene the language of conjunctive aggregate queries as a subset of rst-order
aggregate queries. A conjunctive aggregate query is a formula of the form
select z, F
1
(v
1
), . . . , F
m
(v
m
)
from q
( w, z)
group by z
where q
( w, z) is a conjunctive query, v
1
, . . . , v
m
are vectors of variables from w, and
F
1
, . . . , F
m
are aggregation functions of the arities of v
1
, . . . , v
m
. We will say that z are
the grouping variables, and v
1
, . . . , v
m
are the aggregation variables. The semantics is the
same as for rst-order conjunctive queries.
As with rst-order aggregate queries, the language of conjunctive aggregate queries is
inuenced by previous proposals. In particular, it corresponds closely to the language pre-
sented by Cohen, Nutt and Serebrenik [CNS99], except that we use a SQL-like syntax
instead of a Datalog syntax. It is also related to the language of real conjunctive queries
(conjunctive queries with bag semantics) introduced by Chaudhuri and Vardi [CV93],
and the class of conjunctive queries with label systems representing multisets presented
by Ioannidis and Ramakrishnan [IR95]. In the latter two cases, tuples are returned to-
gether with their multiplicity. This can be obtained in our conjunctive aggregate queries
by using the aggregation function count().
4.2 Algorithms
In this section, we present query rewriting algorithms under the aggconsistent
se-
mantics for a class of queries that extends the class c
forest
of the previous chapter with
operators for grouping and aggregation. In Section 4.2.1, we present the rewriting algo-
rithm for queries with bag semantics (i.e., the count(*) operator), and in Section 4.2.2
we present the algorithm for queries with the unary aggregation functions sum, min, and
max.
4.2.1 Queries with Bag Semantics
In this subsection, we give a query rewriting algorithm for conjunctive queries with bag
semantics (i.e., the count(*) operator). We start with an example, and then give the
general algorithm. The example illustrates how we can build upon the results for query
rewriting conjunctive queries under set-theoretic semantics of the previous chapter.
Example 4.1. Let R be a schema with one relation symbol employee. Assume that r
has two attributes: emplKey (the name of the employee) and salary. Let be a set that
consists of only one constraint stating that emplKey is the key of relation employee.
Consider the following query q
1
, which counts the number of occurrences of each
salary (it corresponds to query q
3
of Example 2.1).
q
1
(s, v): select s, count(*) as v
from employee(e, s)
group by s
Let I be a database instance such that I = employee(John, 1000), employee(John, 2000),
employee(Mary, 1000), employee(Ali, 1000). There are two repairs of I with respect to
: 1
1
= employee(John, 1000), employee(Mary, 1000), employee(Ali, 1000) and 1
2
=
employee(John, 2000),employee(Mary, 1000), employee(Ali, 1000). Furthermore, q
1
(1
1
) =
(1000, 3) and q
1
(1
2
) = (1000, 2), (2000, 1). By Denition 2.4, aggconsistent
(q
1
, I) =
(1000, 2, 3). That is, the salary 1000 is an answer that appears at least twice and at
most three times in the result of applying q
1
on the repairs.
Let us focus on obtaining the greatest lower bound for q
1
. From the previous chapter,
we know how to obtain consistent answers for conjunctive queries without aggregation
under set-theoretic semantics. We would like to reuse such results here. An obvious
strategy (shown to be incorrect shortly) is to rst remove grouping and aggregation
from q
1
, obtain the consistent answers under set-theoretic semantics, and nally apply
grouping and aggregation to the intermediate result. That is, rst compute the consistent
answers for the following query q
1
(s):
select s
from employee(e, s)
We can express q
1
in conjunctive query notation as follows: q
1
(s) = e. employee(e, s).
Let QConsistent
(s) be the rst-order query obtained by applying RewriteForest(q
1
, ),
the algorithm introduced in the previous chapter. Suppose that now apply the operator
count(*) to the the result of QConsistent
(s) as follows:
select s, count(*)
from QConsistent
(s)
group by s
It is easy to see that this strategy leads to a wrong result. Since the result of the
consistent answers to q
1
(consistent
(q
1
, I)) is (1000), we would incorrectly conclude
that the greatest lower bound for 1000 is one, when in fact it is two. Clearly, the cause
for the incorrect result is that cardinalities are lost in the set-theoretic consistent answers
that we computed as an intermediate step. But, is there any way of obtaining the correct
bounds for the aggregate query, and yet be able to reuse the notion of set-theoretic
consistent answers as an intermediate step? The answer is positive: we can use a root
key value at a time principle. In this case, this corresponds to making the variable e
(for employee name) free because it is at the key position of employee(e, s), the literal
at the root (and only node) of q
1
. We will obtain the consistent answer one employee
at a time in the intermediate result, and then project out the employees (since they
are not retrieved by q
1
). The intermediate result will be guaranteed to have the correct
cardinalities despite the fact that it is obtained using set semantics. The intuitive reason
is that repairs are sets of tuples that satisfy the key constraints, and hence every employee
name appears exactly once in each repair.
Following the previous discussion, let q
1
be the query q
1
, where the variable e is made
free. That is, let q
1
(e, s) = employee(e, s). The set-theoretic consistent answers for q
1
are
consistent
(q
1
, I) = (Mary, 1000), (Ali, 1000). We can now project out the employee
names and count the number of occurrences of salary 1000, arriving at the correct lower
bound for count(*) in q
1
.
Let us now turn our attention to the computation of the lowest upper bound of q
1
.
Since aggconsistent
(q
1
, I) = (1000, 2, 3), the salary 1000 is an answer that appears
at most three times in the results of applying q
1
to the repairs. We can use q
1
(e, s) =
employee(e, s) to obtain the lowest upper bound of salary 1000 as follows:
select s, count(*) as lub
from q
1
(e, s)
group by s
However, this query also retrieves the tuple (2000, 1) which should not be in the result
of aggconsistent
(q
1
, I) because the salary 2000 does not appear in q
1
(1
1
). This means
that we must make sure that the values for the grouping variables are in the consistent
answers for q
1
. We can do this by employing the rst-order rewriting QConsistent(e, s)
of query q
1
, which can be obtained by invoking the algorithm RewriteForest. Now, we
can rule out 2000 from the nal result because there is no tuple for salary 2000 in the
result of QConsistent(e, s). This can be achieved with the following query:
select s, count(*) as lub
from employee(e, s) e
.QConsistent(e
, s)
group by s
Query Rewriting Algorithm
In Figure 4.1, we give the rewriting algorithm for aggregate conjunctive queries with
the count() aggregation function. The algorithm works for queries q of the form
select z, count(*)
from q
(z)
group by z
where q
is a conjunctive query in c
forest
. The reason for requiring q
to be in c
forest
is
that, as we motivated in the previous example, we would like to build upon the results for
rst-order rewriting of conjunctive queries under set-theoretic semantics. In the previous
chapter, we showed how to obtain such rewritings for the conjunctive queries in class
c
forest
.
By denition, the join graph of all queries in c
forest
is a forest. We can then instantiate
the values for the key attributes at each root literal of the join graph of q
, using the
root key value at a time strategy that we illustrated in the previous example. More
precisely, let G be the join graph of q
. We will construct a conjunctive query q
that
has the same literals as q
, but all the variables that are at the key of some root of G are
free in q
.
Following the algorithm, let R
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
) be the literals at the roots
of all trees in G. Let x =
i=1...m
x
i
, let

z
= z x. Let ( w, z) be the conjunction

of literals of q
, and let

w
= w x. We dene q
as q
(x,
) =
.(x,

w
). The
advantage of query q
is that since the variables at the key of all root literal are free,
each tuple appears exactly once in the answer to q
in the repairs (we will show this

formally in Lemma 4.4). Thus, set and bag-set semantics coincide in the answer to q
.
We can exploit this fact by computing the set-theoretic consistent answers for q
as an
intermediate result towards producing the consistent answers to the aggregate query q.
The rst-order query rewriting QConsistent for q
is obtained by invoking the algorithm

RewriteForest given in Figure 3.2 of Chapter 3.
The greatest lower bound is computed with the following query, which counts the
number of occurrences of tuples for z (the grouping variables) in the consistent answer
to q
.
QGlb(z, low) = select z, count(*)
from QConsistent(x,
)
group by z
Notice that the free variables of QConsistent, x and

z
, contain the variables of z, but

may have additional variables. In the nal result, we are projecting out these additional
variables, since they are not in the select clause of the query q.
The lowest upper bound is obtained by counting the number of tuples that satisfy
q
(x,
) and checking that some instantiation of the grouping variables of z appear in the
RewriteCount(q, )
Input: A query q of the form
select z, count(*)
from q
(z)
group by z
where q
forest
, a set of key constraints (one per relation)
Output: Q, an aggregate rst-order query that computes aggconsistent
(q, I)
for every database I
Let R
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
) be the literals at the roots of all trees of G
Let x =
i=1...m
x
i
Let

z
= z x
Let ( w, z) be the conjunction of literals of q
Let

w
= w x
Let q
(x,
) =
.(x,

w
)
Let QConsistent(x,
) be the query obtained by invoking RewriteForest(q
, )
Let QGlb(z, low) = select z, count(*)
from QConsistent(x,
)
group by z
Let

x
= x z
Let QLub(z, up) = select z, count(*)
from q
(x,
) (
.QConsistent(x,
))
group by z
Let Q(z, low, up) = QGlb(z, low) QLub(z, up)
return Q
Figure 4.1: Query rewriting algorithm for queries with count(*).
consistent answers of q
. This is obtained with the query
.QConsistent(x,
), where
are the variables of x that are not free variables of q.

QLub(z, up) = select z, count(*)
from q
(x,
) (
.QConsistent(x,
))
group by z
4.2.2 Queries with the sum, min, and max Functions
In Figure 4.2, we present the query rewriting algorithm for queries with the sum, min,
and max aggregation functions. The main dierence with the rewritings produced by
RewriteCount is that aggregation is performed here in two levels. At the inner level of
the rewriting, we aggregate the values for u (the value that is aggregated in the original
query), and we group by the key-root attributes (vector x in the gure). We then project
out the key-root attributes that are not in the select clause of the input query, and
apply the aggregation function of the input query.
For example, the greatest lower bound of the max function is computed as follows:
QGlb(z, low) =
select z, max(bottom)
from
select x,

z
, min(u) as bottom
from QConsistent(x,
) q
(x,
, u)
group by x,
group by z
Notice that, as in RewriteCount, the lower bound is obtained by selecting tuples from
QConsistent(x,
). In addition, we now have a conjunct q
(x,
, u), which retrieves the

values for the aggregate attribute u. The inner level of aggregation consists in this case
of the computation of the bottom attribute, as the minimum for the values retrieved for
u. The outer level applies the max function (i.e., the function of the original query) to
the values of the bottom attribute.
RewriteAgg(q, )
select z, [max(u)[min(u)[sum(u)]
from q
(z, u)
group by z
where q
forest
Output: Q, an aggregate rst-order query that computes aggconsistent
(q, I)
for every database I
Let R
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
) be the literals at the roots of all trees of G
Let x =
i=1...m
x
i
Let

z
= z x
Let ( w, z, u) be the conjunction of literals of q
Let

w
= w x
Let q
(x,

z
) =
, u.(x,

w
,

z
, u)
Let QConsistent(x,

z
, )
Let q
(x,

z
, u) =
.(x,

w
,

z
, u)
Let

x
= x z u
if the aggregate function is max then
QGlb(z, low) =
select z, max(bottom)
from
select x,

z
, min(u) as bottom
from QConsistent(x,

z
) q
(x,

z
, u)
group by x,

z
group by z
QLub(z, up) =
select z, max(top)
from
select x,

z
, max(u) as top
from q
(x,

z
, u) (
.QConsistent(x,

z
))
group by x,

z
group by z
endif
continues on next page...
Figure 4.2: Query rewriting algorithm for queries with aggregation
continued from previous page...
if the aggregate function is sum then
QGlb(z, low) =
select z, sum(bottom)
from
select x,

z
, min(u) as bottom
from QConsistent(x,

z
) q
(x,

z
, u)
group by x,

z
having bottom 0
select x,

z
, min(u) as bottom
from q
(x,

z
, u) (
.QConsistent(x,

z
))
group by x,

z
having bottom < 0

group by z
QLub(z, up) =
select z, sum(top)
from
select x,

z
, max(u) as top
from q
(x,

z
, u) (
.QConsistent(x,

z
))
group by x,

z
having top > 0
select x,

z
, max(u) as top
from QConsistent(x,

z
) q
(x,

z
, u)
group by x,

z
having top 0
group by z
endif
continues on next page...
continued from previous page...
if the aggregate function is min then
QGlb(z, low) =
select z, min(bottom)
from
select x,

z
, min(u) as bottom
from q
(x,

z
, u) (
.QConsistent(x,

z
))
group by x,

z
group by

z
QLub(z, up) =
select z, min(top)
from
select x, z, max(u) as top
from QConsistent(x,

z
) q
(x,

z
, u)
group by x,

z
group by z
endif
Let Q(z, low, up) = QGlb(z, low) QLub(z, up)
return Q
4.3 Correctness of the Algorithms
In this section, we prove the correctness of the query rewriting algorithms of this chapter.
We consider the following class of queries, which we call c
aggforest
.
Denition 4.1. Let q be an aggregate conjunctive query. We say that q is in class
c
aggforest
if q is of the form
select z, [count(*)[ F(u)]
from q
(z, u)
group by z
where q
forest
, and F is one of the aggregation functions
min, max or sum.
The main result of this section is the following theorem:
Theorem 4.2. Let R be a schema. Let be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(z, v) be a query in c
aggforest
. Let Q(z, l, u)
be the rst-order aggregate query returned by RewriteCount(q, ) or RewriteAgg(q, )
(depending on the aggregate function of the query).
Let I be an instance over R. If q has the aggregate function sum, assume that the
aggregated attribute ranges over positive numbers on I.
Then, for every tuple
t, and pair of real numbers low and up, we have that (
t, low, up)
aggconsistent
(q, I) i (
t, low, up) Q(I).

Notice that for the sum operator we have an additional requirement: the aggregated
variable must take only positive numbers. The rewriting for sum, however, does produce
sound bounds for arbitrary numbers (positive or negative), as we prove in Section 4.3.3.
The algorithms use the rst-order query rewritings of the previous chapter as a build-
ing block. The semantics of those rewritings is set-theoretic, whereas the aggregate
functions we consider in this chapter take bags as input. In Section 4.3.1, we show that
for a subclass of the conjunctive queries in c
forest
, the cardinality of the query results on
every repair is exactly one. Thus, for this subclass, it is not necessary to keep track of
tuple multiplicities in the intermediate results. Recall that in Chapter 3, we showed that
for every query q in a subclass of c
forest
, there is a pessimistic repair /such that q(/)
retrieves all the consistent answers to q. We will use the notion of pessimistic repair to
prove that the bounds produced by the rewritings are tight. We will also need the dual
notion of an optimistic repair, which we introduce in Section 4.3.2. In Section 4.3.3, we
show that the ranges produced by the query rewritings are sound, in the sense that the
value of the aggregation function falls within the range on every repair. In Section 4.3.4,
we show that the ranges produced by the query rewritings are tight, in the sense that
they are satised in at least one repair. Finally, in Section 4.3.5 we put it all together,
and give the proof of correctness of the rewritings.
4.3.1 Building Upon First-Order Rewritings
The semantics of rst-order rewritings is set-theoretic, whereas aggregate functions take
bags as input. In this subsection, we show that for a class of conjunctive queries that
is relevant in the query rewriting algorithms, the cardinality of the tuples in the result
of applying a query to the repairs is always one. As a consequence, for such queries, it
suces to obtain a set-theoretic rst-order rewriting. The result of applying the rst-
order rewriting to the inconsistent database can be used as an intermediate step towards
obtaining the consistent answers for conjunctive queries with aggregation.
The queries with the aforementioned property are the conjunctive queries in c
forest
,
where all the variables at key positions of some root of the join graph are free. The
proof is given in Lemma 4.4. The lemma makes use of an auxiliary result, that we give
next, which focuses on queries in c
forest
that satisfy the additional condition that the
join graph must be a tree (instead of a forest). Intuitively, we show that in each repair
1, each tuple

t in the query result is obtained due to the same set of tuples in 1. More
formally, we show that if S and S
are sets that contain exactly one tuple per relation of

1 and such that

t q(S) and

t q(S
), then S
= S.
Lemma 4.3. Let q(z) be a query in c
forest
. Assume that the join graph T of q is a
tree, and that all the variables at key positions of the literal at the root of T are free in q
(that is, there is a literal R(x, y) at the root of T such that x z). Let I be a database
instance over the schema of q, and be a set consisting of at most one key dependency
per relation of q. Let 1 be a repair of I wrt . Let S and S
be sets that contain exactly

one tuple per relation of 1 and such that

t q(S), and

t q(S
). Then, S
= S.
Proof. The proof is by induction on the number of literals of q.
Base case. Assume that q has exactly one literal. Assume towards a contradiction
that S ,= S
. Then, there are distinct tuples

t
0
and

t
0
in 1 such that

t q(
t
0
) and
t q(
0
). Let R(x, y) be the only literal of q. Since all the variables at key positions of
the root literal of T are free, and z are the free variables of q, we have that x z. Thus,
there are vectors of values c,

d and

d
such that

d ,=

d
,

t
0
= R(c,

d), and

t
0
= R(c,
).
Thus, 1 ,[= . But 1 is a repair of I wrt ; contradiction.
Inductive step. Assume that q has more than one literal. Let R be a literal of q
that appears at a leaf of T (recall that T is a tree). Let

t
0
and

t
0
be tuples of S and S
,
respectively, such that

t
0
= R(c,

d) and

t
0
= R(
).
Let M be a set that consists of all the tuples of S, except the one for literal R.
Let M
be a set that consists of all the tuples of S
, except the one for literal R. By

inductive hypothesis, M = M
. Notice that M and M
are the only subsets of S and S
,
respectively, that satisfy these conditions since S and S
contain exactly one tuple per

relation of 1.
Let R
) be the parent of R in T. Then, there is a tuple

t
1
in R
and valuations
and
such that

t
1
S,

t
1
S
t
0
,
t
1
[= R
) R(x, y)[z/
t][], and
0
,
t
1
[=
R
) R(x, y)[z/
t][
]. Notice that (
) =
). Since q c
forest
, there is a full
nonkey-to-key join from R
to R. Thus, all the variables of

y
appear in x. Therefore,
(x) =
(x); and c =

c
. Assume towards a contradiction that

t
0
,=

t
0
. Then, there are
tuples R(c,

d) and R(
) in 1 such that c =

c
and

d ,=

d
. This means that 1 ,[= .

But 1 is a repair of I wrt ; contradiction.
In the next lemma, we show that for queries in c
forest
such that the variables at key
positions of all root literals are free, the cardinality of each tuple in the query result is
exactly one.
Lemma 4.4. Let R be a schema. Let be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(z) be a conjunctive query over R such that
q c
forest
. Let G be the join graph of q. Let R
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
) be the literals
at the root of each connected component (tree) of G. Assume that x
1
, . . . , x
m
are free
variables in q (i.e., they occur in z).
Let I be an instance over R. Let 1 be a repair of I wrt . Let B be a bag such that
B = q(1) under bag semantics. Let

t be such that

t q(1). Then, [
t[
B
= 1.
Proof. Assume towards a contradiction that [
t[
B
> 1. Then, there are distinct sets S and
S
that contain exactly one tuple per literal of q and such that

t q(S), and

t q(S
).
Since q c
forest
, G is a forest. For each 1 i m, let T
i
be the tree whose root is R
i
.
Let
i
( w, z) be the conjunction of the literals of T
i
. Let q
i
(z) = w.
i
( w, z). Recall that
x
i
(the variables at the key of the root literal of T
i
) are free, and therefore occur in z.
Thus, q
i
satises the conditions of Lemma 4.3.
Since S ,= S
,

t q(S), and

t q(S
), there must be some i and some sets M and M
such that M ,= M
, M S, M
, M and M
have one tuple for each relation symbol

in
i
,

t q
i
(M), and

t q
i
(M
). But this contradicts Lemma 4.3 above.

4.3.2 An Optimistic Repair
Recall that in Chapter 3 we showed that for every query q in a subclass of c
forest
, there is
a pessimistic repair / such q(/) retrieves all the consistent answer to q. In Section
4.3.4, we will use / to prove the tightness of the query rewritings. For example, if
we apply an aggregate query on /, the value that we get for the count(*) aggregate
function corresponds to the greatest lower bound computed by the rewriting produced
by RewriteCount(q, ).
For the lowest upper bound, we will need the notion of an optimistic repair ^. The
name optimistic comes from the fact that in this repair, if a tuple

t can be obtained
from some repair of the inconsistent database, then the tuple is also in q(^). In Lemma
4.6, we show the existence of such a repair.
Before proving the existence of the optimistic repair, we formally dene the notion
of possible answers. This notion can be considered as dual to the notion of consistent
answers. While a consistent answer is one that holds in the query results obtained from
all the repairs, a possible answer is one that holds in the query result from at least one
repair.
Denition 4.5 (Possible Answers). Let R be a schema. Let be a set of integrity
constraints. Let I be an instance over R (possibly inconsistent with respect to ). Let
q be a query over R. We say that a tuple

t is a possible answer for q with respect to
if there exists a repair 1 of I with respect to such that

t q(1). We denote this as
t possible
(q, I).
For a Boolean query q over R, we say that possible
(q, I) = true if there exists a

repair 1 of I with respect to such that 1 [= q. We say that possible
(q, I) = false if
for every repair 1 of I with respect to , 1 ,[= q.
Lemma 4.6. Let q(x) be a query in c
forest
, whose join graph T is a tree and where
R(x, y) is the literal at the root of T. Let I be an instance. Then, there is a repair ^
such that for all c if c possible
(q, I), then c q(^).

Proof. Let ^ be the instance instance built by BuildOptimisticRepair(q, I) (the al-
gorithm given in Figure 4.3). We will prove the claim by induction on the number of
literals of q.
Base case. Assume that q consists of exactly one literal R(x, y). Let

t be the
tuple selected by the algorithm in the iteration for literal R and the vector of values c.
Assume towards a contradiction that ^ ,[= w.R(c, y). Then,
t ,[= w.R(c, y). Since

possible
( w.R(c, y), I) = true, there is some repair 1 of I such that 1 [= w.R(c, y).
Thus, there is a tuple

t
such that
[= w.R(c, y). Notice that

t and

t
can be added
to ^ only during the iteration for the vector of values c. Since
t ,[= w.R(c, y) and
[= w.R(c, y), the algorithm never selects tuple

t. But

t ^; contradiction.
Inductive step. Assume that q has more than one literal. Let ( w, x) be the
conjunction of literals of q. Let T
1
, . . . , T
m
be the subtrees of T such that the root of
T
j
is a child of the root of T, for 1 j m. For each 1 j m, let S
j
(x
j
, y
j
)
be the literal at the root of T
j
. Let
j
j
. Let
w
j
= w : w is a variable of
j
, and w , x
j
. Let q
j
(x
j
) = w
j
.
j
(x
j
, w
j
). Let
^
j
= BuildOptimisticRepair(q
j
, I).
Assume towards a contradiction that c , q(^). Let
t be the tuple of I selected by the

algorithm in the iteration for literal R and the vector of values c. Then,

t ^, and there
is some

d such that

t = R(c,

d). Since c , q(^), there must be some j, some valuation
for the variables of y, and some c
j
such that 1 j m, (y) =

d, ( x
j
) = c
j
, and
c
j
, q
j
(^
j
).
Since possible
(q(c), I) = true, there is some repair 1 of I such that c q(1).

Thus, there is some tuple

t
in 1, some

d
, and some valuation for the variables of y

such that

t
= R(c,

d
), (y) =

d
, and the following condition holds: for every j and

tuple of values c
j
such that 1 j m and (x
j
) = c
j
, we have that c
j
q
j
(1). Thus,
possible
(q
j
(c
j
), I) = true. By inductive hypothesis c
j
q
j
(^
j
). Thus, the algorithm
selects

t
in the construction of ^, rather than

t. But

t ^; contradiction.
Algorithm BuildOptimisticRepair
Input: q(x), a query in c
forest
of the form w.( w, x),
I, a database instance
Initialize ^ as an empty instance

d) in I do
if there is some

d such that R(c,

d) I,
and R(c,

d) [= w.R(c, y) then
Let

t = R(c,

d)
else
Let


t = R(c,

d), for some

d
end if
Add

t to ^
end for
else
/* has more than one literal*/
Let S
1
, . . . , S
m
be the children of R in T
for j := 1 to m do
Let T
j
be the subtree of T whose root is S
j
Let
j
j
Let w
j
j
and w, and w , x
j
Let q
j
(x
j
) = w
j
.
j
(x
j
, w
j
)
Let ^
j
= BuildOptimisticRepair(q
j
, I)
Add ^
j
to ^
end for

d) in I do
if there is some

d and some valuation for the variables of y such that R(c,

d) I,
(y) =

d, and there is no j and c
j
such that (x
j
) =c
j
and c
j
, q
j
(^
j
) then
Let

t = R(c,

d)
else
Let


t = R(c,

d), for some

d
end if
Add

t to ^
end for
end if
Figure 4.3: Algorithm to build the optimistic repair
4.3.3 Sound Ranges
In this subsection, we show that the ranges produced by the query rewritings are sound,
in the sense that the value of the aggregation function falls within the returned range on
every repair.
The next lemma shows that the rewritings produced by RewriteCount compute sound
ranges.
one key dependency per relation of R. Let q(z, v) be a query of the following form:
select z, count(*)
from q
(z)
group by z
where q
(z) is a conjunctive query in c

forest
.
Let Q be the rst-order aggregate query returned by RewriteCount(q, ). Let I be a
database instance over R. Let 1 be a repair of I wrt . Let

t be a tuple, and low and up
be a pair of real numbers such that (
t, low, up) Q(I) and

t consistent
(q
, I). Let d
be such that (
t, d) q(1). Then, low d up.

Proof. Let G be the join graph of q. Let R
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
) be the literals at the
roots of all trees of G. Let ( w, z) be the conjunction of literals of q
. Let x =
i=1...m
x
i
,
let

z
= z x, and let

w
= w x. Let q
(x,
) =
.(x,

w
). Let

x
= x z. Let
QConsistent(x,
, ).
Lower Bound. Since (
t, low, up) Q(I), the lower bound low of

t is computed with
the following query:
QGlb(z, glb) = select z, count(*)
from QConsistent(x,
)
group by z
Assume towards a contradiction that d < low. Then, there is a tuple (c,
) such
that (c,
) QConsistent(I) and (c,
) , q
(1). Then, (c,
) , consistent
(q
, I). By
Theorem 3.5, we conclude that (c,
) , QConsistent(I); contradiction.
Upper Bound. Since (
t, low, up) Q(I), the upper bound up of

t is computed with
Let QLub(z, lub) = select z, count(*)
from q
(x,
) (
.QConsistent(x,
))
group by z
Assume towards a contradiction that d > up. Then, there is a valuation and a tuple
(c,
) such that (x) =c, (
) =

t
, (z) =
t, (c,
) q
(1), and either (1) (c,
) , q
(I);
or (2) I ,[= (
.QConsistent(x,
)).
Assume that (1) (c,
) , q
(I). Since 1 is a repair of I, by Proposition 3.6, 1 I.

Thus, (c,
) , q
(1); contradiction. Assume that (2) I ,[= (
.QConsistent(x,
)).
Recall that

x
= x z. By Theorem 3.5, (
) , consistent
(q
, I), for every

c
. In
particular, (c,
) , consistent
(q
, I). Recall that there is a valuation for the variables

of x and

z
such that (x) = c, (
) =

t
and (z) =

t. Thus,

t , consistent
(q
, I);
contradiction.
The next lemma shows that the rewritings for queries with the sum operator compute
sound ranges.
select z, sum(u)
from q
(z, u)
group by z
where q
(z, u) is a conjunctive query in c

forest
.
Let Q be the rst-order aggregate query returned by RewriteAgg(q, ). Let I be a


t consistent
(q
, I). Let d
be such that (

1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
) be the literals at
the roots of all trees of G. Let ( w, z, u) be the conjunction of literals of q
. Let
x =
i=1...m
x
i
, let

z
= z x, and let

w
= w x. Let

x
= x z u. Let
q
(x,
) =
, u.(x,

w
, u). Let QConsistent(x,
) be the query obtained by in-

voking RewriteForest(q
, ). Let q
be the query q
(x,
, u) =
.(x,

w
, u).
Lower Bound. Since (
t, low, up) Q(I), the lower bound low of

t is computed with
QGlb(z, glb) = select z, sum(v)
from QContribConsistent(x,
, v) QContribNonConsistent(x,
, v)
group by z
where QContribConsistent is the following query:
QContribConsistent(x,
, bottom) =
select x,

z
, min(u) as bottom
from QConsistent(x,
) q
(x,
, u)
group by x,
having bottom 0
and QContribNonConsistent is the following query:
QContribNonConsistent(x,
, bottom) =
select x,

z
, min(u) as bottom
from q
(x,
, u) (
.QConsistent(x,
))
group by x,
having bottom < 0

Assume towards a contradiction that d < low. Since (
t, d) q(1), we must consider

the following cases.
First, assume that there is a valuation for the variables in z, x such that (z) =

t,
(x) =c , (
) =

t
, and
(c,
) , q
(1); and
there is some e such that e > 0; and
either (c,
, e) QContribConsistent QContribNonConsistent(I).
Since e > 0, (c,
, e) QContribConsistent(I). Since (c,
) , q
(1), (c,
) ,
consistent
(q
, I). By Theorem 3.5, we conclude that (c,
) , QConsistent(I). There-
fore, (c,
, e) , QContribConsistent(I); contradiction.
Second, assume that there is a valuation for the variables in z, x such that (z) =
t,
(x) =c , (
) =

t
, and
there is some e
such that and e
< 0; and
(c,
, e
) q
(1); and
for every e such that e < 0, we have that (c,
, e) , QContribConsistent
QContribNonConsistent(I).
Since 1 I and (c,
, e
) q
(1), we have that (c,
, e
) q
(I). Since by hy-

pothesis,

t consistent
(q
, I), (
) consistent
(q
, I) for some

c
. By Theorem
3.5, (
) QConsistent(I). Thus, I [=
.QConsistent(x,
)[z/
t]. Since e
< 0,
(c,
, e
) q
(I) and I [=
.QConsistent(x,
)[z/
t], we conclude that (c,
, e
)
QContribNonConsistent(I); contradiction.
Third, assume that there is a valuation for the variables in z, x such that (z) =
t,
(x) =c , (
) =

t
, and
there is some e such that (c,
, e) QContribConsistentQContribNonConsistent(I);
and
there is some e
such that e
< e; and
(c,
, e
) q
(1).
Assume that (c,
, e) QContribConsistent(I). Then, (c,
) QConsistent(I),
and (c,
, e) q
(I). Since 1 I, and (c,
, e
) q
(1), we have that (c,
, e
)
q
(I). Notice that e and e
correspond to the attribute bottom of QContribConsistent.

This attribute is computed as min(u), that is the minimum of the values of u for the
tuples of (c,
). Since (c,
, e) and (c,
, e
) satisfy the conditions of the from clause of

QContribConsistent, e < e
; contradiction.
Now, assume that (c,
, e) QContribNonConsistent(I). Since 1 I, (c,
, e
)
q
(I). Since e corresponds to the attribute bottom of QContribNonConsistent, e < e
;
contradiction.
Upper Bound The proof for the lowest upper bound is analogous to the proof for
the greatest lower bound.
The next lemma shows that the rewritings for queries with the min and max aggrega-
tion functions compute sound ranges.
select z, [min(u)[ max(u)]
from q
(z, u)
group by z
where q

forest
.
Let Q be the rst-order aggregate query returned by RewriteAgg(q, ). Let I be a


t consistent
(q
, I). Let d
be such that (

1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
the roots of all trees of G. Let ( w, z, u) be the conjunction of literals of q
. Let
x =
i=1...m
x
i
, let

z
= z x, and let

w
= w x. Let

x
= x z u. Let
q
(x,
) =
, u.(x,

w
, u). Let QConsistent(x,
) be the query obtained by in-

voking RewriteForest(q
, ). Let q
be the query q
(x,
, u) =
.(x,

w
, u).
Lower Bound. Suppose that the aggregate function of q is max. Since (
t, low, up)
Q(I), the lower bound low of

t is computed with the following query:
QGlb(z, glb) = select z, max(u)
, u)
group by z
, bottom) =
select x,

z
, min(u) as bottom
from QConsistent(x,
) q
(x,
, u)
group by x,
Assume towards a contradiction that d < low. Then, there is a valuation for the
variables in z, x such that (z) =
t, (x) =c , (
) =

t
, and
, e) QContribConsistent(I); and
there is some e
such that e
< e; and
(c,
, e
) q
(1).
We arrive to a contradiction by the same arguments used in Lemma 4.8 for queries
with the sum operator.
Now, suppose that the aggregate function of q is min. Since (
t, low, up) Q(I), the

lower bound low of

QGlb(x, z, bottom) =
select z, min(bottom)
from QContribNonConsistent(x,
, u)
group by z
where QContribNonConsistent is the following query:
select x,

z
, min(u) as bottom
from q
(x,
, u) (
.QConsistent(x,
))
group by x,
)
Assume towards a contradiction that d < low. Then, there is a valuation for the
variables in z, x such that (z) =
t, (x) =c , (
) =

t
, and
, e) QContribNonConsistent(I); and
there is some e
such that e
< e; and
(c,
, e
) q
(1).
We arrive to a contradiction by the same arguments used in Lemma 4.8 for queries
with the sum operator.
Upper Bound For the max operator, we can give an argument analogous to the
argument given for the lower bound of the min operator. For the min operator, we
can give an argument analogous to the argument given for the lower bound of the max
operator.
4.3.4 Tight Ranges
In this section, we show that the ranges produced by the query rewritings are tight. For
this, we must exhibit two repairs, where the result of the aggregation function corresponds
to the greatest lower bound in one repair, and to the lowest upper bound in the other. For
example, if the query has the count(*) operator, the repair that we need for the greatest
lower bound turns out to be the pessimistic repair / used in the correctness proof of
the rst-order rewritings of Section 3.3.3. For the lowest upper bound, the needed repair
is the optimistic repair ^ that we introduced in Section 4.3.2.
We start by showing that the rewritings produced by RewriteCount give tight bounds.
In the next lemma, we show that the greatest lower bound of count(*) can be obtained
by executing the query on the pessimistic repair /. We also show that the query
rewriting that we obtain correctly returns such bound.
one key dependency per relation of R. Let q(z) be a query of the following form:
select z, count(*)
from q
(z)
group by z
where q
(z) is a query in c
forest
.
Let G the the join graph of q. Let R
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
of each tree of G. Let ( w, z) be the conjunction of literals of q
. Let x =
i=1...m
x
i
,
let

z
= z x, and let

w
= w x. Let q
(x,
) =
.(x,

w
). Let Q(z, l, u) be the

rst-order aggregate query returned by RewriteCount(q, ). Let I be an instance over
R. Let

t be a tuple and low and up be a pair of real numbers.
Then, there is a repair / of I wrt and a bag B such that B = q(/), and the
following conditions hold:
1. for every valuation such that (x) =c, (
) =

t
, and (z) =
t, if (c,
) q
(/),
then c consistent
(q
[z/
t], I), and

2. if (
t, low, up) aggconsistent
(q, I), then [
t[
B
= low, and
3. if (
t, low, up) Q(I), then [
t[
B
= low.
Proof. Let / be the pessimistic repair obtained by invoking the algorithm BuildPess-
imisticRepair(q, , I). Condition (1) holds by Lemma 3.10. We must now prove Con-
ditions (2) and (3).
In order to prove Condition 2, let

t be a tuple, and low, and up be a pair of real
numbers such that (
(q, I). Then, there is a repair 1 of I

wrt and a bag B
such that B
= q(1) and [
t[
B
= low. Furthermore, by Lemma
4.7, since / is a repair of I wrt , [
t[
B
low. Assume towards a contradiction that
[
t[
B
> low. Then, there is a valuation for the variables of x and z such that (x) = c,
(z) =
t and (
) =

t
, and one of the following conditions holds:

(c,
) q
(/) and [(c,
)[
B
> 1; or
(c,
) q
[z/
t](/) and (c,
) , q
[z/
t](1).
Assume that (c,
) q
(/) and [(c,
)[
B
> 1. This contradicts Lemma 4.4. Now, as-
sume that (c,
) q
[z/
t](/) and (c,
) , q
[z/
t](1). Then, (c,
) , consistent
(q
[z/
t], I).
By Condition 1, we have that (c,
) , q
[z/
t](/); contradiction.

t, low, and up be such that (
t, low, up) Q(I).

Since / is a repair of I, by Lemma 4.7, [
t[
B
low. Let QConsistent(x,
) be the query
obtained by invoking RewriteForest(q
, ). Then, the lower bound low of
t is computed
with the following query:
QGlb(z, low) = select z, count(*)
from QConsistent(x,
)
group by z
Assume towards a contradiction that [
t[
B
> low. Then, there is a valuation for the
variables of x and z such that (x) =c, (z) =
t and (
) =

t
, and one of the following

conditions holds:
(c,
) q
(/) and [(c,
)[
B
> 1; or
(c,
) q
(/) and (c,
) , QConsistent(I).
Assume that (c,
) q
(/) and [(c,
)[
B
> 1. This contradicts Lemma 4.4. Now,
assume that (c,
) q
(/) and (c,
) , QConsistent(I). Since (c,
) , QConsistent(I),
by Theorem 3.5, (c,
) , consistent
(q
, I). Then, by Condition 1, we have that (c,
) ,
q
(/); contradiction.
In the next lemma, we show that the lowest upper bound of count(*) can be obtained
by executing q on the optimistic repair ^. We also show that the query rewriting of q
correctly returns such bound.
Lemma 4.11. Let R be a schema. Let be a set of integrity constraints, consisting
of one key dependency per relation of R. Let q(z) be a query in c
forest
of the following
form:
select z, count(*)
from q
(z)
group by z
where q
(z) is a query in c
forest
.
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
of each tree of G. Let ( w, z) be the conjunction of literals of q
. Let x =
i=1...m
x
i
,
let

z
= z x, and let

w
= w x. Let q
(x,
) =
.(x,

w
). Let Q(z, l, u) be the

rst-order aggregate query returned by RewriteCount(q, ). Let I be an instance over
R. Let

t be a tuple and low and up be a pair of real numbers.
Then, there is a repair ^ of I wrt and a bag B such that B = q(^), and the
following conditions hold:
1. for every valuation such that (x) = c and (z) =
t, if c possible
(q
[z/
t], I),
then c q
[z/
t](^), and
2. if (
(q, I), then [
t[
B
= up, and
3. if (
t, low, up) Q(I), then [
t[
B
= up.
Proof. Let ^ be the optimistic repair obtained by invoking the algorithm BuildOpti-
misticRepair(q, , I). Condition (1) holds by Lemma 4.6. We must now prove Condi-
tions (2) and (3).

t be a tuple, and low and up be real numbers such
that (
(q, I). Then, there is a repair 1 of I wrt and a bag

B
such that B
= q(1) and [
t[
B
= up. Furthermore, since ^ is a repair of I wrt , by
Lemma 4.7, [
t[
B
up. Assume towards a contradiction that [
t[
B
< up. Then, there is a
valuation for the variables of x and z such that (x) = c, (z) =

t and (
) =

t
, and
one of the following conditions holds:
(c,
) q
(1) and [(c,
)[
B
> 1; or
(c,
) , q
(^) and (c,
) q
(1).
Assume that (c,
) q
(1) and [(c,
)[
B
assume that (c,
) , q
(^) and (c,
) q
(1). Then, c possible
(q
[z/
t], I). By
Condition 1, we have that c q
[z/
t](^); contradiction.

t, low, up) Q(I).

Since ^ is a repair of I, by Lemma 4.7, [
t[
B
up. Let

x
= xz. Let QConsistent(x,
)
be the query obtained by invoking RewriteForest(q
, ). Since (
t, low, up) Q(I), the

upper bound up of

Let QLub(z, up) = select z, count(*)
from q
(x,
) (
.QConsistent(x,
))
group by z
Assume towards a contradiction that [
t[
B
< up. Then, there is a valuation for the
t and (
) =

t
, and either:
(c,
) is accounted for more than once in the from clause of QLub; or

(c,
) , q
(^), (c,
) q
(I), and I [= (
.QConsistent[z/
t]).
Assume that (c,
) is accounted for more than once in the from clause of QLub. This
is a contradiction since by denition the from clause of a rst-order aggregate query is
computed using set semantics. Now, assume that (c,
) , q
(^), (c,
) q
(I), and
I [= (
.QConsistent[z/
t]). Since (c,
) q
(I), we have that c possible
(q
[z/
t], I).
Thus, by Condition 1, c q
[z/
t](^); contradiction.
For the unary operators, the proof of tightness proceeds in an analogous way, except
that the optimistic and pessimistic repairs have to be modied to ensure every tuple has
the minimum (or maximum, depending on the case) for attribute u. We next show how
to obtain a pessimistic repair for queries with the sum operator.
Algorithm BuildPessimisticRepairForSum (q, I, /
)
select z, sum(u)
from q
(z)
group by z
where q
forest
I, an instance
/
, an pessimistic repair
Output: /, an pessimistic repair
Initialize / as /
Let R(x, y) be the literal of q where u appears

for each tuple R(c,

d) of / do
Let be a valuation for the variables of R such that (x) =c and (y) =

d
for every valuation
for the variables of R such that
(x) =

c
(y) =

d
,
R(
) I, and (z) =
(z) for every z such that z ,= u do

if
(u) < (u) then

Replace R(c,

d) with R(
) in /
end if
end for
end for
Notice in the algorithm that a tuple R(c,

d) is replaced only if there is another tuple
with the same values, except for the attribute u, and the other tuple has a smaller value
on u (condition
(u) < (u) in the algorithm). In the rewriting for the lower bound of
the sum operator, this corresponds to the fact that for positive values we aggregate over
the minimum value of u for all tuples in the intermediate result. In contrast, for the upper
bound, we aggregate over the maximum value of u. Thus, for the upper bound, a similar
algorithm can be used, where we replace tuples for which the condition
(u) > (u)

is satised. Since we choose the conditions that correspond to positive numbers in the
rewriting given in RewriteAgg, the tightness results for the sum operator need to restrict
the domain of the aggregated value to range over positive numbers (for min and max we
do not have this restriction). In Figure 4.4, we summarize the repairs that must be
modied in order to obtain the tight bounds of each aggregation function, and which
condition must be checked.
The following lemma shows that the greatest lower bound computed for the sum
operator can be obtained from the pessimistic repair computed with the procedure given
above. We also show that our query rewriting correctly returns such bound.
Function Bound Repair Condition
max glb pessimistic
(u) < (u)

max lub optimistic
(u) > (u)

sum glb pessimistic
(u) < (u)

sum lub optimistic
(u) > (u)

min glb optimistic
(u) < (u)

min lub pessimistic
(u) > (u)

Figure 4.4: Repairs that must be used to obtain the tight bounds of unary operators
one key dependency per relation of R. Let q(z) be a query of the following form:
select z, sum(u)
from q
(z, u)
group by z
where q

forest
and u ranges over the positive numbers.
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
of each tree of G. Let ( w, z, u) be the conjunction of literals of q
. Let x =
i=1...m
x
i
,
let

z
= z x, and let

w
= w x. Let q
(x,
) =
, u.(x,

w
, u). Let Q(z, l, u)

be the rst-order aggregate query returned by RewriteAgg(q, ). Let I be an instance
over R. Let

t be a tuple and low and up be a pair of real numbers. Let q
(x,
, u) =
.(x,

w
, u).
Then, there is a repair / of I wrt and some value d such that (
t, d) q(/), and
the following conditions hold:
) =

t
, and (z) =
t, if (c,
) q
(/),
then c consistent
(q
[z/
t], I), and

2. if (
(q, I), then d = low, and

3. if (
t, low, up) Q(I), then d = low.

Proof. Let /
be the repair obtained by invoking the algorithm BuildPessimistic-

Repair(q, , I). Let / be the repair obtained by invoking the algorithm BuildPess-
imisticRepairForSum(q, I, /
). Condition (1) holds by Lemma 3.10. We must now

prove Conditions (2) and (3).

t be a tuple, and low, and up be a pair of real
numbers such that (
(q, I). Then, there is a repair 1 of I

wrt such that (
t, low) q(1). Furthermore, by Lemma 4.8, since /is a repair of I wrt

, d low. Assume towards a contradiction that d > low. Let B = q
(/). Then, there

is a valuation for the variables of x and z such that (x) = c, (z) =

t and (
) =

t
,
and one of the following conditions holds:
(c,
) q
(/) and [(c,
)[
B
> 1; or
there are e and e
such that e > e
, (c,
, e) q
(/) and (c,
, e
) q
(1); or
(c,
) q
[z/
t](/) and (c,
) , q
[z/
t](1).
Assume that (c,
) q
(/) and [(c,
)[
B
assume that there are e and e
such that e > e
, (c,
, e) q
(/) and (c,
, e
)
q
(1). Let
and
be valuations such that for every w ,= u, (w) =
(w) and
(w) =
(w);
(w) = e; and
(w) = e
. Since / is constructed using the algo-

rithm BuildPessimisticRepairForSum and 1 I,
(w) <
(w). Thus, e < e
;
contradiction. Finally, assume that (c,
) q
[z/
t](/) and (c,
) , q
[z/
t](1). Then,
(c,
) , consistent
(q
[z/
t], I). By Condition 1, we have that (c,
) , q
[z/
t](/); con-
tradiction.

t, low, up) Q(I).

Since / is a repair of I, by Lemma 4.8, d low. Let QConsistent(x,
) be the query
obtained by invoking RewriteForest(q
, ). Since u ranges only over positive numbers,

the lower bound low of

QGlb(z, glb) = select z, sum(v)
, v)
group by z
, bottom) =
select x,

z
, min(u) as bottom
from QConsistent(x,
) q
(x,
, u)
group by x,

having bottom 0
Assume towards a contradiction that d > low. Then, there is a valuation for the
t and (
) =

t
, and one of the following

conditions holds:
(c,
) q
(/) and [(c,
)[
B
> 1; or
there are e and e
such that e > e
, (c,
, e) q
(/) and
(c,
, e
) QContribConsistent(I); or
(c,
) q
(/) and (c,
) , QConsistent(I).
Assume that (c,
) q
(/) and [(c,
)[
B
assume that there are e and e
such that e > e
, (c,
, e) q
(/) and (c,
, e
)
QContribConsistent(I). Since e
is computed as min(u) in QContribConsistent,

and / I, e
< e; contradiction. Finally, assume that (c,
) q
(/) and (c,
) ,
QConsistent(I). Since (c,
) , QConsistent(I), by Theorem 3.5, we have that (c,
) ,
consistent
(q
, I). Then, by Condition 1, we have that (c,
) , q
(/); contradic-
tion.
Notice that the proof above is similar to the one for Lemma 4.10, except that we need
to account for the fact that each tuple may contribute a value greater than one. A proof
similar to Lemma 4.11 can be given for the lowest upper bound.
4.3.5 Putting It All Together
The next lemma states the correctness of the algorithm RewriteCount. The correctness
for the unary operators can be obtained analogously by employing the optimistic and
pessimistic repairs as shown in Figure 4.4.
Lemma 4.13. Let R be a schema. Let be a set of integrity constraints, consisting
of one key dependency per relation of R. Let q(z) be a query in c
forest
of the following
form:
select z, count(*)
from q
( w, z)
group by z
Let Q(z, l, u) be the rst-order aggregate query returned by RewriteCount(q, ). Let
I be an instance over R. Then, for every tuple

t, and pair of real numbers low and up,
we have that (
(q, I) i (
t, low, up) Q(I).

1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
the roots of all trees of G. Let ( w, z) be the conjunction of literals of q
. Following the
algorithm RewriteCount, let x =
i=1...m
x
i
, let

z
= z x, and let

w
= w x. Let
= xz. Let q
(x,
) =
.(x,

w
). Let QConsistent(x,
) be the query obtained

by invoking RewriteForest(q
, ).
() Let

t be a tuple and low and up be real numbers such that (
t, low, up)
aggconsistent
(q, I). By Lemma 4.10, there is a pessimistic repair / of I wrt

and a bag B such that B = q(/), and the following conditions hold:
) =

t
, and (z) =
t, if (c,
) q
(/),
then c consistent
(q
[z/
t], I), and

2. if (
(q, I), then [
t[
B
= low.
Since, (
(q, I), by item (2) above, [
t[
B
= low. Assume
towards a contradiction that (
t, low, up) , Q(I). Let low
be a value computed as follows:

QGlb(z, low
) = select z, count(*)
from QConsistent(x,
)
group by z
Assume that low
< low. Then, there is a valuation for the variables of x and z

such that (x) =c, (z) =
t, (
) =

t
, and one of the following conditions holds:

(c,
) q
(/) and [(c,
)[
B
> 1; or
(c,
) q
(/) and (c,
) , QConsistent(I).
Assume that (c,
) q
(/) and [(c,
)[
B
assume that (c,
) q
(/) and (c,
) , QConsistent(I). By Theorem 3.5, (c,
) ,
consistent
(q
, I). By Condition 1 above, (c,
) , q
(/); contradiction.
Assume towards a contradiction that low
> low. Then, there is a valuation for

the variables of x and z such that (x) = c, (z) =

t, (
) =

t
, (c,
) , q
(/) and
(c,
) QConsistent(I). Since (c,
) QConsistent(I), by Theorem 3.5, (c,
)
consistent
(q
, I). Then, since / is a repair of I wrt , we have that (c,
) q
(/);
contradiction.
By Lemma 4.11, there is an optimistic repair ^ of I wrt and a bag B such that
B = q(^), and the following conditions hold:
1. for every valuation such that (x) = c and (z) =

t, if c possible
(q
[z/
t], I),
then c q
[z/
t](^), and
2. if (
(q, I), then [
t[
B
= up.
Since, (
(q, I), by item (2) above, [
t[
B
= up. Assume
towards a contradiction that (
t, low, up) , Q(I). Let up
be a value computed as follows:

Let QLub(z, up
) = select z, count(*)
from q
(x,
) (
.QConsistent(x,
))
group by z
Assume that up
< up. Then, there is a valuation for the variables of x and

z such that (x) = c, (z) =

t, (
) =

t
, (c,
) , q
(^), (c,
) q
(I), and I [=
.QConsistent(x,
). Since (c,
) q
(I), (c,
) possible
(q
, I). Thus, by Lemma

4.6, (c,
) q
(^); contradiction.
Assume that up
< up. Then, there is a valuation for the variables of x and z such
that (x) = c, (z) =

t, (
) =

t
, and one of the following two cases holds. First,

(c,
) q
(^) and [(c,
)[
B
> 1. But this contradicts Lemma 4.4. Second, (c,
) q
(^)
and either (1) (c,
) , q
(I), or (2) I ,[=
.QConsistent(x,
). Assume that (1) (c,
) ,
q
(I). Since ^ is a repair of I wrt , ^ I. Thus, (c,
) , q
(^); contradiction.
Assume that (2) I ,[=
.QConsistent(x,
). Recall that

x
= x z. By Theorem 3.5,
(
) , consistent
(q
, I), for every

c
. In particular, (c,
) , consistent
(q
, I). Thus,
(c,
) , q
(^); contradiction.
() Let

t be a tuple and low and up be real numbers such that (
t, lb, up) Q(I). In

order to prove that (
(q, I), we must show that:

1. For every repair 1 of I wrt , if B = q(1), then low [
t[
B
up.
2. There is a repair 1 of I wrt , and a bag B such that B = q(1) and [
t[
B
= low.
3. There is a repair 1 of I wrt , and a bag B such that B = q(1) and [
t[
B
= up.
Claim 1 follows by Lemma 4.7. Claim 2 follows by Lemma 4.10. Claim 3 follows by
Lemma 4.11.
4.4 Related Work
Our work on aggregation is inspired by Arenas et al. [ABC
+
03b], who were the rst to
propose the use of ranges in a semantics for consistent query answering. The work of
Arenas et al. is restricted to queries of the following form:
select F(A)
from r
where F is an aggregation function, r is a single relation, and A is an attribute from
r. Notice that such queries have no grouping and no selection or join conditions (i.e., no
where clause). In this chapter, we consider a much richer class of queries. For the class
of queries considered by Arenas et al., the semantics proposed in their paper and our
semantics for aggregate queries coincide. However, we need to extend their semantics in
order to be able to deal with queries that perform grouping.
In their paper, Arenas et al. [ABC
+
03b] consider functional dependencies. If there
is exactly one functional dependency on the (only) relation of the query, they show that
the problem of obtaining the lowest upper and greatest lower bounds is tractable for the
count(*), min, max, sum, and avg functions. Except for avg, we considered all these
functions in our class c
aggforest
. Arenas et al. also show the intractability of queries with
the count(distinct) operator and exactly one functional dependency. If the relation
of the query has more than one functional dependency, they show that the problem
of obtaining tight bounds is intractable for all the aggregate functions they consider
(count(*), min, max, sum, and avg, count(distinct)). This gives further evidence of
the maximality of the class considered in this chapter: going from one to two functional
dependencies may lead to intractability even for queries on just one relation and with no
grouping.
Chapter 5
Complexity-Theoretic Analysis
In the previous chapters, we presented query rewriting algorithms that work on a broad
class of queries. In this chapter, we show the maximality of this class based on complexity-
theoretic arguments. In Section 5.1, we show that minimal relaxations of the conditions of
the class lead to intractability. Then, in Section 5.2, we embark on a more ambitious goal:
for a large class of conjunctive queries, we show that the conditions of the class c
forest
presented in Chapter 3 are not only sucient, but they are also necessary conditions for
a query to be rst-order rewritable.
5.1 Minimal Relaxations of c
forest
In this section, we show that minimal relaxations of the conditions of c
forest
lead to
intractability. In particular, we show the intractability of the problem of computing
consistent answers for: (1) a conjunctive query whose join graph is a cycle of length
two; and (2) a conjunctive query whose join graph is a forest, but the query has some
nonkey-to-key joins that are not full.
Chomicki and Marcinkowski [CM05] proved that the problem of computing consistent
answers for a query with a single nonkey-to-nonkey join is coNP-complete. Their result
used a query with repeated relation symbols (specically, a query with only two literals
both for a single relation R). We can use their insight to show that the problem of
computing consistent answers for the following query without repeated relation symbols,
but with a single nonkey-to-nonkey join is also coNP-complete.
q
nk
= x, x
, y.S
1
(x, y) S
2
(x
, y)
83
Chapter 5. Complexity-Theoretic Analysis 84
Notice that q
nk
has a cycle of length two (actually, a nonkey-to-nonkey join), and
no nonkey-to-key joins. Our proof of hardness is a simple modication to the re-
sults of Chomicki and Marcinkowski [CM05] and uses a reduction from the problem
MONOTONE-3SAT, which is well known to be NP-complete. The only dierence between
the MONOTONE-3SAT and 3SAT problems is that the former assumes that the input 3CNF
propositional formula is monotone. That is, each clause
i
contains either positive or
negative atoms, but not both. We shall say that a clause that contains only positive
(negative) atoms is a positive (negative) clause.
Lemma 5.1. Let q be the query x, x
, y.S
1
(x, y) S
2
(x
, y). Then, CONSISTENT(q, ) is

coNP-hard.
Proof. We will prove hardness by reduction from MONOTONE-3SAT. Let =
1

m
be a 3CNF formula such that each clause
i
contains either positive or negative atoms,
but not both. We shall build an instance I as follows:
For each positive clause
i
and each atom z that occurs in
i
, we add a tuple
S
1
(i, z) to I.
For each negative clause
i
and each atom z that occurs in
i
, we add a tuple
S
2
(i, z) to I.
We now show that consistent
(q, I) = false i is satisable.

() Since consistent
(q, I) = false, there exists a repair 1 of I such that 1 ,[= q.

We now build a valuation v for the variables of as follows. For each variable z, we let
v(z) = true if there is some i such that S
1
(i, z) 1; and we let v(z) = false if there is
some i such that S
2
(i, z) 1. It is easy to see that v is a truth valuation that satises
.
() Assume that is satisable. Let v be a truth assignment for the variables of .
We shall build a repair 1 as follows. For each positive clause
i
, select a variable z that
appears in
i
and such that v(z) = true. Let S
1
(i, z) 1. For each negative clause
i
,
select a variable z that appears in
i
and such that v(z) = false. Let S
2
(i, z) 1. It is
easy to see that 1 ,[= q.
Now, we show the intractability of the problem for a conjunctive query whose join
graph is a forest, but the query has nonkey-to-key joins that are not full. In particular,
we focus on the following query:
x, x
, w, w
, z, z
, m.R
1
(x, w) R
2
(m, w, z) R
3
(x
, w
) R
4
(m, w
, z
)
We prove hardness by showing a reduction from the problem of computing the con-
sistent answers for the query q
nk
shown to be coNP-hard in Lemma 5.1.
Lemma 5.2. Let q be the query x, x
, w, w
, z, z
, m.R
1
(x, w) R
2
(m, w, z) R
3
(x
, w
)
R
4
(m, w
, z
). Let q
be the query x, x
, y.S
1
(x, y) S
2
(x
, y). Then, there is a polynomial

time reduction from the problem CONSISTENT(q
) to the problem CONSISTENT(q, ).

Proof. Let I
be an instance over the schema of q
. We shall build an instance I over the

schema of q as follows:
Initialize I as the empty instance
for each tuple S
1
(c
1
, d
1
) I
do
Add R
1
(c
1
, d
1
) to I
end for
for each tuple S
2
(c
2
, d
2
) I
do
Add R
3
(c
2
, d
2
) to I
end for
Let c
z
, c
z
be some constants
for each valuation
q
such that I
[= S
1
(x, y) S
2
(x
, y)[
q
] do
Let
q
(x) =
q
(x)
Let
q
(x
) =
q
(x
)
Let
q
(w) =
q
(y)
Let
q
(w
) =
q
(y)
Let c
m
be a newly-created constant
Let
q
(m) = c
m
Let
q
(z) = c
z
Let
q
(z
) = c
z
Add tuple R
2
(m, w, z)[
q
] to I
Add tuple R
4
(m, w
, z
)[
q
] to I
end for
We claim that consistent
(q
, I
) = true i consistent
(q, I) = true.
() Let 1 be a repair of I. We shall build an instance 1
as follows:
for each tuple R
1
(c
1
, d
1
) of 1 do
Add a tuple S
1
(c
1
, d
1
) to 1
end for
for each tuple R
3
(c
2
, d
2
) of 1 do
Add a tuple S
2
(c
2
, d
2
) to 1
end for
Notice that R
1
and S
1
(and, similarly, R
3
and S
2
) have the same extensions in I and I
,
respectively. Thus, since 1 is a repair of I, 1
is a repair of I
. Since consistent
(q
, I
) =
true, 1
[= q
. Thus, there is a valuation

q
such that 1
[= S
1
(x, y) S
2
(x
, y)[
q
]. Let
c
1
=
q
(x), c
2
=
q
(x
), d =
q
(y). Let c
z
and c
z
be the constants used in the algorithm
that constructs I. Let c
m
be the constant created in the algorithm for the iteration
corresponding to
q
. Let
q
be a valuation for the variables of q such that:

q
(x) = c
1

q
(x
) = c
2

q
(w) = d

q
(w
) = d

q
(m) = c
m

q
(z) = c
z

q
(z
) = c
z
Since S
1
(c
1
, d) 1
, R
1
(c
1
, d) 1. Since S
2
(c
2
, d) 1
, R
3
(c
2
, d) 1. By Proposition
3.6, 1
. Thus, S
1
(c
1
, d) I
and S
2
(c
2
, d) I
. Since c
m
is the constant chosen in the
iteration for
q
in the algorithm that constructs I, R
2
(c
m
, d, c
z
) I and R
4
(c
m
, d, c
z
) I.
By Proposition 3.7, R
2
(c
m
, d, e) 1 and R
4
(c
m
, d, e
) 1, for some e, e
. Thus, 1 [= q[
q
].
() Let 1
be a repair of I
. We shall build an instance 1 as follows.

for each tuple S
1
(c
1
, d
1
) 1
do
Add R
1
(c
1
, d
1
) to 1
end for
for each tuple S
2
(c
2
, d
2
) 1
do
Add R
3
(c
2
, d
2
) to 1
end for
for each tuple R
2
(c
1
, c
2
, d) I do
Add R
2
(c
1
, c
2
, d) to 1
end for
for each tuple R
4
(c
1
, c
2
, d) I do
Add R
4
(c
1
, c
2
, d) to 1
end for
We now show that 1 is a repair of I. First, notice that R
1
and S
1
(and, similarly, R
3
and S
2
) have the same extensions in I and I
, respectively. Second, in the construction

of I, every tuple of R
2
and R
4
is given a distinct key value. Then, by Propositions 3.6
and 3.7, every tuple in the extension of R
2
in I is in the extension of R
2
in 1; and every
tuple in the extension of R
4
in I is in the extension of R
4
in 1.
Since consistent
(q, I) = true, 1 [= q. Thus, there exists some valuation

q
such
that 1 [= R
1
(x, w) R
2
(m, w, z) R
3
(x
, w
) R
4
(m, w
, z
)[
q
]. By construction of I, if
R
2
and R
4
join on m, then
q
(w) =
q
(w
). Let
q
be such that:

q
(x) =
q
(x)

q
(x
) =
q
(x
)

q
(y) =
q
(w) =
q
(w
)
It is easy to see that 1
[= S
1
(x, y) S
2
(x
, y)[
q
]. Thus, 1
[= q
.
5.2 A Dichotomy Result
5.2.1 The Class c
In Chapter 3, we presented a query rewriting algorithm which works on a class of queries

that we call c
forest
. Clearly, c
forest
gives sucient conditions for a query to be rst-
order rewritable. In this section, we address the following question: for which class of
queries does c
forest
also give necessary conditions? That is, we show a class of queries
such that the problem of computing the consistent answers is coNP-complete for every
query of the class which does not satisfy the conditions of c
forest
. Notice that this
establishes a dichotomy between rst-order rewritability and coNP-completeness, and
is therefore much stronger than the complexity results that we presented in Section
5.1 (and, in fact, all the complexity results present in the consistent query answering
literature [CLR03a, CM05]). In the literature, a class c is said to be coNP-hard if there
is at least one query q c such that CONSISTENT(q, ) is a coNP-hard problem. Under
such a denition, it suces to exhibit just one intractable query in order to conclude
that the entire class is coNP-complete. In contrast, in this section we will present a class
of queries such that for every query q in the class, CONSISTENT(q, ) is coNP-complete.
We will focus on conjunctive queries without repeated relation symbols and all of
whose nonkey-to-key joins are full. Within this class, there are some queries for which
the existence of a cycle is not a sucient condition for intractability. Consider, for
example, the query q = x, y.R
1
(x, y) R
2
(x, y). The join graph of this query is not a
forest; yet, it can be rewritten as follows:
x, y.R
1
(x, y) R
2
(x, y) y
.(R
1
(x, y
) y
= y) y
.(R
2
(x, y
) y
= y)
Recall that the problem of computing consistent answers is intractable for the query
q
nk
= x, x
, y.R
1
(x, y)R
2
(x
, y). Notice that q

nk
and q have exactly the same join graph.
The only dierence between them is that in q
nk
, the two literals are related exclusively
by a nonkey-to-nonkey join; whereas in q, they are related by both a key-to-key and a
nonkey-to-nonkey join. Our intuition is that a query with a cyclic join graph may be
tractable only if there are literals related by more than one type of join (e.g., nonkey-
to-nonkey and key-to-key). We formalize this intuition with the denition of a class c
,
which essentially separates the dierent types of joins of the query. In c
, every pair of
literals can be related by at most one of type of join (i.e., key-to-key, nonkey-to-nonkey,
and nonkey-to-key).
Denition 5.3. Let q be a conjunctive query without repeated relation symbols and all
of whose nonkey-to-key joins are full. We say that q is in class c
if for every pair R

and R
of literals of q at most one of the following conditions holds:

there is a key-to-key join between R and R
.
there is a nonkey-to-nonkey join between R and R
.
there are literals R
1
. . . R
m
in q such that there is a nonkey-to-key join from R to
R
1
, from R
m
to R
, and from R
i
to R
i+1
, for every i such that 1 i < m.
Notice that c
is a fairly broad class of queries. For example, it includes the class

of queries that have exclusively nonkey-to-key joins. In general, the only queries that
are outside c
are the ones that have a pair of literals related by more than one type of
join. As anecdotal evidence of the practicality of the class, the only query in the TPC-H
benchmark [TPC03] that has nonkey-to-nonkey joins (Query 5) is in c
. From the results

of this chapter, we can immediately conclude that the problem of computing consistent
answers for this query is not rst-order rewritable.
We will consider a class, called c
hard
, of all queries of c
that are not in c

forest
. The
main result of this chapter, Theorem 5.5, proves that the problem of computing the
consistent answers for every query of c
hard
is coNP-complete.
Denition 5.4. We say that a query q is in class c
hard
if q c
and q , c
forest
.
Theorem 5.5. Let q be a query such that q c
hard
. Then, CONSISTENT(q, ) is coNP-
complete in data complexity.
Our motivation to provide a dichotomy for c
is the following. First, for a fairly broad

class of queries we can test in polynomial time if the problem of computing consistent
answers is tractable. Second, our results are an initial step towards proving a dichotomy
for the larger class of all conjunctive queries. Indeed, as a result of our work, future
eorts for nding dichotomy results for conjunctive queries need to focus only on queries
whose literals are related by more than one type of join.
1
In general, by Ladners Theorem [Lad75], there are classes of coNP problems for
which there is no dichotomy between P and coNP-complete problems. However, this
is not the case for the class of queries that is the focus of this section. In fact, as a
corollary of Theorems 3.5 and 5.5, we get a dichotomy between membership in P and
coNP-completeness. Notice that, given a query q such that q c
, it can be decided in
polynomial time on which side of the dichotomy the query q falls.
Corollary 5.6. Let q be a query such that q c
. Then, CONSISTENT(q, ) is either in

P, or it is coNP-complete.
Under a complexity-theoretic assumption, we also get a dichotomy between rst-order
rewritability and rst-order inexpressibility for the class c
. That is, for all the queries

of c
that are not in c

hard
, we can produce a rst-order rewriting using our algorithm
1
Since c
intersects, but does not contain c

forest
, we know that there are queries outside c
for which
the problem of computing consistent answers is tractable.
RewriteForest. For the queries of c
hard
, since the problem of obtaining consistent an-
swers is coNP-complete, there is no rst-order rewriting, unless P=NP (which is unlikely).
Corollary 5.7. Let q be a query such that q c
. Assuming P ,= NP, the problem

CONSISTENT(q, ) is rst-order rewritable i q c
forest
.
Tractable but not First-Order Rewritable Queries
An interesting question is whether there are queries for which the problem of computing
consistent answers is tractable, yet not rst-order rewritable. Although this remains
open for conjunctive queries without inequalities, we now show that there are tractable
conjunctive queries with inequalities that are not rst-order rewritable.
Consider a schema with one binary relation R(E, S). Assume that E is the key of
the relation. Consider the following query q:
q = e
1
, e
2
, s : R(e
1
, s) R(e
2
, s) e
1
,= e
2
In order to nd the consistent answers for q, we construct a graph of the inconsistent
database instance as follows.
2
Let I be a database instance with one binary relation
R(E, S). The graph G of I is a bipartite graph G, with partitions E and S. Partitions
E and S have one vertex for each value in the active domain of attributes E and S,
respectively. The set of edges of G consists of all tuples (e, s) of R.
We use the graph of I to introduce the following necessary and sucient condition
for consistent
(q, I) = false.
Lemma 5.8. Let I be a database with one binary relation R(E, S), possibly inconsistent
wrt a functional dependency = E S. Then, consistent
(q, I) = false i the

graph G of I has a perfect matching.
Proof. Assume that G has a perfect matching M. We can build an instance 1 by
creating a tuple in 1 for each edge in M. Since M is a matching, each vertex from
partition S is incident to at most one edge. Therefore, 1 ,[= q. Also, since the matching
is perfect, every key appears in 1. Consequently, 1 is minimal, and therefore it is a repair
of I wrt .
2
Notice that unlike the join graph of a query, this graph is constructed from a database instance, not
a query.
Assume that consistent
(q, I) = false. Then, there must exist a repair 1 of I

wrt such that 1 ,[= q. We can construct a graph G
by selecting the edges of G that

correspond to tuples of 1. It is easy to see that G
is a perfect matching of G.
There are a number of algorithms in the literature for deciding the existence of a
perfect bipartite matching. For example, one of the best known is given by Hopcroft and
Karp [HK75], and runs in O(n
2.5
) time. Therefore, q is a tractable query. We now show
that no approach based on query-rewriting works for q.
Theorem 5.9. There is no rst-order rewriting Q of q such that consistent
(q, I) =
Q(I) for every instance I.
Proof. Let A
1
, . . . , A
n
be a system of distinct representatives. A system of distinct rep-
resentatives [Ost70] of A
1
, . . . , A
n
is a sequence of n distinct elements a
1
, . . . , a
n
with
a
i
A
i
, 1 i n. Let R be a binary relation that encodes A
1
, . . . , A
n
as follows:
R(i, x) i x A
i
. Let G be the graph of R as constructed above. Clearly, G has a
perfect matching i A
1
, ..., A
n
has a system of distinct representatives. By Lemma 5.8,
consistent
(q, I) = false i G has a perfect matching.

Let I be the database instance that consists of relation R. Assume that there is
a rst order query Q such that I ,[= Q i consistent
(q, I) = false. Then, Q can

test whether A
1
, ..., A
n
has a system of distinct representatives. But it is known in the
literature [LW95] that relational algebra, with an appropriate encoding of sets, cannot
test whether a family of sets has a system of distinct representatives; contradiction.
5.2.2 Basic Intractable Cases
The intractability of all queries in c
hard
will be shown as follows. First, we show in
Lemma 5.10 that the problem of computing consistent answers for conjunctive queries
is in coNP. This is a result known in the literature, but we briey give a proof for our
setting. For hardness, we will use a reduction from the problem of computing consistent
answers for one of two particular queries to the problem of computing consistent answers
for q. One of these specic queries is the query q
nk
= x, x
, y.S
1
(x, y) S
2
(x
, y). This
query has a nonkey-to-nonkey join, and was shown to be intractable in Lemma 5.1. The
other query has a cycle of nonkey-to-key joins, and is shown to be intractable in Lemma
5.11.
The next lemma shows that the problem of computing consistent answers for con-
junctive queries is in coNP.
Lemma 5.10. Let q be a conjunctive query. The problem CONSISTENT(q, ) is in coNP.
Proof. Let I be an instance. In order to decide whether

t , consistent
(q, I), it suces

to show a repair 1 of I such that 1 ,[= q[
t]. The size of 1 is polynomially bounded by the

size of I. In particular, by Proposition 3.6, 1 I. Furthermore, 1 ,[= q[
t] can be checked
in polynomial time, since q is a conjunctive query.
In the next lemma, we show the coNP hardness of computing consistent answers for
one of the two particular queries that will be used in Lemma 5.14. The coNP hardness
of the other query was proven in Lemma 5.1.
Lemma 5.11. Let q = x, y.T
1
(x, y) T
2
(y, x). Then, the problem CONSISTENT(q, ) is
coNP-hard.
Proof. We will prove hardness by reduction from MONOTONE-3SAT. Let =
1

m
be a monotone 3CNF formula. We shall build an instance I as follows:
For each atom z, let
i
1
, . . . ,
i
n
be the positive clauses where z occurs. Add tuples
T
1
(<
i
1
, . . . ,
i
n
>, z) and T
2
(z, <
i
1
, . . . ,
i
n
>) to I.
For each atom z, let
i
1
, . . . ,
i
n
be the negative clauses where z occurs. Add tuples
T
1
(<
i
1
, . . . ,
i
n
>, z) and T
2
(z, <
i
1
, . . . ,
i
n
>) to I.
We now show that consistent
(q, I) = false i is satisable.

() Since consistent
(q, I) = false, there exists a repair 1 of I such that 1 ,[= q.

Assume towards a contradiction that there are tuples T
1
(c, z) 1 and T
1
(c
, z) 1 such
that c ,= c
. By construction of I, if T
2
(z, d) I, then d = c or d = c
. By Propositions
3.6 and 3.7, either T
2
(z, c) 1 or T
2
(z, c
) 1. Thus, 1 [= q; contradiction.
We now build a valuation v for the variables of as follows. For each variable z,
we let v(z) = true if there is some c such that T
1
(c, z) 1 and c is a list of positive
clauses; and we let v(z) = false if there is some i such that T
1
(c, z) 1, and c is a list
of negative clauses. It is easy to see that v is a truth valuation that satises .
() Assume that is satisable. Let v be a truth assignment for the variables of .
We shall build a repair 1 as follows. For each positive clause
i
, select a variable z that
appears in
i
and such that v(z) = true. Add T
1
(c, z) to 1, where c is a list of positive
clauses. For each negative clause
i
, select a variable z that appears in
i
and such that
v(z) = false. Add T
1
(c, z) to 1, where c is a list of negative clauses. For each variable
z, if v(z) = false, add T
2
(z, c) to 1, where c is a list of positive clauses; if v(z) = true,
add T
2
(z, c) to 1, where c is a list of negative clauses. It is easy to see that 1 ,[= q.
We now give some auxiliary results before proving Lemma 5.14. The next lemma
generalizes Lemma 5.11 from cycles of length two to the case of cycles of arbitrary length.
Lemma 5.12. Let q be the query w
1
, . . . , w
m
.S
1
(w
m
, w
1
)S
2
(w
1
, w
2
) S
m
(w
m1
, w
m
).
Let q
= x, y.T
1
(x, y)T
2
(y, x) Then, there is a polynomial time reduction from the prob-
lem CONSISTENT(q
) to the problem CONSISTENT(q, ).

Proof. Let I

for each valuation
q
for the variables of q
such that I
[= T
1
(x, y) T
2
(y, x)[
q
] do
Let
q
(w
m
) =
q
(x)
Let
q
(w
1
) =
q
(y)
Create a new constant c
new
for i := 2 to m1 do
Let
q
(w
i
) = c
new
end for
Add the tuples of S
1
(w
m
, w
1
) S
2
(w
1
, w
2
) S
m
(w
m1
, w
m
)[
q
] to I
end for
(q
, I
(q, I) = true.
() Let 1 be a repair of I over the schema of q. We shall build a repair 1
over the
schema of q
as follows:
for each tuple S
1
(c
m
, c
1
) of 1 do
Add a tuple T
1
(c
m
, c
1
) to 1
for each c
new
such that S
2
(c
1
, c
new
) 1 and S
m
(c
new
, c
m
) 1 do
Add a tuple T
2
(c
1
, c
m
) to 1
end for
end for
Since consistent
(q
, I
) = true, 1
[= q
. Thus, there is a valuation

q
such that
1
[= T
1
(x, y) T
2
(y, x)[
q
]. Let c
m
=
q
(x), c
1
=
q
(y). Since T
2
(c
1
, c
m
) 1
, there
exists c
new
such that S
2
(c
1
, c
new
) 1 and S
m
(c
new
, c
m
) 1. Let
q
be a valuation for the
variables of q such that:

q
(w
m
) = c
m

q
(w
1
) = c
1

q
(w
i
) = c
new
, for 1 < i < m
Since T
1
(c
m
, c
1
) 1
, S
1
(c
m
, c
1
) 1. By construction of
q
, S
2
(c
1
, c
new
) 1 and
S
m
(c
new
, c
m
) 1. For 2 < i m, notice that by construction of I, there are no tuples
S
i
(c
i
, d
i
) and S
i
(c
i
, d
i
) in I such that d
i
,= d
i
. Therefore, by Propositions 3.6 and 3.7,
every tuple in the extension of S
i
in I appears in the extension of S
i
in 1. By construction
of I, S
i
(c
new
, c
new
) I, for 3 i m 1. Thus, S
i
(c
new
, c
new
) 1. We conclude that
1 [= S
1
(w
m
, w
1
) S
2
(w
1
, w
2
) . . . S
m
(w
m1
, w
m
)[
q
]. Thus, 1 [= q.
() Let 1
be a repair of I

for each tuple T
1
(c
m
, c
1
) of 1
do
Add a tuple S
1
(c
m
, c
1
) to 1
Let c
new
be a constant such that S
2
(c
1
, c
new
) I and S
m
(c
new
, c
m
) I
Add a tuple S
2
(c
1
, c
new
) to 1
for i := 3 to m1 do
Add a tuple S
i
(c
new
, c
new
) to 1
end for
Add a tuple S
m
(c
new
, c
m
) to 1
end for
It is easy to see that 1 is a repair of I. Since consistent
(q, I) = true, 1 [=
q. Thus, there exists some valuation
q
such that 1 [= S
1
(w
m
, w
1
) S
2
(w
1
, w
2
)
. . . S
m
(w
m1
, w
m
)[
q
]. Let
q
be such that:

q
(x) =
q
(w
m
)

q
(y) =
q
(w
m
1
)
[= T
1
(x, y) T
2
(y, x)[
q
]. Thus, 1
[= q
.
5.2.3 Generalizing the Basic Cases
Our strategy for proving the dichotomy will be to show that if q has a subquery q
that
is known to be intractable (in particular, a cycle), then q is not tractable. This does not
hold in general, but as we show with the next auxiliary result, it holds for the queries in
c
.
Lemma 5.13. Let q be a Boolean query such that q c
. Let R
1
(x
1
, y
1
), . . . ,
R
n
(x
n
, y
n
) be the literals of q. Let q
be a Boolean query. Let S

1
(x
1
, y
1
), . . . ,
S
m
(x
m
, y
m
) be the literals of q
, where m n. Assume that the join graph of q
is a cycle.
Let L = x
1
, y
1
, . . . , x
m
, y
m
. Assume that:
x
i
occurs in x
i
, for 1 i m, and
y
i
occurs in y
i
, for 1 i m, and
for 1 i m, if w L and w occurs in R
i
, then w occurs in S
i
.
Then, there is a polynomial-time reduction from the problem CONSISTENT(q
) to
CONSISTENT(q, ).
Proof. Let F = w : w occurs in R
i
, and 1 i mL. Let U = w : w occurs in q
F L.
Let I

for each variable w such that w F do
new
Let
F
(w) = c
new
end for
for each valuation
q
for the variables of q
such that I
[= S
1
(x
1
, y
1
) S
m
(x
m
, y
m
)[
q
]
do
for each variable w such that w F do
Let
q
(w) =
F
(w)
end for
for each variable w such that w U do
new
Let
q
(w) = c
new
end for
for i := 1 to m do
Let
q
(x
i
) =
q
(x
i
)
Let
q
(y
i
) =
q
(y
i
)
end for
Add the tuples of R
1
(x
1
, y
1
) R
n
(x
n
, y
n
)[
q
] to I
end for
(q
, I
(q, I) = true.
() Let 1 be a repair of I over the schema of q. We shall build an instance 1
over
the schema of q
as follows.
for i := 1 to m do
for each tuple R
i
(c
i
,

d
i
) of 1 do
Let c
i
be the constant that appears in c
i
at the position of one of the occurrences
of x
i
in x
i
.
Let d
i
be the constant that appears in

d
i
at the position of y
i
in y
i
Add S
i
(c
i
, d
i
) to 1
end for
end for
We make the following observations with respect to the construction of 1
. By con-
struction of I, if R
i
(c
i
,

d
i
) I, the same constant appears in c
i
at all the positions where
x
i
appears in x
i
. By Proposition 3.6, 1 I. Thus, in the construction of 1
, it suces
to choose the constant that occurs in c
i
at any of the positions where x
i
occurs in x
i
.
Assume that 1
is not a repair of I
. Then, there are constants c

i
, d
i
and d
i
such
that d
i
,= d
i
, S
i
(c
i
, d
i
) 1
and S
i
(c
i
, d
i
) 1
. By construction of 1
, there are tuples

R
i
(c
i
,

d
i
) 1 and R
i
(c
i
,

d
i
) 1 such that c
i
appears in c
i
and c
i
at all the positions
where x
i
appears in x
i
; and d
i
and d
i
appear in

d
i
and

d
i
, respectively, at the position
of y
i
in y
i
. Clearly,

d
i
,=

d
i
. By construction of I, if w is a variable such that w , L,
w is assigned the value
F
(w) in every tuple of I. By Proposition 3.6, 1 I. Thus,
c
i
=c
i
. Since

d
i
,=

d
i
, 1 does not satisfy the key constraints of . Thus 1 is not a repair;
contradiction. We conclude that 1
is a repair of I
.
Since consistent
(q
, I
) = true, 1
[= q
. Thus, there is some valuation

q
such
that 1
[= S
1
(x
1
, y
1
) S
m
(x
m
, y
m
)[
q
]. Let
m
be a valuation for the variables of
R
1
, . . . , R
m
such that:

m
(x
i
) =
q
(x
i
), for 1 i m

m
(y
i
) =
q
(y
i
), for 1 i m

m
(w) =
F
(w) if w F
Let w be a variable that appears in R
i
, for 1 i m. If w L and w occurs in
R
i
, by hypothesis, w occurs in S
i
. If w , L, then w F, by denition of F. Since
1
[= S
1
(x
1
, y
1
) S
m
(x
m
, y
m
)[
q
], and
m
(w) =
F
(w) if w F, we conclude that
1 [= R
1
(x
1
, y
1
) R
m
(x
m
, y
m
)[
m
].
By construction of I, there is a valuation
q
for the variables of q such that:

m
(w) =
q
(w) if w appears in R
i
, for 1 i m; and
I [= R
1
(x
1
, y
1
) R
n
(x
n
, y
n
)[
q
].
Let R
i
(x
i
, y
i
) be a literal of q such that i > m. Notice that we assume that the join
graph of q
is a cycle. Since q is in c
, there exists some variable w such that w occurs in

x
i
and w does not occur in any of R
1
, . . . , R
m
. Thus, w U. Since the variables of U are
assigned a distinct constant in every iteration of the algorithm that constructs I, if two
tuples R
i
(c
i
,

d
i
) and R
i
(c
i
,

d
i
) are added at dierent iterations, then c
i
,= c
i
. Therefore,
by Proposition 3.6 and 3.7, every tuple in the extension of R
i
in I is in the extension of
R
i
in 1. Therefore, 1 [= R
1
(x
1
, y
1
) R
n
(x
n
, y
n
)[
q
].
() Let 1
be a repair of I

for i := 1 to m do
for each tuple S
i
(c
i
, d
i
) of 1
do
Let R
i
(c
i
,

d
i
) be a tuple of I such that c
i
appears in c
i
at all the positions of x
i
in
x
i
, and d
i
appears in

d
i
i
in y
i
Add R
i
(c
i
,

d
i
) to 1
end for
end for
for i := m + 1 to n do
for each tuple R
i
(c
i
,

d
i
) in I do
Add R
i
(c
i
,

d
i
) to 1
end for
end for
We will now show that 1 is a repair of I. Towards a contradiction, assume that 1 is
not a repair of I. Then, there are values c
i
,

d
i
, and

d
i
such that

d
i
,=

d
i
, R
i
(c
i
,

d
i
) 1,
and R
i
(c
i
,

d
i
) 1.
First, assume that 1 i m. For every variable w such that w , L and w occurs
in R
i
, w F. Thus, w is assigned the same constant
F
(w) in every tuple of I. By
Proposition 3.6, 1 I. Therefore, there are constants c
i
, d
i
and d
i
such that d
i
,= d
i
, c
i
appears in c
i
at the positions of x
i
in x
i
, and d
i
and d
i
appears in

d
i
and

d
i
, respectively,
i
in y
i
. By construction of 1, there are tuples S
i
(c
i
, d
i
) and S
i
(c
i
, d
i
)
in 1
. Since d
i
,= d
i
, 1
does not satisfy the key constraints of
. Thus, 1
is not a repair;
contradiction.
Now, assume that m < i n. Notice that we assume that the join graph of q
is
a cycle. Since q is in c
, there exists some variable w such that w occurs in x

i
and
w does not occur in any of R
1
, . . . , R
m
. Thus, w U. Since the variables of U are
assigned a dierent constant in every iteration of the algorithm that constructs I, if two
tuples R
i
(c
i
,

d
i
) and R
i
(c
i
,

d
i
) are added at dierent iterations, then c
i
,= c
i
. Therefore,
the extension of R
i
in I satises the key dependencies of . Thus, by construction of
1, the extension of R
i
in 1 satises the key constraints of . Thus, 1 is a repair of I;
contradiction.
We conclude that 1 is a repair of I. Since consistent
(q, I) = true, 1 [= q. Thus,

there exists some valuation
q
such that 1 [= R
1
(x
1
, y
1
) R
n
(x
n
, y
n
)[
q
]. Let
q
be
a valuation for the variables of q
such that, for 1 i m:

q
(x
i
) =
q
(x
i
)

q
(y
i
) =
q
(y
i
)
[= S
1
(x
1
, y
1
) S
m
(x
m
, y
m
)[
q
]. Thus, 1
[= q
.
We are now ready to prove Lemma 5.14, which gives a polynomial-time reduction
from the problem of computing consistent answers for the queries of Lemmas 5.1 or 5.11
to every query in c
hard
. From this, Theorem 5.5 follows directly.
Lemma 5.14. Let q be a query such that q c
hard
. Then, there is a polynomial-time
reduction from CONSISTENT(q
) to CONSISTENT(q, ), where q
is one of the following

queries:
x, x
, y.S
1
(x, y) S
2
(x
, y)
x, y.T
1
(x, y) T
2
(y, x)
Proof. Let G be the join graph of q. Let G
be an induced subgraph of G such that:

G
is connected, and
G
is not a tree, and

if G
is a proper induced subgraph of G
, and G
is connected, then G
is a tree.
Let P = R
1
, R
2
, R
1
) be a cycle of G
. Let R
1
(x
1
, y
1
) and R
2
(x
2
, y
2
) be the literals in
G
. Assume that there is some variable y such that y occurs in y

1
and y
2
. By Denition
of c
, there is no key-to-key join between R

1
and R
2
. Therefore, there exists a variable
x such that x occurs in x
1
, and x does not occur in x
2
; and a variable x
such that x
occurs in x
2
and x
does not occur in x

1
. Let q
= S
1
(x, y) S
2
(x
, y). By Lemma 5.13,

there is a polynomial-time reduction from CONSISTENT(q
) to CONSISTENT(q, ).
Let P = R
1
, . . . , R
m
, R
1
) be a cycle of G
. Let R
1
(x
1
, y
1
),. . . , R
m
(x
m
, y
m
) be the
literals of P. Let w
1
, w
2
, . . . , w
m
be variables such that w
i
occurs in y
i
and in R
(i mod m)+1
,
for every 1 i m. Assume that there is some w
i
such that 1 i m and w
i
occurs in
some literal R
j
of q such that j ,= i and j ,= (i mod m)+1. Then R
1
, . . . , R
i
, R
j
, . . . , R
1
is a cycle. Therefore G
contains a proper induced subgraph G
such that G
is connected,
and G
is not a tree; contradiction. Let q
= S
1
(w
m
, w
1
)S
2
(w
1
, w
2
). . . S
m
(w
m1
, w
m
).
It can be checked that q and q
satisfy the conditions of Lemma 5.13. Consequently,

there is a polynomial-time reduction from CONSISTENT(q
) to CONSISTENT(q, ). Let
q
= x, y.T
1
(x, y) T
2
(y, x). By Lemma 5.12, there is a polynomial-time reduction from
CONSISTENT(q
) to CONSISTENT(q
).
Finally, we give the proof for Theorem 5.5, the main result of this chapter.
Theorem 5.5. Let q be a query such that q c
hard
. Then, CONSISTENT(q, ) is coNP-
complete in data complexity.
Proof. By Lemma 5.10, CONSISTENT(q, ) is in coNP. In order to prove hardness, let q
be one of the following queries:

x, x
, y.S
1
(x, y) S
2
(x
, y)
x, y.T
1
(x, y) T
2
(y, x)
By Lemma 5.14, there is a polynomial-time reduction from CONSISTENT(q
) to
CONSISTENT(q, ). By Lemmas 5.1 and 5.11, CONSISTENT(q
) is coNP-hard. Thus,
CONSISTENT(q, ) is coNP-hard.
5.3 Related Work
Chomicki and Marcinkowski [CM05] and Cal`, Lembo and Rosati [CLR03a] thoroughly
study the decidability and complexity of consistent query answering for several classes
of queries and integrity constraints. In order to show intractability of a class, they
take the usual approach of exhibiting one query of the class for which the problem is
intractable. To the best of our knowledge, the result that we present in Section 5.2 is the
rst dichotomy result in the area of consistent query answering.
Both Chomicki and Marcinkowski and Cal`, Lembo and Rosati show that the problem
of obtaining consistent answers for conjunctive queries under primary key constraints is
coNP-complete. Chomicki and Marcinkowski also show an example of a query with just
one literal but two key dependencies for which the problem is coNP-complete. This gives
further support for our decision of considering exactly one key dependency per relation.
Cal`, Lembo and Rosati show the undecidability of the problem of obtaining consis-
tent answers when the set of constraints contains primary keys and arbitrary inclusion
dependencies. They also show the problem becomes decidable for foreign key constraints
(it is coNP-complete). Chomicki and Marcinkowski study the same problem but under
a semantics where only tuple deletion is allowed (i.e., repairs are always subsets of the
inconsistent database). In this case, the problem is
2
p
-complete, and becomes coNP-
complete if the inclusion dependencies are restricted to be acyclic.
Chapter 6
ConQuer: System Implementation
and SQL Rewritings
In this chapter, we present ConQuer, a system for querying inconsistent databases.
We demonstrated this system at the International Conference on Very Large Databases
(VLDB) [FFM05b]. In Section 6.1, we describe the system implementation and a typical
scenario where it can be used. Then, in Sections 6.2 and 6.3, we present the SQL rewrit-
ings that are at the core of ConQuers approach. In Section 6.4, we show how, if desired,
ConQuer can process the database oine in order to improve the performance of the
queries. Finally, in Section 6.5, we review other systems that are related to ConQuer.
6.1 System Implementation
ConQuer is implemented in Java and follows a modular architecture. It consists of the
following components:
Query Rewriting Module. It rewrites an input SQL query into another SQL
query that computes the consistent answers. The details of the rewritings are
presented in Sections 6.2 to 6.4. The SQL queries are parsed using javacc.
Query Execution Engine. The rewritten queries are executed using IBM DB2
UDB Version 8.2. The connection with the database is done through JDBC.
Conict Resolution Module. Provides a tracing facility to nd the data that
leads to dierences between the answer to the original query and the consistent
answer. This module also permits a user to update the database to correct errors.
101
Chapter 6. ConQuer: System Implementation and SQL Rewritings 102
Figure 6.1: Interface for entering hypothetical primary key constraints in ConQuer
User Interface. Query results are displayed using a Web-accessible interface that
is implemented in PHP.
We illustrate a typical use case of ConQuer on a database with information about
airports. The user rst species a set of primary key constraints using the interface shown
in Figure 6.1. These are the constraints that should hold on a consistent database, but
may be violated by the actual database that is being queried. Notice that for the same
schema and database, there is the exibility of running queries under dierent sets of
potentially violated primary key constraints. Then, the user writes a SQL query within
the interface. In Figure 6.2, we show a query where the user is asking for all the countries
that have airports located north of parallel 63N. The result to the query is shown in Figure
6.3. The consistent answers are shown in bold, and the potential answers (i.e., possible
answers that are not consistent answers) are shown in italics. For example, in this case
Italy is a potential answer.
While consistent answers are best suited for decision making, potential answers can be
used to understand the reasons why a database is inconsistent. In this case, the user could
click on Italy and obtain an explanation, which is shown in Figure 6.4. The explanation
is the lineage (or why-provenance) [BKT01, CW03] of the result, i.e., the tuples in the
database that contribute to the answer. According to the explanation, Italy is a potential
answer because it has one airport that appears as satisfying the query (parallel 63) in
Figure 6.2: Interface for entering queries in ConQuer
one tuple, and violating it (parallel 45) in another. Notice that in the comment to the
query, the user wrote select countries that are located north of Trondheim. Trondheim
is a Norwegian city, and the user may have background knowledge telling that all Italian
cities are south of Norwegian cities. Thus, the user could use the explanation obtained
from ConQuer in order to remove the tuple for the Italian airport located on parallel 63.
6.2 ConQuer Rewritings for Queries without Aggre-
gation
In this section, we present the SQL rewritings produced by ConQuer for a class of Select-
Project-Join (SPJ) queries with set semantics. We delay the treatment of conjunctive
queries that return duplicates until the next section, where the number of duplicates
returned by the queries can be counted with the count(*) aggregate function. We rst
give the query rewriting algorithm, and then we illustrate it with a number of examples.
6.2.1 Rewriting Algorithm
We now present a SQL rewriting algorithm for SPJ queries that are equivalent to a
conjunctive query in the class c
forest
, introduced in Denition 3.4, which we repeat next.
Figure 6.3: Query results in ConQuer
Figure 6.4: Query explanation in ConQuer
Denition 3.4. Let q be conjunctive query without repeated relation symbols and all of
whose nonkey-to-key joins are full. Let G be the join graph of q. We say that q c
forest
if G is a forest (i.e., every connected component of G is a tree).
The above denition requires three conditions on the conjunctive query. First, that
the query has no repeated relation symbols. For an SPJ SQL query, this means that each
relation can be used at most once in the where clause. Second, that all its nonkey-to-key
joins must be full. For an SPJ query, this means that if an attribute of a key of a relation
r
1
is equated in the where clause with a nonkey attribute of another relation r
2
, then all
the attributes of the key of r
1
are equated to nonkey attributes of r
2
. Finally, the join
graph of q must be a forest. The notion of a join graph is introduced in Denition 3.1,
and we repeat it next.
Denition 3.1 (join graph). Let q be a conjunctive query. The join graph G of q is a
directed graph such that:
the vertices of G are the literals of q;
there is an arc from literal R
i
to literal R
j
if i ,= j, and there is some variable w
such that w is existentially-quantied in q, w occurs at the position of a nonkey
attribute in R
i
, and w occurs in R
j
.
An analogous denition can be given for the join graph of an SPJ SQL query. The
vertices of the graph will be the relation symbols in the from clause of the query. Fur-
thermore, there will be an arc from relation r
i
to relation r
j
if there is an attribute A
in r
i
such that (1) A is not in the key of r
1
(it is a nonkey attribute), (2) A does not
appear in the select clause of the query, and A is not equated to any attribute B such
that B appears in the select clause of the query (this corresponds to the notion of
an existentially-quantied variable for conjunctive queries); and (3) there is some equal-
ity in the where clause relating A to some attribute B of r
2
(i.e., a nonkey-to-key or
nonkey-to-nonkey join).
1
We can now give a denition analogous to c
forest
for SPJ SQL queries. A query q is
in class c
sql
forest
if no relation appears twice in the from clause of q, all the nonkey-to-key
joins of q are full, and the join graph of q is a forest.
1
This denition works for repeated relation symbols as well. In such case, we assume that if a relation
appears more than once in the from clause, then it is aliased to a new name using the as operator.
We are now ready to give ConQuers rewriting algorithm for SPJ queries in c
sql
forest
.
The algorithm is called RewriteForestSQL and is shown in Figure 6.5. The algorithm
takes as input a SQL query q in c
sql
forest
and a set of key constraints (one per relation of
the schema), and returns a SQL rewriting Q of q.
In the rewriting Q, the attributes of the relations in q play dierent roles. In par-
ticular, we will distinguish the attributes that the query projects on (i.e., that appear
in the select clause), and the attributes that appear in the key of a relation that is
at the root of some tree in the join graph of q. In the rest of the discussion, we will
call these attributes projecting attributes, and key-root attributes, respectively. The for-
mer are denoted in Figure 6.5 with the symbols S
1
, . . . , S
l
; the latter are denoted with
K
1
, . . . , K
n
.
The rewriting Q has three subqueries, specied using a with clause: candidates-
SubQuery, countViolSubQuery and countProjSubQuery. The purpose of candidates-
SubQuery is to prune the number of values for the key-root attributes that should be
considered by the other subqueries. In particular, candidatesSubQuery applies the
selection conditions of the original query q, and projects on its key-root attributes. These
attributes are used to perform an inner join in the next subquery (countViolSubQuery).
If the selectivity of q is low (i.e., few tuples satisfy its conditions), and the query optimizer
pushes down the selection conditions of candidatesSubQuery in the query plan, we would
expect the rewriting to have a low overhead with respect to the original query. We validate
this conjecture in Section 7.2.
Let cO^To be the list of conditions in the where clause of q. In the from clause
of countViolSubQuery, we count the number of tuples that violate the conditions of
cO^To, we group by the key-root attributes, and keep the result in an attribute called
countViol as follows:
sum(case when cO^To then 0 else 1 end)
over (partition by K
1
, . . . , K
n
)
as countViol
Notice the use of the partition by clause. This clause (introduced in the OLAP
Amendment to SQL [ISO01]) diers from the typical group by clause in that it permits
grouping by a set of attributes that may not include all the attributes in the select
clause. This is useful here because we partition by the root-key attributes, but the
select clause of countViolSubQuery also includes the projecting attributes of the query.
In the main body of the query, we lter out the tuples whose key-root attributes are
involved in a violation of cO^To by checking the condition countViol=0.
The from clause of subquery countViolSubQuery is obtained by calling a procedure
called GetJoinsExpression (shown in Figure 6.6), with the join graph of q and the list
of conditions cO^To as parameters. This procedure consists of two parts. In the rst
part, an inner join is computed for the key-to-key joins of relations that are at the root
of some tree of the join graph. Notice that since these relations are in distinct connected
components of the join graph, they are not related by a nonkey-to-key join. In the second
part, the procedure produces a left outer join expression for each tree of the join graph.
This is done by recursively calling the procedure GetTreeJoinsExpression for the nodes
of each tree (also shown in Figure 6.6). The expression returned by GetTreeJoinsExpres-
sion is a left outer join of all relations in the input tree, listed in an order corresponding
to a preorder traversal of the trees.
We will illustrate shortly (in Example 6.4) the rewriting for queries where some of
the root-key attributes do not appear in the select clause (that is, some root-key at-
tributes are not projecting attributes). We will argue that in such cases, we would
like to count the number of distinct values for the projecting attributes, grouping by
the root-key attributes. We will also show how to do this by using the max aggre-
gate function (with a partition by clause) and the rank OLAP function. In the al-
gorithm RewriteForestSQL of Figure 6.5, the rank function is used in the subquery
countViolSubQuery, and the max function is used in the subquery countProjSubquery.
The result of this aggregation is kept in an attribute called countProjection, which
keeps the count of distinct values for each instantiation of the root-key variables. This
attribute is used in the main body of the rewriting, where we check countProjection=1.
In the subqueries, we project not only on the projecting attributes S
1
, . . . , S
l
, but
also on the root-key attributes K
1
, . . . , K
n
. However, in the main query of the rewriting
we project only on the attributes S
1
, . . . , S
l
. In this way, the rewritten query Q and the
input query q return tuples for the same set of attributes.
For the sake of clarity, we omitted the order by clause in the query q. However,
dealing with ordering in the rewriting is quite easy. We just need to add the attributes
of the order by of q to the select clause of the subqueries, and include them in the
order by clause of the main body of the rewriting.
Algorithm RewriteForestSQL(q, )
Input: q, a SQL query in c
sql
forest
of the form
select <list of attributes>
from <list of relations>
where <list of conditions>
Output: Q, a SQL query that computes consistent
(q, I), for every database I

Let S
1
, . . . , S
l
be the attributes in the select clause of q
Let G be the join graph (forest) of q
Let r
1
, . . . , r
m
be the relations at the root of all trees of G
Let K
1
, . . . , K
n
be the attributes in the keys of r
1
, . . . , r
m
Let cO^To be the list of conditions in the where clause of q
Let O1ô be the expression obtained by calling the procedure
GetJoinsExpression(G, cO^To) of Figure 6.6
Let Q be the following SQL query:
with candidatesSubQuery as (
select K
1
as cK
1
,. . . ,K
n
as cK
n
from <list of relations in q>
where cO^To ),
countViolSubQuery as (
select K
1
, . . . , K
n
,
S
1
, . . . , S
l
,
rank() over (partition by K
1
, . . . , K
n
order by S
1
, . . . , S
l
) as rankProjection,
1
, . . . , K
n
) as countViol,
from O1ô ),
where exists (select * from candidatesSubQuery
where K
1
= cK
1
and . . . and K
n
= cK
n
),
countProjSubQuery as (
select K
1
, . . . , K
n
,
S
1
, . . . , S
l
,
max(rankProjection) over (partition by K
1
, . . . , K
n
)
as countProjection,
countViol
from countViolSubQuery )
select distinct S
1
, . . . , S
l
from countProjSubQuery
where countProjection = 1 and countViol=0
return Q
Figure 6.5: SQL query rewriting algorithm for SPJ queries in c
sql
forest
6.2.2 Examples
We now present some examples to illustrate the use of the RewriteForestSQL algorithm.
In the examples, we rst show the rst-order rewriting that we obtain with the algorithms
of Chapter 3, and then we present the actual SQL query produced by ConQuer.
Selection
In the next example, we illustrate ConQuers SQL rewritings with a simple query that
has one selection condition.
Example 6.1. Let R be a schema with our standard employee(emplKey, salary) re-
lation. Consider a SQL query q
1
that retrieves the names and salaries of all employees
whose salary is less than or equal to 1000.
q
1
from employee
Using the notation for conjunctive queries, q
1
can be written as follows:
q
1
(e) = s.employee(e, s) s 1000
A rst-order query rewriting that computes the consistent answers to q
1
can be
obtained with the algorithms of Chapter 3. In particular, the rewriting returned by
RewriteForest(q
1
, ) is the following:
Q
1
(e) = s.employee(e, s) s 1000 s
.(employee(e, s
) s
1000)
Notice that the rst and second conjuncts of the rst-order rewriting Q
1
actually
correspond to the original query q
1
. Thus, the rewriting starts with a subquery called
candidatesSubQuery that retrieves the employee names that satisfy q
1
(and are thus
candidates to be consistent answers).
Algorithm GetJoinsExpression(G, cO^To)
Input: G, a join graph that forms a forest
cO^To, a list of conditions of the form xy,
where is some binary comparison operator such as =, ,=, <, etc.
Output: a subexpression of a SQL query
Let r
1
, . . . , r
m
be the relations at the root of all trees of G
Initialize 1O1ô as the string r
1
for i := 2 to m do
Let 1O1ô be the conjunction of all join conditions (i.e., equalities) between attributes
of r
i1
and r
i
Concatenate join r
i
on 1O1ô to 1O1ô
end for
Initialize T O1ô as an empty expression
Let T
1
, . . . , T
m
be the trees of G rooted at r
1
, . . . , r
m
for i := 1 to m do
Concatenate the expression returned by GetTreeJoinsExpression(T
i
, cO^To) to
T O1ô
end for
return 1O1ô and T O1ô
Algorithm GetTreeJoinsExpression(T, cO^To)
Input: T, a join graph that forms a tree
cO^To, a list of conditions of the form xy,
where is some binary comparison operator such as =, ,=, <, etc.
Initialize /OO1ô as an empty string
if T consists of more than one node r then
Let r
1
, . . . , r
m
be the relations whose root is a child of r
for i := 1 to m do
Let 1O1ô be the conjunction of all join conditions (i.e., equalities) between at-
tributes of r and r
i
Concatenate left outer join r
i
on 1O1ô to /OO1ô
end for
for i := 1 to m do
Let T
i
be the subtree of T rooted at r
i
Concatenate the expression returned by GetTreeJoinsExpression(T
i
, cO^To) to
/OO1ô
end for
end if
return /OO1ô
Figure 6.6: Procedures to obtain an expression for the joins of a query
select emplKey
from employee
where salary <= 1000)
Since emplKey is a key of the relation employee, in the repairs, each employee name
will be associated with exactly one salary. However, in the inconsistent database, an
employee name may appear with several dierent salaries. Thus, the rewriting must
ensure that the employee names in the consistent answers are associated with salaries
satisfying the selection condition of the input query q
1
(i.e., that the salary is less or
equal than 1000) in every tuple of the inconsistent relation employee where the employee
name appears. This is done in Q
1
with the expression s
.employee(e, s
) s
<= 1000.
It is straightforward to translate this expression into SQL using nested queries and the
not exists construct. However, from our empirical observations in the context of DB2,
we have noticed that such constructs lead in many cases to inecient queries. Thus,
for the sake of eciency, the rewritings produced by ConQuer avoid the not exists
construct. One way of doing this is to count, for each employee, the number of salaries
in the inconsistent database that violate the selection condition of q
1
. If there are no
violations (i.e., the number of salaries violating the condition for the employee is zero),
then the employee name satises the selection condition in every tuple of the inconsistent
relation. This can be achieved with the following subquery.
with countViolSubQuery as (
( select emplKey,
sum(case
when salary 1000 then 0 else 1 end) as countViol
from employee
where exists (select *
from candidatesSubQuery C where
C.emplKey=employee.emplKey)
group by emplKey)
In the above subquery, we count the number of violations for each employee. We keep
this count in an attribute called countViol. The nal result of the query consists of the
employee names for which there are no violations (countViol = 0). In the subquery,
for each tuple of employee, we compute a case statement. If the salary in the tuple
is less than or equal to 1000 (i.e., it satises the selection condition of q
1
) we output
a value of zero (meaning no violation). Otherwise, we output 1 (meaning a violating
tuple). The query aggregates these values, summing them up for each employee name.
If the sum for an employee name is zero, that means that there are no violating tuples
involving that employee name. Otherwise, we get the number of violating tuples (hence
the name, countViol). In the main body of the query (which we give below), we return
all employee names that are not involved in any violation.
select emplKey
from countViolSubQuery
where countViol = 0
Join
We now present two examples to illustrate the rewriting of queries that contain join
conditions. In the rst example, we show the rewriting for a query that has one join
condition. In the second example, we show the rewriting for a query with a more complex
join graph.
Example 6.2. Let R be a schema with relations employee(emplKey, deptFKey), and
dept(deptKey, mgrName). Consider a SQL query q
2
that retrieves the names of all
employees whose department appears in the dept relation:
q
2
from employee,dept
where employee.deptFKey= dept.deptKey
Notice that q
2
has an inner join specied with the condition employee.deptFKey=
dept.deptKey of its where clause. In conjunctive query notation, q
2
can be written as
follows.
q
2
(e) = d, m.employee(e, d) dept(d, m)
It can be easily checked that q
2
is in the class c
forest
of conjunctive queries. The
rst-order query rewriting obtained by applying the algorithm RewriteForest(q
2
, ) is
the following:
Q
2
(e) = d, m.employee(e, d) dept(d, m) d.(employee(e, d) m.R
2
(d, m))
We could translate Q
2
to SQL using a not exists construct to achieve the eect of
the universal quantier. Although this may be a reasonable strategy for a simple query
like q
2
, we will show in the next example that it leads to deeply nested rewritings when
the original queries have several joins.
We now illustrate how to avoid the not exists construct in the rewritings. As in
the previous example, we can count, for each employee, the number of tuples violating
the conditions of the input query (in this case, the join condition). In order to detect
violations of the join condition employee.deptFKey=dept.emplKey, we need to check
whether there is a tuple in the employee relation whose department is not in the dept re-
lation. This can be achieved by performing a left outer join between the relations as
follows:
select emplKey
from employee,dept
where employee.deptFKey= dept.deptKey ),
select emplKey,
sum(case
when employee.deptFKey=dept.emplKey then 0 else 1 end)
as countViol
from employee left outer join dept
on employee.deptFKey=dept.emplKey
group by emplKey )
select emplKey
where countViol = 0
Notice that there is a subquery called countViolSubQuery, specied using a with
clause. In this subquery, we count the number of violations for each employee. We keep
this count in an attribute called countViol. The nal result of the query consists of the
employee names for which there are no violations (countViol = 0). In the computa-
tion of countViol, we use a case statement. If there is a join with some tuple of the
dept relation, we output a value of zero (meaning no violation). Otherwise, we output 1
(meaning a violating tuple). Notice that we can detect the violations of the (inner) join
of the input query q
2
because we are performing a left-outer join in the rewritten query
Q
2
. Had we performed an inner join in Q
2
, the tuples that do not join on the department
would have never been seen by the case statement.
As in the previous example, the query aggregates the values for countViol, summing
them up for each employee name. If the sum for an employee name is zero, there are no
violating tuples involving that employee name. Otherwise, we get the number of violating
tuples.
We just illustrated how we can avoid the use of not exists in the SQL rewritings
by performing a left outer join. In next example, we show why we adopt this strategy
in ConQuer: a naive translation may lead to a deeply nested query , where the level of
nesting may be as large as the number of relations in the from clause of the query.
Example 6.3. Let Rbe a schema with relations employee(emplKey, cityFKey, deptFKey),
dept(deptKey, mgrName), city(cityKey, provFKey), and prov(provKey, countryName).
Consider a SQL query q
3
that retrieves the names of all employees that are located in
Canada and whose manager is Peter:
q
3
from employee, city, prov, dept
where employee.cityFKey=city.cityKey
and city.provFKey=prov.provKey
and employee.deptFKey=dept.deptKey
and prov.countryName= "Canada"
and dept.mgrName="Peter"
In conjunctive query notation, q
3
can be written as follows.
q
3
(e) = d, c, m, p.employee(e, d, c) city(c, p) prov(p, Canada) dept(d, Peter)
Figure 6.7: Join graph of query q
3
.
It can be checked that q
3
is in class c
forest
. In particular, notice that the join graph of q
3
(given in Figure 6.7) is a tree. As shown in Chapter 3, a rst-order rewriting of q
3
can
be obtained by recursively traversing its join graph. The rst-order query rewriting Q
3
obtained by applying RewriteForest(q
3
, ) is the following:
Q
3
(e) = d, c, m, p.employee(e, d, c) dept(d, m) city(c, p) prov(p, Canada) Q
(e)
where :
Q
(e) = d, c.employee(e, d, c) d, c.employee(e, d, c) (Q
(c) Q
IV
(d))
Q
(c) = p.city(c, p) p.city(c, p) Q
(p)
Q
(p) = prov(p, Canada) w
.(prov(p, w
) w
= Canada)
Q
IV
(d) = dept(d, Peter) u
.(dept(d, u
) u
= Peter)
The universal quantiers can be translated to SQL using the not exists construct.
However, this may lead to an inecient query. First, because it would have four self
joins (since each relation appears twice in the rewriting). Second, because each recursive
invocation of the algorithm produces a new universal quantier, and a new subquery
within its scope. For example, Q
is under the scope of a universal quantier for variable

d in Q
, and Q
is under the scope of another universal quantier (for variable p) in Q
.
As a consequence, the level of nesting of the SQL rewriting Q
3
would be three, which
corresponds to the height of the join graph.
As we showed in the previous example, in ConQuer we avoid using the not exists
construct by performing a left-outer join of the relations in each tree of the join graph.
The SQL rewriting produced by ConQuer in this case is the following:
select emplKey
from employee,city, prov,dept
where employee.cityFKey=city.cityKey
and dept.mgrName="Peter" ),
select emplKey,
sum(case
when employee.cityFKey=city.cityKey
and dept.mgrName="Peter"
then 0 else 1 end) as countViol
from employee left outer join dept on employee.deptFKey=dept.deptKey
left outer join city on employee.cityFKey=city.cityKey
left outer join prov on city.provFKey=prov.provKey
group by emplKey )
select emplKey
where countViol = 0
It is important to note that the SQL rewriting has only two subqueries, even though
q
3
has four relations, and a join graph with a tree of depth three.
Projection and the Need for OLAP Functions
In Example 6.1, we dealt with a query that projects on the key attribute of the relation
employee. If a query does not project on the key attribute, then special care must be
taken in the rewriting. We illustrate this with the next example.
Example 6.4. Let R be a schema with our standard employee(emplKey, salary) rela-
tion. Let q
4
be a query that retrieves all salaries (regardless of the employee name).
q
4
: select distinct salary
from employee
Comparing q
4
to q
1
, the former query does not project on the key attribute emplKey,
and it has no where clause. In conjunctive query notation, q
4
can be written as follows.
q
4
(s) = e.employee(e, s)
The rst-order query rewriting obtained by invoking RewriteForest(q
4
, ) is the
following.
Q
4
(s) = e.employee(e, s) s
.(employee(e, s
) s
= s)
Again, we would like to avoid the naive (but inecient) translation of Q
4
into SQL
that uses the not exists construct. Intuitively, Q
4
returns the salaries s for which there
is at least one employee name that is associated to s and only to s in the tuples of the
inconsistent relation employee. In this way, we ensure that salary s will appear in every
repair. One way of writing Q
4
in SQL is the following:
select salary
from employee
where emplKey is in
select emplKey
from employee
group by emplKey
having count(distinct salary)=1
In our empirical observations, the self join of the above query sometimes leads to
inecient queries. The self join is needed because we are not including the salary
attribute in the select clause of the subquery. This is not an arbitrary decision. Rather,
it is forced by the syntax of SQL. In SQL, all the attributes of the select clause must
appear in the group by clause. If we include salary in the select clause of the
subquery, we must also group by it, and hence we are unable to count the number of
distinct salaries per employee name. We will show shortly how we overcome this problem
in ConQuers rewritings.
We just argued that there are some query rewritings for which there is no obvious way
of avoiding self joins, and that this is caused by the syntax of the group by clause. This
problem was addressed in the OLAP Amendment to the SQL standard [ISO01], which
introduces aggregate functions with a partition by clause. The OLAP Amendment to
the standard has been implemented by all major database vendors. In particular, for
DB2, the standard has been supported since Version 7 (we are using Version 8.2).
The partition by clause is more exible than group by for two reasons. First, there
can be one partition by clause for each aggregate function, whereas there can only be
one group by for the entire query. Second, unlike group by, the attributes of the select
clause are not required to appear in the partition by clauses of the query. We illustrate
the use of the partition by clause with the next example.
Example 6.5. Consider the following SQL query:
select emplKey,salary,
sum(salary) over (partition by emplKey)
as countProjection
from employee
The query returns triples of values. The rst two values of each triple correspond to
employee names and salaries in the relation employee. The last attribute is the sum of
the salaries for the employee name in the tuple. Notice that the attribute emplKey is
in the partition by clause, but the salary attribute is not. So we are projecting on
two attributes (emplKey and salary), but considering only one of them for grouping the
results of the aggregate function. This cannot be done with a group by clause.
Let us nish this example by showing an application of the query to an actual
database. Consider the database I = employee(John, 1000), employee(John, 2000),
employee(Mary, 1000). The result of applying the SQL query above to I is the following
(John, 1000, 3000), (John, 2000, 3000), (Mary, 1000, 1000).
In the next example, we show how the partition by clause could be used in order
to avoid self joins in the rewritings.
Example 6.4. (continued) Recall that we had obtained a rewriting of query q
4
that
performs a self join on the employee relation. We can write an equivalent query without
a self join by taking advantage of the partition by clause.
with countProjSubQuery as (
select emplKey,
salary,
count(distinct salary) over (partition by emplKey) as countProj
from employee )
select salary
where countProj = 1
In the subquery countProjSubQuery, we obtain the number of distinct salaries for
each employee name (which we keep in a variable called countProj). The rewriting then
returns the salaries of employees for which there is exactly one salary in the database
(countProj = 1).
The query rewriting that we just obtained avoids the use of a self join by using the
partition by clause. Unfortunately, though, this is not the end of the story. The
version of DB2 that we use in ConQuer currently supports the partition by clause for
a variety of aggregate functions (such as sum, min, max, count(*), and avg), but it does
not support the count(distinct) function. Nevertheless, the eect of count(distinct)
can be obtained by combining the use of the max aggregation function (with a partition
by clause) and an OLAP function called rank() as follows.
with rankProjSubQuery as (
select emplKey, salary,
rank() over (partition by emplKey order by salary)
as rankProjection
from employee ),
countProjSubQuery as (
select emplKey, salary,
max(rankProjection) over (partition by emplKey)
as countProjection
from rankProjSubQuery )
select distinct salary
where countProjection = 1
First, let us explain the use of the rank() function. The syntax of rank() is the
following:
rank() over
(partition by <partition attributes> order by <order attributes>)
The function creates groups for each tuple of values (instantiation) of the attributes
in the partition by clause, as we discussed before for other functions. The tuples of
each group are ordered according to the attributes in the order by clause, and assigned
a number according to their position in this ordering. If there is a tie (in our example,
two tuples with the same employee name and salary), the tuples are mapped to the same
number.
Let us illustrate the semantics of the rank() function in the context of our example
rewriting. Consider a database I = employee(John, 1000), employee(John, 2000),
employee(Mary, 1000). Then, the function rank() over (partition by emplKey
order by salary) would map (John, 1000) to 1, (John, 2000) to 2, and (Mary, 1000)
to 1.
Now consider the instance I as an inconsistent database with respect to (which
contains a constraint stating that emplKey is the key of the employee relation). In
the subquery rankProjSubQuery of the rewritten query, we compute the ranking func-
tion for each tuple and keep the value in an attribute called rankProjection. Then,
in the subquery countProjSubQuery, we obtain the maximum value of the attribute
rankProjection for each employee name, and keep it in an attribute called count-
Projection. Notice that the grouping is done by employee names since the attribute
emplKey is in the partition by clause of the max aggregate function. In our example,
we would obtain (John, 1000, 2), (John, 2000, 2),(Mary, 1000, 1). In the nal result,
we would like to get salary 1000 because it appears associated with Mary in every re-
pair, but not 2000 because it does not appear in all repairs. We obtain this in the query
rewriting by checking the condition countProjection=1.
6.3 ConQuer Rewritings for SPJ Queries with Ag-
gregation
In this section, we present the SQL query rewritings produced by ConQuer for queries
with grouping and aggregation. We rst present the algorithm and then illustrate it with
some examples.
6.3.1 Rewriting algorithm
We now present the SQL rewriting algorithm for SPJ queries with aggregation that are
equivalent to the aggregate conjunctive queries in class c
aggforest
, introduced in Denition
4.1, which we repeat next.
Denition 4.1. Let q be an aggregate conjunctive query. We say that q is in class
c
aggforest
if q is of the form
select z, [count(*)[ F(u)]
from q
(z, u)
group by z
where q
forest
, and F is one of the aggregation functions
min, max, or sum.
We can now give a denition analogous to c
aggforest
for SPJ SQL queries with aggre-
gation.
Denition 6.1. We say that query q is in class c
sql
aggforest
if q is the form
select S
1
, . . . , S
l
,[count(*)],F
1
(A
1
), . . . , F
u
(A
u
)
group by S
1
, . . . , S
l
where S
1
, . . . , S
l
, A
1
, . . . , A
u
are attributes of the relations in the from clause, and
F
1
, . . . , F
u
may be any of the aggregation functions min, max, and sum.
We are now ready to give ConQuers rewriting for queries in c
sql
aggforest
. The algorithm
is called RewriteAggSQL, and is shown in Figure 6.8. It takes as input a SQL query q in
class c
sql
aggforest
and a set of key constraints (one per relation of the schema), and returns
a SQL rewriting Q of q.
In the rewriting Q, the attributes of the relations in q play dierent roles. As in
the algorithm RewriteForestSQL for queries without aggregation, we have projecting
and key-root attributes. The former are the attributes that q projects on (i.e., that
appear in its select clause), and the latter are the attributes that appear in the key
of a relation that is at the root of some tree in the join graph of q. In addition, in
RewriteAggSQL, we have aggregation attributes, that is the attributes that appear as
arguments of some aggregation function of q. In Figure 6.8, we denote the projecting
attributes with the symbols S
1
, . . . , S
l
; the key-root attributes with K
1
, . . . , K
n
; and the
aggregation attributes with A
1
, . . . , A
u
.
We denote the aggregation functions of q with F
1
, . . . , F
u
. In the gure, we assume
that the 0-ary function count(*) is present in the query (but during the explanation it
will be easy to see what can be dropped if count(*) is not present).
The rewriting Q has ve subqueries, specied using a with clause: candidatesSub-
Query, countViolSubQuery, contribAllSubQuery, contribConsistentSubQuery, and
contribNonConsistentSubQuery.
As in the algorithm RewriteForestSQL, the purpose of candidatesSubQuery is to
determine the values for the key-root attributes that should be considered by the other
subqueries. The subquery countViolSubQuery has the same purpose (counting the num-
ber of violations per key-root value) as the subquery of the same name in the rewrit-
ing RewriteForestSQL. One dierence is that here we need to compute the attribute
satConds which keeps track of whether each tuple satises the conditions of the query
(denoted as cO^To). The other dierence is that in the select clause of the subquery,
we must project on the aggregation attributes since their values are needed to perform
aggregation in the rest of the rewriting.
The other three subqueries are used to compute the contributions to the lower and
upper bounds of each aggregate result. The subquery contribAllSubQuery computes,
for each instantiation of the key-root and projecting attributes, the minimum and max-
imum value for each aggregation attribute. In particular, in the subquery we group by
K
1
, . . . , K
n
, S
1
, . . . , S
l
(the key-root and projecting attributes), and for each aggregation
F
i
(A
i
) in the select clause of q, we compute attributes bottomA
i
and topA
i
as min(A
i
)
and max(A
i
), respectively. We also compute an attribute countProjection, to keep
track of the projection on nonkey attributes.
The subqueries contribConsistentSubQuery and contribNonConsistentSubQuery
compute the contribution of the consistent and nonconsistent tuples to the aggre-
gation. The former are the tuples whose key-root values satisfy the following two con-
ditions. First, they have the same value for the projecting attributes in every tuple
where they appear (checked with condition countProjection = 1). Second, they are
not involved in a violation of the selection conditions cO^To in any of the tuples where
they appear (checked with condition countViol=0). The tuples that violate at least
one of these conditions are considered nonconsistent and dealt with in the subquery
contribNonConsistentSubQuery.
For the consistent tuples, the contributions computed in contribConsistentSub-
Query correspond to the bottom and top values from contribAllSubQuery. That is,
the attributes bottomA
i
and topA
i
of contribAllSubQuery appear in the select clause
of contribConsistentSubQuery. The computation of the contributions of the noncon-
sistent tuples is more involved. In contribNonConsistentSubQuery, the expression of
the select clause that handles the contributions is obtained by calling the procedure
GetBoundsNonConsistent given in Figure 6.9. Notice in the gure that the contributions
are dierent depending on the aggregation function. The rationale and correctness proof
for these contributions were given in Chapter 4. In the gure, we do not include the 0-ary
operator count(*). For this operator, we need to return the attributes bottomCount and
topCount with values of zero and one, respectively.
In the subqueries, we project not only on the projecting attributes S
1
, . . . , S
l
but
also on the root-key attributes K
1
, . . . , K
n
. However, in the main query of the rewriting
we project and group by only the attributes S
1
, . . . , S
l
(i.e., we project out the key-root
attributes). In this way, the rewritten query Q and the input query q return tuples
for the same set of attributes. We also compute the greatest lower bound (glbA
i
) and
lowest upper bound (lubA
i
) for each tuple of values for the projecting attributes. This
is obtained by performing the corresponding aggregation function (min, max, or sum) on
the top and bottom values computed in the previous subqueries. For the 0-ary func-
tion count(*), the bounds are computed by summing up the values of the attributes
bottomCount and topCount from the previous subqueries. Notice that there is also a
condition having sum(bottomCount) > 0. This is included in order to ensure that the
tuples for the projecting attributes are consistent answers.
For the sake of clarity, we omitted the order by clause in the query q. However,
dealing with ordering in the rewriting is quite easy. We just need to add the attributes
of the order by clause of q to the select clause of the subqueries, and nally add an
order by clause to the main subquery. The only special case that must be considered
is when an aggregate attribute appears in the order by clause. Since for each aggregate
attribute of q we have two attributes in the rewritten query (one for each bound), we
must (arbitrarily) decide whether the ordering will be by either the greatest lower or the
lowest upper bound.
Algorithm RewriteAggSQL(q, )
Input: q, a SQL query in c
sql
aggforest
of the form
select <list of attributes>,<list of aggregation functions>
group by <list of attributes>
Output: Q, a SQL query that computes aggconsistent

Let F
1
(A
1
), . . . , F
u
(A
u
) be the aggregation function applications in the select clause
of the query, where each F
i
is an aggregation function, and each A
i
is an attribute
from a relation that appears in the from clause
Let S
1
, . . . , S
l
be the attributes in the select clause of q (by denition of c
sql
aggforest
,
these are the attributes in the group by clause as well)
Let G be the join graph (forest) of q
Let r
1
, . . . , r
m
be the relations at the root of some tree of G
Let K
1
, . . . , K
n
be the attributes in the keys of r
1
, . . . , r
m
Let cO^To be the list of conditions in the where clause
Let O1ô be the expression obtained by calling the procedure
GetJoinsExpression(G, cO^To) of Figure 6.6
Let Q be the following SQL query:
select K
1
as cK
1
,. . . ,K
n
as cK
n
where cO^To ),
select K
1
, . . . , K
n
, S
1
, . . . , S
l
, A
1
, . . . , A
u
1
, . . . , K
n
order by S
1
, . . . , S
l
1
, . . . , K
n
) as countViol,
case when cO^To then yes else no end as satConds
from O1ô
where K
1
= cK
1
and . . . and K
n
= cK
n
),
continued on next page...
sql
aggforest
continues from previous page...
contribAllSubQuery as (
select K
1
, . . . , K
n
, S
1
, . . . , S
l
,
min(A
1
) as bottomA
1
,max(A
1
) as topA
1
,...,
min(A
u
) as bottomA
u
,max(A
u
) as topA
u
,
max(rankProjection) over (partition by K
1
, . . . , K
n
)
as countProjection,
countViol
where satConds=yes
group by K
1
, . . . , K
n
, S
1
, . . . , S
l
,countViol,rankProjection )
contribConsistentSubQuery as (
select K
1
, . . . , K
n
, S
1
, . . . , S
l
,
bottomA
1
,topA
1
,. . . ,
bottomA
u
,topA
u
,
1 as bottomCount,
1 as topCount
from contribAllSubQuery
where countProjection = 1 and countViol=0 )
contribNonConsistentSubQuery as (
select K
1
, . . . , K
n
, S
1
, . . . , S
l
,
GetBoundsNonConsistent(F, A
1
),. . . ,
GetBoundsNonConsistent(F, A
u
),
0 as bottomCount,
1 as topCount,
where countProjection > 1 or countViol >= 1 )
select S
1
, . . . , S
l
,
F(bottomA
1
) as glbA
1
,F(topA
1
) as lubA
1
,. . . ,
F(bottomA
u
) as glbA
u
,F(topA
u
) as lubA
u
,
sum(bottomCount) as glbCount, sum(topCount) as lubCount
from
( select * from contribConsistentSubQuery
union all
select * from contribNonConsistentSubQuery ) q
group by S
1
, . . . , S
l
having sum(bottomCount)>0
return Q
sql
aggforest
Algorithm GetBoundsNonConsistent
Input: F
i
, one of the aggregation functions sum, min, max
A
i
, an attribute
if F
i
= sum then
return case when
bottomA
i
< 0 then bottomA
i
else 0 end as bottomA
i
,
case when
topA
i
> 0 then topA
i
else 0 end as topA
i
end if
if F
i
= min
return bottomA
i
, 0 as topA
i
end if
if F
i
= max
return 0 as bottomA
i
, topA
i
end if
Figure 6.9: Algorithm to obtain the bottom and top contributions of nonconsistent
tuples
6.3.2 Examples
We next illustrate the rewriting for a query that uses the count aggregation function.
Example 6.6. Let R be a schema with relation employee(emplKey, salary, age). Con-
sider a SQL query q
5
that, for each age in the database, gives the number of occurrences
of the age on tuples for employees whose salary is less than or equal to 1000.
q
5
: select age, count(*)
from employee
group by age
In the aggregate conjunctive query notation introduced in Chapter 4, q
5
can be written
as follows.
q
5
(a, cnt) = select a, count(*)
from employee(e, s, a) s 1000
group by a
The above query is in the class c
aggforest
for which we gave a query rewriting algorithm
in Chapter 4. A key idea of that algorithm is to rst produce a rst-order rewriting for
a conjunctive query, and then perform aggregation on the result of the rst-order query.
For our example, this conjunctive query is q
(e, a) = s.employee(e, s, a) s 1000. Let

us call QConsistent(e, s) to the result of invoking RewriteForest(q
, ) (the algorithm
introduced in Chapter 3).
Let Q
5
be the query rewriting for q
5
obtained by invoking RewriteCount(q
5
, ) (the
algorithm of Figure 4.1 of Chapter 4). In that rewriting, the greatest lower bound is
obtained as follows:
QGlb(s, glb)= select s, count(*)
from QConsistent(e, s)
group by s
Notice that aggregation is performed on the result of the rst-order query QConsistent(e, s).
Thus, for computing the greatest lower bound in the SQL rewriting, we can reuse the al-
gorithm RewriteForestSQL introduced in Section 6.2. In particular, we will use the next
two subqueries, which are similar to those that would be produced by RewriteForestSQL(q
, )
(we will show the dierences next).
select emplKey
from employee
where salary <= 1000 )
select emplKey,age,
rank() over (partition by emplKey
order by age) as rankProjection,
sum(case when salary <= 1000 then 0 else 1 end)
over (partition by emplKey) as countViol,
case when salary <= 1000 then yes else no end
as satConds
from employee
C.emplKey=employee.emplKey) )
with contribAllSubQuery as (
select emplKey,age,
as countProjection,
countViol
from rankProjSubQuery
where satConds=yes
group by emplKey,age,countViol,rankProjection )
The above subqueries dier from the ones that would be produced by Rewrite-
ForestSQL in the following aspects. In countViolSubQuery, we compute an attribute
satConds that keeps track of whether each tuple satises or violates the selection con-
dition of q
5
(i.e., that the salary is less than or equal to 1000). This is dierent from
the attribute countViol because countViol counts the violations for all tuples where a
key value (employee name, in this case) appears, whereas satConds may take dierent
values on dierent tuples of the same employee, depending on the salary that appears in
the tuple. The third subquery corresponds to the subquery countProjSubQuery of the
algorithm RewriteForestSQL, but it has a dierent name here (contribAllSubQuery)
because, as we will show shortly, it is used to compute the contribution of each tuple
to the lower and upper bounds of count(*). In this subquery, we check the condition
satConds=yes. The intuitive reason is that the tuples that do not satisfy the con-
ditions of q
5
(and hence satConds = no) do not contribute neither to the lower nor
to the upper bound of count(*), and should thus be ltered out.
Let us now consider the computation of the lowest upper bound. In the query Q
5
returned by RewriteCount, this bound is obtained as follows:
QLub(a, lub) = select a, count(*)
from q
(e, a) (e.QConsistent(e, a))

group by s
In this case, aggregation is done on the result of the following rst-order expression:
q
(e, a)(e.QConsistent(e, a)). The naive way of writing this expression in SQL may be
inecient because QConsistent already contains q
as a subexpression. A more ecient

way of writing Q
5
in SQL involves computing the contributions of each tuple to the
value of count(*), with the two subqueries shown next.
One of the subqueries (called contribConsistentSubQuery) computes the contribu-
tion of the consistent tuples. These are the tuples for employees that (1) have the
same age (the attribute in the select clause of q
5
) in every tuple where they appear;
and (2) are not involved in a violation of the conditions of q
5
in any of the tuples where
they appear (i.e., their salary is always less than or equal to 1000). This can be checked
with the condition countProjection = 1 and countViol=0. In addition, the subquery
has attributes bottomCount and topCount that are used in the main body of the query
to combine the contributions of the consistent and nonconsistent tuples. For the
consistent tuples, the contribution is one to both the lower and upper bounds.
with contribConsistentSubQuery as (
select emplKey,age
1 as bottomCount
1 as topCount
The other subquery (called contribNonConsistentSubQuery) computes the contri-
butions of the nonconsistent tuples. We give this name to the tuples that are not
in the consistent answer of q
, but do satisfy q
. These tuple do not contribute to

the greatest lower bound of count(*), but they may contribute to the lowest upper
bound. In the SQL rewriting, the nonconsistent tuples are captured with the condition
countProjection > 1 or countViol >= 1. In addition, the subquery has attributes
bottomCount and topCount that are used in the main body of the query to combine
the contributions of the consistent and nonconsistent tuples. For the nonconsistent
tuples, the contribution is zero to the lower bound and one to the upper bound (compare
this to the consistent tuples, which contribute one to both bounds).
with contribNonConsistentSubQuery as (
select emplKey,age
0 as bottomCount,
1 as topCount
Finally, the main body of the rewriting sums ups the contributions of each tuple to the
lower and upper bounds, and projects out the attribute emplKey. The condition having
sum(bottomCount)>0 is used to ensure that we return ages that are consistent answers.
As we mentioned before, this corresponds to checking the condition e.QConsistent(e, a).
select age
sum(bottomCount) as glb,
sum(topCount) as lub
from
union all
group by age
In the next example, we illustrate the rewriting for a query that has the sum aggre-
gation function. The rewritings for the min and max aggregation functions are similar.
Example 6.7. Consider the same schema as in the previous example. Let q
6
be a SQL
query that, for each age in the database, gives the sum of all salaries in the database
that are less or equal than 1000.
q
6
: select age, sum(salary)
from employee
group by age
The SQL rewriting of q
6
is computed by ConQuer along the same lines of the rewriting
for query q
5
of the previous example. As in that example, the rewriting starts with three
subqueries: candidatesSubQuery, countViolSubQuery and contribAllSubQuery. The
subquery countViolSubQuery counts the number of violations of the selection condition
for each key value (age), and is the same as in the previous example, except that it
includes the attribute salary in its select clause. The subquery contribAllSubQuery
computes the contribution of all key values to the nal result. The only dierence with
the previous example is that here we compute the minimum and maximum salary for
each employee (attributes bottomSalary and topSalary). This was not necessary in
the previous example since count(*) is a 0-ary function, whereas sum is a unary function
(in this case, taking the argument salary).
select emplKey
from employee
where salary <= 1000 )
select emplKey,age,salary,
rank() over (partition by emplKey
order by age) as rankProjection,
sum(case when salary <= 1000 then 0 else 1 end)
over (partition by emplKey) as countViol,
case when salary <= 1000 then yes else no end
as satConds
from employee
C.emplKey=employee.emplKey) )
with contribAllSubQuery as (
select emplKey,age,
min(salary) as bottomSalary,
max(salary) as topSalary,
as countProjection,
countViol
from rankProjSubQuery
where satConds=yes
group by emplKey,age,countViol,rankProjection )
Then, as in the previous example, the rewriting computes the contributions from the
consistent and nonconsistent tuples. For clarity of presentation, we will assume that
all salaries are positive values (but in the general algorithm we deal with the case of
negative values as well). For the consistent tuples (whose contributions are computed
in contribConsistentSubQuery), the bottom and top salaries computed in contribAll-
SubQuery contribute to the greatest lower bounds and lowest upper bounds, respectively.
The top salary also contributes to the lowest upper bound of the nonconsistent tuples
(whose contributions are computed in contribNonConsistentSubQuery). However, as
we explained in Chapter 4, the bottom salary does not contribute to the greatest lower
bound. Therefore, the attribute bottomSalary of contribNonConsistentSubQuery gets
a value of zero.
with contribConsistentSubQuery as (
select emplKey,age,
bottomSalary,
topSalary,
1 as bottomCount
with contribNonConsistentSubQuery as (
select emplKey,age
0 as bottomSalary,
topSalary,
0 as bottomCount
Finally, the main body of the rewriting sums up the contributions of each tuple
to the lower and upper bounds, and projects out the emplKey attribute. Notice that
as in the rewriting for query q
5
of the previous example, we have a condition having
sum(bottomCount)>0. This is done because, again, we want to report only the ages that
appear for sure in every repair.
select age,
sum(bottomSalary) as glbSalary,
sum(topSalary) as lubSalary
from
union all
group by age
6.4 Exploiting Precomputed Annotations
The main focus of the thesis is on query processing directly on the inconsistent database.
However, in some circumstances, it may be advantageous to process the database oine
in order to materialize data structures with information about constraint violations. This
precomputed data could then be exploited during online query answering to improve the
performance of the queries.
In this section, we will present a simple oine precomputation scheme, and show the
rewritings that ConQuer produces in order to exploit it. The scheme is based on annota-
tions attached to each tuple. The annotation consists of just one bit that states whether
the tuple satises or violates a given key constraint. If annotation are present, then
ConQuer can produce a rewriting that exploits them. We call such rewriting annotation-
aware. In the next example, we illustrate the annotation-aware rewritings. In the next
section, we will identify the scenarios where it is desirable to exploit the annotations, and
we will empirically validate the eectiveness of the annotation-aware rewritings.
Example 6.8. Let R be a schema with relations employee(emplKey, deptFKey) and
dept(deptKey, mgrName). We will give an example based on a SPJ query without ag-
gregation. However, the example shows all the ingredients of the rewritings on annotated
databases, and extending the rewriting to the case of rewritings for queries with aggre-
gation is straightforward.
Consider a SQL query q
7
that retrieves the names of all employees whose department
manager is Peter:
q
7
from employee,dept
where employee.deptFKey= dept.deptKey and dept.mgrName=Peter
Consider the database I = employee(John, Sales), employee(Mary, Engineering),
dept(Sales, Peter), dept(Sales, Tom), dept(Engineering, Peter). Suppose that we in-
struct ConQuer to process the database oine and annotate each tuple with a bit stating
whether it satises or violates the constraints of . Assume that ConQuer augments the
set of attributes of each relation with an attribute called cons that stores the annotation.
The annotated database produced by ConQuer would then be the following.
employee dept
emplKey deptFKey cons deptKey mgrName cons
John Sales y Sales Peter n
Mary Engineering y Sales Tom n
Engineering Peter y
Note that the tuple for Mary in relation employee, and the tuple for Engineering in
relation dept have a value of y in their cons attributes, meaning that they do not
violate any constraint. If we join these tuples, we get a tuple that satises query q
7
.
Furthermore, it is easy to see that this will be the only tuple in the result for Mary.
Thus, it must be a consistent answer.
In general, the join of consistent tuples (i.e, tuples where cons = y) produces
a consistent answer. For such tuples, it suces to check whether the conditions of the
original query are satised (in this example, check that they satisfy q
7
). In this way, we
can avoid the possibly costly operations of the rewritings produced by the algorithms
RewriteForestSQL and RewriteAggSQL. In the rewriting, we capture these tuples in a
subquery called allConsistentSubQuery (allConsistent because they come from the
join of tuples all of which are consistent). The subquery consists of the input query and a
lter that requires every tuple in the join to have a value of y in the cons attribute.
with allConsistentSubQuery as (
select distinct emplKey
from employee,dept
and employee.cons=y and dept.cons=y
Now, note that the tuple for John also satises the constraints and has a value of
y in its cons attribute. However, this tuple joins with the tuples for the Sales
department, which violate the key constraint of their relation (they are annotated with
n). If we join the tuple for John with the tuple dept(Sales, Peter), the result satises
q
7
. But if we join with dept(Sales, Tom), the result does not satisfy the query. Thus,
John is not a consistent answer to q
7
.
To keep track of the join of tuples that may violate a constraint, we produce a rewrit-
ing that is similar to the one that would be produced by RewriteForestSQL, the only
dierence being that we augment the candidatesSubQuery subquery of the rewriting
with a condition checking whether the cons attribute of at least one of the joined tu-
ples is set to n. In our example, we check the condition employee.cons=n or
dept.cons=n. The result obtained from these tuples is kept in a subquery called
someNonConsistentSubQuery (the name comes from the fact that some of the tuples of
the join may not be consistent).
with candidatesSubQuery (
select distinct emplKey
from employee,dept
and (employee.cons=n or dept.cons=n) )
select emplKey,
sum(case
when employee.deptFKey=dept.emplKey
and dept.mgrName=Peter then 0 else 1 end)
as countViol
from employee left outer join dept
on employee.deptFKey=dept.emplKey
where exists (select * from Candidates C where C.emplKey=employee.emplKey)
group by emplKey )
with someNonConsistentSubQuery as (
select emplKey
where countViol = 0)
Finally, the main body of the query takes the union of the tuples obtained with the
subqueries allConsistentSubQuery and someNonConsistentSubQuery.
select emplKey from
(select emplKeyfrom someNonConsistentSubQuery)
union all
(select emplKeyfrom someNonConsistentSubQuery)
Notice that this rewriting is correct even if annotations incorrectly mark a consistent
tuple as inconsistent. Hence, when deleting or updating a tuple, it is not mandatory to
update annotations.
6.5 Related Work
In this section, we review systems for managing inconsistent databases that are related
to ConQuer. Hippo [CMS04b, CMS04a] is a system that produces consistent answers
for unions of quantier-free conjunctive queries (that is, unions of queries in the class
presented by Arenas, Bertossi, and Chomicki [ABC99]). Hippo does not consider queries
with aggregation, grouping or bag semantics. Apart from the class of queries that it can
handle, Hippo diers from ConQuer in the fact that it is not based on query rewriting.
Rather, Hippo takes the more procedural approach of producing a Java program which
computes the consistent answers. Although the program does interact with an RDBMS
back-end, most of the processing is done by processing an (in-memory) conict graph
data structure that contains all the tuples that violate the constraints. The system may
not be able to operate on databases where this data structure does not t in memory.
Hippo has been shown to scale to database of up to 300,000 tuples [CMS04b].
There are a number of systems for consistent query answering that rewrite queries into
powerful logics [CB00, LLR02, EFGL03, CB05]. Infomix [EFGL03] is a notable example
of such an approach. In Infomix, queries are rewritten into disjunctive logic programs.
Such programs are computationally more expensive than SQL, but also more expressive
and permit rewritings over a very rich class of query constraints. For example, Infomix
considers general functional, inclusion, and exclusion query constraints. These systems
focus on expressiveness, more than eciency and scalability, and therefore address a
dierent design point than the one we are considering. To give an idea of the scale of
the dierence, one of the few experimental studies available in the literature [EFGL03]
reports results for databases with at most 100 tuples violating primary key constraints
(over a database of 50,000 tuples). In contrast, the largest database that we used in the
experiments reported in the next chapter has 8.6 million inconsistent tuples (over a total
of 172 million tuples).
Chapter 7
Experimental Analysis
In this chapter, we validate the eciency of ConQuers rewritings using IBM DB2 UDB
Version 8.2 (from now on, referred to as just DB2). In Section 7.1, we give a detailed
description of the experimental framework. Then, in Section 7.2, we report and analyze
the experimental results obtained within this framework.
7.1 Experimental Framework
7.1.1 System and Database Manager Conguration
The experiments were performed on a Sun v40z server class computer with 4 processors
and 8 GB of RAM, running RedHat Linux AS 4 kernel Version 2.6.9. The relational
database management system used to run the queries was IBM DB2 UDB Version 8.2.
We now describe some important parameters in the database conguration. The
buer pool size was deliberately kept considerably below the systems available memory.
This is because our aim is to test the overhead of the queries in environments where the
amount of primary memory is small compared to database size. In particular, the buer
pool size was restricted to 400 MB (whereas the size of the largest database reported
here is 20 GB).
In order to reduce the number of variables to consider when comparing running
times, the query optimizer was set to use a degree of intra-parallelism (parameter DFT-
DEGREE) of 1, meaning that the query plan always chooses to use one processor, even
though there are four available in the system. The query optimization level, which dic-
tates the amount of time that the query optimizer may spend to produce a query plan,
139
Chapter 7. Experimental Analysis 140
was set to its highest value (parameter DFT QUERYOPT was set to 9) since the time to
produce a plan is always negligible with respect to the time to execute the fairly complex
queries that we use in our experiments.
For all databases, statistics were created by running the DB2 RUNSTATS command.
The parameters for statistics gathering were set as follows: the number of most frequent
values to be collected from each table (parameter NUM FREQVALUES) was set to 10;
and the number of quantiles for the distributions (parameter NUM QUANTILES) was
set to 20.
We created clustered indices for the (potentially violated) primary key attributes.
Notice that these indices cannot be declared as unique since the database may be
inconsistent. With respect to the annotations introduced in Section 6.4, we added an
attribute called cons to each table, and used it to keep track of whether each tuple satises
or violates the primary key constraints. For each relation, we declared a secondary index
on the attributes of the key plus the cons attribute. The values for the cons attributes
are computed oine. However, it is important to point out that in the experimental
results that we report here, this attribute is used only where we explicitly say that the
rewritings are annotation-aware. By default, we assume that the rewritings work on the
inconsistent database without exploiting precomputed information.
Regarding the indices of the database, we considered a worst-case and a typical sce-
nario. In the worst-case scenario, the only indices in the database are those for the key
attributes and the annotations. We also considered a more typical scenario, where several
indices are declared. In particular, we created all indices suggested by DB2s Congu-
ration Advisor. In each database, the size of the indices proposed by the Conguration
Advisor corresponds to a third of the size of the database. The indices are shown in
Appendix B.
7.1.2 Inconsistent Database Instances
For the inconsistent databases, we employed the schema and data of TPC-H, the standard
benchmark for decision support systems. The schema is shown in Figure 7.1. The sizes
of the tables are also shown in Figure 7.1 (under their names), and are given in number
of tuples for a 1 GB instance. For example, the relation lineitem has 6 million tuples on
a 1 GB instance. As per the TPC-H standard, all tables except nation and region are
scaled proportionally to the size of the database (this is indicated with SF in the gure).
Figure 7.1: Schema specied in the TPC-H standard (taken from [TPC03])
The parameters used to build the databases are the following:
The size s of the database. We considered databases of various sizes, up to 20
GB (172 million tuples). Notice that this size is 50 times larger than the size of the
buer pool of the database (whose size is 400 MB).
The percentage p of the database that is inconsistent. For example on a 1 GB
instance (8.6 million tuples) where p is 25%, there are 2.15 million tuples that
violate the key constraints of the schema. We created the databases in such a way
that every relation has the same value of p as the entire database. We experimented
with values of p ranging from 0% (totally consistent database) to 25%.
The number of tuples n that share a common key value (and hence violate a key
constraint), for every key value in the inconsistent portion of the database. For
example, if n = 2, then every key value in the inconsistent portion of the database
appears in exactly two tuples. The value is xed for every tuple of the inconsistent
portion (i.e., every key value of the database appears exactly one or n times). We
experimented with values of n ranging from 2 to 7.
The TPC Consortium provides a data generator called dbgen that produces database
instances compliant with the standard.
1
Since the TPC-H standard does not consider
inconsistent databases, dbgen creates instances that do not violate the primary key con-
straints of the schema. For this reason, we modied the source code of dbgen in order to
produce a generator that creates inconsistent databases. The database generator creates
each table as follows. Let l be total number of tuples to be generated in the table. First,
we generate l.(1
p
100
+
p
100n
) tuples. Second, we randomly select
l.p
100.n
tuples from them.
Third, for each selected tuple
t, we generate n1 additional tuples by invoking the tuple

generation functions of dbgen. We replace the key values of the n 1 generated tuples
with the key value of

t.
7.1.3 Workload
The experiments were performed using queries specied in the TPC-H standard. There
are twenty two queries in the standard, twelve of which are aggregate conjunctive queries,
the type of queries that we handle in this work. The other ten queries have features
1
The database generator can be obtained from the TPC Consortiums website at http://www.tpc.org
that are beyond aggregate conjunctive queries, such as aggregation in nested subqueries
(Queries 2, 11, 15, 17, 18 and 20 of the specication), left outer joins (Query 13), and
negation (Queries 16, 21, and 22).
In our experiments, we will focus on eleven queries from the TPC-H specication
(Queries 1, 3, 4, 6, 7, 8, 9, 10, 12, 14, and 19). The original TPC-H queries together with
their rewritings are given in Appendix A. Notice that, of the twelve aggregate conjunctive
queries, we rule out only one query. This is Query 5 of the specication, which contains
a nonkey-to-nonkey join, which we cannot handle with our query rewriting algorithm.
(Following the results of Chapter 5, Query 5 is in class c
and thus has no query rewriting

into SQL). Of the eleven queries that we consider, six are strictly in class c
sql
aggforest
(Queries 3, 4, 6, 9, 10, and 12), and the other ve can be handled with our rewriting
algorithm RewriteAggSQL with little or no modication for the following reasons. First,
Queries 7 and 8 have repeated relation symbols that appear at leaf nodes of the join
graph. The algorithm RewriteAggSQL can handle this case, since the nonkey variables of
these repeated relation symbols are not involved in any join. Second, Queries 7 and 19
have disjunction involving equalities of attributes to constants. We showed in Chapter
3 that it is quite easy to extend the algorithm that produces a rst-order rewriting to
handle this case, and the SQL rewriting algorithm RewriteAggSQL of this chapter can
be used for such cases without modication (the disjunction is considered part of the
selection conditions in the expression cO^To of Figures 6.8). Finally, Queries 8, and
14 perform an arithmetic operation (division) on the result of two aggregate operators,
and Query 1 computes an average. In such cases, we give bounds that are sound, but
not tight.
2
In Figure 7.2, we summarize the main characteristics of the eleven queries used in
the experiments. For each query, we give the number of relations in the from clause,
the number of selection conditions in the where clause (this excludes join conditions),
the selectivity (as the percentage of joined tuples that satisfy the selection conditions of
the query), the number of projecting attributes in the select clause, and the number of
aggregate functions in the select clause. The queries in the TPC-H specication are pa-
rameterized, and the standard suggests values for these parameters. In the experiments,
we used the suggested values in all the queries. The selectivities reported in Figure 7.2
are based on these parameters.
2
For the queries with the sum operator, all ranges are tight since the queries in the TPC-H standard
only aggregate over attributes with positive values.
relations selection selectivity projecting aggregation
conditions (in %) attrs functions
Q1 1 1 98.56 2 8
Q3 3 3 0.51 3 1
Q4 2 3 2.35 1 1
Q6 1 4 1.91 0 1
Q7 5 4 0.10 3 1
Q8 7 4 0.04 1 2
Q9 6 1 5.13 2 1
Q10 4 3 1.87 7 1
Q12 2 5 0.51 2 2
Q14 2 2 1.23 0 2
Q19 2 24 0.001 0 1
Figure 7.2: TPC-H queries used in the experiments
7.2 Experimental Results
In this section, we report the results of the experiments that we performed in order to
quantify the overhead of the rewritings produced by ConQuer.
7.2.1 Scalability
In this subsection, we study the scalability of ConQuers approach. In particular, we
show the eect of the size of the inconsistent databases on the overhead of the rewritten
queries. In Figure 7.3, we report the overhead of the eleven rewritten queries on a number
of databases where we x the degree of inconsistency to 5% of the database (p = 5%), and
2 conicts per inconsistent key value (n = 2). The size of the databases (reported on the
x-axis) ranges from 1 GB to 20 GB (that is, from 8.6 million tuples to 172 million tuples).
The databases are generated independently of each other, and correspond to the scenario
where indices are created only for the key attributes. On the y-axis, we report the
overhead of the rewritten queries, computed as the ratio between the running time of
the rewritten query over the running time of the original (non-rewritten) query. The
rewritings reported in the gure do not exploit annotations (i.e., they are unaware of
annotations, if any, computed as explained in Section 6.4).
For presentation purposes, we split the queries into three graphs. The queries are
grouped based on the behaviour of the overhead as the size of the databases increases.
The graph at the top shows queries where the overhead initially increases, but then
remains constant or decreases (Queries 1, 7, 12, 14). The graph in the middle shows
queries where the overhead increases monotonically with the size of the database (Queries
3, 8, 10). The rest of the queries are shown in the graph at the bottom (Queries 4, 6, 9,
19).
We identied two factors that have a signicant impact on the overhead of the rewrit-
ings: the selectivity of the original queries, and the query plans chosen by DB2s opti-
mizer. Let us start with the selectivity of the queries. To understand their eect, recall
that in the SQL rewriting algorithm RewriteAggSQL of Figure 6.8, there is a subquery
called candidatesSubQuery that is designed to exploit the selectivity of the original
queries. In particular, this subquery returns only the values for the root-key attributes
that satisfy the conditions of the original query. More specically, let q be a query,
K
1
, . . . , K
n
be the attributes that appear at some root of the join graph of q, and cO^To
be the selection conditions of q. Then, the rewriting produced by RewriteAggSQL(q, )
has a subquery of the following form:
select K
1
as cK
1
,. . . ,K
n
as cK
n
where cO^To )
Clearly, the lower the selectivity of the original query q, the fewer tuples are returned
by candidatesSubQuery. The rest of the rewriting operates on the result of the following
subquery called countViolSubQuery.
select K
1
, . . . , K
n
, S
1
, . . . , S
l
, A
1
, . . . , A
u
1
, . . . , K
n
order by S
1
, . . . , S
l
1
, . . . , K
n
) as countViol,
case when cO^To then yes else no end as satConds
from O1ô
0 2 4 6 8 10 12 14 16 18 20
0
1
2
3
4
5
6
7
Size (GB)
O
v
e
r
h
e
a
d

Q: 001
Q: 007
Q: 012
Q: 014
0 2 4 6 8 10 12 14 16 18 20
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Size (GB)
O
v
e
r
h
e
a
d

Q: 003
Q: 008
Q: 010
0 2 4 6 8 10 12 14 16 18 20
0
0.5
1
1.5
2
2.5
3
Size (GB)
O
v
e
r
h
e
a
d

Q: 004
Q: 006
Q: 009
Q: 019
Figure 7.3: Size of the inconsistent database vs. overhead (running time of rewritten
query over running time of original query) for p = 5% and n = 2
where K
1
= cK
1
and . . . and K
n
= cK
n
)
Notice that the where clause of the subquery restricts the focus to the tuples that
join with those returned by candidatesSubQuery. Since all further processing in the
rewriting is done on the result of countViolSubQuery, the selectivity of the original
query q signicantly aects the running time of the rewriting.
We can see in Figure 7.2 that the selectivity of Query 1 is much higher than the
selectivity of all the other queries. More specically, Query 1 has a selectivity of 98.5%,
whereas the highest selectivity of the other ten queries is 5.1% (Query 9). This explains
the high overhead of the rewriting of Query 1, which goes up to 5.8 times the running
time of the original query on the 20 GB database. The selectivity also explains the low
overhead of Query 19. In this case, the overhead of the rewriting goes up to just 1.2
times the running time of the original query on the 20 GB instance. Notice in Figure 7.2
that this query has considerably less selectivity than all other queries: 0.001%. Thus, in
eect, the computation of candidatesSubQuery accounts for most of the running time
of the rewriting; with the computation of the other subqueries having a negligible cost.
We also observed that the query plans selected by DB2 have an eect on the over-
head. For example, all queries involve lineitem, the largest relation of the TPC-H
database, which contains 70% of all tuples in the database. Except for Queries 4 and
10, the running time of all queries (and their rewritings) is dominated by the size of the
lineitem relation. In particular, for all those queries, DB2 selects plans that involve a
costly table scan of the lineitem relation. In contrast, for queries 4 and 10 (and their
rewritings), the running time is dominated by the size of the smaller relation orders
for the following reasons. First, the plans involve a table scan of relation orders, with
the access to lineitem being done through its clustered index. Second, a low selectivity
predicate is applied on the tuples retrieved from orders, which are then joined with
those coming from lineitem. Thus, only a very small fraction of the tuples of lineitem
are actually accessed. We conjecture that for this reason most of the processing of both
the original and rewritten queries can be done in main memory, hence the low overhead
of the rewritings of Queries 4 and 10.
The low overhead of Query 6 (with a maximum of 2.1 on the 10 GB instance) can be
explained in terms of the shape of its rewriting. Notice in Figure 7.2 that this is a query
on one relation (hence no joins), and with a relatively low selectivity. Furthermore, it
does not perform any grouping (it has no projecting attributes) and computes just one
aggregate function. This results in a simpler and more ecient rewriting. In particular,
the attributes countProjection and rankProjection of the rewriting do not need to
be computed.
In Figure 7.3, we can observe three trends in the growth of the overhead as we increase
the size of the instances. For some queries, the overhead increases slowly with the size
of the instances (Queries 4, 6, 10, and 19). These are the low-overhead rewritings, and
thus the processing of both the original queries and their rewritings can be done mostly
in main memory. For others, the overhead increases monotonically at a relatively high
rate (Queries 3 and 8). A possible explanation for this behaviour is that the original
queries can do most of their processing in main memory, whereas this is not the case for
the more costly rewritings. Finally, for another group of queries (Queries 1, 7, 9, 12, and
14), the overhead grows up initially, and then either remains constant or decreases. The
reason is that as the size of the databases grow, the amount of available main memory
becomes small not only for the rewritten queries but also for the original queries. Hence,
the rate of growth of the ratio between them diminishes.
For Query 9, we slightly modied the query rewriting produced by RewriteAggSQL (the
modied rewriting is equivalent to the one produced by RewriteAggSQL). The reason for
this is that for the rewriting obtained with RewriteAggSQL, DB2 was producing a very in-
ecient query plan. For example, on a 2 GB database, the running time of the rewriting
was 28 times the running time of the original query.
We detected that the problem of the rewriting produced by RewriteAggSQL was in
the subquery candidatesSubQuery. To understand the reason, let us show a simplied
version of Query 9:
select n name as nation,
l extendedprice * (1 - l discount) - ps supplycost * l quantity
from part, supplier, lineitem, partsupp, orders, nation
where s suppkey = l suppkey
and ps suppkey = l suppkey
and ps partkey = l partkey
and p partkey = l partkey
and o orderkey = l orderkey
and s nationkey = n nationkey
and p name like %green%
The subquery candidatesSubQuery produced by RewriteAggSQL is the following:
select l orderkey,l linenumber
from part, supplier, lineitem, partsupp, orders, nation
where s suppkey = l suppkey
and ps suppkey = l suppkey
and ps partkey = l partkey
and p partkey = l partkey
and o orderkey = l orderkey
and s nationkey = n nationkey
and p name like %green%
An important observation is that if we modify candidatesSubQuery, the rewrit-
ing will still be correct (i.e., compute the consistent answers of Query 9) as long as
candidatesSubQuery still returns the tuples that are candidates to be consistent an-
swers, i.e., that they satisfy the selection conditions of Query 9. Based on this observa-
tion, we modied the candidatesSubQuery subquery produced by RewriteAggSQL, and
detected that DB2 would produce a more ecient query plan. In particular, we removed
the relation partsupp from the from clause of candidatesSubQuery and the conditions
ps suppkey = l suppkey and ps partkey = l partkey from its where clause.
The overhead reported in Figure 7.3 corresponds to the modied rewriting. Notice
that we do not provide a value for the 20 GB database. The reason is that the execution
of the original Query 9 on the 20 GB database timed out in our experiments because
DB2 came up with a particulary inecient plan, dierent from the one chosen for the
other instances.
Besides the peculiarities of each query, an important conclusion of these experiments
is that the query rewritings can scale to large database instances. Even for an instance
of 20 GB (172 million tuples) the overhead of the queries ranges from 1.2 (Query 19) to
5.8 (Query 1). This is remarkable if we take into account that the semantics of consistent
query answering is much more involved than the semantics of traditional query answering.
Let us now consider the rewritings that exploit annotations, as explained in Section
6.4. In our experiments, the only rewriting that beneted substantially from the annota-
tions was the one on Query 1. The other queries do not benet from annotations due to
0 5 10 15 20
0
1
2
3
4
5
6
7
Size (GB)
O
v
e
r
h
e
a
d

Q: 1annotations
Q: 1no annotations
Figure 7.4: Size of the inconsistent database vs. overhead of the rewritings that exploit
and do not exploit annotations for Query 1 (for an instance where p = 5% and n = 2).
their low selectivity. Recall that Query 1 has a high selectivity of 98.5%, as opposed to
all other queries, whose selectivity is at most 5.1% (Query 9). Since the annotations (in
particular the cons attribute) are used in the where clause of one of the subqueries of the
annotation-aware rewriting, they are in eect reducing the selectivity of the rewriting,
thereby having a more signicant impact on the queries with high selectivity.
In Figure 7.4, we focus on Query 1, and we compare the overhead of the annotation-
aware rewriting with the rewriting which does not exploit annotations. As in the previous
gure, we x the degree of inconsistency to 5% of the database (p = 5%) and the number
of conicts per inconsistent key value to 2 (n = 2). The size of the databases (reported on
the x-axis) ranges, as before, from 1 to 20 GB. The databases correspond to the scenario
where indices are created for the key attributes and the annotations. On the y-axis, we
report the overhead of the queries, computed as we explained above.
It can be observed that we get a substantial gain by exploiting the annotations. For
example, on the 20 GB instance, the overhead of the rewriting which does not exploit
annotations is 5.8, whereas the overhead of the annotation-aware rewriting is 3.3. That
is, the running time of the rewriting is reduced by 57% by exploiting the annotations.
Finally, we performed experiments on databases where indices are created by follow-
ing the suggestions of DB2s Conguration Advisor, in addition to the indices on the
key attributes. In Figure 7.5, we report the overhead of the eleven rewritten queries on
a number of databases where we x the degree of inconsistency to 5% of the database
(p = 5%), and 2 conicts per inconsistent key value (n = 2). The size of the databases
(reported on the x-axis) ranges from 1 to 20 GB. On the y-axis, we report the overhead
of the rewritten queries, computed as explained above. The rewritings do not exploit
annotations. The indices suggested by the Conguration Advisor are shown in the ap-
pendix.
For presentation purposes, we present the queries in three graphs. Notice the dierent
(linear) scales of the graphs. The graphs at the top and center show queries with low
overhead, whereas the one at the bottom shows queries where the overhead is much
higher.
In the graph at the bottom of Figure 7.5, we can observe a sharp spike in the overhead
of Query 14 on the 5 GB database. The overhead jumps from 2.1 on the 3 GB database
to 25.5 on the 5 GB database; and then decreases to 3.2 on the 10 GB database. This is
due to an index of the 5 GB database that is particularly benecial to the original query.
This is an index on the lineitem relation and on attributes (l shipdate, l discount,
l extendedprice, l partkey). The index is not present on any of the other databases.
There is a similar situation for Query 6. In this case, the overhead jumps from 5.1 on the
2 GB database to 31.2 on the 3 GB database. The overhead stays high on the 5 and 10
GB databases (28.3 and 33.5, respectively) and nally decreases sharply on the 20 GB
database to a value of 2.8.
For Queries 8, 9, and 19 we observe the opposite behavior: the overhead lies below
one. That is, the indices benet considerably the rewritten query as opposed to the
original query. This is most noticeable on Query 9, whose overhead is 0.05 on the 2 GB
database, and 0.04 on the 5 GB database. Notice that the overhead behaves dierently
on the 1, 3, and 10 GB databases, where the original query runs faster than the rewritten
query (the overhead is above 1). We do not report the overhead for the 20 GB database
since, as occurred in the scenario with only key constraints, the original query times out.
Excluding the above exceptions, the overhead of all queries is comparable with the
overhead in the scenario where there are indices only for the key constraints. For example,
on the 20 GB database, the overhead of all queries ranges from 1.2 (Query 19) to 5.8
(Query 1) on the databases with just key constraints; and from 1.06 (Query 19) to 5.4
(Query 1) on the databases with indices suggested by the Conguration Advisor.
0 2 4 6 8 10 12 14 16 18 20
0
0.5
1
1.5
2
2.5
3
3.5
4
Size (GB)
O
v
e
r
h
e
a
d

Q: 003
Q: 004
Q: 007
Q: 008
0 2 4 6 8 10 12 14 16 18 20
0
5
10
15
20
25
30
35
Size (GB)
O
v
e
r
h
e
a
d

Q: 001
Q: 006
Q: 014
Figure 7.5: Size of the inconsistent database vs. overhead (running time of rewritten
query over running time of original query) for p = 5% and n = 2 using indices suggested
by Conguration Advisor
7.2.2 Eect of Degree of Inconsistency
In this subsection, we study the eect of the degree of inconsistency of the databases on
the performance of ConQuers rewritings. We consider the two parameters that determine
the degree of inconsistency: the percentage of the database being inconsistent (p), and
the number of conicts per inconsistent key value (n).
In Figure 7.6, we report the overhead of the eleven queries on a number of databases
where we x the size to 3 GB and the number of inconsistencies per key value to 2 (n = 2).
The percentage of inconsistency of the databases (reported on the x-axis) ranges from
0 (totally consistent database) to 25% (a quarter of the database being inconsistent).
On the y-axis, we report the overhead of the rewritten queries, computed as the ratio
between the running time of the rewritten query over the running time of the original
(non-rewritten) query. The rewritings reported in the gure do not exploit annotations
(i.e., they are unaware of annotations, if any). All the databases correspond to the
scenario where indices are created only for the key attributes.
We observed that the overhead is not considerably inuenced by the percentage of
inconsistency. This is reasonable since in the rewriting we do not make a distinction
between tuples that violate or satisfy the constraints. In the gure, we can see an
anomaly for Query 14, with its overhead sharply decreasing from 0 to 1%, and then
sharply increasing from 1 to 5%. The reason for this is that, for the rewritten query
and the 1% inconsistent database, DB2 chooses a dierent plan. In particular, for the
rewritten query on all databases except the 1% inconsistent, DB2 chooses a plan that
includes one table scan of the lineitem relation and a join that accesses lineitem
through its clustered index. For the 1% inconsistent database, DB2 chooses a dierent
plan that involves two tablescans of lineitem and the application of a low selectivity
predicate in each case. In this case, the alternative plan turns out to be a good choice:
the overhead becomes lower than in the other cases.
In Figure 7.7, we turn our attention to the number of conicts per inconsistent key
value. In particular, we report the overhead of the eleven queries on a number of databases
where we x the size to 1 GB and the percentage of inconsistency to 5% (p = 5%). The
number of conicts per inconsistent key value (reported on the x-axis) ranges from 1
(totally consistent database) to 7. On the y-axis, we report the overhead of the rewritten
queries, computed as in the other gures. The rewritings considered in the gure do not
exploit annotations.
0 5 10 15 20 25
0
1
2
3
4
5
6
7
Percentage of inconsistency
O
v
e
r
h
e
a
d

Q: 001
Q: 003
Q: 004
Q: 006
0 5 10 15 20 25
0
0.5
1
1.5
2
2.5
3
O
v
e
r
h
e
a
d

Q: 007
Q: 008
Q: 009
Q: 010
0 5 10 15 20 25
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
O
v
e
r
h
e
a
d

Q: 012
Q: 014
Q: 019
Figure 7.6: Percentage of inconsistency vs. overhead (running time of rewritten query
over running time of original query) for instances of 3 GB and n = 2
As with the percentage of inconsistency, we observed that the number of conicts per
key value does not have a considerable eect on the overhead of the rewritten queries.
The only exception is Query 9, where the overhead decreases signicantly as the number
of conicts increases. We detected that this is because DB2s optimizer was choosing
dierent plans on dierent instances. In particular, the plan chosen for the original
query on the database where n = 7 is so inecient that it runs more slowly than the
corresponding rewritten query (and, hence, the overhead falls below 1).
1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
Number of inconsistent tuples per key value
O
v
e
r
h
e
a
d

Q: 001
Q: 003
Q: 004
Q: 006
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
3
O
v
e
r
h
e
a
d

Q: 007
Q: 008
Q: 009
Q: 010
1 2 3 4 5 6 7
0
0.5
1
1.5
2
2.5
3
3.5
4
O
v
e
r
h
e
a
d

Q: 012
Q: 014
Q: 019
Figure 7.7: Number of conicts per inconsistent key value (n) vs. overhead (running
time of rewritten query over running time of original query) for instances of 1 GB and
p = 5%
Chapter 8
Conclusions and Future Work
In this thesis, we presented ConQuer, a system for query answering over inconsistent
databases. We showed the correctness of ConQuers rewritings for a broad class of Select-
Project-Join queries with set and bag semantics, and with grouping and aggregation. We
also showed the maximality of the class of queries from a complexity-theoretic point of
view. The eciency and scalability of the approach was empirically validated with an
extensive set of experiments on a commercial database system.
The assumptions of our work can be relaxed in dierent directions. For example,
we assumed that the set of constraints that might be violated consists exclusively of
key dependencies. It would be interesting to consider foreign key dependencies as well.
In this way, we would be covering the most common constraints that are supported by
commercial database systems. We are also interested in other constraints, for example
constraints arising from business rules (e.g., a rule saying that a car insurance policy
cannot be held by people who are younger than 18 years old). Regarding the data
model, ConQuer currently works on relational databases. An obvious extension is to
provide support to semi-structured data, such as XML documents. With respect to
queries, we would like to support more expressive query languages, where queries may
have disjunction and negation. We note that this direction of research has been started
recently by Lembo, Rosatti and Ruzzi [LRR06], who extend our class c
forest
to consider
unions of conjunctive queries.
In this work, we provide exact algorithms that compute all the consistent answers
to a query. We would also like to explore approximation algorithms [Vaz01]. For ex-
ample, we could compute results where some consistent answers may be missing. For
Select-Project-Join queries, we could give a formal guarantee on the number of poten-
157
Chapter 8. Conclusions and Future Work 158
tially missing tuples. For queries with aggregation, we could also give formal guarantees
about the ranges for the aggregate functions. An interesting question is whether the
query rewriting algorithms used by ConQuer can be used as a building block of the
approximation algorithms.
It is easy to see that, in general, queries under the consistent answers semantics do
not compose. That is, the consistent answers of a rst query cannot be used to compute
the consistent answers of other queries. However, it may be possible to produce auxiliary
data when executing the rst query that could be used in turn to obtain the result of other
queries. We would like to characterize what kind of auxiliary information is necessary for
the composition of dierent classes of queries. One application of these results would be
for OLAP queries [CD97], where the computation of, e.g., roll-up operations is usually
done by composing queries.
ConQuer currently deals with inconsistencies that occur after the source data has
been transformed to conform to the schema of the integrated database. The problem of
creating the integrated database is called data exchange, and has recently been formalized
by Fagin, Kolaitis, Miller, and Popa [FKMP05]. In this framework, we are given a
source schema, a target schema, and a mapping, which is a declarative specication of
a transformation. Mappings are unidirectional in the data exchange framework, going
from the source to the target schema. The goal is, given a source database, to materialize
a target database that satises the mapping. We, together with other authors, have
proposed a generalization of the data exchange framework, called peer data exchange
[FKMT06], where the mapping may be bidirectional (source-to-target and target-to-
source). An important problem in the context of peer data exchange is the existence-
of-solutions problem, which consists of deciding whether it is actually possible to obtain
a target database that satises the mapping. Interestingly, the problem of computing
consistent answers under key constraints can be reduced to the existence-of-solutions
problem in the context of peer data exchange, where the key constraints are encoded in
the mapping [Fux04]. This reduction may contribute to the potential application of the
techniques presented in this thesis to the context of peer data exchange.
ConQuer provides an interface that enables the user to gradually clean the database.
In particular, when a query is submitted, the system shows the clean answers together
with a query explanation. The explanation can be extremely valuable, since it often
points to underlying errors in the database that require attention from the user. For key
constraints, the only actions that a user may perform are deleting or modifying tuples.
Chapter 8. Conclusions and Future Work 159
However, if other constraints are covered in the future, the explanations could trigger
more complex transformations on the database. There are interesting questions as to
how to specify such transformations using, for example, Extract-Transform-Load tools.
Nowadays, there are mature data integration and database management products
in the market. In our opinion, these products should be tightly coupled, with data
integration tools producing databases that are potentially inconsistent, and precise char-
acterizing the inconsistency; and database systems exploiting the knowledge about the
inconsistencies to produce better answers. We expect the results in this thesis to be an
initial step in this direction.
Bibliography
[ABC99] M. Arenas, L. Bertossi, and J. Chomicki. Consistent query answers in incon-
sistent databases. In Symposium on Principles of Database Systems (PODS),
pages 6879, 1999.
[ABC00] M. Arenas, L. Bertossi, and J. Chomicki. Specifying and querying database
repairs using logic programs with exceptions. In International Conference
on Flexible Query Answering Systems, pages 2741, 2000.
[ABC03a] M. Arenas, L. Bertossi, and J. Chomicki. Answer sets for consistent query
answering in inconsistent databases. Theory and Practice of Logic Program-
ming, 3(4-5):392424, 2003.
[ABC
+
03b] M. Arenas, L. Bertossi, J. Chomicki, X. He, V. Raghavan, and J. Spinrad.
Scalar Aggregation in Inconsistent Databases. Theoretical Computer Science,
296:405434, 2003.
[AD98] S. Abiteboul and O. M. Duschka. Complexity of answering queries using ma-
terialized views. In Symposium on Principles of Database Systems (PODS),
pages 254263, 1998.
[AFM06] P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty
databases: a probabilistic approach. In International Conference on Data
Engineering (ICDE), 2006. Paper 30.
[AHV95] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-
Wesley, 1995.
[AKG87] S. Abiteboul, P. Kanellakis, and G. Grahne. On the representation and
querying of sets of possible worlds. In ACM International Conference on the
Management of Data (SIGMOD), pages 3448, 1987.
160
Bibliography 161
[AKWS95] S. Agarwal, A. Keller, G. Wiederhold, and K. Saraswat. Flexible relation:
An approach for the integration of data from multiple, possible inconsistent
databases. In International Conference on Data Engineering (ICDE), pages
495504, 1995.
[ATMS04] P. Andritsos, P. Tsaparas, R. J. Miller, and K. C. Sevcik. Limbo: Scalable
clustering of categorical data. In International Conference on Extending
Database Technology (EDBT), pages 123146, 2004.
[Bal91] B. Balzer. Tolerating inconsistency. In International Conference on Software
Engineering (ICSE), pages 158165, 1991.
[BB03a] P. Barcelo and L. Bertossi. Logic programs for querying inconsistent
databases. In International Symposium on Practical Aspects of Declarative
Languages, pages 208222, 2003.
[BB03b] L. Bravo and L. Bertossi. Logic programs for consistently querying data inte-
gration systems. In International Joint Conference on Articial Intelligence
(IJCAI), pages 1015, 2003.
[BBFL05] L. Bertossi, L. Bravo, E. Franconi, and A. Lopatenko. Fixing numerical at-
tributes under integrity constraints. In International Symposium on Database
Programming Languages (DBPL), pages 262278, 2005.
[BC03] L. Bertossi and J. Chomicki. Logics for Emerging Applications of Databases,
chapter Query Answering in Inconsistent Databases, pages 4383. Springer,
2003.
[Ber06] L. Bertossi. Consistent query answering in databases. ACM SIGMOD
Record, 35(2):6876, 2006. Database Principles column.
[BKT01] P. Buneman, S. Khanna, and W. Tan. Why and where: A characterization of
data provenance. In International Conference on Database Theory (ICDT),
pages 316330, 2001.
[BMFR05] P. Bohannon, F. Michael, W. Fan, and R. Rastogi. A cost-based model and
eective heuristic for repairing constraints by value modication. In ACM
International Conference on the Management of Data (SIGMOD), pages
143154, 2005.
Bibliography 162
[BMP92] D Barbara, H. Garcia Molina, and D. Porter. The management of probabilis-
tic data. IEEE Transactions on Knowldge and Data Engineering (TKDE),
4:487502, 1992.
[CB00] A. Celle and L. Bertossi. Querying inconsistent databases: Algorithms and
implementation. In Computational Logic (CL), pages 942956, 2000.
[CB05] M. Caniupan and L. Bertossi. Optimizing repair programs for consistent
query answering. In International Conference of the Chilean Computer Sci-
ence Society, pages 312, 2005.
[CD97] S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP
technology. ACM SIGMOD Record, 26(1):6574, 1997.
[CLR03a] A. Cal`, D. Lembo, and R. Rosati. On the decidability and complexity of
query answering over inconsistent and incomplete databases. In Symposium
on Principles of Database Systems (PODS), pages 260271, 2003.
[CLR03b] A. Cal`, D. Lembo, and R. Rosati. Query rewriting and answering under
constraints in data integration systems. In International Joint Conference
on Articial Intelligence (IJCAI), pages 1621, 2003.
[CM77] A. Chandra and P. Merlin. Computable queries for relational databases. In
ACM Symposium on the Theory of Computing (STOC), pages 7790, 1977.
[CM05] J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance
using tuple deletions. Information and Computation, 197(1-2):90121, 2005.
[CMS04a] J. Chomicki, J. Marcinkowski, and S. Staworko. Computing Consistent
Query Answers using Conict Hypergraphs. In International Conference
on Information and Knowledge Management (CIKM), pages 417426, 2004.
[CMS04b] J. Chomicki, J. Marcinkowski, and S. Staworko. Hippo: A System for Com-
puting Consistent Answers to a Class of SQL Queries. In International Con-
ference on Extending Database Technology (EDBT), pages 841844, 2004.
[CNS99] S. Cohen, W. Nutt, and A. Serebrenik. Rewriting aggregate queries using
views. In Symposium on Principles of Database Systems (PODS), pages
155166, 1999.
Bibliography 163
[CNS03] S. Cohen, W. Nutt, and Y. Sagiv. Containment of aggregate queries. In
International Conference on Database Theory (ICDT), pages 111125, 2003.
[CP87] R. Cavallo and M. Pittarelli. The theory of probabilistic databases. In
International Conference on Very Large Databases (VLDB), pages 7181,
1987.
[CV93] S. Chaudhuri and M. Vardi. Optimization of real conjunctive queries. In
Symposium on Principles of Database Systems (PODS), pages 5970, 1993.
[CW03] Y. Cui and J. Widom. Lineage tracing for general data warehouse transfor-
mations. Very Large Databases (VLDB) Journal, 12(1):4158, 2003.
[DeM89] L. DeMichiel. Resolving database incompatibility: An approach to perform-
ing relational operations over mismatched domains. In IEEE Transactions
on Knowldge and Data Engineering (TKDE), pages 485493, 1989.
[DJ03] T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John
Wiley, 2003.
[DS04] N. Dalvi and D. Suciu. Ecient query evaluation on probabilistic databases.
In International Conference on Very Large Databases (VLDB), pages 864
875, 2004.
[EFGL03] T. Eiter, M. Fink, G. Greco, and D. Lembo. Ecient Evaluation of Logic
Programs for Querying Data Integration Systems. In International Confer-
ence on Logic Programming (ICLP), pages 163177, 2003.
[FFM05a] A. Fuxman, E. Fazli, and R. J. Miller. ConQuer: Ecient management of in-
consistent databases. In ACM International Conference on the Management
of Data (SIGMOD), pages 155166, 2005.
[FFM05b] A. Fuxman, D. Fuxman, and R. J. Miller. ConQuer: A system for e-
cient querying over inconsistent databases. International Conference on Very
Large Databases (VLDB), pages 13541357, 2005.
[FFP05] S. Flesca, F. Furfaro, and F. Parisi. Consistent query answers on numeri-
cal databases under aggregate constraints. In International Symposium on
Database Programming Languages (DBPL), pages 279294, 2005.
Bibliography 164
[FKMP05] R. Fagin, P. Kolaitis, R. Miller, and L. Popa. Data exchange: semantics and
query answering. Theoretical Computer Science, 336(1):89124, 2005.
[FKMT06] A. Fuxman, P. Kolaitis, R. J. Miller, and W. Tan. Peer data exchange. ACM
Transactions on Database Systems, 2006. To appear in a special issue with
selected papers from PODS 2005.
[FM05] A. Fuxman and R. J. Miller. First-order query rewriting for inconsistent
databases. In International Conference on Database Theory (ICDT), pages
337351, 2005.
[FM06] A. Fuxman and R. J. Miller. First-order query rewriting for inconsistent
databases. Journal of Computer and System Sciences (JCSS), 2006. To
appear.
[FPL
+
01] E. Franconi, A. Laureti Palma, N. Leone, S. Perri, and F. Scarcello. Census
data repair: a challenging application of disjunctive logic programming. In
Logic for Programming, Articial Intelligence, and Reasoning (LPAR), pages
561578, 2001.
[FR97] N. Fuhr and T. Rolleke. A probabilistic relational algebra for the integra-
tion of information retrieval and database systems. ACM Transactions on
Information Systems, 15(1):3266, 1997.
[Fux04] A. Fuxman. A survey of the applications of schema mapping and the certain
answers semantics. Technical Report CSRG-541, University of Toronto, 2004.
Available at ftp://ftp.cs.toronto.edu/cs/ftp/pub/reports/csrg/541.
[GGZ01] G. Greco, S. Greco, and E. Zumpano. A logic programming approach to
the integration, repairing and querying of inconsistent databases. In Inter-
national Conference on Logic Programming (ICLP), pages 348364, 2001.
[GLRR05] L. Grieco, D. Lembo, R. Rosati, and M. Ruzzi. Consistent query answer-
ing under key and exclusion dependencies: Algorithms and experiments.
In International Conference on Information and Knowledge Management
(CIKM), pages 792799, 2005.
[GM96] S. Grumbach and T. Milo. Towards tractable algebras for bags. In Journal
of Computer and System Sciences (JCSS), volume 52, pages 570588, 1996.
Bibliography 165
[GR95] P. Gardenfors and H. Rott. Handbook of Logic in Articial Intelligence and
Logic Programming, volume 4, chapter Belief Revision, pages 35132. Oxford
University Press, 1995.
[GRT99] S. Grumbach, M. Rafanelli, and L. Tininini. Querying aggregate data.
In Symposium on Principles of Database Systems (PODS), pages 174184,
1999.
[GZ00] S. Greco and E. Zumpano. Querying inconsistent databases. In Logic for
Programming, Articial Intelligence, and Reasoning (LPAR), pages 308325,
2000.
[HK75] J. Hopcroft and R. M. Karp. An O(n
2.5
) algorithm for maximum matching
in bipartite graphs. SIAM Journal of Computing, 2:225231, 1975.
[HLNW01] L. Hella, L. Libkin, J. Nurmonen, and L. Wong. Logics with aggregate
operators. Journal of the ACM, 48(4):880907, 2001.
[IR95] Y. Ioannidis and R. Ramakrishnan. Containment of conjunctive queries: Be-
yond relations as sets. ACM Transactions on Database Systems, 20(3):288
324, 1995.
[ISO01] ISO. SQL - part 2: Foundation (SQL/Foundation) - amendment 1: On-line
analytical processing (SQL/OLAP). Technical Report 9075-2-1999/Amd1-
2001, INCITS/ISO/IEC, 2001.
[IvdMV95] T. Imielinski, R. van der Meyden, and K. Vadaparty. Complexity tailored de-
sign: A new design methodology for databases with incomplete information.
Journal of Computer and System Sciences (JCSS), 51(3):405432, 1995.
[Lad75] R. E. Ladner. On the structure of polynomial time reducibility. Journal of
the ACM, 22(1):155171, 1975.
[Lev81] H. Levesque. A Formal Treatment of Incomplete Knowledge Bases. PhD
thesis, University of Toronto, 1981.
[Lip79] W. Lipski. On semantic issues connected with incomplete information
databases. ACM Transactions on Database Systems, 4(3):262296, 1979.
Bibliography 166
[Lip81] W. Lipski. On databases with incomplete information. Journal of the ACM,
28(1):4170, 1981.
[LLR02] D. Lembo, M. Lenzerini, and R. Rosati. Source inconsistency and incom-
pleteness in data integration. In International Workshop on Knowledge Rep-
resentation meets Databases (KRDB), 2002.
[LLRS97] L. Lakshmanan, N. Leone, R. Ross, and V. Subrahmanian. Probview: A ex-
ible probabilistic database system. ACM Transactions on Database Systems,
22(3):419469, 1997.
[LM96] J. Lin and A. Mendelzon. Merging databases under constraints. International
Journal of Cooperative Information Systems, 7(1):5576, 1996.
[LRR06] D. Lembo, R. Rosati, and M. Ruzzi. On the rst-order reducibility of unions
of conjunctive queries over inconsistent databases. In International Work-
shop on Inconsistency and Incompleteness in Databases, pages 1732, 2006.
[LW95] L. Libkin and L. Wong. On representation and querying incomplete informa-
tion in databases with bags. Information Processing Letters, 56(4):209214,
1995.
[LW97] L. Libkin and L. Wong. Query languages for bags and aggregate functions.
Journal of Computer and System Sciences (JCSS), 55(2):241272, 1997.
[Moo85] R. Moore. Formal Theories of the Commonsense World, chapter A Formal
Theory of Knowledge and Action, pages 319358. 1985.
[NER00] B. Nuseibeh, S. Easterbrook, and A. Russo. Leveraging inconsistency in
software development. IEEE Computer, 33(4):2429, 2000.
[Ost70] P. Ostrand. Systems of distinct representatives. Journal of Mathematical
Analysis and Applications, 32:14, 1970.
[TPC03] Transaction Processing Performance Council: TPC. TPC Benchmark H
(Decision Support). Standard Specication Revision 2.1.0, 2003.
[Vaz01] V. Vazirani. Approximation Algorithms. Springer, 2001.
Bibliography 167
[vdM98] R. van der Meyden. Logical approaches to incomplete information: A survey.
In Logics for Databases and Information Systems, pages 307356. Kluwer,
1998.
[Wij05] J. Wijsen. Database repairing using updates. ACM Transactions on
Database Systems, 30(3):722768, 2005.
Appendix A
TPC-H Queries and their Rewritings
The following are the queries from the TPC-H standard [TPC03] that we employed in
our experiments, together with their rewritings.
TPC-H Query 1
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= date(1998-12-01) - 90 DAYS
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;
Rewritten Query 1
168
Appendix A. TPC-H Queries and their Rewritings 169
select
l_orderkey,
l_linenumber
from
lineitem
where
l_shipdate <= date(1998-12-01) - 90 DAYS
),
select
l_returnflag,
l_linestatus,
max(l_quantity) as max_qty,
max(l_extendedprice) as max_extendedprice,
max(l_extendedprice * (1 - l_discount)) as max_disc_price,
max(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as max_charge,
max(l_discount) as max_disc,
min(l_quantity) as min_qty,
min(l_extendedprice) as min_extendedprice,
min(l_extendedprice * (1 - l_discount)) as min_disc_price,
min(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as min_charge,
min(l_discount) as min_disc,
condWhereViol,
condWhereSat,
max(rankProj) over (partition by l_orderkey,l_linenumber) as countProj
from
(select
l_orderkey,
l_linenumber,
l_returnflag,
l_linestatus,
l_quantity,
l_extendedprice,
l_discount,
l_tax,
rank() over (partition by l_orderkey,l_linenumber
order by l_returnflag,l_linestatus)
as rankProj,
sum(case
when l_shipdate <= date(1998-12-01) - 90 days then 0 else 1 end)
over (partition by l_orderkey,l_linenumber) as condWhereViol,
case
when l_shipdate <= date(1998-12-01) - 90 days then 1 else 0 end
as condWhereSat
from lineitem li
where
exists (select * from candidatesSubQuery sc
where li.l_orderkey=sc.l_orderkey and
li.l_linenumber=sc.l_linenumber)
) q
where condWhereSat = 1
group by l_orderkey,l_linenumber,l_returnflag,l_linestatus,condWhereViol,
condWhereSat,rankProj),
select
l_returnflag,
l_linestatus,
max_qty,
max_extendedprice,
max_disc_price,
max_charge,
max_disc,
min_qty,
min_extendedprice,
min_disc_price,
min_charge,
min_disc,
1 as countConsistent
from
contribAllSubQuery Cand
where condWhereViol = 0 and countProj=1),
select
l_returnflag,
l_linestatus,
max_qty,
max_extendedprice,
max_disc_price,
max_charge,
max_disc,
0 as min_qty,
0 as min_extendedprice,
0 as min_disc_price,
0 as min_charge,
0 as min_disc,
from
where condWhereViol >= 1 or countProj > 1)
select
l_returnflag,
l_linestatus,
sum(max_qty) as max_sum_qty,
sum(max_extendedprice) as max_sum_base_price,
sum(max_disc_price) as max_sum_disc_price,
sum(max_charge) as max_sum_charge,
sum(max_qty)/sum(countConsistent) as max_avg_qty,
sum(max_extendedprice)/sum(countConsistent) as max_avg_price,
sum(max_disc)/sum(countConsistent) as max_avg_disc,
count(*) as max_count_order,
sum(min_qty) as min_sum_qty,
sum(min_extendedprice) as min_sum_base_price,
sum(min_disc_price) as min_sum_disc_price,
sum(min_charge) as min_sum_charge,
sum(min_qty)/sum(countConsistent) as min_avg_qty,
sum(min_extendedprice)/sum(countConsistent) as min_avg_price,
sum(min_disc)/sum(countConsistent) as min_avg_disc,
sum(countConsistent) as min_count_order
from
(select * from contribConsistentSubQuery
union all
select * from contribNonConsistentSubQuery) q
group by
l_returnflag,
l_linestatus
having sum(countConsistent)>0
order by
l_returnflag,
l_linestatus;
TPC-H Query 3
select
l_orderkey,
sum(l_extendedprice * (1 - l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer,
orders,
lineitem
where
c_mktsegment = BUILDING
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < 1995-03-15
and l_shipdate > 1995-03-15
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate
fetch first 10 rows only;
Rewritten Query 3
select
l_orderkey,
l_linenumber
from
customer,
orders,
lineitem
where
c_mktsegment = BUILDING
),
select
l_orderkey,
l_linenumber,
o_orderdate,
o_shippriority,
min(l_extendedprice * (1 - l_discount)) as min_revenue,
max(l_extendedprice * (1 - l_discount)) as max_revenue,
1 as min_count,
max(rankProj) over (partition by l_orderkey,l_linenumber) as countProj,
cond_viol,
cond_sat
from
(select
l_orderkey,
l_linenumber,
o_orderdate,
o_shippriority,
l_extendedprice,
l_discount,
order by o_orderdate,o_shippriority)
as rankProj,
sum(case
when c_mktsegment = BUILDING
then 0 else 1 end)
over (partition by l_orderkey,l_linenumber) as cond_viol,
case when c_mktsegment = BUILDING
then 1 else 0 end as cond_sat
from orders o1 JOIN lineitem l ON l_orderkey = o1.o_orderkey
LEFT OUTER JOIN customer ON c_custkey=o1.o_custkey
where
where l.l_orderkey=sc.l_orderkey and l.l_linenumber=sc.l_linenumber)
) q
where cond_sat=1
group by
l_orderkey,
l_linenumber,
o_orderdate,
o_shippriority,
cond_viol,cond_sat,rankProj
),
select l_orderkey,
l_linenumber,
o_orderdate,
o_shippriority,
min_revenue,
max_revenue,
min_count
from
where
countProj = 1 and cond_viol=0),
select
l_orderkey,
l_linenumber,
o_orderdate,
o_shippriority,
0 as min_revenue,
max_revenue,
0 as min_count
from
where
countProj > 1 or cond_viol >= 1)
select
l_orderkey,
o_orderdate,
o_shippriority,
sum(min_revenue) as sum_min_revenue,
sum(max_revenue) as sum_max_revenue
from
(select * from contribNonConsistentSubQuery
union all
select * from contribConsistentSubQuery) as q
group by
l_orderkey,
o_orderdate,
o_shippriority
having sum(min_count)>0
order by
sum_min_revenue desc,
o_orderdate
TPC-H Query 4
select
o_orderpriority,
count(*) as order_count
from
orders
where
o_orderdate >= 1993-07-01
and o_orderdate < date(1993-07-01) + 3 MONTHS
and exists (
select *
from
lineitem
where
l_orderkey = o_orderkey
and l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority;
Rewritten Query 4
select
l_orderkey,
l_linenumber
from
orders, lineitem
where
),
select
l_orderkey,
l_linenumber,
o_orderpriority,
1 as min_count,
cond_viol,
cond_sat
from
(select
l_orderkey,
l_linenumber,
o_orderpriority,
order by o_orderpriority) as rankProj,
sum(case
when l_commitdate < l_receiptdate and
then 0 else 1 end)
case when l_commitdate < l_receiptdate and
from orders, lineitem li
where
l_orderkey = o_orderkey
and
where li.l_orderkey=sc.l_orderkey and li.l_linenumber=sc.l_linenumber)
) q
where cond_sat=1
group by
l_orderkey,
l_linenumber,
o_orderpriority,
cond_viol,cond_sat,rankProj),
select l_orderkey,
l_linenumber,
o_orderpriority,
min_count
from
where
countProj =1 and cond_viol=0),
select
l_orderkey,
l_linenumber,
o_orderpriority,
0 as min_count
from
where
select
o_orderpriority,
count(*) as max_order_count,
sum(min_count) as min_order_count
from
union all
group by
o_orderpriority
order by
o_orderpriority;
TPC-H Query 6
select
sum(l_extendedprice * l_discount) as revenue
from
lineitem
where
l_shipdate >= 1994-01-01
and l_shipdate < date(1994-01-01) + 1 YEAR
and l_discount >= 0.06 - 0.01
and l_discount <= 0.06 + 0.01
and l_quantity < 24;
Rewritten Query 6
select
l_orderkey,
l_linenumber
from
lineitem
where
l_shipdate >= 1994-01-01
and l_quantity < 24
),
select l_orderkey,
l_linenumber,
min(l_extendedprice * l_discount) as min_revenue,
max(l_extendedprice * l_discount) as max_revenue,
cond_viol,
cond_sat
from (
select
l_orderkey,
l_linenumber,
l_extendedprice,
l_discount,
sum(case
when l_shipdate >= 1994-01-01
and l_quantity < 24
then 0 else 1 end)
case when l_shipdate >= 1994-01-01
and l_quantity < 24
from
lineitem li
where
where li.l_orderkey=sc.l_orderkey
and li.l_linenumber=sc.l_linenumber) ) q
where cond_sat=1
group by l_orderkey,
l_linenumber, cond_viol,cond_sat),
select l_orderkey,
l_linenumber,
min_revenue,
max_revenue
from
where
cond_viol=0),
select l_orderkey,
l_linenumber,
0 as min_revenue,
max_revenue
from
where cond_viol >= 1
)
select
sum(min_revenue) as min_sum_revenue,
sum(max_revenue) as max_sum_revenue
from
union all
select * from contribConsistentSubQuery) as q;
TPC-H Query 7
select
supp_nation,
cust_nation,
l_year,
sum(volume) as revenue
from
(
select
n1.n_name as supp_nation,
n2.n_name as cust_nation,
year(l_shipdate) as l_year,
l_extendedprice * (1 - l_discount) as volume
from
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2
where
s_suppkey = l_suppkey
and o_orderkey = l_orderkey
and s_nationkey = n1.n_nationkey
and c_nationkey = n2.n_nationkey
and (
(n1.n_name = FRANCE and n2.n_name = GERMANY)
or (n1.n_name = GERMANY and n2.n_name = FRANCE)
)
and l_shipdate >= 1995-01-01
and l_shipdate <= 1996-12-31
) as shipping
group by
supp_nation,
cust_nation,
l_year
order by
supp_nation,
cust_nation,
l_year;
Rewritten Query 7
select
l_orderkey,
l_linenumber
from
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2
where
and (
)
and l_shipdate <= 1996-12-31
),
select
supp_nation,
cust_nation,
l_year,
min(volume) as low_revenue,
max(volume) as up_revenue,
condWhereSat,
condWhereViol,
max(rankProj) as countProj
from
(
select
l_orderkey,
l_linenumber,
n1.n_name as supp_nation,
n2.n_name as cust_nation,
year(l_shipdate) as l_year,
l_extendedprice * (1 - l_discount) as volume,
order by n1.n_name,n2.n_name,year(l_shipdate)) as rankProj,
sum(case
when (
and (
)
and l_shipdate <= 1996-12-31)
then 0 else 1 end)
case
when (
and (
)
and l_shipdate <= 1996-12-31)
then 1 else 0 end as condWhereSat
from
lineitem li JOIN orders ON o_orderkey = l_orderkey
LEFT OUTER JOIN supplier ON s_suppkey = l_suppkey
LEFT OUTER JOIN nation n1 ON s_nationkey = n1.n_nationkey
LEFT OUTER JOIN customer ON c_custkey = o_custkey
LEFT OUTER JOIN nation n2 ON c_nationkey = n2.n_nationkey
where
exists
(select * from candidatesSubQuery sc
and li.l_linenumber=sc.l_linenumber)
) q
where condWhereSat=1
group by
l_orderkey,
l_linenumber,
supp_nation,
cust_nation,
l_year,
condWhereSat,condWhereViol),
select
supp_nation,
cust_nation,
l_year,
low_revenue,
up_revenue,
from
where condWhereViol = 0 and countProj = 1),
select
supp_nation,
cust_nation,
l_year,
low_revenue,
0 as up_revenue,
from
where condWhereViol >= 1 or countProj >1)
select
supp_nation,
cust_nation,
l_year,
sum(low_revenue) as low_sum_revenue,
sum(up_revenue) as up_sum_revenue
from
union all
group by
supp_nation,
cust_nation,
l_year
order by
supp_nation,
cust_nation,
l_year;
TPC-H Query 8
select
YEAR(o_orderdate) as o_year,
sum(case
when n2.n_name = BRAZIL then l_extendedprice * (1 - l_discount)
else 0
end) / sum(l_extendedprice * (1 - l_discount)) as mkt_share
from
part,
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2,
region
where
p_partkey = l_partkey
and s_suppkey = l_suppkey
and o_custkey = c_custkey
and n1.n_regionkey = r_regionkey
and r_name = AMERICA
and o_orderdate >= 1995-01-01
and o_orderdate <= 1996-12-31
and p_type = ECONOMY ANODIZED STEEL
group by
YEAR(o_orderdate)
order by
YEAR(o_orderdate);
Rewritten Query 8
select
l_orderkey,
l_linenumber
from
part,
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2,
region
where
and p_type = ECONOMY ANODIZED STEEL
),
select
o_year,
min(dividend) as low_dividend,
max(dividend) as up_dividend,
min(divisor) as low_divisor,
max(divisor) as up_divisor,
condWhereSat,
condWhereViol,
max(rankProj) over (partition by l_orderkey,l_linenumber) as countProj
from
(
select
l_orderkey,
l_linenumber,
case
when n2.n_name = BRAZIL then l_extendedprice * (1 - l_discount)
else 0
end as dividend,
l_extendedprice * (1 - l_discount) as divisor,
order by YEAR(o_orderdate)) as rankProj,
sum(case when
(p_partkey = l_partkey
and p_type = ECONOMY ANODIZED STEEL)
then 0 else 1 end)
case when
(p_partkey = l_partkey
and p_type = ECONOMY ANODIZED STEEL)
from
lineitem li JOIN orders ON l_orderkey = o_orderkey
LEFT OUTER JOIN supplier ON s_suppkey = l_suppkey
LEFT OUTER JOIN nation n2 ON s_nationkey = n2.n_nationkey
LEFT OUTER JOIN part ON p_partkey = l_partkey
LEFT OUTER JOIN customer ON o_custkey = c_custkey
LEFT OUTER JOIN nation n1 ON c_nationkey = n1.n_nationkey
LEFT OUTER JOIN region ON n1.n_regionkey = r_regionkey
where
) q
where
condWhereSat=1
group by
l_orderkey,l_linenumber,o_year,condWhereSat,condWhereViol,rankProj),
select
o_year,
low_dividend,
up_dividend,
low_divisor,
up_divisor,
from
where condWhereViol = 0 and countProj=1),
select
o_year,
0 as low_dividend,
up_dividend,
0 as low_divisor,
up_divisor,
from
where condWhereViol >= 1 or countProj > 1)
select o_year,
sum(low_dividend)/sum(up_divisor) as low_mkt_share,
sum(up_dividend)/sum(low_divisor) as up_mktshare
from
union all
group by o_year
order by o_year;
TPC-H Query 9
select
n_name,
sum(l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity) as sum_profit
from
part,
supplier,
lineitem,
partsupp,
orders,
nation
where
and ps_suppkey = l_suppkey
and ps_partkey = l_partkey
and p_partkey = l_partkey
and s_nationkey = n_nationkey
and p_name like %green%
group by
n_name,
o_orderdate
order by
n_name,
o_orderdate desc;
Rewritten Query 9
select
l_orderkey,
l_linenumber
from
part,
supplier,
lineitem,
orders,
nation
where
),
select
l_orderkey,
l_linenumber,
n_name as nation,
min(l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity) as min_profit,
max(l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity) as max_profit,
1 as min_count,
cond_viol,
cond_sat
from
(select l_orderkey,
l_linenumber,
n_name,
o_orderdate,
l_extendedprice,
l_discount,
ps_supplycost,
l_quantity,
order by n_name,YEAR(o_orderdate)) as rankProj,
sum(case
when s_suppkey = l_suppkey
then 0 else 1 end)
case when
from
lineitem l JOIN orders o1 ON l_orderkey=o1.o_orderkey
LEFT OUTER JOIN part ON p_partkey=l_partkey
LEFT OUTER JOIN supplier ON s_suppkey=l_suppkey
LEFT OUTER JOIN nation n1 ON n1.n_nationkey=s_nationkey
LEFT OUTER JOIN partsupp ON ps_partkey=l_partkey and ps_suppkey=l_suppkey
where
where l.l_orderkey=sc.l_orderkey
and l.l_linenumber=sc.l_linenumber)
) q
where cond_sat=1
group by
l_orderkey,
l_linenumber,
n_name,
o_orderdate,
select l_orderkey,
l_linenumber,
nation,
o_year,
min_profit,
max_profit,
min_count
from
where
countProj = 1 and cond_viol=0),
select
l_orderkey,
l_linenumber,
nation,
o_year,
0 as min_profit,
max_profit,
0 as min_count
from
where
select
nation,
o_year,
sum(min_profit) as min_sum_profit,
sum(max_profit) as max_sum_profit
from
union all
group by
nation,
o_year
order by
nation,
o_year desc;
TPC-H Query 10
select
c_custkey,
c_name,
sum(l_extendedprice * (1 - l_discount)) as revenue,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment
from
customer,
orders,
lineitem,
nation
where
c_custkey = o_custkey
and l_returnflag = R
and c_nationkey = n_nationkey
group by
c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
order by
revenue desc
Rewritten Query 10
select
l_orderkey,
l_linenumber
from
customer,
orders,
lineitem,
nation
where
c_custkey = o_custkey
),
select
l_orderkey,
l_linenumber,
c_custkey,
c_name,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment,
min(l_extendedprice * (1 - l_discount)) as min_revenue,
max(l_extendedprice * (1 - l_discount)) as max_revenue,
1 as min_count,
max (rankProj) over (partition by l_orderkey,l_linenumber) as countProj,
cond_viol,
cond_sat
from
(select l_orderkey,
l_linenumber,
c_custkey,
c_name,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment,
l_extendedprice,
l_discount,
order by c_custkey,c_name,
c_acctbal,n_name, c_address, c_phone,
c_comment) as rankProj,
sum(case
when c_custkey = o_custkey
then 0 else 1 end)
case when c_custkey = o_custkey
from
lineitem l JOIN orders on l_orderkey=o_orderkey
LEFT OUTER JOIN customer c1 ON c1.c_custkey=o_custkey
LEFT OUTER JOIN nation n1 ON n1.n_nationkey=c1.c_nationkey
where
) q
where cond_sat=1
group by
l_orderkey,
l_linenumber,
c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment,
select l_orderkey,
l_linenumber,
c_custkey,
c_name,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment,
min_revenue,
max_revenue,
min_count
from
where
countProj=1 and cond_viol=0),
select
l_orderkey,
l_linenumber,
c_custkey,
c_name,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment,
0 as min_revenue,
max_revenue,
0 as min_count
from
where
select
c_custkey,
c_name,
sum(min_revenue) as min_sum_revenue,
sum(max_revenue) as max_sum_revenue,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment
from
union all
group by
c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
order by
min_sum_revenue desc
TPC-H Query 12
select
l_shipmode,
sum(case
when o_orderpriority = 1-URGENT
or o_orderpriority = 2-HIGH
then 1
else 0
end) as high_line_count,
sum(case
when o_orderpriority <> 1-URGENT
and o_orderpriority <> 2-HIGH
then 1
else 0
end) as low_line_count
from
orders,
lineitem
where
o_orderkey = l_orderkey
and l_shipmode in (MAIL, SHIP)
and l_shipdate < l_commitdate
and l_receiptdate >= 1994-01-01
and l_receiptdate < date(1994-01-01) + 1 YEAR
group by
l_shipmode
order by
l_shipmode;
Rewritten Query 12
select
l_orderkey,
l_linenumber
from
orders,
lineitem
where
o_orderkey = l_orderkey
and l_shipmode in (MAIL, SHIP)
),
select l_orderkey,
l_linenumber,
l_shipmode,
min(case
then 1
else 0
end) as min_high_line_count,
max(case
then 1
else 0
end) as max_high_line_count,
min(case
then 1
else 0
end) as min_low_line_count,
max(case
then 1
else 0
end) as max_low_line_count,
1 as min_count,
cond_viol,
cond_sat
from
(select l_orderkey,
l_linenumber,
l_shipmode,
o_orderpriority,
order by l_shipmode) as rankProj,
sum(case
when l_shipmode in (MAIL, SHIP)
then 0 else 1 end)
case when l_shipmode in (MAIL, SHIP)
from orders JOIN lineitem l ON l_orderkey = o_orderkey
where
) q
where cond_sat=1
group by
l_shipmode,
l_orderkey,
l_linenumber,
select l_orderkey,
l_linenumber,
l_shipmode,
min_high_line_count,
max_high_line_count,
min_low_line_count,
max_low_line_count,
min_count
from
where
countProj=1 and cond_viol=0),
select
l_orderkey,
l_linenumber,
l_shipmode,
0 as min_high_line_count,
max_high_line_count,
0 as min_low_line_count,
max_low_line_count,
0 as min_count
from
where
select
l_shipmode,
sum(min_high_line_count) as sum_min_high_line_count,
sum(max_high_line_count) as sum_max_high_line_count,
sum(min_low_line_count) as sum_min_low_line_count,
sum(max_low_line_count) as sum_max_low_line_count
from
union all
group by
l_shipmode
order by
l_shipmode;
TPC-H Query 14
select
100.00 * sum(case
when p_type like PROMO%
then l_extendedprice * (1 - l_discount)
else 0
end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue
from
lineitem,
part
where
l_partkey = p_partkey
and l_shipdate < date(1995-09-01) + 30 DAYS;
Rewritten Query 14
select
l_orderkey,
l_linenumber
from
lineitem,
part
where
l_partkey = p_partkey
and l_shipdate < date(1995-09-01) + 30 DAYS
),
select
min(dividend) as low_dividend,
max(dividend) as up_dividend,
min(divisor) as low_divisor,
max(divisor) as up_divisor,
condWhereSat,
condWhereViol
from
(
select
l_orderkey,
l_linenumber,
100.00 * case
when p_type like PROMO%
then l_extendedprice * (1 - l_discount)
else 0
end as dividend,
l_extendedprice * (1 - l_discount) as divisor,
sum(case when
(l_partkey = p_partkey
and l_shipdate < date(1995-09-01) + 30 DAYS)
then 0 else 1 end)
case when
(l_partkey = p_partkey
and l_shipdate < date(1995-09-01) + 30 DAYS)
from
lineitem li LEFT OUTER JOIN part ON l_partkey = p_partkey
where
) q
where
condWhereSat=1
group by l_orderkey,l_linenumber,condWhereSat,condWhereViol),
select
low_dividend,
up_dividend,
low_divisor,
up_divisor,
from
where condWhereViol = 0),
select
0 as low_dividend,
up_dividend,
0 as low_divisor,
up_divisor,
from
where condWhereViol >= 1)
select sum(low_dividend)/sum(up_divisor) as low_promo_revenue,
sum(up_dividend)/sum(low_divisor) as up_promo_revenue
from
union all
having sum(countConsistent)>0;
TPC-H Query 19
select
sum(l_extendedprice* (1 - l_discount)) as revenue
from
lineitem,
part
where
(
and p_brand = Brand#12
and p_container in (SM CASE, SM BOX, SM PACK, SM PKG)
and l_quantity >= 1 and l_quantity <= 1 + 10
and p_size between 1 and 5
and l_shipmode in (AIR, AIR REG)
and l_shipinstruct = DELIVER IN PERSON
)
or
(
and p_container in (MED BAG, MED BOX, MED PKG, MED PACK)
)
or
(
and p_container in (LG CASE, LG BOX, LG PACK, LG PKG)
);
Rewritten Query 19
select
l_orderkey,
l_linenumber
from
lineitem,
part
where
(
)
or
(
)
or
(
)
),
select min(revenue) as low_revenue,
max(revenue) as up_revenue,
condWhereViol,
condWhereSat
from
(select
l_orderkey,
l_linenumber,
l_extendedprice* (1 - l_discount) as revenue,
sum (case when (
(
)
or
(
)
or
(
)
) then 0 else 1 end)
case when (
(
)
or
(
)
or
(
)
) then 1 else 0 end as condWhereSat
from
lineitem li
LEFT OUTER JOIN part ON p_partkey = l_partkey
where
) q
where condWhereSat = 1
group by l_orderkey,l_linenumber,condWhereViol,condWhereSat),
select low_revenue,
up_revenue,
from
where condWhereViol = 0),
select
0 as low_revenue,
up_revenue,
from
where condWhereViol >= 1)
select sum(low_revenue) as sum_low_revenue,
sum(up_revenue) as sum_up_revenue
from
union all
having sum(countConsistent)>0;
Appendix B
Design Advisor Indices
The following are the indices suggested by DB2s Design Advisor for the inconsistent databases of the
scalability experiment.
Inconsistent database size 1 GB, p = 5%, n = 2
-- index[1], 4.118MB
CREATE INDEX "DB2ADMIN"."IDX609161412250000" ON "DB2ADMIN"."CUSTOMER" ("C_MKTSEGMENT" ASC, "C_CUSTKEY" ASC)
ALLOW REVERSE SCANS ;
-- index[2], 2.884MB
CREATE INDEX "DB2ADMIN"."IDX609161413270000" ON "DB2ADMIN"."CUSTOMER" ("C_NATIONKEY" ASC, "C_CUSTKEY" ASC)
-- index[3], 28.802MB
CREATE INDEX "DB2ADMIN"."IDX609161413220000" ON "DB2ADMIN"."ORDERS" ("O_CUSTKEY" ASC, "O_ORDERKEY" ASC)
-- index[4], 0.196MB
CREATE INDEX "DB2ADMIN"."IDX609161412310000" ON "DB2ADMIN"."SUPPLIER" ("S_NATIONKEY" ASC, "S_SUPPKEY" ASC)
-- index[5], 10.751MB
CREATE INDEX "DB2ADMIN"."IDX609161415040000" ON "DB2ADMIN"."PART" ("P_TYPE" ASC, "P_PARTKEY" ASC)
-- index[6], 0.013MB
CREATE INDEX "DB2ADMIN"."IDX609161413420000" ON "DB2ADMIN"."REGION" ("R_NAME" ASC, "R_REGIONKEY" ASC)
-- index[7], 0.013MB
CREATE INDEX "DB2ADMIN"."IDX609161413510000" ON "DB2ADMIN"."NATION" ("N_REGIONKEY" ASC, "N_NATIONKEY" ASC)
-- index[8], 28.282MB
CREATE INDEX "DB2ADMIN"."IDX609161415260000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC, "O_ORDERDATE" ASC)
-- index[9], 39.142MB
CREATE INDEX "DB2ADMIN"."IDX609161417270000" ON "DB2ADMIN"."ORDERS" ("O_ORDERDATE" ASC,
"O_ORDERKEY" ASC, "O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[10], 63.013MB
CREATE INDEX "DB2ADMIN"."IDX609161421380000" ON "DB2ADMIN"."LINEITEM" ("L_SHIPDATE" ASC, "L_LINENUMBER" ASC,
208
Appendix B. Design Advisor Indices 209
"L_ORDERKEY" ASC) ALLOW REVERSE SCANS ;
-- index[11], 4.118MB
CREATE INDEX "DB2ADMIN"."IDX609161422320000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC, "C_MKTSEGMENT" ASC)
-- index[12], 2.884MB
CREATE INDEX "DB2ADMIN"."IDX609161424190000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC, "C_NATIONKEY" ASC)
-- index[13], 28.802MB
CREATE INDEX "DB2ADMIN"."IDX609161424240000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC, "O_CUSTKEY" ASC)
-- index[14], 39.142MB
CREATE INDEX "DB2ADMIN"."IDX609161425360000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC, "O_ORDERDATE" ASC,
"O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[15], 131.692MB
CREATE INDEX "DB2ADMIN"."IDX609161431410000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC, "L_QUANTITY" ASC)
-- index[16], 9.919MB
CREATE INDEX "DB2ADMIN"."IDX609161432260000" ON "DB2ADMIN"."PART" ("P_PARTKEY" ASC, "P_SIZE" ASC,
"P_BRAND" ASC, "P_CONTAINER" ASC) ALLOW REVERSE SCANS ;
-- index[1], 103.747MB
CREATE INDEX "DB2ADMIN"."IDX609161434540000" ON "DB2ADMIN"."LINEITEM" ("L_DISCOUNT" ASC,
"L_SHIPDATE" ASC, "L_QUANTITY" ASC, "L_EXTENDEDPRICE" ASC) ALLOW REVERSE SCANS ;
-- index[2], 210.294MB
CREATE INDEX "DB2ADMIN"."IDX609161446330000" ON "DB2ADMIN"."LINEITEM" ("L_QUANTITY" ASC, "L_SHIPDATE" ASC,
"L_DISCOUNT" ASC, "L_EXTENDEDPRICE" ASC, "L_LINENUMBER" ASC, "L_ORDERKEY" ASC) ALLOW REVERSE SCANS ;
-- index[3], 396.028MB
CREATE INDEX "DB2ADMIN"."IDX609161447560000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC, "L_SUPPKEY" ASC,
"L_ORDERKEY" ASC, "L_LINENUMBER" ASC) ALLOW REVERSE SCANS ;
-- index[4], 263.388MB
CREATE INDEX "DB2ADMIN"."IDX609161453480000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC, "L_QUANTITY" ASC)
-- index[1], 12.341MB
CREATE INDEX "DB2ADMIN"."IDX609161458010000" ON "DB2ADMIN"."CUSTOMER" ("C_MKTSEGMENT" ASC,
"C_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[2], 135.380MB
CREATE INDEX "DB2ADMIN"."IDX609161458080000" ON "DB2ADMIN"."LINEITEM" ("L_DISCOUNT" ASC, "L_SHIPDATE" ASC,
"L_QUANTITY" ASC, "L_EXTENDEDPRICE" ASC) ALLOW REVERSE SCANS ;
-- index[3], 8.646MB
CREATE INDEX "DB2ADMIN"."IDX609161459360000" ON "DB2ADMIN"."CUSTOMER" ("C_CUSTKEY" ASC,
"C_NATIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[4], 84.853MB
CREATE INDEX "DB2ADMIN"."IDX609161459280000" ON "DB2ADMIN"."ORDERS" ("O_CUSTKEY" ASC,
"O_ORDERKEY" ASC) ALLOW REVERSE SCANS ;
-- index[5], 0.583MB
CREATE INDEX "DB2ADMIN"."IDX609161458370000" ON "DB2ADMIN"."SUPPLIER" ("S_NATIONKEY" ASC,
"S_SUPPKEY" ASC) ALLOW REVERSE SCANS ;
-- index[6], 0.013MB
CREATE INDEX "DB2ADMIN"."IDX609161459480000" ON "DB2ADMIN"."REGION" ("R_NAME" ASC,
"R_REGIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[7], 0.013MB
CREATE INDEX "DB2ADMIN"."IDX609161459570000" ON "DB2ADMIN"."NATION" ("N_REGIONKEY" ASC,
"N_NATIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[8], 8.646MB
CREATE INDEX "DB2ADMIN"."IDX609161459580000" ON "DB2ADMIN"."CUSTOMER" ("C_NATIONKEY" ASC,
-- index[9], 32.243MB
CREATE INDEX "DB2ADMIN"."IDX609161501100000" ON "DB2ADMIN"."PART" ("P_TYPE" ASC,
"P_PARTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[10], 84.853MB
CREATE INDEX "DB2ADMIN"."IDX609161501320000" ON "DB2ADMIN"."ORDERS" ("O_ORDERKEY" ASC,
"O_ORDERDATE" ASC) ALLOW REVERSE SCANS ;
-- index[11], 26.794MB
CREATE INDEX "DB2ADMIN"."IDX609161502520000" ON "DB2ADMIN"."PARTSUPP" ("PS_PARTKEY" ASC, "PS_SUPPKEY" ASC,
"PS_SUPPLYCOST" ASC) ALLOW REVERSE SCANS ;
-- index[12], 12.341MB
"C_MKTSEGMENT" ASC) ALLOW REVERSE SCANS ;
-- index[13], 84.853MB
-- index[14], 115.099MB
-- index[15], 395.075MB
CREATE INDEX "DB2ADMIN"."IDX609161515590000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC,
"L_QUANTITY" ASC) ALLOW REVERSE SCANS ;
-- index[16], 29.751MB
"P_BRAND" ASC, "P_CONTAINER" ASC) ALLOW REVERSE SCANS ;
-- mqt[1], 1430.329MB
CREATE SUMMARY TABLE "DB2ADMIN"."MQT609161518000000" AS (SELECT Q4.C0 AS "C0", Q4.C1 AS "C1",
Q4.C2 AS "C2", Q4.C5 AS "C3", Q4.C4 AS "C4", Q4.C3 AS "C5", Q4.C6 AS "C6"
FROM TABLE(SELECT Q3.C0 AS "C0", SUM(Q3.C1) AS "C1", SUM(Q3.C2) AS "C2", Q3.C5 AS "C3", Q3.C4 AS "C4", Q3.C3 AS "C5", COUNT(* ) AS "C6" FROM TABLE(SELECT Q1.L_SHIPMODE AS "C0", CASE WHEN ((Q2.O_ORDERPRIORITY = 1-URGENT ) OR (Q2.O_ORDERPRIORITY = 2-HIGH )) THEN 1 ELSE 0 END AS "C1", CASE WHEN ((Q2.O_ORDERPRIORITY <> 1-URGENT ) AND (Q2.O_ORDERPRIORITY <> 2-HIGH )) THEN 1 ELSE 0 END AS "C2", Q1.L_RECEIPTDATE AS "C3", Q1.L_SHIPDATE AS "C4", Q1.L_COMMITDATE AS "C5" FROM DB2ADMIN.LINEITEM AS Q1, DB2ADMIN.ORDERS AS Q2 WHERE (Q2.O_ORDERKEY = Q1.L_ORDERKEY)) AS Q3 GROUP BY Q3.C3, Q3.C4, Q3.C5, Q3.C0) AS Q4) DATA INITIALLY DEFERRED REFRESH IMMEDIATE IN USERSPACE1 ;
-- index[1], 990.099MB
"L_SUPPKEY" ASC, "L_ORDERKEY" ASC, "L_LINENUMBER" ASC) ALLOW REVERSE SCANS ;
-- index[2], 49.587MB
"P_CONTAINER" ASC, "P_BRAND" ASC) ALLOW REVERSE SCANS ;
-- index[3], 164.485MB
CREATE INDEX "DB2ADMIN"."IDX609161539380000" ON "DB2ADMIN"."MQT609161518000000"
("C3" ASC, "C0" ASC, "C5" ASC, "C4" ASC) ALLOW REVERSE SCANS ;
-- index[1], 41.126MB
-- index[2], 373.618MB
CREATE INDEX "DB2ADMIN"."IDX609161543080000" ON "DB2ADMIN"."LINEITEM" ("L_DISCOUNT" ASC,
"L_SHIPDATE" ASC, "L_QUANTITY" ASC, "L_EXTENDEDPRICE" ASC) ALLOW REVERSE SCANS ;
-- index[3], 1.923MB
-- index[4], 282.829MB
-- index[5], 28.802MB
-- index[6], 0.013MB
-- index[7], 0.013MB
-- index[8], 28.802MB
-- index[9], 107.474MB
-- index[10], 230.290MB
CREATE INDEX "DB2ADMIN"."IDX609161547520000" ON "DB2ADMIN"."PARTSUPP" ("PS_PARTKEY" ASC, "PS_SUPPKEY" ASC,
"PS_SUPPLYCOST" ASC) ALLOW REVERSE SCANS ;
-- index[11], 282.829MB
-- index[12], 1565.091MB
CREATE INDEX "DB2ADMIN"."IDX609161550570000" ON "DB2ADMIN"."LINEITEM" ("L_ORDERKEY" ASC,
"L_LINENUMBER" ASC, "L_SHIPDATE" ASC) ALLOW REVERSE SCANS ;
-- index[13], 41.126MB
-- index[14], 282.829MB
"O_ORDERKEY" DESC) ALLOW REVERSE SCANS ;
-- index[15], 1.923MB
CREATE INDEX "DB2ADMIN"."IDX609161554170000" ON "DB2ADMIN"."SUPPLIER" ("S_SUPPKEY" ASC,
"S_NATIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[16], 383.646MB
-- index[17], 99.169MB
CREATE INDEX "DB2ADMIN"."IDX609161601440000" ON "DB2ADMIN"."PART" ("P_PARTKEY" ASC,
"P_SIZE" ASC, "P_BRAND" ASC, "P_CONTAINER" ASC) ALLOW REVERSE SCANS ;
-- index[1], 990.228MB
"O_SHIPPRIORITY" ASC, "O_ORDERKEY" ASC, "O_CUSTKEY" ASC) ALLOW REVERSE SCANS ;
-- index[2], 82.243MB
-- index[3], 575.935MB
CREATE INDEX "DB2ADMIN"."IDX609161606040000" ON "DB2ADMIN"."ORDERS" ("O_CUSTKEY" ASC,
"O_ORDERKEY" ASC) ALLOW REVERSE SCANS ;
-- index[4], 3.841MB
-- index[5], 57.599MB
-- index[6], 214.939MB
-- index[7], 0.013MB
-- index[8], 0.013MB
-- index[9], 178.575MB
CREATE INDEX "DB2ADMIN"."IDX609161609310000" ON "DB2ADMIN"."PARTSUPP" ("PS_PARTKEY" ASC,
"PS_SUPPKEY" ASC, "PS_SUPPLYCOST" ASC) ALLOW REVERSE SCANS ;
-- index[10], 575.935MB
-- index[11], 782.731MB
-- index[12], 919.810MB
"L_SHIPINSTRUCT" ASC, "L_SHIPMODE" ASC, "L_QUANTITY" ASC) ALLOW REVERSE SCANS ;
-- index[13], 82.243MB
-- index[14], 1462.771MB
CREATE INDEX "DB2ADMIN"."IDX609161615450000" ON "DB2ADMIN"."LINEITEM" ("L_LINENUMBER" ASC,
"L_RECEIPTDATE" ASC, "L_COMMITDATE" ASC, "L_ORDERKEY" ASC) ALLOW REVERSE SCANS ;
-- index[15], 575.935MB
"O_ORDERKEY" DESC) ALLOW REVERSE SCANS ;
-- index[16], 57.599MB
-- index[17], 3.841MB
CREATE INDEX "DB2ADMIN"."IDX609161617540000" ON "DB2ADMIN"."SUPPLIER" ("S_SUPPKEY" ASC,
"S_NATIONKEY" ASC) ALLOW REVERSE SCANS ;
-- index[18], 0.013MB
CREATE INDEX "DB2ADMIN"."IDX609161622150000" ON "DB2ADMIN"."NATION" ("N_NATIONKEY" ASC,
"N_NAME" ASC) ALLOW REVERSE SCANS ;
-- index[19], 198.333MB
CREATE INDEX "DB2ADMIN"."IDX609161625390000" ON "DB2ADMIN"."PART" ("P_PARTKEY" ASC,
"P_SIZE" ASC, "P_BRAND" ASC, "P_CONTAINER" ASC) ALLOW REVERSE SCANS ;

Fuxman Thesis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fuxman Thesis

Uploaded by

Copyright:

Available Formats

Efficient Query Processing Over Inconsistent Databases

is dened as their symmetric dierence, i.e., (I, I

I). The formal denition of repair is the following.

) (I, 1) (i.e., (I, 1) is

(q, I) = true if for every repair 1 of I with respect to , 1 [= q. We

(q, I) = false if there exists at least one repair 1 of I with respect

(q, I) = true and

(q, I) the modied

(q, I) if all the following conditions hold:

t, d) q(1) and glb d

t, glb) q(1); and

above, 2000 is not in the answer because

t and real numbers glb and lub, is it the case that (

t, glb, lub) aggconsistent

t, glb, lub) Q(I) i (

,= 1000. This can be checked with a formula

= 1000. In fact, we will show that a query rewrit-

(q, I) for every database I

to s. The resulting query rewriting for q

(q, I) for every database I

and w occurs in y at position p

be a vector of variables of the same arity as y, and

(c) = p.city(c, p) prov(p, Canada); let q

(p) = prov(p, Canada); and let q

(q(c), I) = true. This enables the algorithm to independently consider

be an instance such that I

[= . Assume that there is a tuple

t. It is easy to see that by removing tuples from

is not a repair of I wrt .

such that R(c,

be an instance such that I

is not a repair of I wrt .

be a repair of I wrt . By Proposition 3.7, there exists

does not introduce any violation to the key dependencies of , I

is not a repair of I. Then, there exists a repair 1

of I such that (I, 1

I. Furthermore, by Proposition 3.6, 1

is not a repair; contradiction.

(q, I). This is a fundamental

(x) = w.(x, w). That is, q and q

have the same literals, but some of the

. Suppose that we would like to

(q, I) = true. This holds if, for every repair 1 of I, 1 [= q. In

, I). This property will be

, y). That is, q

( w.R(x, w)[x/c], I) = false. Then, there is

,[= w.R(x, y)[x/c].

can be added to / only during the iteration for the vector of

t [= w.R(x, y)[x/c] and

,[= w.R(x, y)[x/c], the algorithm never

(q(x)[x/c], I) = false. Then, there

t]. Then, there is a tuple R(c,

t]. Assume towards a contradiction that

t], I) = false. Then, there is some repair 1 such that

t]. By Proposition 3.7, there is a tuple R(c,

, then (p) = z. Let and

be valuations for the variables of x and

such that (x) = c, (

) ,[= w.R(x, y)[x/c][z/

(z), and there is a constant at position p in y; or

(z), and there is some variable w such that w occurs at position p of y,

such that w occurs at position p of

t], (z) = d. Since (z) ,=

(z), there is a constant d