Professional Documents
Culture Documents
Fuxman Thesis
Fuxman Thesis
by
Ariel Damian Fuxman
A thesis submitted in conformity with the requirements
for the degree of Ph.D. in Computer Science
Graduate Department of Computer Science
University of Toronto
Copyright c _ 2007 by Ariel Damian Fuxman
Abstract
Ecient Query Processing Over Inconsistent Databases
Ariel Damian Fuxman
Ph.D. in Computer Science
Graduate Department of Computer Science
University of Toronto
2007
Although integrity constraints have long been used to maintain data consistency, there
are situations in which they may not be enforced or satised. In this thesis, we present
ConQuer, a system for ecient and scalable answering of SQL queries on databases
that may violate a set of constraints. ConQuer permits users to postulate a set of key
constraints together with their queries. The system rewrites the queries to retrieve all
(and only) data that is consistent with respect to the constraints. The rewriting is into
SQL, so the rewritten queries can be eciently optimized and executed by commercial
database systems.
The problem of obtaining consistent answers for primary key constraints and Select-
Project-Join (SPJ) queries is known to be intractable in general. However, we identify
a large and practical class of SPJ queries for which the problem is tractable. For this
class of queries, we provide a query rewriting algorithm that can be executed in linear
time in the size of the query. We consider SPJ queries that may have either set or bag
semantics. For the latter case, the queries may also have grouping and aggregation. We
show the maximality of the class of queries, in the sense that minimal relaxations of its
conditions may lead to intractability. Finally, we study the eciency and scalability of the
query rewritings on a commercial database system. The study shows that the overhead
of the rewritings is reasonable, when we consider the original (non-rewritten) queries
as a baseline. The experiments use representative queries from TPC-H (the standard
benchmark for decision support systems) and databases of up to 20 GB.
ii
A mis padres Silvia y Miguel
iii
Acknowledgements
First and foremost, I would like to thank my supervisor, Renee J. Miller, for her constant
encouragement and support. During these years, I have beneted tremendously from her
remarkable vision and experience. She has been the greatest mentor, always available for
discussion and guidance. I will always be grateful for the endless hours she devoted to
reading and correcting my drafts, and for the numerous times she stayed at the university
until very late to help me out before conference deadlines.
I am grateful to the members of my committee (John Mylopoulos, Mariano Consens,
and Thodoros Topaloglou) for thoroughly reading my thesis and for their valuable feed-
back. I also thank Leopoldo Bertossi for serving as the external reviewer of the thesis, and
for coming to Canada during his sabbatical in Chile with the sole purpose of attending
my thesis defense.
I am indebted to Alberto Mendelzon, who sadly passed away the year before I com-
pleted my thesis. Alberto was not only an outstanding researcher, but also the warmest
and most generous person. At the beginning of my stay in Canada, I was needing a job
oer in order to obtain permanent resident status. Alberto hardly knew me at that time
(I was then not even a member of the Database Group), but as soon as he heard about
my situation, he oered me a position as Research Associate in his group.
In 2004, I had the opportunity of visiting Phokion Kolaitis and Wang-Chiew Tan at
University of California at Santa Cruz. It was a joy to work with both of them. They
were also wonderful hosts, and I thank them for their hospitality. During the summer of
2005, I did an internship with the Clio group at IBM Almaden, working with Mauricio
Hernandez, Lucian Popa, and Howard Ho. I very much enjoyed my time at Almaden,
where I had an opportunity to learn how research is done at an industrial lab. Special
thanks go to Mauricio for his unwavering support during the internship.
For the implementation of the ConQuer system, I received invaluable help from my
brother Diego. I convinced him to do his nal undergraduate project on the topic of
consistent query answering, and his contribution was fundamental for the demo that we
gave at VLDB in Trondheim. Diego, I am proud of your work! I also thank Jiang Du for
his help in building up the experimental framework used in Chapter 7.
Many people helped to make these years in Toronto a very enjoyable experience. I
especially thank the Latin American gang (Sebastian Sardi na, Andres Lagar-Cavilla,
Carlos Hurtado, Blas Melissari, Flavio Rizzolo, Pablo Sala, and many others) for their
iv
friendship. I will always remember our long, heated debates at the Graduate Lounge,
which gained us the reputation of being the loudest group of people in the Department.
I am also grateful to Patricia Rodriguez Gianolli for her support during the last year of
my Ph.D.
And last, but denitely not least, I would like to thank my parents, Silvia and Miguel,
and my brothers, Adrian and Diego, for always being there, despite the distance: without
their love and support none of this would have ever been possible.
v
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Consistent Query Answering . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Organization of the Document . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Formal Framework 10
2.1 Repairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Query Answering Semantics . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Query Rewritings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Rewritings for Conjunctive Queries 22
3.1 A Broad Class of First-Order Rewritable Queries . . . . . . . . . . . . . 22
3.1.1 Notation for Conjunctive Queries . . . . . . . . . . . . . . . . . . 22
3.1.2 Join Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.3 The Class c
forest
of First-Order Rewritable Queries . . . . . . . . 25
3.2 Query Rewriting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Correctness of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Properties of Repairs . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 A Structural Property of c
forest
. . . . . . . . . . . . . . . . . . . 35
3.3.3 A Pessimistic Repair . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.4 Correctness of RewriteLocal . . . . . . . . . . . . . . . . . . . . 39
3.3.5 Correctness of RewriteTree . . . . . . . . . . . . . . . . . . . . . 42
3.3.6 Correctness of RewriteForest . . . . . . . . . . . . . . . . . . . . 44
vi
3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Rewritings for Queries with Grouping and Aggregation 48
4.1 Formal Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Queries with Bag Semantics . . . . . . . . . . . . . . . . . . . . . 51
4.2.2 Queries with the sum, min, and max Functions . . . . . . . . . . . 56
4.3 Correctness of the Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Building Upon First-Order Rewritings . . . . . . . . . . . . . . . 61
4.3.2 An Optimistic Repair . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.3 Sound Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.4 Tight Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.5 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5 Complexity-Theoretic Analysis 83
5.1 Minimal Relaxations of c
forest
. . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 A Dichotomy Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.1 The Class c
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.2 Basic Intractable Cases . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.3 Generalizing the Basic Cases . . . . . . . . . . . . . . . . . . . . . 95
5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6 ConQuer: System Implementation and SQL Rewritings 101
6.1 System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2 ConQuer Rewritings for Queries without Aggregation . . . . . . . . . . . 103
6.2.1 Rewriting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3 ConQuer Rewritings for SPJ Queries with Aggregation . . . . . . . . . . 121
6.3.1 Rewriting algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4 Exploiting Precomputed Annotations . . . . . . . . . . . . . . . . . . . . 134
6.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
vii
7 Experimental Analysis 139
7.1 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.1.1 System and Database Manager Conguration . . . . . . . . . . . 139
7.1.2 Inconsistent Database Instances . . . . . . . . . . . . . . . . . . . 140
7.1.3 Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2.2 Eect of Degree of Inconsistency . . . . . . . . . . . . . . . . . . 153
8 Conclusions and Future Work 157
Bibliography 159
A TPC-H Queries and their Rewritings 168
B Design Advisor Indices 208
viii
Chapter 1
Introduction
1.1 Motivation
The presence of inconsistent data is known to be a major problem in enterprises. How-
ever, data analysts often make business decisions based on inconsistent data; and their
database systems rarely give any warning or indication about this situation. In fact,
current database management systems are largely unable to give such a warning because
they rely upon the fundamental assumption that the underlying data is consistent. In
this thesis, we tackle this problem by providing a set of tools that enable users to obtain
meaningful answers from databases even if they are partially inconsistent.
Integrity constraints have long been used by database management systems in order
to maintain data consistency. The typical data design process focuses on developing a set
of constraints that ensure that every possible database reects a valid, consistent state
of the world. However, integrity constraints may not always be enforced or satised for
a number of reasons. For example, when data is integrated from multiple sources, each
source may satisfy a constraint (for example, a key constraint), but the merged data may
not (for example, if the same key value exists in multiple sources). More generally, when
data is exchanged between independently designed sources with dierent constraints, the
exchanged data may not satisfy the constraints of the destination schema. As another
example, in some environments, checking the consistency of constraints may be too ex-
pensive, particularly for workloads with high update rates. Hence, the database may
become inconsistent with respect to the (unenforced) integrity constraints. In addition
to these long-standing problems, the trend toward autonomous computing is making the
need to manage inconsistent data more acute. In autonomous environments, we can no
1
Chapter 1. Introduction 2
longer assume that data are married with a single set of constraints that dene their
semantics. As constraints are used in an increasing number of roles (from modelling
the query capabilities of a system, to dening mappings between independent sources),
there is an increasing number of applications in which data must be used with a set
of independently designed constraints. In such applications, a static approach where
consistency (with respect to a xed set of constraints) is enforced on the database may
not be appropriate. Rather, a dynamic approach in which inconsistent data is tolerated,
but consistency is taken into account at query time, permits the constraints to evolve
independently from the data.
One strategy for managing inconsistent databases is data cleaning [DJ03]. Data
cleaning techniques seek to identify and correct errors in the data, and can be used to
restore an inconsistent database to a consistent state. Data cleaning, when applicable,
can be very successful. However, it is necessarily a semiautomatic process, which makes
it infeasible or unaordable for some applications. Furthermore, committing to a single
cleaning strategy may not always be appropriate. A user may wish to experiment with
dierent cleaning strategies, or may desire to retain all data, even inconsistent data,
for tasks such as lineage tracing. Finally, data cleaning is only applicable to data that
contains errors. However, the violation of a constraint may also indicate that the data
contains exceptions, that is, clean data which simply does not satisfy a constraint.
In this thesis, we consider inconsistent databases that may violate a set of primary
key constraints. This type of constraint (together with foreign key constraints) are the
most commonly used in commercial databases systems. Furthermore, databases that
violate primary key constraints are ubiquitous in enterprises. For example, in the domain
of Customer Relationship Management (CRM), data sources often contain conicting
information about the same customer. Notably, commercial CRM tools provide limited
support for merging tuples corresponding to the same customer into one tuple in the
integrated database. Although they typically support some form of conict resolution
rules (e.g., rules that take the average between two conicting incomes of the same
customer), these rules may be dicult to design. In the absence of conict resolution
rules, some CRM tools transfer all conicting tuples to the integrated database. Thus,
even if the sources satisfy the key constraints, the integrated database may not.
Chapter 1. Introduction 3
1.2 Consistent Query Answering
While it is well known how to answer queries over consistent databases, we must give
a clear and precise semantics to the notion of a meaningful answer obtained from an
inconsistent database. In this thesis, we make use of a semantics based upon the notions
of possible worlds and certain answers, concepts that are widely used not only in the
context of database theory and data integration [Lip79, Lip81, AKG87, AD98], but also
in the eld of knowledge representation [Lev81, Moo85]. These notions were rst adapted
to the context of inconsistent databases by Arenas, Bertossi and Chomicki [ABC99], who
dened the semantics of consistent query answers.
The semantics of consistent query answers relies on the intuition that an inconsistent
database can be cleaned (or repaired) by adding or deleting tuples in such a way that
the resulting database satises some given integrity constraints. The semantics is agnostic
about which tuples should be added or removed. Therefore, each inconsistent database
may be associated to more than one clean, consistent database. A consistent answer is
then an answer that is obtained from every possible consistent database. Intuitively, this
means that the consistent answers are obtained no matter how the database is cleaned.
The semantics of consistent query answers provides a sound and elegant basis for the
study of the problem of query answering over inconsistent databases. However, despite
considerable work on its theoretical underpinnings [ABC99, CB00, ABC
+
03b, CLR03a,
CLR03b, BB03a, BB03b, CM05], to the best of our knowledge, little work has been
done on its practical applications. A key contribution of this thesis is to bridge the
gap between theory and practice by providing an ecient and scalable system to obtain
consistent query answers from inconsistent databases. In particular, we report the design
and evaluation of ConQuer, a system for managing inconsistent data.
1
In ConQuer, a
user may postulate a set of integrity constraints, possibly at query time, and the system
automatically retrieves all (and only) the query answers that are consistent with respect
to the constraints. ConQuer also helps users take advantage of the query results in order
to interactively clean the inconsistent database.
The major challenge in consistent query answering is the potentially huge number
of consistent databases that can be associated with a given inconsistent database. In
the case of primary key constraints, that is the focus of this thesis, the number of con-
1
ConQuer stands for Consistent Querying. ConQuers web page can be found at
www.cs.toronto.edu/db/conquer.
Chapter 1. Introduction 4
emplKey salary
t1 John 1000
t2 John 2000
t3 Mary 1000
Figure 1.1: An inconsistent database
sistent databases is exponential in the size of the inconsistent database. This problem
is tackled in ConQuer by implementing a query rewriting approach. Given a query q,
ConQuer rewrites q into another query Q that has the following property: for every incon-
sistent database, the rewritten query Q retrieves the consistent answers for the original
query q. The rewriting is done independently of the data, and works on every inconsistent
database. This approach has two fundamental advantages. First, it avoids constructing
the (potentially huge number of) consistent databases associated with the inconsistent
database. Second, the rewritten query is a SQL query that can be executed using any
commercial relational database management system (in ConQuer, we use IBMs DB2).
In an extensive set of experiments, reported in Chapter 7, we show that the overhead
in the execution of the rewritten queries is reasonable, when compared to the original
(non-rewritten) ones.
In the next example, we illustrate the semantics of consistent answers and the query
rewriting approach.
Example 1.1. Consider the database of Figure 1.1, which contains information about
employees and their salaries. In particular, the schema of the database has one relation
called employee, with two attributes: emplKey (the name of the employee) and salary.
Assume that a user species that the key of the relation should be the attribute
emplKey. Note that the database violates this key constraint, perhaps because its data
has been integrated from many operational sources. In particular, there are two tuples
for employee John, one stating that he makes a salary of 1000, and the other stating that
he makes a salary of 2000. Suppose that we do not know which one of this alternatives is
correct, but we still want to be able to draw meaningful answers from the database. Let
us consider the consistent databases (i.e., databases that satisfy the key constraint) that
can be built from the inconsistent database. We would like these databases to be not
only consistent, but also as close as possible to the inconsistent database. This leaves
Chapter 1. Introduction 5
emplKey salary emplKey salary
t1 John 1000 t2 John 2000
t3 Mary 1000 t3 Mary 1000
Consistent database 1 Consistent database 2
Figure 1.2: Consistent databases for the inconsistent database of Figure 1.1
us with two possible consistent databases (shown in Figure 1.2), obtained by deleting
exactly one tuple for John in each of them.
Consider a query q
1
that retrieves information about customers whose salary is less
or equal than 1000.
q
1
: select distinct emplKey
from employee
where salary <= 1000
If we execute this query directly over the inconsistent database, we obtain John, Mary.
Intuitively, this is not a consistent answer because it may be the case that John has a
salary over 1000. In fact, if the consistent database turns out to be the database on the
right hand side of Figure 1.2, then John would not appear in the answer.
One strategy to obtain the consistent answer would be to apply query q
1
to each
of the consistent databases of Figure 1.2. While this may be feasible in this simple
example, it is clearly impractical when the number of tuples violating the constraint
grows. In particular, even for the schema and single constraint of this example, the
number of consistent databases is exponential in the size of the inconsistent database.
For this reason, in ConQuer, we never build the consistent databases explicitly. Instead,
we follow a query rewriting approach, where we rewrite the original query (q
1
in this
case) into another query that can be executed directly on the inconsistent database and
is guaranteed to always return the consistent answers for the original query.
In this case, it is quite simple to obtain a rewriting of q
1
. Notice that John appears
associated with two dierent salaries in the inconsistent database: one satisfying the
query, the other not. This suggests that in the rewriting we should return the employees
that satisfy q
1
(i.e., have a salary of less or equal than 1000) in every tuple of the
inconsistent database where they appear. This can be obtained using the following
query:
Chapter 1. Introduction 6
Q
1
: select distinct emplKey
from employee e
where salary <= 1000
and not exists (select *
from employee e
where e.emplKey=e.emplKey
and c.salary > 1000)
Notice the use of a nested subquery related by not exists. The purpose of this
subquery is to lter out those key values that satisfy q
1
in some tuples, but violate it in
others. In our example, this subquery lters John out of the answer because he appears
in tuple t2 with an account balance above 1000.
Despite the simplicity of the previous example, it has been shown in the literature
[CLR03a, CM05] that there are Select-Project-Join queries for which there is no rewriting
into SQL (under a very likely complexity-theoretic assumption). However, we observe
that the presence of these negative results does not necessarily preclude the existence of
classes of queries for which there is a SQL rewriting. In fact, in Chapter 3, we show a
large and practical class of Select-Project-Join queries for which there is a SQL rewriting.
In Chapter 5, we show that this is a maximal class of queries, in the sense that minimal
relaxations of its conditions lead to queries for which there is no SQL rewriting.
Most of the previous work on consistent query answering (except [ABC
+
03b]) focuses
on queries with set semantics and no aggregation. However, practical query languages
like SQL have bag semantics (duplicates are not eliminated unless explicitly requested),
and support aggregation functions and grouping of results. In Chapter 2, we present
a generalization of the semantics of consistent answers for queries with bag semantics,
grouping and aggregation. In Chapter 4, we provide query rewritings that work under
this semantics.
In the thesis, we are concerned not only with the correctness of the rewritings (i.e.,
ensuring that they retrieve all and only the consistent answers), but also with their
eciency when executed using existing database technology. We address eciency issues
and their empirical validation in Chapters 6 and 7.
Chapter 1. Introduction 7
1.3 Contributions
The main contributions of this thesis are the following:
We identify a large and practical class of Select-Project-Join queries for which the
problem of computing consistent answers is tractable. The class consists of queries
that can have two kinds of joins. First, they can have joins between key attributes.
Second, they can have joins from non-key attributes of a relation (possibly a foreign
key) to the primary key of another relation. Arguably, these two types of joins are
the most commonly used in practice (and certainly the most common in industry
standard benchmarks like TPC-H). (Chapter 3)
For the class of tractable queries that we identify, we provide a query rewriting algo-
rithm that produces a query in rst-order logic that returns the consistent answers.
The algorithm runs in polynomial time in the size of the query. The rewritings
are sound and complete, in the sense that they return all (and only) the consistent
answers. Since rst-order queries can be written in SQL, the rewritings in rst-
order logic are a rst step towards reusing existing commercial database technology.
This work was rst published at the International Conference on Database Theory
(ICDT) [FM05], and an extended journal version has been invited to the Journal
of Computer and Systems Sciences (JCSS) [FM06]. (Chapter 3)
We consider not only Select-Project-Join queries with set semantics, but also queries
with bag semantics, grouping and aggregation. These extensions are needed to en-
able practical use in decision support applications. For this purpose, we extend
the semantics of consistent answers originally proposed by Arenas, Bertossi and
Chomicki [ABC99, ABC
+
03b] . We provide sound and complete algorithms un-
der this semantics for the most common SQL aggregation functions (count, min,
max, sum). This work has been published at the ACM International Conference
on the Management of Data (SIGMOD) [FFM05a]. (Chapters 2 and 4)
We show a large class of Select-Project-Join queries for which the conditions of
applicability of our rewriting algorithm are not only sucient but also necessary.
In particular, we show a class in which the problem of computing the consistent
answers is coNP-complete (and, assuming P ,= NP, inexpressible in rst-order logic)
for every query of the class that violates the conditions of the class of queries for
Chapter 1. Introduction 8
which we give a rewriting algorithm. This type of result is stronger than the com-
plexity results given in the consistent query answering literature [CLR03a, CM05],
which consist of showing intractability of a class by exhibiting at least one query for
which the problem is intractable. As a corollary of our result, we get a dichotomy
for this class of queries: given a query q in our class, either the problem of comput-
ing the consistent answers for q is rst-order rewritable (and thus it is in PTIME),
or it is a coNP-complete problem. (Chapter 5)
We present the implementation of ConQuer, a system for querying inconsistent
databases. We also explain in detail the SQL rewritings produced by the system.
ConQuer has been demonstrated at the International Conference on Very Large
Databases (VLDB) [FFM05b]. (Chapter 6)
We study the running time of ConQuers SQL rewritings on a commercial database
system, in particular IBM DB2. To this end, we present a detailed performance
study using the data and queries of the TPC-H decision support benchmark. The
study focuses on the overhead of the rewritings, using the original (non-rewritten
queries) as a baseline. We study the scalability of the approach (with databases of
up to 172 million tuples), and the eect of the degree of inconsistency (in terms
of the percentage of tuples that are inconsistent and the number of conicting
tuples per key value). The experiments show that our approach can be applied to
large databases, several orders of magnitude larger than those considered in other
approaches for querying inconsistent databases. (Chapter 7)
1.4 Organization of the Document
The rest of this document is organized as follows. In Chapter 2, we present the formal
framework for querying inconsistent databases that will be used throughout the thesis.
In Chapters 3 and 4, we present query rewritings and focus on proving their correctness.
In Chapter 3, we consider a large and practical class of conjunctive queries (that is,
Select-Project-Join queries) and present rewritings in rst-order logic. In Chapter 4, we
consider queries with bag semantics, grouping and aggregation, and present rewritings
in an extension of rst-order logic with grouping and aggregation functions. In Chapter
5, we show the maximality of the class of queries that is the input to the rewriting
algorithms.
Chapter 1. Introduction 9
In Chapter 6, we present ConQuer, a system for eciently querying inconsistent
databases. We present in detail the SQL query rewritings produced by ConQuer for
queries with and without aggregation. The eciency of these rewritings is empirically
validated in Chapter 7 with an extensive set of experiments. We present related work in
separate sections at the end of each of the chapters. In Chapter 8, we nish the document
with conclusions and directions for future work.
Chapter 2
Formal Framework
In this chapter, we present the formal framework that will be used throughout the thesis.
In this framework, an inconsistent database is associated with a space of consistent
databases called repairs. In Section 2.1, we formally dene the notion of repair. Then, in
Section 2.2, we introduce the semantics for query answering over inconsistent databases.
This semantics involves the exploration of all repairs of an inconsistent database. Since
the number of repairs can be very large, in this thesis we advocate a query rewriting
approach, where queries are rewritten in such a way that their consistent answer can be
obtained by posing another query directly on the inconsistent database, without explicitly
building any repair. In Section 2.3, we formally dene the notion of a query rewriting.
Finally, in Section 2.4, we introduce the integrity constraints that are the focus of this
thesis.
2.1 Repairs
A schema R is a nite collection of relation symbols, each of which has an associated
arity. A database instance (or database) I over R is a function that associates each
relation symbol r of R to a relation I(r). A relation I(r) of arity k is a set of k-tuples
whose elements belong to some underlying xed domain.
1
Whenever it is clear from
context, we will abuse notation and use the same symbol r to denote both a relation
symbol and a relation. Given a tuple
t occurring in relation I(r), we denote by r(
t) the
association between
t and r.
1
Although we will consider both set and bag semantics for queries, we always assume the relations of
a database instance (including inconsistent databases) to be sets.
10
Chapter 2. Formal Framework 11
A database instance I is consistent with respect to a set of integrity constraints if
I satises in the standard model-theoretic sense, that is I [= . (As customary, an
integrity constraint may be any rst-order formula [AHV95]). Throughout this thesis,
we will consider databases that may violate a given set of integrity constraints. That is,
given R and set of integrity constraints over R, a database I may be inconsistent with
respect to , that is I ,[= .
Intuitively, we will assume that an inconsistent database can be cleaned (or re-
paired) by adding or deleting tuples in such a way that the resulting database satises
the given integrity constraints. We will be agnostic about which tuples should be added
or removed. Therefore, each inconsistent database may be associated to more than one
possible clean, consistent database. Furthermore, no matter how the clean databases are
obtained, we would like them to be as close as possible to the original, inconsistent
database (that is, to minimize the number of tuples that are added or removed). We will
call each consistent database a repair.
The notion of repair was originally introduced by Arenas, Bertossi and Chomicki
[ABC99]. A repair is a database instance that satises the given integrity constraints,
and which has a minimal distance to the inconsistent database. The distance between
two database instances I and I
) =
(I I
) (I
such that I
[= and (I, I
t consistent
(q, I).
This denition was originally given by Arenas, Bertossi and Chomicki [ABC99]. It is
based on the semantics of certain answers [Lip79, Lip81, AKG87] that has been used in
database theory, and possible worlds, which is well-known in knowledge representation
[Lev81]. In the case of consistent answers, the space of possible worlds corresponds to
the repairs of the inconsistent database.
Example 2.1. (continued) Consider a query that retrieves all the employees from
the database, expressed as q
1
(e) = s.employee(e, s). Recall that there are two re-
pairs of I wrt : 1
1
= employee(John, 1000), employee(Mary, 1000) and 1
2
=
employee(John, 2000), employee(Mary, 1000). The result of applying q
1
on both 1
1
Chapter 2. Formal Framework 13
and 1
2
is (John), (Mary). Thus, the consistent answers for q
1
on I are the tuples
(John) and (Mary).
Now, consider a query that retrieves employees together with their salaries, expressed
as q
2
(e, s) = employee(e, s). Notice that q
2
is the identity on the repairs. Thus, the con-
sistent answer can be obtained as the intersection of 1
1
and 1
2
. In consequence, the only
consistent answer for q
2
on I is (Mary, 1000). Notice that the tuples (John, 1000) and
(John, 2000) are not consistent answers. The reason is that neither of them are present
in both repairs. Intuitively, this reects the fact that Johns salaries are inconsistent data,
and we do not want to retrieve possibly erroneous results.
For convenience, we will use the following notation for the consistent answers of
Boolean queries.
Denition 2.3. Let R be a schema. Let be a set of integrity constraints. Let
I be a database instance over R. Let q be a Boolean query over R. We say that
consistent
(q, I) = false. While for the former, every repair must satisfy the query,
for the latter it suces to have just one non-satisfying repair. This is not intrinsic to
Boolean queries: by Denition 2.2, it is also the case that
t , consistent
(q, I) if there
exists at least one repair 1 such that
t , q(1).
The denition of consistent answers is independent of the language used to express
the input query q, and it makes perfect sense for queries that, for example, return tuples
from the active domain of the database. However, for queries that compute aggregates
over groups of tuples, it may be useful to relax this denition, as we motivate next.
Example 2.1. (continued) Let q
3
(s, v) be a SQL query that counts the number of
occurrences of each salary in the database:
select salary as s, count(*) as v
from employee
group by salary
Chapter 2. Formal Framework 14
Recall that there are two repairs of I with respect to : 1
1
= employee(John, 1000),
employee(Mary, 1000) and 1
2
= employee(John, 2000), employee(Mary, 1000). The
result of applying query q
3
to the repairs is the following: q
3
(1
1
) = (1000, 2), and
q
3
(1
2
) = (1000, 1), (2000, 1). Since the intersection of these results is empty, according
to Denition 2.2, the set of consistent answers for q
3
is empty. However, notice that the
salary 1000 appears in every query result (but together with a dierent number for the
count of occurrences). Intuitively, it would be desirable to report this salary in the result.
In the previous example, the value 1000 appears in every query result. However, it
appears a dierent number of times on each of them. How do we report the number of
times that it appears? In the semantics that we dene next, we employ tight bounds
for this purpose. In this particular example, we will say that the minimum (greatest
lower bound) is one, since the salary 1000 appears exactly once in q
3
(1
1
); and that the
maximum (lowest upper bound) is two, since salary 1000 appears exactly twice in q
3
(1
2
).
In the following denition, we formalize this notion. The denition applies to any query
that computes an aggregate over a group (in our example, the aggregate is the count
of occurrences of each salary). We will denote with aggconsistent
t, glb, lub)
aggconsistent
t, lub) q(1).
We also say that glb is the greatest lower bound of
t in q, and that lub is the lowest
upper bound of
t in q.
This denition is particularly well suited to the case of queries with bag semantics,
grouping and aggregation, which are prevalent in practice. For instance, consider the
query q
3
(s, v) of Example 2.1:
Chapter 2. Formal Framework 15
select salary as s, count(*) as v
from employee
group by salary
In this case, q
3
has free variables s and v. The variable s corresponds to the attribute
salary, on which there is a grouping condition; the numerical argument v, for which we
give tight ranges, corresponds to the result of count(*). Essentially, for a query q(z, v),
aggconsistent
(q, I) gives the consistent answers on I with respect to for each value
of z (the salary in our example), together with a tight range for the possible associated
numerical values.
Example 2.1. (continued) Let us obtain the aggconsistent
answers for q
3
on I. Re-
call that the result of applying q
3
to the repairs of the inconsistent database is: q
3
(1
1
) =
(1000, 2), and q
3
(1
2
) = (1000, 1), (2000, 1). Then, we have that aggconsistent
(q
3
, I) =
(1000, 1, 2). This means that the salary 1000 appears in every query result, and the
value of count(*) for 1000 has a greatest lower bound of one and a lowest upper bound
of two. Notice that the salary 2000 does not appear in aggconsistent
(q
3
, I). The intu-
itive reason is that 2000 is not a consistent answer, since it does not occur in repair 1
1
.
According to the denition of aggconsistent
and
aggconsistent
operators (the latter for the case in which the query computes numerical
values over a group of tuples).
Denition 2.5. Let R be a schema. Let q be a query over R. Let be a set of integrity
constraints.
The problem CONSISTENT(q, ) is the following: given an instance I over R, and
tuple
t, is it the case that
t consistent
(q, I)?
The problem AGGCONSISTENT(q, ) is the following: given an instance I over R, tuple
(q, I)?
We can now dene the notion of query rewriting for the problems CONSISTENT(q, )
and AGGCONSISTENT(q, ). The denition is given for a xed (but undened) query
language.
Denition 2.6 (/-query rewriting). Let R be a schema. Let be a set of integrity
constraints. Let q be a query over R. Let Q be a query expressed in a query language /
(possibly dierent from the language used to express q).
We say that Q is an /-rewriting of CONSISTENT(q, ) if for every instance I over R
and tuple
t,
t Q(I) i
t consistent
(q, I).
We say that Q is an /-rewriting of AGGCONSISTENT(q, ) if for every instance I
over R, tuple
t and real numbers glb and lub, (
t, glb, lub)
aggconsistent
(q, I).
We also dene the rewritability of a problem in a language / as follows. We say that
CONSISTENT(q, ) is /-rewritable if there exists a query Q expressed in language / such
that Q is a query rewriting for CONSISTENT(q, ). A similar denition can be given for
AGGCONSISTENT(q, ).
In Chapter 3, we will consider classes of conjunctive queries, and present query rewrit-
ings in rst-order logic. Notice that if CONSISTENT(q, ) is rst-order rewritable, then
Chapter 2. Formal Framework 17
it is tractable. This is because the data complexity of rst-order logic is in PTIME (in
fact, in AC
0
, which is a subset of PTIME). Thus, the query rewriting Q can be executed
on the inconsistent database in polyomial time. Besides this, an approach based on rst-
order query rewriting is attractive because rst-order queries can be written in SQL. In
Chapter 4, we will focus on classes of conjunctive queries with bag semantics, grouping,
and aggregation. We will give query rewritings for the problem AGGCONSISTENT(q, ) in
a language that extends rst-order logic with operators for grouping and aggregation. In
Chapter 5, we will study the computational complexity of the problem CONSISTENT(q, ).
Finally, in Chapters 6 and 7, we will present SQL query rewritings and show experimen-
tally that they can be run eciently and scalably on a commercial relational database
system.
2.4 Constraints
The most commonly used type of constraints in database systems are keys and foreign
keys. Of these, keys pose a particular challenge since databases that are inconsistent
with respect to a set of key dependencies admit an exponential number of repairs in the
worst case. This potentially large number of repairs leads to the question of whether it is
possible to compute consistent answers eciently. The answer to this question is known
to be negative in general [CLR03a, CM05]. However, this does not necessarily preclude
the existence of classes of queries for which the problem is easier to compute. Hence, we
consider the following question: for what queries is the problem of computing consistent
answers under key constraints in polynomial time (in data complexity)? And, can these
rewritings be executed eciently in practice? We address the rst question in Chapters
3 and 4, and the second question in Chapter 6.
A key constraint is an integrity constraint of the form
x, y, z.(r(x, y) r(x, z)) y = z
In the above constraint, we say that x is a key of relation r. Notice that a key may
consist of many attributes. Throughout the thesis, we will assume that is a set of key
constraints that includes one key constraint per relation of the schema. This corresponds
to the notion of primary keys in database systems.
To facilitate specifying the key constraints each time that we give a query, we will un-
derline the positions in each literal that correspond to key attributes. Furthermore,
Chapter 2. Formal Framework 18
by convention, the key attributes will be given rst. For example, the query q =
x, y, z.r
1
(x, y) r
2
(y, z) indicates that the rst and second literals correspond to bi-
nary relations whose rst attribute is the key. We will use vector notation (e.g., x, y) to
denote vectors of variables or constants from a query or tuple. In addition, when we give
a tuple, we will underline the values that appear at the position of key attributes. For
instance, for a tuple r(c,
d), we will say that c is a key value, and
d is a nonkey value.
Using this notation, the key constraints of that are relevant to the query are denoted
directly in the query expression.
2.5 Related Work
In this section, we survey work on related formal frameworks for managing inconsistent
data. For two excellent surveys of the area of consistent query answering, we refer the
reader to Bertossi and Chomicki [BC03] and Bertossi [Ber06].
Intuitively, a repair is a consistent database that is as close as possible to the given
inconsistent database. To formalize this intuition, it is necessary to dene a notion of
distance between databases. The notion of distance that we employ in this thesis (and
which was initially proposed by Arenas, Bertossi, and Chomicki [ABC99]) is dened in
terms of the symmetric dierence between sets. Other notions of distance have been
explored in the literature, which we review next.
Some proposals adopt a cardinality-based notion of distance between database in-
stances, instead of set-theoretic. For example, Lin and Mendelzon [LM96] propose a
semantics where conicts are resolved according to a majority criterion. Their frame-
work is presented in the context of belief revision for rst-order theories, and is therefore
broader in scope than consistent query answering. However, the complexity of query an-
swering under this semantics has not been studied. Other approaches [FPL
+
01, BBFL05,
FFP05, BMFR05] consider cost-based notions of distance, where each operation that can
be used to restore consistency is given a cost. Then, repairs are dened as the consistent
databases that can be obtained from the inconsistent database with a minimum cost.
These operations include not only insertion and deletion of tuples, but also modication
of values. While a cost-based notion of distance is attractive from a semantic point of
view, it can be computationally more expensive than the set-theoretic semantics. For
example, in the case of inconsistencies with respect to primary key dependencies, the
problem of obtaining a repair of an inconsistent database is NP-complete [BMFR05],
Chapter 2. Formal Framework 19
whereas it can be obtained in linear time under the set-theoretic semantics.
In some of the cost-based approaches mentioned above [FPL
+
01, BBFL05, FFP05],
tuples can be modied to contain values that are not in the active domain of the incon-
sistent database. Thus, the domain of the attributes that can be modied must have
an intrinsic distance metric. In particular, these approaches consider only numerical at-
tributes (it is not clear how their techniques could be extended to categorical values).
An approach based on tuple modication which allows arbitrary attribute domains is
given by Wijsen [Wij05]. In his work, the repaired databases may contain variables, and
the semantics is given in terms of homomorphisms to the inconsistent database. Instead
of answering queries directly on the inconsistent database (as we do in ConQuer), his
approach requires the oine processing of the inconsistent databases to construct con-
densed representations. The consistent answers to certain classes of queries can then be
obtained by directly executing the original query on the condensed representation.
In contrast to consistent answers, we could also consider possible answers, where
we retrieve answers that appear in at at least one repair. This notion has received
less attention than consistent answers, perhaps because it is less challenging from a
computational point of view. In fact, for broad classes of queries and constraints for which
obtaining consistent answers is intractable, the problem of obtaining possible answers
is tractable (and it usually suces to compute the original query on the inconsistent
database). Although they are easier to obtain, possible answers are as important as
consistent answers in the context of inconsistent databases. While consistent answers are
best suited for decision making, possible answers can be used to understand the reasons
why a database is inconsistent. For example, in ConQuer, we give the option of retrieving
not only the consistent answers but also the possible answers (see Chapter 6). If the user
decides that a possible answer should have been a consistent answer, he or she can request
an explanation from the system in terms of the underlying database. This explanation
often helps the user to detect incorrect data and to (interactively) correct it.
The notions of possible and consistent answers are two opposite ends of a spectrum:
the former being the most aggressive, and the latter the most cautious. In some sce-
narios, it is desirable to give preference (or rank) tuples in the answer according to the
number of repairs where they appear. Furthermore, some repairs may be more preferable
than others. To formalize this intuition, it is natural to appeal to a semantics based on
probabilities, where each repair is assigned a probability of being the consistent database
that the user has in mind. There has been considerable research on the topic of prob-
Chapter 2. Formal Framework 20
abilistic databases [CP87, BMP92, LLRS97, FR97, DS04]. Recently, Dalvi and Suciu
[DS04] presented a framework for query rewriting over probabilistic databases. Their
rewriting algorithms rely on the fundamental assumption that each tuple has an inde-
pendent probability of being in the (in our terms) consistent database. In the context
of databases that violate primary key constraints, which is the focus of this thesis, we
cannot assume that all tuples are independent. In fact, tuples that share the same key
value are mutually exclusive. In recent work (which is not covered in this thesis), we
and other authors [AFM06] presented query rewriting algorithms that work under the
probabilistic semantics for databases that may violate primary key constraints. In that
paper, we also considered the important problem of obtaining the probabilities. In par-
ticular, we explored the use of a clustering-based technique that works particularly well
on categorical values [ATMS04]. The non-probabilistic semantics that we consider in this
thesis is a special case of the probabilistic semantics. However, the class of rewritable
queries that we can handled under the probabilistic semantics [AFM06] is considerably
more restricted than the classes considered in Chapters 3 and 4 of this thesis for the
non-probabilistic case.
Databases that are inconsistent with respect to primary key constrains can be mod-
elled as disjunctive databases [vdM98]. In particular, if is a set of key dependencies, the
set of all repairs of an inconsistent database can be represented as a disjunctive database
D in such a way that each repair corresponds to a minimal model of D. However, to
the best of our knowledge, there are no results in the literature for query rewritings over
disjunctive databases. A relevant special case of disjunctive databases are databases with
OR-objects [IvdMV95]. If an inconsistent relation has two attributes (a key and a nonkey
attribute), then it can be modelled with OR-objects. However, this is no longer the case
for relations whose arity is greater than two.
To the best of our knowledge, DeMichiel [DeM89] and Agarwal et al. [AKWS95] are
the rst authors to recognize the need to manage inconsistent databases. They propose
semantics analogous to the one for OR-objects. DeMichiel proposes algorithms that are
sound but not necessarily complete with respect to the semantics. Agarwal et al. do not
discuss the implementation of the projection and join operations which, as we will see in
Chapter 3, are particularly challenging under the consistent query answering semantics,
and an important contribution of this thesis.
We conclude this section by pointing out that the problem of dealing with inconsis-
tency arises (and has been studied) in other elds of computer science. For example, our
Chapter 2. Formal Framework 21
approach to handling inconsistency is related to the approaches followed by the belief
revision community [GR95] in the eld of articial intelligence. The scenario typically
adopted in belief revision is more general in scope than ours, since (in our terms) they
allow the modication of not only the data but also the integrity constraints. As another
example, the problem of handling inconsistency has been studied in software engineer-
ing [Bal91, NER00]. The focus of this body of work is not centered on data or query
answering, but on the reconciliation of inconsistent views of software requirements and
specications.
Chapter 3
Rewritings for Conjunctive Queries
The problem of computing consistent answers for conjunctive queries over databases that
might violate a set of key constraints is known to be coNP-complete in general [CLR03a,
CM05]. This is the case even for queries with no repeated relation symbols, which is
the focus of this chapter. However, this does not necessarily preclude the existence of
classes of queries for which the problem is easier to compute. In fact, in this section we
characterize a large and practical class of conjunctive queries for which the problem of
computing consistent answers under key constraints is indeed tractable. Even more so,
we show that all queries in this class are rst-order rewritable, and we give a linear-time
algorithm that computes the rst-order rewriting. We introduce the class of queries in
Section 3.1, and we present the query rewriting algorithm in Section 3.2. The proof of
correctness of the algorithm is given in Section 3.3.
3.1 A Broad Class of First-Order Rewritable Queries
3.1.1 Notation for Conjunctive Queries
The results in this chapter concern a class of conjunctive queries. Conjunctive queries
[CM77, AHV95] are rst-order formulas that may only have conjunctions of positive
literals and existential quantication. That is, they are formulas of the following form:
q(z) = w.R
1
(x
1
, y
1
) R
n
(x
n
, y
n
)
where the variables of x
1
, y
1
, . . . , x
n
, y
n
appear in exactly one of z and w. We will
say that the variables in z are the free variables of q, and that the variables in w are the
22
Chapter 3. Rewritings for Conjunctive Queries 23
existentially-quantied variables of q. Even though there are no equality symbols in our
notation for conjunctive queries, their eect can be achieved by having variables appear
more than once in the queries.
Notice that in the formula above, we denote the literals as R
i
(x
i
, y
i
). Throughout
the thesis, we will use the convention of using capital letters (usually R, S and T) to
denote literals of a query. Notice that two distinct literals R
i
and R
j
may be on the same
relation symbol r (although most results in this thesis are for queries without repeated
relation symbols in which each literal corresponds to a distinct relation).
We will adopt the convention of using x to denote variables and constants of a literal
that appear at a position corresponding to key attributes of the relation symbol of the
literal, and y for variables and constants that appear at the position of nonkey attributes
of the relation symbol of the literal.
We will say that there is a join on a variable w if w appears in two literals R
i
(x
i
, y
i
)
and R
j
(x
j
, y
j
) such that i ,= j. If w occurs in y
i
and y
j
, we say that there is a nonkey-
to-nonkey join on w; if w occurs in y
i
and x
j
, we say that there is a nonkey-to-key join;
and if w occurs in x
i
and x
j
, we say that there is a key-to-key join.
3.1.2 Join Graph
Before introducing the class of queries handled by our algorithm, let us get some insight
from queries that are not considered by our algorithm because (unless P=NP) there is
no rst-order rewriting that computes the consistent answer (no matter what rewriting
algorithm is used). In particular, let us consider the following queries:
q
1
= x, x
, y.R
1
(x, y) R
2
(x
, y)
q
2
= x, y.R
1
(x, y) R
2
(y, x)
q
3
= x, x
, w, w
, z, z
, m.R
1
(x, w) R
2
(m, w, z) R
3
(x
, w
) R
4
(m, w
, z
)
We will show in Chapter 5 that the problem of computing consistent answers for the
above queries is intractable. The rst query consists of a join between nonkey attributes;
the second one involves a cycle of nonkey-to-key joins; and in the third, there are two
joins from nonkey variables to part, but not the entire key, of the corresponding relations.
In order to be more precise in specifying such conditions, we need the notion of the join
graph of a query, which has a node for each literal of a query. Notice that the conditions
Chapter 3. Rewritings for Conjunctive Queries 24
that we just gave are concerned with joins where at least one nonkey variable is involved.
Therefore, the join graph will be a directed graph, where directionality is determined by
the nonkey variables involved in the join.
Denition 3.1 (join graph). Let q be a conjunctive query. The join graph G of q is a
directed graph such that:
the vertices of G are the literals of q;
there is an arc from R
i
to R
j
if i ,= j, and there is some variable w such that w is
existentially-quantied in q, w occurs at the position of a nonkey attribute in R
i
,
and w occurs in R
j
.
Notice that key-to-key joins do not introduce any arcs to the join graph. Since the
class of rst-order rewritable queries that we will present shortly is dened in terms of
the join graph, its queries can have arbitrary key-to-key joins. Further, the free variables
of a query do not introduce arcs to the join graph. As a special case, if all the variables
of a query are free, then its join graph has no arcs. Such queries correspond to the
class of quantier-free queries, and have already been shown to be rst-order rewritable
[ABC99]. If we think in terms of equivalent SQL queries, the fact that all variables are
free means that every attribute of every relation in the from clause must appear in the
select clause.
1
This a strong condition which restricts the practical applicability of
the class. As an empirical observation, none of the queries in the TPC-H specication
[TPC03], the industry standard for decision support systems, satisfy this restriction. For
this reason, we will focus on a class of conjunctive queries that may have existential
quantication (in relational algebra terms, arbitrary projections). Handling queries with
existentially-quantied variables is a major challenge, which we address in this chapter.
In Figure 3.1, we show the join graphs for q
1
and q
2
(we label the arcs with the variable
involved in the joins for illustration purposes). Observe in the gure that both join graphs
have a cycle. For our rewriting algorithm, we will focus on queries that have an acyclic
join graph. Additionally, when we consider how two literals R
i
and R
j
are joined, we will
require that if any of the key attributes of R
i
are joined with a nonkey attribute of R
j
,
then all of the key attributes of R
i
join with nonkey attributes of R
j
. We will then say
that the query has only full nonkey-to-key joins. For example, in the query q
3
above, of
1
The only exception are the attributes that are equated in the where clause. In that case, only one
of the equated attributes needs to appear in the select clause.
Chapter 3. Rewritings for Conjunctive Queries 25
the form x, x
, w, w
, z, z
, m.R
1
(x, w) R
2
(m, w, z) R
3
(x
, w
) R
4
(m, w
, z
), the joins
between R
1
and R
2
, and between R
3
and R
4
, are not full since they do not involve the
entire key of R
2
and R
4
, respectively.
Denition 3.2. Let q be a conjunctive query. Let R
i
(x
i
, y
i
) and R
j
(x
j
, y
j
) be a pair of
literals of q. We say that there is a full nonkey-to-key join from R
i
to R
j
if every variable
of x
j
appears in y
i
.
We observe that if G is an acyclic join graph for a query all of whose nonkey-to-key
joins are full, then G must be a forest. We show this with the following proposition.
Proposition 3.3. Let q be a query all of whose nonkey-to-key joins are full. Let G be
the join graph of q. If G is acyclic, then G is a forest.
Proof. Assume towards a contradiction that G is a directed acyclic graph that is not a
tree. Then, there is a node v in G that receives arcs from two dierent nodes v
i
and v
j
of G. Let R(x, y), R
i
(x
i
, y
i
), and R
j
(x
j
, y
j
) be the literals at the nodes of v, v
i
, and v
j
,
respectively. Since there are arcs from v
i
and v
j
to v, there are variables w
i
and w
j
in
y
i
and y
j
, respectively, that appear in R. Since G is acyclic, w
i
and w
j
must appear in
x. Also, w
j
cannot appear in a nonkey position of R
i
(or, otherwise, there would be a
cycle between the nodes v
i
and v
j
). Since there is a nonkey-to-key join from R
i
to R on
variable w
i
, and variable w
j
does not occur at a nonkey position of R
i
, the join is not
full; contradiction.
3.1.3 The Class c
forest
of First-Order Rewritable Queries
We will now characterize a broad class of conjunctive queries for which the problem of
computing consistent answers under key constraints is tractable and rst-order rewritable.
The characterization is given in terms of the join graph of the queries. In particular, we
will require three conditions. First, all the nonkey-to-key joins of the query must be full.
Second, the join graph must be a forest. As we showed in Proposition 3.3, this includes
all queries with full nonkey-to-key joins with acyclic join graph. Finally, the query should
have no repeated relation symbols. We call this class c
forest
since we require the join
graph of its queries to be a forest, and we give the formal denition next.
Denition 3.4. Let q be conjunctive query without repeated relation symbols and all of
whose nonkey-to-key joins are full. Let G be the join graph of q. We say that q c
forest
if G is a forest (i.e., every connected component of G is a tree).
Chapter 3. Rewritings for Conjunctive Queries 26
Figure 3.1: Cyclic join graphs of intractable queries
A fundamental observation about c
forest
is that it is a very common, practical class
of queries. Arguably, the most used form of joins are from a set of nonkey attributes of
one relation (which may be a foreign key)
2
to the key of another relation (which may be
a primary key). Furthermore, such joins typically involve the entire primary key of the
relation (and, hence, they are full joins in our terms). Finally, cycles are rarely present
in the queries used in practice. Admittedly, the restriction not to have repeated relation
symbols does rule out some common queries (those in which the same relation appears
twice in the from clause of an SQL query). Still, many queries used in practice do not
have repeated relation symbols.
As an empirical observation, only one out of 22 queries in the TPC-H specication
[TPC03], the industry standard for decision support queries, has a nonkey-to-nonkey
join. All the queries in the standard are acyclic, and all the nonkey-to-key joins of the
queries are full.
3.2 Query Rewriting Algorithm
In this section, we present the query rewriting algorithm RewriteForest that works for
the class of conjunctive queries c
forest
introduced in the previous section. We start the
presentation with a number of examples that highlight some of the intuition underlying
the algorithm.
In the next example, we illustrate the rewriting for a query consisting of only one
2
Notice that we are not dealing with the problem of inconsistency with respect to foreign keys, but
only with respect to key dependencies.
Chapter 3. Rewritings for Conjunctive Queries 27
literal. We also show that even for such a simple query, the query itself is not a rewriting
for the problem of computing its own consistent answers.
Example 3.1. As in Example 2.1, consider a schema R with one relation symbol
employee, which has two attributes: emplKey (the name of the employee) and salary.
Furthermore, consider a set consisting of only one constraint stating that the attribute
emplKey is the key of relation employee.
Let q
1
be a query that retrieves all the employees from the database that make
a salary of 1000, expressed as q
1
(e) = employee(e, 1000). First of all, notice that q
1
itself is not a query rewriting of CONSISTENT(q
1
, ). Consider a database instance I
1
=
employee(John, 1000), employee(John, 2000). It is easy to see that (John) q
1
(I
1
).
However, (John) , consistent
(q
1
, I
1
) because the repair 1 = employee(John, 2000)
is such that (John) , q
1
(1).
Now, consider a database instance I
2
= employee(John, 1000), employee(John, 2000),
employee(Mary, 1000). It is easy to see that (Mary) consistent
(q, I
2
). This is be-
cause employee Mary appears with a salary of 1000 as its nonkey value, and does not
appear with any other s
such that s
.employee(e, s
) s
.employee(e, s
) s
= 1000
In the next example, we illustrate the rewriting for a conjunctive query that has a
nonkey-to-key join.
Example 3.2. Let R be a schema with two relation symbols: employee and dept. As-
sume that employee has two attributes: emplKey (employee name), and deptFKey (de-
partment name); and dept has two attributes deptKey (department name) and mgrName
(manager name). Assume that there are two key constraints in , stating that emplKey is
the key of the relation employee, and deptKey is the key of relation dept.
Consider the query q
2
that retrieves the names of all employees whose department
appears in the dept relation:
q
2
(e) = d, m.employee(e, d) dept(d, m)
As in the previous example, q
2
itself is not a query rewriting of CONSISTENT(q
2
, ).
Consider the database instance I
1
= employee(John, Sales), employee(John, Engineering),
Chapter 3. Rewritings for Conjunctive Queries 28
dept(Sales, Peter). It is easy to see that (John) q
2
(I
1
). However, we have that
(John) , consistent
(q
2
, I
1
) because the repair 1 = employee(John, Engineering),
dept(Sales, Peter) is such that (John) , q
2
(1).
Now, consider the following database instance I
2
= employee(John, Sales),
employee(John, Engineering), dept(Sales, Peter), dept(Engineering, Tom). It is easy
to see that (John) consistent
(q
2
, I
2
). This is because every nonkey value (de-
partment name) that appears together with John in some tuple (in this case, Sales
and Engineering) joins with a tuple of dept. This can be checked with a formula
Q
consist
(e) = d.employee(e, d) m.dept(d, m). We will soon show that a query rewrit-
ing Q
2
for q
2
can be obtained as the conjunction of q
2
and Q
consist
, as follows:
Q
2
(e) = d, m.employee(e, d) dept(d, m) d.(employee(e, d) m.dept(d, m))
We now proceed to present RewriteForest, the query rewriting algorithm for queries
in c
forest
(shown in Figures 3.2, 3.3, and 3.4). Given a query q such that q c
forest
and a set of key constraints (containing one key per relation), RewriteForest(q, )
returns a rst-order rewriting Q for the problem of obtaining the consistent answers
for q with respect to . The main procedure of the algorithm is shown in Figure 3.2.
The rst-order rewriting Q that it returns is obtained as the conjunction of the input
query q, and a new query called Q
consist
. The query Q
consist
is used to ensure that q is
satised in every repair. It is important to notice that Q
consist
will be applied directly to
the inconsistent database (i.e., we will never explicitly generate the repairs). The query
Q
consist
is obtained by recursion on the tree structure of each of the components of the
join graph of q (recall that since q is in c
forest
, the join graph is a forest). The recursive
procedure is called RewriteTree, and is shown in Figure 3.3.
The rst part of RewriteTree produces a rewriting Q
local
for the literal R(x, y) at the
root of the input tree. This rewriting is done independently of the rest of the query, and
it is produced by the procedure RewriteLocal (shown in Figure 3.4). The query Q
local
deals with the constants that appear in y in the same way as we illustrated in Example
3.1. It also deals with the free variables that appear at nonkey positions of the query in
the way that we illustrate in the next example.
Example 3.3. Consider the query q
3
that retrieves all employees and their salaries from
the database, expressed as q
3
(e, s) = employee(e, s). Notice that the only dierence with
the query q
1
of Example 3.1 is that the constant 1000 is replaced by the free variable
Chapter 3. Rewritings for Conjunctive Queries 29
Algorithm RewriteForest(q, )
Input: q(z), a query of the form w.( w, z)
, a set of key constraints, one per relation used in q
Output: Q, a rst-order query that computes consistent
Let z
i
= z : z is a variable that occurs in
i
and z, and z , x
i
Let q
i
(x
i
, z
i
) = w
i
.
i
(x
i
, w
i
, z
i
)
Let Q
i
(x
i
, z
i
) = RewriteTree(q
i
, )
end for
Let Q
consist
( w, z) =
i=1...m
Q
i
(x
i
, z
i
)
Let Q(z) = w.(( w, z) Q
consist
( w, z))
return Q
Figure 3.2: Query rewriting algorithm for conjunctive queries in c
forest
s. The algorithm RewriteLocal creates a new, universally-quantied variable s
for the
free variable s, and equates s
.employee(e, s
) s
= s
The second part of RewriteTree recursively creates a query Q
i
for each subtree T
i
of T rooted at R. Let y
0
be the variables at nonkey positions of R (excluding those
that also appear in x). Then, one of the conjuncts of the rewritten query returned by
RewriteTree is of the form y
0
.R(x, y)
i=1...m
Q
i
(x
i
, z
i
). Notice that the variables of
y
0
(i.e., the variables at nonkey positions of the root literal R) are universally quantied.
The intuition behind this is that, as we illustrated in Example 3.2, the query must
be satised by all the nonkey values of a given key (in that example, all the possible
departments for the given employee).
Chapter 3. Rewritings for Conjunctive Queries 30
Algorithm RewriteTree(q, )
Input: q(x, z), a query in c
forest
of the form w.(x, w, z),
whose join graph T is a tree with root literal R(x, y)
, a set of key constraints, one per relation
Output: Q, a rst-order query that computes consistent
Let z
i
= z : z is a variable that occurs in
i
and z, and z , x
i
Let q
i
(x
i
, z
i
) = w
i
.
i
(x
i
, w
i
, z
i
)
Let Q
i
(x
i
, z
i
) = RewriteTree(q
i
, )
end for
Let y
0
= y : y is a variable that occurs in y and w, and y , x
Let Q(x, z) = Q
local
(x, z) y
0
.R(x, y)
i=1...m
Q
i
(x
i
, z
i
)
end if
return Q
Figure 3.3: Recursive algorithm on the tree structure of the join graph
The next example illustrates an application of the algorithm.
Example 3.4. Let R be a schema with four relation symbols: employee, dept, city,
and prov. Assume that employee has three attributes: emplKey (employee name),
cityFKey (city name), and deptFKey (department name); dept has two attributes:
deptKey (department name) and mgrName (managers name); city has two attributes:
cityKey and provFKey; and prov has two attributes: provKey (province name) and
countryName (country name). Assume that there are four key constraints in , stating
that emplKey is the key of the relation employee; cityKey is the key of relation city;
deptKey is the key of the relation dept; and provKey is the key of the relation prov.
Consider a query q
4
that retrieves the names of all employees that are located in
Chapter 3. Rewritings for Conjunctive Queries 31
Algorithm RewriteLocal(q, )
Input: q(x, z), a query of the form w.R(x, y), where
none of the variables of w appear in x
, a set of key constraints
Let be an injective function mapping natural numbers to variables not present in R
Initialize Eq as an empty set
for each position p of y do
Let w be the variable that appears at position p of y
Let z = (p)
if there is a constant d at position p of y then
Add the equality z = d to Eq
end if
if w appears in x or w appears in z then
Add the equality z = w to Eq
end if
for every position p
of y such that p ,= p
do
Let z
= (p
)
Add the equality z = z
to Eq
end for
end for
if Eq ,= then
Let y
, then (p) = z
Let Q
eq
be the conjunction of the equalities of Eq
Let Q
local
(x, z) = w.R(x, y) y
.R(x, y
) Q
eq
else
Let Q
local
(x, z) = w.R(x, w)
end if
return Q
local
Figure 3.4: Query rewriting for a given literal
Chapter 3. Rewritings for Conjunctive Queries 32
Figure 3.5: Join graph of query q
4
.
Canada and whose manager is Peter:
q
4
(e) = d, c, m, p. employee(e, d, c) city(c, p) prov(p, Canada) dept(d, Peter)
The join graph of q
4
is given in Figure 3.5. Notice that the join graph of q
4
is a tree.
Furthermore q
4
has full nonkey-to-key joins and no repeated relation symbols. Thus, q
4
is in c
forest
.
Let q
be the query q
be the query
q
(c) Q
IV
(d))
Q
(c) = RewriteTree(q
, ) =
p.city(c, p) p.city(c, p) Q
(p)
Q
(p) = RewriteTree(q
, ) =
prov(p, Canada) w
.(prov(p, w
) w
= Canada)
Q
IV
(d) = RewriteTree(q
IV
, ) =
dept(d, Peter) u
.(dept(d, u
) u
= Peter)
Notice the reuse of variables in the rewritten queries. In particular, each existentially-
quantied variable of q
4
that appears at a nonkey position in a literal of q
4
is rst
existentially quantied, and then universally quantied in the rewriting Q
4
.
Chapter 3. Rewritings for Conjunctive Queries 33
Recall that queries with repeated relation symbols are not allowed in the class c
forest
.
We now give an example of a query with repeated relation symbols for which our al-
gorithm fails to give the consistent answers. Although not addressed in this work, it
would be interesting to characterize the class of queries with repeated relation symbols
for which our algorithm is indeed correct.
Example 3.5. Let R be a schema with one relation symbol r, which has three attributes:
A, B, C. Assume that A is the key of the relation r. Let q be the Boolean query
q = x, y, z.r(x, y, a) r(y, z, b), where a and b are constants. If we apply our query
rewriting algorithm, we obtain the following:
Q = x, y, z.r(x, y, a) r(y, z, b) y
, z
.(r(x, y
, z
) z
= a)
y.(r(x, y, a) z.r(y, z, b) z
, w
.(r(y, z
, w
) z
= b))
Let I be the database instance I = r(c, d, a), r(d, e, b), r(d, f, a), r(f, g, b). In this
case, there are two repairs of I with respect to : 1
1
= r(c, d, a), r(d, e, b), r(f, g, b)
and 1
2
= r(c, d, a), r(d, f, a), r(f, g, b). Clearly, 1
1
[= q and 1
2
[= q. However, I ,[= Q.
We nish this section by pointing out that the complexity of the query rewriting
algorithm is linear in the number of literals of the input query. To see this, notice that
the algorithm visits each node of the join graph exactly once.
3.3 Correctness of the Algorithm
In this section, we show that the algorithm RewriteForest presented in the previous
section is correct for all queries in the class c
forest
. In particular, we prove the following
theorem.
Theorem 3.5. Let R be a schema. Let be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(z) be a conjunctive query over R such that
q c
forest
. Let Q(z) be the rst-order query returned by RewriteForest(q, ). Let I be
an instance over R.
Then,
t Q(I) i
t consistent
(q, I).
Our proof relies on a few simple properties of repairs of inconsistent databases where
the set of integrity constraints contains a single key dependency per relation. We establish
Chapter 3. Rewritings for Conjunctive Queries 34
these properties in Section 3.3.1. In Section 3.3.2, we show a structural property of the
queries in c
forest
that is important in order to guarantee the correctness of the algorithms
RewriteTree and RewriteForest: the literals from distinct trees of the join graph may
only share variables that appear as key attributes at the root of their trees.
In Section 3.3.3, we introduce the notion of a pessimistic repair. The name comes
from the fact that, for a given query q and database I, if a tuple fails to satisfy the query
on some repair, then it also fails to satisfy the query on the pessimistic repair. More
precisely, for any inconsistent database I, there is a repair / such that if / [= q(c),
then consistent
t I
and
t , I. Let I
= I
[= . Clearly, (I, I
) (I, I
). Therefore, I
) is a tuple of 1.
Proof. Let I
[= and R(c,
d
) , I
, for every
d
. Let
Chapter 3. Rewritings for Conjunctive Queries 35
I
= I
R(c,
d). Since R(c,
d
) , I
for every
d
, I
[= . Clearly, (I, I
) =
(I, I
) R(c,
d). Since (I, I
) (I, I
), I
such that
R(c,
d
) 1
. Let I
= 1
R(c,
d
) R(c,
d). Since 1
is a repair, 1
[= . Since I
[= . Assume that I
) (I, I
).
By Proposition 3.6, 1
I, and thus I
I.
Thus, I 1
I I
. Therefore, I
. Let I
= 1
R(c,
d) R(c,
d
).
Clearly, 1
. Thus, 1
(/).
By Lemma 3.10 below, it follows that c consistent
(q
, y.r
1
(x, y)r
2
(x
, y).
Notice that q
nk
is not in c
forest
because it contains a nonkey-to-nonkey join. Let I be an
instance such that I = r
1
(a
1
, b
1
), r
1
(a
1
, b
2
), r
1
(a
2
, b
3
), r
1
(a
2
, b
4
), r
1
(a
3
, b
5
),
r
1
(a
3
, b
3
), r
2
(c
1
, b
1
), r
2
(c
1
, b
3
), r
2
(c
2
, b
4
), r
2
(c
2
, b
5
), r
2
(c
3
, b
2
), r
2
(c
3
, b
3
). It can be checked
that for every repair 1 of I, 1 [= q
nk
.
Now, consider the query q
nk
(x) = x
, y.r
1
(x, y) r
2
(x
nk
dier
only in the fact that x is existentially-quantied in the former, and free in the latter. Let
1
1
be repair of I such that 1
1
= r
1
(a
1
, b
1
), r
1
(a
2
, b
3
), r
1
(a
3
, b
5
), r
2
(c
1
, b
3
), r
2
(c
2
, b
4
), r
2
(c
3
, b
3
).
Let 1
2
be a repair of I such that 1
2
= r
1
(a
1
, b
1
), r
1
(a
2
, b
3
), r
1
(a
3
, b
5
), r
2
(c
1
, b
1
), r
2
(c
2
, b
4
),
r
2
(c
3
, b
2
). Notice that (a
1
) , q
nk
(1
1
), (a
2
) , q
nk
(1
2
), and (a
3
) , q
nk
(1
1
). Thus, even
though consistent
(q
nk
, I) = true, we have that (a) , consistent
(q
nk
, I) = false,
Chapter 3. Rewritings for Conjunctive Queries 37
for every a. Therefore, it is not possible to check whether consistent
(q
nk
, I) = true
by independently checking each instantiation of the free variables of q
nk
.
The result that we give below assumes an input query q(x) that is in c
forest
, whose
join graph T is a tree, and whose free variables x are exactly the variables of the key of Ts
root. In the algorithm RewriteForest, the input query will be broken into subqueries
that satisfy this condition.
Lemma 3.10. Let q(x) be a query in c
forest
, whose join graph T is a tree and where
R(x, y) is the literal at the root of T. Let I be an instance. Then, there is a repair /
such that for all c if c q(/), then c consistent
(q, I).
Proof. Let / be the instance instance built by invoking the procedure
BuildPessimisticRepair(q, I) given in Figure 3.3.3. Assume that q is of the form
q(x) = w.( w, x). We will prove the claim by induction on the number of literals of .
Base case. Assume that consists of exactly one literal R(x, y). Let
t be the tuple
selected by the algorithm in the iteration for literal R and the vector of values c. Assume
towards a contradiction that consistent
in 1 and some
d
such that
t
= R(c,
d
). Since
1 ,[= w.R(x, y)[x/c], we have that
d such that
t = R(c,
d). Since / [= q(x)[x/c], we have that for every j such that
1 j m, there is some valuation for the variables of y, and some c
j
such that
(y) =
d, ( x
j
) =c
j
, and /
j
[= q
j
(x
j
)[x
j
/c
j
].
Chapter 3. Rewritings for Conjunctive Queries 38
Algorithm BuildPessimisticRepair
Input: q(x), a query in c
forest
of the form w.( w, x),
whose join graph T is a tree with root literal R(x, y)
, a set of key constraints, one per relation
I, an instance
Output: /, a repair of I
Initialize / as an empty instance
if has exactly one literal then
for each c such that there is some R(c,
d) in I do
if there is some
d such that R(c,
d) I,
and R(c,
d) ,[= w.R(x, y)[x/c] then
Let
t = R(c,
d)
else
Let
t be any tuple of I such that
t = R(c,
d), for some
d
end if
Add
t to /
end for
else
/* has more than one literal*/
Let S
1
, . . . , S
m
be the children of R in T
for j := 1 to m do
Let T
j
be the subtree of T whose root is S
j
Let
j
be the conjunction of literals of T
j
Let w
j
= w : w is a variable that occurs in
j
and w, and w , x
j
Let q
j
(x
j
) = w
j
.
j
(x
j
, w
j
)
Let /
j
= BuildPessimisticRepair(q
j
, I)
Add /
j
to /
end for
for each c such that there is some R(c,
d) in I do
if there is some
d, some j, some valuation for the variables of y,
and some c
j
such that R(c,
d) I, (y) =
d, ( x
j
) =c
j
, and
/
j
,[= q
j
(x
j
)[x
j
/c
j
] then
Let
t = R(c,
d)
else
Let
t be any tuple of I such that
t = R(c,
d), for some
d
end if
Add
t to /
end for
end if
Figure 3.6: Algorithm to construct a pessimistic repair
Chapter 3. Rewritings for Conjunctive Queries 39
Assume towards a contradiction that consistent
in 1 and some
d
such that
= R(c,
d
). By Lemma
3.9, none of the variables of w
i
appear in w
j
, for every i and j such that i ,= j, 1 i m,
1 j m. Thus, there is some j, some valuation for the variables of y, and some tuple
of values c
j
such that 1 j m, 1 ,[= q
j
(x
j
)[x
j
/c
j
], (y) =
d
, and (x
j
) = c
j
. Thus,
consistent
(q
j
(x
j
)[x
j
/c
j
], I) = false. By inductive hypothesis /
j
,[= q
j
(x
j
)[x
j
/c
j
].
Since /
j
[= q
j
(x
j
)[x
j
/c
j
], the algorithm never selects
t in the construction of /. But
t /; contradiction.
3.3.4 Correctness of RewriteLocal
We now give a correctness proof of RewriteLocal, the module of the algorithm that
handles atomic queries, that is queries with a single literal (and hence no joins). These
atomic queries may have arbitrary selections and projections on any subset of the nonkey
attributes (more precisely, any of the nonkey attributes may be projected out of the
query result). We consider here only equality selections, but it is quite easy to see how to
extend the algorithm and the proof to more general selection conditions (including not
only inequalities, but also arbitrary rst-order expressions relating the variables of the
literal).
Lemma 3.11. Let q(x, z) be a query of the form w.R(x, y). Let I be a database instance.
Let Q
local
(x, z) be the rst-order query returned by RewriteLocal(q, ).
Then, (c,
t) Q
local
(I) i (c,
t) consistent
(q, I).
Proof. () Assume that I [= Q
local
(x, z)[x/c][z/
( w.R(x, y)[x/c][z/
) in 1.
Following the construction of Q
local
in RewriteLocal, let be an injective function
that maps natural numbers to variables not present in R. Let y
be a vector of variables
of the same arity as y and such that if z is at position p of y
) =
d,
(x) = c,
and
) =
d
.
Since R(c,
d) [= w.R(x, y)[x/c][z/
t] and R(c,
t], there
is some variable z at some position p of
y
such that
Chapter 3. Rewritings for Conjunctive Queries 40
1. (z) ,=
, and a position p
of y, p ,= p
, z
= (p
), and
(z) ,=
(z
).
Assume (1) that there is a constant d at position p in y. Since
R(c,
d) [= w.R(x, y)[x/c][z/
such that d ,= d
and
(z) = d
.R(x, y
) z = d. Since 1 I, R(c,
) I.
Thus, R(c,
) [= y
.R(x, y
) z = d. Therefore,
(z) = d; contradiction.
Assume (2) that there is some variable w such that w occurs at position p of y,
and w occurs in either x or in z. Let c = (w). Since R(c,
d) [= w.R(x,
y
)[x/c][z/
t],
(z) = c. Since (z) ,=
(z),
.R(x, y
) z = w[w/c]. Since 1 I,
R(c,
) I. Thus, R(c,
) [= y
.R(x, y
) z = w[w/c]. Therefore,
(z) = c;
contradiction.
Assume (3) that there are variables w and z
, and a position p
of y, p ,= p
, z
= (p
), and
(z) ,=
(z
).
Notice in the algorithm RewriteLocal that since I [= Q
local
(x, z)[x/c][z/
.R(x, y
) z = z
. Since 1 I, R(c,
) I. Thus, R(c,
) [= y
.R(x, y
)
z = z
. Therefore,
(z) =
(z
); contradiction.
() Assume that consistent
(q(x, z)[x/c][z/
t]; or
2. there is a constant d at position p in y and a variable z such that z = (p) and
I ,[= y
.R(x, y
) z = d[x/c][z/
t]; or
3. there is some variable w such that w occurs at position p of y, w occurs in either
x or z, and I ,[= y
.R(x, y
) z = w[x/c][z/
t]; or
4. there is some variable w that occurs at position p of y, and at a position p
of y
such that p ,= p
, (p) = z, (p
) = z
and I ,[= y
.R(x, y
) z = z
[x/c][z/
t].
Chapter 3. Rewritings for Conjunctive Queries 41
Assume that I ,[= w.R(x, y)[x/c][z/
t]; contradiction.
Suppose that I [= w.R(x, y)[x/c][z/
.R(x, y
) z =
d[x/c][z/
.R(x, y
)
z = d[x/c][z/
) be a
tuple of 1 such that R(c,
) [= w.R(x, y)[x/c][z/
. Therefore, R(c,
d) [= w.R(x, y)[x/c][z/
t];
contradiction.
Suppose that I [= w.R(x, y)[x/c][z/
.R(x, y
) z = w[x/c][z/
.R(x, y
) z = w[x/c][z/
) be
a tuple of 1 such that R(c,
) [= w.R(x, y)[x/c][z/
. Therefore, R(c,
d) [= w.R(x, y)[x/c][z/
t];
contradiction.
Suppose that I [= w.R(x, y)[x/c][z/
of y such that p ,= p
,
(p) = z, (p
) = z
and I ,[= y
.R(x, y
) z = z
[x/c][z/
.R(x, y
) z = z
[x/c][z/
t]. Let
be a valuation for the variables of
y
such that (
) =
d. Then, there are con-
stants d and e at the respective positions p and p
of
d such that d ,= e. Thus,
R(c,
d) ,[= w.R(x, y)[x/c][z/
) [= w.R(x, y)[x/c][z/
. Therefore, R(c,
d) [= w.R(x, y)[x/c][z/
t]; contradiction.
Chapter 3. Rewritings for Conjunctive Queries 42
3.3.5 Correctness of RewriteTree
Consider a Boolean query q = x, w.(x, w) and a query q
have the same literals, but some of the (existentially-quantied) variables of q are
free in q
(/), then
(c) consistent
(q
, I). We also argued that this fact implies that, in order to check
whether consistent
(q
t) Q(I) i (c,
t) consistent
(q, I).
Proof. The proof is by induction on the number of literals of q.
Base case Assume that q has exactly one literal. Then, q(x, z) = w.R(x, y),
and Q = RewriteLocal(q, ). By Lemma 3.11, we have that I [= Q(x, z)[x/c][z/
t]
i consistent
(q(x, z)[x/c][z/
t], I) = true.
() Notice in the algorithm RewriteLocal that, since I [= Q
local
[],
I [= y
1
, . . . , y
m
.R(x, y)[]. Let c = (x). Then, there exists some
d such that R(c,
d) I.
Let 1 be a repair of I. By Proposition 3.7, there is some
d
) 1.
Assume that there are no constants in y. Since all the variables of y are existentially
quantied in q
T
, R(c,
d
) [= q
T
[], and we are done.
Assume that there is some constant in y. Since all the variables of y are existentially
quantied in q
T
, in order to show that R(c,
d
) [= q
T
[], it suces to show that
d
and
y coincide in their constants. By Proposition 3.6, 1 I. Thus, R(c,
d
) I. Since
I [= Q
local
[] and R(c,
d
) I, we have that [= Q
const
[y
, then [= E
i
[w
i
/e], where w
i
is the variable created
in RewriteLocal for the i-th position of y. By construction of E
i
, this means that there
is a constant e at position i of y.
Chapter 3. Rewritings for Conjunctive Queries 43
() Let 1 be a repair of I. Let c = (x). Since 1 [= q
T
[], there exists
d such that
R(c,
d) 1. By Proposition 3.6, 1 I. Therefore, there exists
d
) I.
Thus, I [= y
1
, . . . , y
m
.R(x, y)[].
Assume that there is some constant in y. Let
y
be a valuation for the variables
of y
, where y
d =
y
(y
). If R(c,
d) , I, then I [= R(x, y
) Q
const
[][
y
] because the left-hand side
of the implication is not satised. Assume R(c,
d) I. By Proposition 3.8, there exists
a repair 1 of I such that R(c,
d) 1. Since 1 [= , if R(c,
d
) 1, then
d
=
d. Since
1 [= q
T
[], R(c,
d) [= q
T
[]. Therefore, if d is a constant that appears at position i in
y, then d occurs at position i in
d. Thus, I [= Q
const
[][
y
].
Inductive step Let R
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
) be the children of R in T. Assume
that q is of the form w.( w, z), where is a conjunction of literals. For each 1 i m,
let T
i
be the tree whose root is R
i
. Let
i
be the conjunction of the literals of T
i
. Let
w
i
= w : w is a variable that occurs in
i
and w, and w , x
i
. Let z
i
= z : z
is a variable that occurs in
i
and z, and z , x
i
. Let q
i
(x
i
, z
i
) = w
i
.
i
(x
i
, w
i
, z
i
).
Let Q
i
(x
i
, z
i
) = RewriteTree(q
i
, ). Let q
local
(x, z) = w.R(x, y). Let Q
local
(x, z) =
RewriteLocal(q
local
, ).
() Assume that I [= Q(x, z)[x/c][z/
t, and
3. I [= Q
local
(x, z)[], and
4. for every i such that 1 i m, there are c
i
and
t
i
such that (x
i
) =c
i
, (z
i
) =
t
i
,
and I [= Q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
]
Let 1 be a repair of I. Assume towards a contradiction that 1 ,[= w.R(x, y)[x/c][z/
t].
Then, consistent
( w.R(x, y)[x/c][z/
t
i
)
for some i such that 1 i m. Thus, consistent
(q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
], I) = false.
By inductive hypothesis, I ,[= Q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
]; contradiction.
Chapter 3. Rewritings for Conjunctive Queries 44
() Assume that consistent
(q(x, z)[x/c][z/
( w.R(x, y)[x/c][z/
t], I) =
false. Thus, it is the case that consistent
(q(x, z)[x/c][z/
(q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
], I) = false. Thus, it is the case that
consistent
(q(x, z)[x/c][z/
i=1...m
Q
i
(x
i
, z
i
)), where ( w, z) is the conjunction of literals of
the original query q, and the variables of each x
i
are in w. The correctness of this formula
relies on the structural property of Section 3.3.2 and the notion of a pessimistic repair of
Section 3.3.3. First, by Lemma 3.10, it suces to nd one instantiation for the variables
of each x
i
. Thus, the variables of x
i
can be free in Q
i
. Second, the subqueries do not
share existentially-quantied variables. This is ensured by the structural property proved
in Lemma 3.9.
Chapter 3. Rewritings for Conjunctive Queries 45
Theorem 3.5. Let R be a schema. Let be a set of integrity constraints, consisting
of one key dependency per relation of R. Let q(z) be a conjunctive query over R such
that q c
forest
. Let Q(z) be the rst-order query returned by RewriteForest(q, ). Let
I be an instance over R.
Then,
t Q(I) i
t consistent
(q, I).
Proof. Let G be the join graph of q. Since q c
forest
, G is a forest. Let T
1
, . . . , T
m
be
the connected components (trees) of G. Assume that q is of the form w.( w, z), where
is a conjunction of literals. For each 1 i m, let R
i
(x
i
, y
i
) be the literal at the root
of T
i
. Let
i
be the conjunction of the literals of T
i
. Let w
i
= w : w is a variable that
occurs in
i
and w, and w , x
i
. Let z
i
= z : z is a variable that occurs in
i
and z,
and z , x
i
. Let q
i
(x
i
, z
i
) = w
i
.
i
(x
i
, w
i
, z
i
). Let Q
i
(x
i
, z
i
) = RewriteTree(q
i
, ).
() Assume that I [= Q(z)[z/
t, and
2. I [= ( w, z)[], and
3. for every i such that 1 i m, there are c
i
and
t
i
such that (x
i
) =c
i
, (z
i
) =
t
i
,
and I [= Q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
]
Let 1 be a repair of I. Assume towards a contradiction that 1 ,[= q[z/
t]. Thus,
1 ,[= q[]. By Lemma 3.9, none of the variables of w
i
appear in w
j
, for every i and j
such that i ,= j, 1 i m, 1 j m. Then, 1 ,[= q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
] for some i such
that 1 i m. Thus, consistent
(q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
], I) = false. By Lemma 3.12,
I ,[= Q
i
(x
i
, z
i
)[x
i
/c
i
][z
i
/
t
i
]; contradiction.
() Assume that
t consistent
(q
i
(x
i
)[x
i
/c
i
], I
i
) = true. We add all the tuples of
/
i
to /.
We now show that / ,[= q(z)[]. Assume that I ,[= q(z)[]. Since / I, / ,[=
q(z)[]. Now, assume that there is some i such that 1 i m and I ,[= Q
i
(x
i
, z
i
)[]. By
Chapter 3. Rewritings for Conjunctive Queries 46
Lemma 3.12, consistent
(q
i
(x
i
, z
i
)[], I) = false. By Lemma 3.10, /
i
,[= q
i
(x
i
, z
i
)[].
Thus, / ,[= q(z)[].
So, for every valuation such that (z) =
t, we have that / ,[= q(z)[]. Thus,
t , consistent
( w, z)
group by z
where q
(i.e., all the satisfying assignments for the grouping variables z). Second, for
each group a (i.e., for each instantiation of the grouping variables z), we obtain the bag
of tuples
a
that satisfy q
( w, a), and
a
= (c, a) : (c, a) q
(I), and
for every i such that 1 i m, b
i
= F
i
(B
i,a
), where B
i,a
is the bag obtained by
taking each tuple (c, a) of
a
and projecting on the aggregation variables v
i
.
We now dene the language of conjunctive aggregate queries as a subset of rst-order
aggregate queries. A conjunctive aggregate query is a formula of the form
select z, F
1
(v
1
), . . . , F
m
(v
m
)
from q
( w, z)
group by z
where q
( w, z) is a conjunctive query, v
1
, . . . , v
m
are vectors of variables from w, and
F
1
, . . . , F
m
are aggregation functions of the arities of v
1
, . . . , v
m
. We will say that z are
the grouping variables, and v
1
, . . . , v
m
are the aggregation variables. The semantics is the
same as for rst-order conjunctive queries.
As with rst-order aggregate queries, the language of conjunctive aggregate queries is
inuenced by previous proposals. In particular, it corresponds closely to the language pre-
sented by Cohen, Nutt and Serebrenik [CNS99], except that we use a SQL-like syntax
instead of a Datalog syntax. It is also related to the language of real conjunctive queries
(conjunctive queries with bag semantics) introduced by Chaudhuri and Vardi [CV93],
Chapter 4. Rewritings for Queries with Grouping and Aggregation 51
and the class of conjunctive queries with label systems representing multisets presented
by Ioannidis and Ramakrishnan [IR95]. In the latter two cases, tuples are returned to-
gether with their multiplicity. This can be obtained in our conjunctive aggregate queries
by using the aggregation function count().
4.2 Algorithms
In this section, we present query rewriting algorithms under the aggconsistent
se-
mantics for a class of queries that extends the class c
forest
of the previous chapter with
operators for grouping and aggregation. In Section 4.2.1, we present the rewriting algo-
rithm for queries with bag semantics (i.e., the count(*) operator), and in Section 4.2.2
we present the algorithm for queries with the unary aggregation functions sum, min, and
max.
4.2.1 Queries with Bag Semantics
In this subsection, we give a query rewriting algorithm for conjunctive queries with bag
semantics (i.e., the count(*) operator). We start with an example, and then give the
general algorithm. The example illustrates how we can build upon the results for query
rewriting conjunctive queries under set-theoretic semantics of the previous chapter.
Example 4.1. Let R be a schema with one relation symbol employee. Assume that r
has two attributes: emplKey (the name of the employee) and salary. Let be a set that
consists of only one constraint stating that emplKey is the key of relation employee.
Consider the following query q
1
, which counts the number of occurrences of each
salary (it corresponds to query q
3
of Example 2.1).
q
1
(s, v): select s, count(*) as v
from employee(e, s)
group by s
Let I be a database instance such that I = employee(John, 1000), employee(John, 2000),
employee(Mary, 1000), employee(Ali, 1000). There are two repairs of I with respect to
: 1
1
= employee(John, 1000), employee(Mary, 1000), employee(Ali, 1000) and 1
2
=
employee(John, 2000),employee(Mary, 1000), employee(Ali, 1000). Furthermore, q
1
(1
1
) =
(1000, 3) and q
1
(1
2
) = (1000, 2), (2000, 1). By Denition 2.4, aggconsistent
(q
1
, I) =
Chapter 4. Rewritings for Queries with Grouping and Aggregation 52
(1000, 2, 3). That is, the salary 1000 is an answer that appears at least twice and at
most three times in the result of applying q
1
on the repairs.
Let us focus on obtaining the greatest lower bound for q
1
. From the previous chapter,
we know how to obtain consistent answers for conjunctive queries without aggregation
under set-theoretic semantics. We would like to reuse such results here. An obvious
strategy (shown to be incorrect shortly) is to rst remove grouping and aggregation
from q
1
, obtain the consistent answers under set-theoretic semantics, and nally apply
grouping and aggregation to the intermediate result. That is, rst compute the consistent
answers for the following query q
1
(s):
select s
from employee(e, s)
We can express q
1
in conjunctive query notation as follows: q
1
(s) = e. employee(e, s).
Let QConsistent
1
, ),
the algorithm introduced in the previous chapter. Suppose that now apply the operator
count(*) to the the result of QConsistent
(s) as follows:
select s, count(*)
from QConsistent
(s)
group by s
It is easy to see that this strategy leads to a wrong result. Since the result of the
consistent answers to q
1
(consistent
(q
1
, I)) is (1000), we would incorrectly conclude
that the greatest lower bound for 1000 is one, when in fact it is two. Clearly, the cause
for the incorrect result is that cardinalities are lost in the set-theoretic consistent answers
that we computed as an intermediate step. But, is there any way of obtaining the correct
bounds for the aggregate query, and yet be able to reuse the notion of set-theoretic
consistent answers as an intermediate step? The answer is positive: we can use a root
key value at a time principle. In this case, this corresponds to making the variable e
(for employee name) free because it is at the key position of employee(e, s), the literal
at the root (and only node) of q
1
. We will obtain the consistent answer one employee
at a time in the intermediate result, and then project out the employees (since they
are not retrieved by q
1
). The intermediate result will be guaranteed to have the correct
cardinalities despite the fact that it is obtained using set semantics. The intuitive reason
Chapter 4. Rewritings for Queries with Grouping and Aggregation 53
is that repairs are sets of tuples that satisfy the key constraints, and hence every employee
name appears exactly once in each repair.
Following the previous discussion, let q
1
be the query q
1
, where the variable e is made
free. That is, let q
1
(e, s) = employee(e, s). The set-theoretic consistent answers for q
1
are
consistent
(q
1
, I) = (Mary, 1000), (Ali, 1000). We can now project out the employee
names and count the number of occurrences of salary 1000, arriving at the correct lower
bound for count(*) in q
1
.
Let us now turn our attention to the computation of the lowest upper bound of q
1
.
Since aggconsistent
(q
1
, I) = (1000, 2, 3), the salary 1000 is an answer that appears
at most three times in the results of applying q
1
to the repairs. We can use q
1
(e, s) =
employee(e, s) to obtain the lowest upper bound of salary 1000 as follows:
select s, count(*) as lub
from q
1
(e, s)
group by s
However, this query also retrieves the tuple (2000, 1) which should not be in the result
of aggconsistent
(q
1
, I) because the salary 2000 does not appear in q
1
(1
1
). This means
that we must make sure that the values for the grouping variables are in the consistent
answers for q
1
. We can do this by employing the rst-order rewriting QConsistent(e, s)
of query q
1
, which can be obtained by invoking the algorithm RewriteForest. Now, we
can rule out 2000 from the nal result because there is no tuple for salary 2000 in the
result of QConsistent(e, s). This can be achieved with the following query:
select s, count(*) as lub
from employee(e, s) e
.QConsistent(e
, s)
group by s
Query Rewriting Algorithm
In Figure 4.1, we give the rewriting algorithm for aggregate conjunctive queries with
the count() aggregation function. The algorithm works for queries q of the form
select z, count(*)
from q
(z)
group by z
Chapter 4. Rewritings for Queries with Grouping and Aggregation 54
where q
is a conjunctive query in c
forest
. The reason for requiring q
to be in c
forest
is
that, as we motivated in the previous example, we would like to build upon the results for
rst-order rewriting of conjunctive queries under set-theoretic semantics. In the previous
chapter, we showed how to obtain such rewritings for the conjunctive queries in class
c
forest
.
By denition, the join graph of all queries in c
forest
is a forest. We can then instantiate
the values for the key attributes at each root literal of the join graph of q
, using the
root key value at a time strategy that we illustrated in the previous example. More
precisely, let G be the join graph of q
that
has the same literals as q
, but all the variables that are at the key of some root of G are
free in q
.
Following the algorithm, let R
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
) be the literals at the roots
of all trees in G. Let x =
i=1...m
x
i
, let
z
, and let
w
= w x. We dene q
as q
(x,
) =
.(x,
w
). The
advantage of query q
is that since the variables at the key of all root literal are free,
each tuple appears exactly once in the answer to q
.
We can exploit this fact by computing the set-theoretic consistent answers for q
as an
intermediate result towards producing the consistent answers to the aggregate query q.
The rst-order query rewriting QConsistent for q
.
QGlb(z, low) = select z, count(*)
from QConsistent(x,
)
group by z
Notice that the free variables of QConsistent, x and
z
(x,
) and checking that some instantiation of the grouping variables of z appear in the
Chapter 4. Rewritings for Queries with Grouping and Aggregation 55
RewriteCount(q, )
Input: A query q of the form
select z, count(*)
from q
(z)
group by z
where q
is a conjunctive query in c
forest
, a set of key constraints (one per relation)
Output: Q, an aggregate rst-order query that computes aggconsistent
(q, I)
for every database I
Let G be the join graph of q
Let R
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
) be the literals at the roots of all trees of G
Let x =
i=1...m
x
i
Let
z
= z x
Let ( w, z) be the conjunction of literals of q
Let
w
= w x
Let q
(x,
) =
.(x,
w
)
Let QConsistent(x,
, )
Let QGlb(z, low) = select z, count(*)
from QConsistent(x,
)
group by z
Let
x
= x z
Let QLub(z, up) = select z, count(*)
from q
(x,
) (
.QConsistent(x,
))
group by z
Let Q(z, low, up) = QGlb(z, low) QLub(z, up)
return Q
Figure 4.1: Query rewriting algorithm for queries with count(*).
Chapter 4. Rewritings for Queries with Grouping and Aggregation 56
consistent answers of q
.QConsistent(x,
), where
(x,
) (
.QConsistent(x,
))
group by z
4.2.2 Queries with the sum, min, and max Functions
In Figure 4.2, we present the query rewriting algorithm for queries with the sum, min,
and max aggregation functions. The main dierence with the rewritings produced by
RewriteCount is that aggregation is performed here in two levels. At the inner level of
the rewriting, we aggregate the values for u (the value that is aggregated in the original
query), and we group by the key-root attributes (vector x in the gure). We then project
out the key-root attributes that are not in the select clause of the input query, and
apply the aggregation function of the input query.
For example, the greatest lower bound of the max function is computed as follows:
QGlb(z, low) =
select z, max(bottom)
from
select x,
z
, min(u) as bottom
from QConsistent(x,
) q
(x,
, u)
group by x,
group by z
Notice that, as in RewriteCount, the lower bound is obtained by selecting tuples from
QConsistent(x,
(x,
(z, u)
group by z
where q
is a conjunctive query in c
forest
, a set of key constraints (one per relation)
Output: Q, an aggregate rst-order query that computes aggconsistent
(q, I)
for every database I
Let G be the join graph of q
Let R
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
) be the literals at the roots of all trees of G
Let x =
i=1...m
x
i
Let
z
= z x
Let ( w, z, u) be the conjunction of literals of q
Let
w
= w x
Let q
(x,
z
) =
, u.(x,
w
,
z
, u)
Let QConsistent(x,
z
, )
Let q
(x,
z
, u) =
.(x,
w
,
z
, u)
Let
x
= x z u
if the aggregate function is max then
QGlb(z, low) =
select z, max(bottom)
from
select x,
z
, min(u) as bottom
from QConsistent(x,
z
) q
(x,
z
, u)
group by x,
z
group by z
QLub(z, up) =
select z, max(top)
from
select x,
z
, max(u) as top
from q
(x,
z
, u) (
.QConsistent(x,
z
))
group by x,
z
group by z
endif
continues on next page...
Figure 4.2: Query rewriting algorithm for queries with aggregation
Chapter 4. Rewritings for Queries with Grouping and Aggregation 58
continued from previous page...
if the aggregate function is sum then
QGlb(z, low) =
select z, sum(bottom)
from
select x,
z
, min(u) as bottom
from QConsistent(x,
z
) q
(x,
z
, u)
group by x,
z
having bottom 0
select x,
z
, min(u) as bottom
from q
(x,
z
, u) (
.QConsistent(x,
z
))
group by x,
z
, max(u) as top
from q
(x,
z
, u) (
.QConsistent(x,
z
))
group by x,
z
select x,
z
, max(u) as top
from QConsistent(x,
z
) q
(x,
z
, u)
group by x,
z
having top 0
group by z
endif
continues on next page...
Figure 4.2: Query rewriting algorithm for queries with aggregation
Chapter 4. Rewritings for Queries with Grouping and Aggregation 59
continued from previous page...
if the aggregate function is min then
QGlb(z, low) =
select z, min(bottom)
from
select x,
z
, min(u) as bottom
from q
(x,
z
, u) (
.QConsistent(x,
z
))
group by x,
z
group by
z
QLub(z, up) =
select z, min(top)
from
select x, z, max(u) as top
from QConsistent(x,
z
) q
(x,
z
, u)
group by x,
z
group by z
endif
Let Q(z, low, up) = QGlb(z, low) QLub(z, up)
return Q
Figure 4.2: Query rewriting algorithm for queries with aggregation
Chapter 4. Rewritings for Queries with Grouping and Aggregation 60
4.3 Correctness of the Algorithms
In this section, we prove the correctness of the query rewriting algorithms of this chapter.
We consider the following class of queries, which we call c
aggforest
.
Denition 4.1. Let q be an aggregate conjunctive query. We say that q is in class
c
aggforest
if q is of the form
select z, [count(*)[ F(u)]
from q
(z, u)
group by z
where q
is a conjunctive query in c
forest
, and F is one of the aggregation functions
min, max or sum.
The main result of this section is the following theorem:
Theorem 4.2. Let R be a schema. Let be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(z, v) be a query in c
aggforest
. Let Q(z, l, u)
be the rst-order aggregate query returned by RewriteCount(q, ) or RewriteAgg(q, )
(depending on the aggregate function of the query).
Let I be an instance over R. If q has the aggregate function sum, assume that the
aggregated attribute ranges over positive numbers on I.
Then, for every tuple
t, low, up)
aggconsistent
(q, I) i (
), then S
= S.
Lemma 4.3. Let q(z) be a query in c
forest
. Assume that the join graph T of q is a
tree, and that all the variables at key positions of the literal at the root of T are free in q
(that is, there is a literal R(x, y) at the root of T such that x z). Let I be a database
instance over the schema of q, and be a set consisting of at most one key dependency
per relation of q. Let 1 be a repair of I wrt . Let S and S
). Then, S
= S.
Proof. The proof is by induction on the number of literals of q.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 62
Base case. Assume that q has exactly one literal. Assume towards a contradiction
that S ,= S
0
in 1 such that
t q(
t
0
) and
t q(
0
). Let R(x, y) be the only literal of q. Since all the variables at key positions of
the root literal of T are free, and z are the free variables of q, we have that x z. Thus,
there are vectors of values c,
d and
d
such that
d ,=
d
,
t
0
= R(c,
d), and
t
0
= R(c,
).
Thus, 1 ,[= . But 1 is a repair of I wrt ; contradiction.
Inductive step. Assume that q has more than one literal. Let R be a literal of q
that appears at a leaf of T (recall that T is a tree). Let
t
0
and
t
0
be tuples of S and S
,
respectively, such that
t
0
= R(c,
d) and
t
0
= R(
).
Let M be a set that consists of all the tuples of S, except the one for literal R.
Let M
,
respectively, that satisfy these conditions since S and S
and valuations
and
such that
t
1
S,
t
1
S
t
0
,
t
1
[= R
) R(x, y)[z/
t][], and
0
,
t
1
[=
R
) R(x, y)[z/
t][
]. Notice that (
) =
). Since q c
forest
, there is a full
nonkey-to-key join from R
appear in x. Therefore,
(x) =
(x); and c =
c
0
. Then, there are
tuples R(c,
d) and R(
) in 1 such that c =
c
and
d ,=
d
t[
B
= 1.
Proof. Assume towards a contradiction that [
t[
B
> 1. Then, there are distinct sets S and
S
that contain exactly one tuple per literal of q and such that
t q(S), and
t q(S
).
Chapter 4. Rewritings for Queries with Grouping and Aggregation 63
Since q c
forest
, G is a forest. For each 1 i m, let T
i
be the tree whose root is R
i
.
Let
i
( w, z) be the conjunction of the literals of T
i
. Let q
i
(z) = w.
i
( w, z). Recall that
x
i
(the variables at the key of the root literal of T
i
) are free, and therefore occur in z.
Thus, q
i
satises the conditions of Lemma 4.3.
Since S ,= S
,
t q(S), and
t q(S
such that M ,= M
, M S, M
, M and M
t possible
(q, I).
For a Boolean query q over R, we say that possible
(q, I) = false if
for every repair 1 of I with respect to , 1 ,[= q.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 64
Lemma 4.6. Let q(x) be a query in c
forest
, whose join graph T is a tree and where
R(x, y) is the literal at the root of T. Let I be an instance. Then, there is a repair ^
such that for all c if c possible
( w.R(c, y), I) = true, there is some repair 1 of I such that 1 [= w.R(c, y).
Thus, there is a tuple
t
such that
can be added
to ^ only during the iteration for the vector of values c. Since
in 1, some
d
= R(c,
d
), (y) =
d
j
such that 1 j m and (x
j
) = c
j
, we have that c
j
q
j
(1). Thus,
possible
(q
j
(c
j
), I) = true. By inductive hypothesis c
j
q
j
(^
j
). Thus, the algorithm
selects
t
Let q
j
(x
j
) = w
j
.
j
(x
j
, w
j
)
Let ^
j
= BuildOptimisticRepair(q
j
, I)
Add ^
j
to ^
end for
for each c such that there is some R(c,
d) in I do
if there is some
d and some valuation for the variables of y such that R(c,
d) I,
(y) =
d, and there is no j and c
j
such that (x
j
) =c
j
and c
j
, q
j
(^
j
) then
Let
t = R(c,
d)
else
Let
t be any tuple of I such that
t = R(c,
d), for some
d
end if
Add
t to ^
end for
end if
Figure 4.3: Algorithm to build the optimistic repair
Chapter 4. Rewritings for Queries with Grouping and Aggregation 66
4.3.3 Sound Ranges
In this subsection, we show that the ranges produced by the query rewritings are sound,
in the sense that the value of the aggregation function falls within the returned range on
every repair.
The next lemma shows that the rewritings produced by RewriteCount compute sound
ranges.
Lemma 4.7. Let R be a schema. Let be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(z, v) be a query of the following form:
select z, count(*)
from q
(z)
group by z
where q
(q
, I). Let d
be such that (
. Let x =
i=1...m
x
i
,
let
z
= z x, and let
w
= w x. Let q
(x,
) =
.(x,
w
). Let
x
= x z. Let
QConsistent(x,
, ).
Lower Bound. Since (
)
group by z
Assume towards a contradiction that d < low. Then, there is a tuple (c,
) such
that (c,
) , q
) , consistent
(q
, I). By
Theorem 3.5, we conclude that (c,
) , QConsistent(I); contradiction.
Upper Bound. Since (
(x,
) (
.QConsistent(x,
))
group by z
Assume towards a contradiction that d > up. Then, there is a valuation and a tuple
(c,
) =
t
, (z) =
t, (c,
) q
) , q
(I);
or (2) I ,[= (
.QConsistent(x,
)).
Assume that (1) (c,
) , q
) , q
.QConsistent(x,
)).
Recall that
x
= x z. By Theorem 3.5, (
) , consistent
(q
. In
particular, (c,
) , consistent
(q
) =
t
and (z) =
t. Thus,
t , consistent
(q
, I);
contradiction.
The next lemma shows that the rewritings for queries with the sum operator compute
sound ranges.
Lemma 4.8. Let R be a schema. Let be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(z, v) be a query of the following form:
select z, sum(u)
from q
(z, u)
group by z
where q
(q
, I). Let d
be such that (
. Let
x =
i=1...m
x
i
, let
z
= z x, and let
w
= w x. Let
x
= x z u. Let
q
(x,
) =
, u.(x,
w
, ). Let q
be the query q
(x,
, u) =
.(x,
w
, u).
Lower Bound. Since (
, v) QContribNonConsistent(x,
, v)
group by z
where QContribConsistent is the following query:
QContribConsistent(x,
, bottom) =
select x,
z
, min(u) as bottom
from QConsistent(x,
) q
(x,
, u)
group by x,
having bottom 0
and QContribNonConsistent is the following query:
QContribNonConsistent(x,
, bottom) =
select x,
z
, min(u) as bottom
from q
(x,
, u) (
.QConsistent(x,
))
group by x,
) =
t
, and
(c,
) , q
(1); and
there is some e such that e > 0; and
either (c,
, e) QContribConsistent QContribNonConsistent(I).
Since e > 0, (c,
) , q
(1), (c,
) ,
consistent
(q
) , QConsistent(I). There-
fore, (c,
, e) , QContribConsistent(I); contradiction.
Second, assume that there is a valuation for the variables in z, x such that (z) =
t,
(x) =c , (
) =
t
, and
Chapter 4. Rewritings for Queries with Grouping and Aggregation 69
there is some e
< 0; and
(c,
, e
) q
(1); and
for every e such that e < 0, we have that (c,
, e) , QContribConsistent
QContribNonConsistent(I).
Since 1 I and (c,
, e
) q
, e
) q
(q
, I), (
) consistent
(q
, I) for some
c
. By Theorem
3.5, (
) QConsistent(I). Thus, I [=
.QConsistent(x,
)[z/
t]. Since e
< 0,
(c,
, e
) q
(I) and I [=
.QConsistent(x,
)[z/
, e
)
QContribNonConsistent(I); contradiction.
Third, assume that there is a valuation for the variables in z, x such that (z) =
t,
(x) =c , (
) =
t
, and
there is some e such that (c,
, e) QContribConsistentQContribNonConsistent(I);
and
there is some e
such that e
< e; and
(c,
, e
) q
(1).
Assume that (c,
) QConsistent(I),
and (c,
, e) q
, e
) q
, e
)
q
). Since (c,
, e) and (c,
, e
; contradiction.
Now, assume that (c,
, e
)
q
;
contradiction.
Upper Bound The proof for the lowest upper bound is analogous to the proof for
the greatest lower bound.
The next lemma shows that the rewritings for queries with the min and max aggrega-
tion functions compute sound ranges.
Lemma 4.9. Let R be a schema. Let be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(z, v) be a query of the following form:
Chapter 4. Rewritings for Queries with Grouping and Aggregation 70
select z, [min(u)[ max(u)]
from q
(z, u)
group by z
where q
(q
, I). Let d
be such that (
. Let
x =
i=1...m
x
i
, let
z
= z x, and let
w
= w x. Let
x
= x z u. Let
q
(x,
) =
, u.(x,
w
, ). Let q
be the query q
(x,
, u) =
.(x,
w
, u).
Lower Bound. Suppose that the aggregate function of q is max. Since (
t, low, up)
Q(I), the lower bound low of
t is computed with the following query:
QGlb(z, glb) = select z, max(u)
from QContribConsistent(x,
, u)
group by z
where QContribConsistent is the following query:
QContribConsistent(x,
, bottom) =
select x,
z
, min(u) as bottom
from QConsistent(x,
) q
(x,
, u)
group by x,
Assume towards a contradiction that d < low. Then, there is a valuation for the
variables in z, x such that (z) =
t, (x) =c , (
) =
t
, and
there is some e such that (c,
, e) QContribConsistent(I); and
there is some e
such that e
< e; and
Chapter 4. Rewritings for Queries with Grouping and Aggregation 71
(c,
, e
) q
(1).
We arrive to a contradiction by the same arguments used in Lemma 4.8 for queries
with the sum operator.
Now, suppose that the aggregate function of q is min. Since (
, u)
group by z
where QContribNonConsistent is the following query:
select x,
z
, min(u) as bottom
from q
(x,
, u) (
.QConsistent(x,
))
group by x,
)
Assume towards a contradiction that d < low. Then, there is a valuation for the
variables in z, x such that (z) =
t, (x) =c , (
) =
t
, and
there is some e such that (c,
, e) QContribNonConsistent(I); and
there is some e
such that e
< e; and
(c,
, e
) q
(1).
We arrive to a contradiction by the same arguments used in Lemma 4.8 for queries
with the sum operator.
Upper Bound For the max operator, we can give an argument analogous to the
argument given for the lower bound of the min operator. For the min operator, we
can give an argument analogous to the argument given for the lower bound of the max
operator.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 72
4.3.4 Tight Ranges
In this section, we show that the ranges produced by the query rewritings are tight. For
this, we must exhibit two repairs, where the result of the aggregation function corresponds
to the greatest lower bound in one repair, and to the lowest upper bound in the other. For
example, if the query has the count(*) operator, the repair that we need for the greatest
lower bound turns out to be the pessimistic repair / used in the correctness proof of
the rst-order rewritings of Section 3.3.3. For the lowest upper bound, the needed repair
is the optimistic repair ^ that we introduced in Section 4.3.2.
We start by showing that the rewritings produced by RewriteCount give tight bounds.
In the next lemma, we show that the greatest lower bound of count(*) can be obtained
by executing the query on the pessimistic repair /. We also show that the query
rewriting that we obtain correctly returns such bound.
Lemma 4.10. Let R be a schema. Let be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(z) be a query of the following form:
select z, count(*)
from q
(z)
group by z
where q
(z) is a query in c
forest
.
Let G the the join graph of q. Let R
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
) be the literals at the roots
of each tree of G. Let ( w, z) be the conjunction of literals of q
. Let x =
i=1...m
x
i
,
let
z
= z x, and let
w
= w x. Let q
(x,
) =
.(x,
w
) =
t
, and (z) =
t, if (c,
) q
(/),
then c consistent
(q
[z/
t[
B
= low, and
3. if (
t[
B
= low.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 73
Proof. Let / be the pessimistic repair obtained by invoking the algorithm BuildPess-
imisticRepair(q, , I). Condition (1) holds by Lemma 3.10. We must now prove Con-
ditions (2) and (3).
In order to prove Condition 2, let
t be a tuple, and low, and up be a pair of real
numbers such that (
such that B
= q(1) and [
t[
B
= low. Furthermore, by Lemma
4.7, since / is a repair of I wrt , [
t[
B
low. Assume towards a contradiction that
[
t[
B
> low. Then, there is a valuation for the variables of x and z such that (x) = c,
(z) =
t and (
) =
t
) q
)[
B
> 1; or
(c,
) q
[z/
) , q
[z/
t](1).
Assume that (c,
) q
)[
B
> 1. This contradicts Lemma 4.4. Now, as-
sume that (c,
) q
[z/
) , q
[z/
) , consistent
(q
[z/
t], I).
By Condition 1, we have that (c,
) , q
[z/
t](/); contradiction.
In order to prove Condition 3, let
t, low, and up be such that (
t[
B
low. Let QConsistent(x,
) be the query
obtained by invoking RewriteForest(q
t is computed
with the following query:
QGlb(z, low) = select z, count(*)
from QConsistent(x,
)
group by z
Assume towards a contradiction that [
t[
B
> low. Then, there is a valuation for the
variables of x and z such that (x) =c, (z) =
t and (
) =
t
) q
)[
B
> 1; or
(c,
) q
) , QConsistent(I).
Assume that (c,
) q
)[
B
> 1. This contradicts Lemma 4.4. Now,
assume that (c,
) q
) , QConsistent(I),
by Theorem 3.5, (c,
) , consistent
(q
) ,
q
(/); contradiction.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 74
In the next lemma, we show that the lowest upper bound of count(*) can be obtained
by executing q on the optimistic repair ^. We also show that the query rewriting of q
correctly returns such bound.
Lemma 4.11. Let R be a schema. Let be a set of integrity constraints, consisting
of one key dependency per relation of R. Let q(z) be a query in c
forest
of the following
form:
select z, count(*)
from q
(z)
group by z
where q
(z) is a query in c
forest
.
Let G the the join graph of q. Let R
1
(x
1
, y
1
), . . . , R
m
(x
m
, y
m
) be the literals at the roots
of each tree of G. Let ( w, z) be the conjunction of literals of q
. Let x =
i=1...m
x
i
,
let
z
= z x, and let
w
= w x. Let q
(x,
) =
.(x,
w
t, if c possible
(q
[z/
t], I),
then c q
[z/
t](^), and
2. if (
t[
B
= up, and
3. if (
t[
B
= up.
Proof. Let ^ be the optimistic repair obtained by invoking the algorithm BuildOpti-
misticRepair(q, , I). Condition (1) holds by Lemma 4.6. We must now prove Condi-
tions (2) and (3).
In order to prove Condition 2, let
t be a tuple, and low and up be real numbers such
that (
such that B
= q(1) and [
t[
B
= up. Furthermore, since ^ is a repair of I wrt , by
Lemma 4.7, [
t[
B
up. Assume towards a contradiction that [
t[
B
< up. Then, there is a
valuation for the variables of x and z such that (x) = c, (z) =
t and (
) =
t
, and
one of the following conditions holds:
Chapter 4. Rewritings for Queries with Grouping and Aggregation 75
(c,
) q
)[
B
> 1; or
(c,
) , q
) q
(1).
Assume that (c,
) q
)[
B
> 1. This contradicts Lemma 4.4. Now,
assume that (c,
) , q
) q
(q
[z/
t], I). By
Condition 1, we have that c q
[z/
t](^); contradiction.
In order to prove Condition 3, let
t, low, and up be such that (
t[
B
up. Let
x
)
be the query obtained by invoking RewriteForest(q
, ). Since (
(x,
) (
.QConsistent(x,
))
group by z
Assume towards a contradiction that [
t[
B
< up. Then, there is a valuation for the
variables of x and z such that (x) =c, (z) =
t and (
) =
t
, and either:
(c,
) , q
(^), (c,
) q
(I), and I [= (
.QConsistent[z/
t]).
Assume that (c,
) is accounted for more than once in the from clause of QLub. This
is a contradiction since by denition the from clause of a rst-order aggregate query is
computed using set semantics. Now, assume that (c,
) , q
(^), (c,
) q
(I), and
I [= (
.QConsistent[z/
) q
(q
[z/
t], I).
Thus, by Condition 1, c q
[z/
t](^); contradiction.
For the unary operators, the proof of tightness proceeds in an analogous way, except
that the optimistic and pessimistic repairs have to be modied to ensure every tuple has
the minimum (or maximum, depending on the case) for attribute u. We next show how
to obtain a pessimistic repair for queries with the sum operator.
Algorithm BuildPessimisticRepairForSum (q, I, /
)
Input: A query q of the form
select z, sum(u)
Chapter 4. Rewritings for Queries with Grouping and Aggregation 76
from q
(z)
group by z
where q
is a conjunctive query in c
forest
I, an instance
/
, an pessimistic repair
Output: /, an pessimistic repair
Initialize / as /
(x) =
c
(y) =
d
,
R(
) I, and (z) =
) in /
end if
end for
end for
Notice in the algorithm that a tuple R(c,
d) is replaced only if there is another tuple
with the same values, except for the attribute u, and the other tuple has a smaller value
on u (condition
(u) < (u) in the algorithm). In the rewriting for the lower bound of
the sum operator, this corresponds to the fact that for positive values we aggregate over
the minimum value of u for all tuples in the intermediate result. In contrast, for the upper
bound, we aggregate over the maximum value of u. Thus, for the upper bound, a similar
algorithm can be used, where we replace tuples for which the condition
(z, u)
group by z
where q
. Let x =
i=1...m
x
i
,
let
z
= z x, and let
w
= w x. Let q
(x,
) =
, u.(x,
w
(x,
, u) =
.(x,
w
, u).
Then, there is a repair / of I wrt and some value d such that (
t, d) q(/), and
the following conditions hold:
1. for every valuation such that (x) =c, (
) =
t
, and (z) =
t, if (c,
) q
(/),
then c consistent
(q
[z/
) =
t
,
and one of the following conditions holds:
(c,
) q
)[
B
> 1; or
there are e and e
, (c,
, e) q
, e
) q
(1); or
(c,
) q
[z/
) , q
[z/
t](1).
Assume that (c,
) q
)[
B
> 1. This contradicts Lemma 4.4. Now,
assume that there are e and e
, (c,
, e) q
, e
)
q
(1). Let
and
(w) and
(w) =
(w);
(w) = e; and
(w) = e
(w) <
;
contradiction. Finally, assume that (c,
) q
[z/
) , q
[z/
t](1). Then,
(c,
) , consistent
(q
[z/
) , q
[z/
t](/); con-
tradiction.
In order to prove Condition 3, let
t, low, and up be such that (
) be the query
obtained by invoking RewriteForest(q
, v)
group by z
where QContribConsistent is the following query:
QContribConsistent(x,
, bottom) =
select x,
z
, min(u) as bottom
from QConsistent(x,
) q
(x,
, u)
group by x,
t and (
) =
t
) q
)[
B
> 1; or
there are e and e
, (c,
, e) q
(/) and
(c,
, e
) QContribConsistent(I); or
(c,
) q
) , QConsistent(I).
Assume that (c,
) q
)[
B
> 1. This contradicts Lemma 4.4. Now,
assume that there are e and e
, (c,
, e) q
, e
)
QContribConsistent(I). Since e
) q
) ,
QConsistent(I). Since (c,
) ,
consistent
(q
) , q
(/); contradic-
tion.
Notice that the proof above is similar to the one for Lemma 4.10, except that we need
to account for the fact that each tuple may contribute a value greater than one. A proof
similar to Lemma 4.11 can be given for the lowest upper bound.
4.3.5 Putting It All Together
The next lemma states the correctness of the algorithm RewriteCount. The correctness
for the unary operators can be obtained analogously by employing the optimistic and
pessimistic repairs as shown in Figure 4.4.
Lemma 4.13. Let R be a schema. Let be a set of integrity constraints, consisting
of one key dependency per relation of R. Let q(z) be a query in c
forest
of the following
form:
select z, count(*)
from q
( w, z)
group by z
Chapter 4. Rewritings for Queries with Grouping and Aggregation 80
Let Q(z, l, u) be the rst-order aggregate query returned by RewriteCount(q, ). Let
I be an instance over R. Then, for every tuple
t, and pair of real numbers low and up,
we have that (
(q, I) i (
. Following the
algorithm RewriteCount, let x =
i=1...m
x
i
, let
z
= z x, and let
w
= w x. Let
= xz. Let q
(x,
) =
.(x,
w
). Let QConsistent(x,
, ).
() Let
t be a tuple and low and up be real numbers such that (
t, low, up)
aggconsistent
) =
t
, and (z) =
t, if (c,
) q
(/),
then c consistent
(q
[z/
t[
B
= low.
Since, (
t[
B
= low. Assume
towards a contradiction that (
) = select z, count(*)
from QConsistent(x,
)
group by z
Assume that low
t, (
) =
t
) q
)[
B
> 1; or
(c,
) q
) , QConsistent(I).
Assume that (c,
) q
)[
B
> 1. This contradicts Lemma 4.4. Now,
assume that (c,
) q
) ,
consistent
(q
) , q
(/); contradiction.
Assume towards a contradiction that low
) =
t
, (c,
) , q
(/) and
Chapter 4. Rewritings for Queries with Grouping and Aggregation 81
(c,
)
consistent
(q
) q
(/);
contradiction.
By Lemma 4.11, there is an optimistic repair ^ of I wrt and a bag B such that
B = q(^), and the following conditions hold:
1. for every valuation such that (x) = c and (z) =
t, if c possible
(q
[z/
t], I),
then c q
[z/
t](^), and
2. if (
t[
B
= up.
Since, (
t[
B
= up. Assume
towards a contradiction that (
) = select z, count(*)
from q
(x,
) (
.QConsistent(x,
))
group by z
Assume that up
) =
t
, (c,
) , q
(^), (c,
) q
(I), and I [=
.QConsistent(x,
). Since (c,
) q
(I), (c,
) possible
(q
) q
(^); contradiction.
Assume that up
< up. Then, there is a valuation for the variables of x and z such
that (x) = c, (z) =
t, (
) =
t
) q
)[
B
> 1. But this contradicts Lemma 4.4. Second, (c,
) q
(^)
and either (1) (c,
) , q
.QConsistent(x,
) ,
q
) , q
(^); contradiction.
Assume that (2) I ,[=
.QConsistent(x,
). Recall that
x
= x z. By Theorem 3.5,
(
) , consistent
(q
. In particular, (c,
) , consistent
(q
, I). Thus,
(c,
) , q
(^); contradiction.
() Let
t be a tuple and low and up be real numbers such that (
t[
B
up.
2. There is a repair 1 of I wrt , and a bag B such that B = q(1) and [
t[
B
= low.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 82
3. There is a repair 1 of I wrt , and a bag B such that B = q(1) and [
t[
B
= up.
Claim 1 follows by Lemma 4.7. Claim 2 follows by Lemma 4.10. Claim 3 follows by
Lemma 4.11.
4.4 Related Work
Our work on aggregation is inspired by Arenas et al. [ABC
+
03b], who were the rst to
propose the use of ranges in a semantics for consistent query answering. The work of
Arenas et al. is restricted to queries of the following form:
select F(A)
from r
where F is an aggregation function, r is a single relation, and A is an attribute from
r. Notice that such queries have no grouping and no selection or join conditions (i.e., no
where clause). In this chapter, we consider a much richer class of queries. For the class
of queries considered by Arenas et al., the semantics proposed in their paper and our
semantics for aggregate queries coincide. However, we need to extend their semantics in
order to be able to deal with queries that perform grouping.
In their paper, Arenas et al. [ABC
+
03b] consider functional dependencies. If there
is exactly one functional dependency on the (only) relation of the query, they show that
the problem of obtaining the lowest upper and greatest lower bounds is tractable for the
count(*), min, max, sum, and avg functions. Except for avg, we considered all these
functions in our class c
aggforest
. Arenas et al. also show the intractability of queries with
the count(distinct) operator and exactly one functional dependency. If the relation
of the query has more than one functional dependency, they show that the problem
of obtaining tight bounds is intractable for all the aggregate functions they consider
(count(*), min, max, sum, and avg, count(distinct)). This gives further evidence of
the maximality of the class considered in this chapter: going from one to two functional
dependencies may lead to intractability even for queries on just one relation and with no
grouping.
Chapter 5
Complexity-Theoretic Analysis
In the previous chapters, we presented query rewriting algorithms that work on a broad
class of queries. In this chapter, we show the maximality of this class based on complexity-
theoretic arguments. In Section 5.1, we show that minimal relaxations of the conditions of
the class lead to intractability. Then, in Section 5.2, we embark on a more ambitious goal:
for a large class of conjunctive queries, we show that the conditions of the class c
forest
presented in Chapter 3 are not only sucient, but they are also necessary conditions for
a query to be rst-order rewritable.
5.1 Minimal Relaxations of c
forest
In this section, we show that minimal relaxations of the conditions of c
forest
lead to
intractability. In particular, we show the intractability of the problem of computing
consistent answers for: (1) a conjunctive query whose join graph is a cycle of length
two; and (2) a conjunctive query whose join graph is a forest, but the query has some
nonkey-to-key joins that are not full.
Chomicki and Marcinkowski [CM05] proved that the problem of computing consistent
answers for a query with a single nonkey-to-nonkey join is coNP-complete. Their result
used a query with repeated relation symbols (specically, a query with only two literals
both for a single relation R). We can use their insight to show that the problem of
computing consistent answers for the following query without repeated relation symbols,
but with a single nonkey-to-nonkey join is also coNP-complete.
q
nk
= x, x
, y.S
1
(x, y) S
2
(x
, y)
83
Chapter 5. Complexity-Theoretic Analysis 84
Notice that q
nk
has a cycle of length two (actually, a nonkey-to-nonkey join), and
no nonkey-to-key joins. Our proof of hardness is a simple modication to the re-
sults of Chomicki and Marcinkowski [CM05] and uses a reduction from the problem
MONOTONE-3SAT, which is well known to be NP-complete. The only dierence between
the MONOTONE-3SAT and 3SAT problems is that the former assumes that the input 3CNF
propositional formula is monotone. That is, each clause
i
contains either positive or
negative atoms, but not both. We shall say that a clause that contains only positive
(negative) atoms is a positive (negative) clause.
Lemma 5.1. Let q be the query x, x
, y.S
1
(x, y) S
2
(x
, w, w
, z, z
, m.R
1
(x, w) R
2
(m, w, z) R
3
(x
, w
) R
4
(m, w
, z
)
We prove hardness by showing a reduction from the problem of computing the con-
sistent answers for the query q
nk
shown to be coNP-hard in Lemma 5.1.
Lemma 5.2. Let q be the query x, x
, w, w
, z, z
, m.R
1
(x, w) R
2
(m, w, z) R
3
(x
, w
)
R
4
(m, w
, z
). Let q
be the query x, x
, y.S
1
(x, y) S
2
(x
do
Add R
1
(c
1
, d
1
) to I
end for
for each tuple S
2
(c
2
, d
2
) I
do
Add R
3
(c
2
, d
2
) to I
end for
Let c
z
, c
z
be some constants
for each valuation
q
such that I
[= S
1
(x, y) S
2
(x
, y)[
q
] do
Let
q
(x) =
q
(x)
Let
q
(x
) =
q
(x
)
Let
q
(w) =
q
(y)
Let
q
(w
) =
q
(y)
Let c
m
be a newly-created constant
Let
q
(m) = c
m
Let
q
(z) = c
z
Let
q
(z
) = c
z
Add tuple R
2
(m, w, z)[
q
] to I
Add tuple R
4
(m, w
, z
)[
q
] to I
end for
We claim that consistent
(q
, I
) = true i consistent
(q, I) = true.
() Let 1 be a repair of I. We shall build an instance 1
as follows:
Chapter 5. Complexity-Theoretic Analysis 86
for each tuple R
1
(c
1
, d
1
) of 1 do
Add a tuple S
1
(c
1
, d
1
) to 1
end for
for each tuple R
3
(c
2
, d
2
) of 1 do
Add a tuple S
2
(c
2
, d
2
) to 1
end for
Notice that R
1
and S
1
(and, similarly, R
3
and S
2
) have the same extensions in I and I
,
respectively. Thus, since 1 is a repair of I, 1
is a repair of I
. Since consistent
(q
, I
) =
true, 1
[= q
[= S
1
(x, y) S
2
(x
, y)[
q
]. Let
c
1
=
q
(x), c
2
=
q
(x
), d =
q
(y). Let c
z
and c
z
be the constants used in the algorithm
that constructs I. Let c
m
be the constant created in the algorithm for the iteration
corresponding to
q
. Let
q
be a valuation for the variables of q such that:
q
(x) = c
1
q
(x
) = c
2
q
(w) = d
q
(w
) = d
q
(m) = c
m
q
(z) = c
z
q
(z
) = c
z
Since S
1
(c
1
, d) 1
, R
1
(c
1
, d) 1. Since S
2
(c
2
, d) 1
, R
3
(c
2
, d) 1. By Proposition
3.6, 1
. Thus, S
1
(c
1
, d) I
and S
2
(c
2
, d) I
. Since c
m
is the constant chosen in the
iteration for
q
in the algorithm that constructs I, R
2
(c
m
, d, c
z
) I and R
4
(c
m
, d, c
z
) I.
By Proposition 3.7, R
2
(c
m
, d, e) 1 and R
4
(c
m
, d, e
) 1, for some e, e
. Thus, 1 [= q[
q
].
() Let 1
be a repair of I
do
Add R
1
(c
1
, d
1
) to 1
end for
for each tuple S
2
(c
2
, d
2
) 1
do
Chapter 5. Complexity-Theoretic Analysis 87
Add R
3
(c
2
, d
2
) to 1
end for
for each tuple R
2
(c
1
, c
2
, d) I do
Add R
2
(c
1
, c
2
, d) to 1
end for
for each tuple R
4
(c
1
, c
2
, d) I do
Add R
4
(c
1
, c
2
, d) to 1
end for
We now show that 1 is a repair of I. First, notice that R
1
and S
1
(and, similarly, R
3
and S
2
) have the same extensions in I and I
, w
) R
4
(m, w
, z
)[
q
]. By construction of I, if
R
2
and R
4
join on m, then
q
(w) =
q
(w
). Let
q
be such that:
q
(x) =
q
(x)
q
(x
) =
q
(x
)
q
(y) =
q
(w) =
q
(w
)
It is easy to see that 1
[= S
1
(x, y) S
2
(x
, y)[
q
]. Thus, 1
[= q
.
5.2 A Dichotomy Result
5.2.1 The Class c
.(R
1
(x, y
) y
= y) y
.(R
2
(x, y
) y
= y)
Recall that the problem of computing consistent answers is intractable for the query
q
nk
= x, x
, y.R
1
(x, y)R
2
(x
,
which essentially separates the dierent types of joins of the query. In c
, every pair of
literals can be related by at most one of type of join (i.e., key-to-key, nonkey-to-nonkey,
and nonkey-to-key).
Denition 5.3. Let q be a conjunctive query without repeated relation symbols and all
of whose nonkey-to-key joins are full. We say that q is in class c
.
there is a nonkey-to-nonkey join between R and R
.
there are literals R
1
. . . R
m
in q such that there is a nonkey-to-key join from R to
R
1
, from R
m
to R
, and from R
i
to R
i+1
, for every i such that 1 i < m.
Chapter 5. Complexity-Theoretic Analysis 89
Notice that c
are the ones that have a pair of literals related by more than one type of
join. As anecdotal evidence of the practicality of the class, the only query in the TPC-H
benchmark [TPC03] that has nonkey-to-nonkey joins (Query 5) is in c
and q , c
forest
.
Theorem 5.5. Let q be a query such that q c
hard
. Then, CONSISTENT(q, ) is coNP-
complete in data complexity.
Our motivation to provide a dichotomy for c
, it can be decided in
polynomial time on which side of the dichotomy the query q falls.
Corollary 5.6. Let q be a query such that q c
for which
the problem of computing consistent answers is tractable.
Chapter 5. Complexity-Theoretic Analysis 90
RewriteForest. For the queries of c
hard
, since the problem of obtaining consistent an-
swers is coNP-complete, there is no rst-order rewriting, unless P=NP (which is unlikely).
Corollary 5.7. Let q be a query such that q c
(q, I) = false.
Lemma 5.8. Let I be a database with one binary relation R(E, S), possibly inconsistent
wrt a functional dependency = E S. Then, consistent
is a perfect matching of G.
There are a number of algorithms in the literature for deciding the existence of a
perfect bipartite matching. For example, one of the best known is given by Hopcroft and
Karp [HK75], and runs in O(n
2.5
) time. Therefore, q is a tractable query. We now show
that no approach based on query-rewriting works for q.
Theorem 5.9. There is no rst-order rewriting Q of q such that consistent
(q, I) =
Q(I) for every instance I.
Proof. Let A
1
, . . . , A
n
be a system of distinct representatives. A system of distinct rep-
resentatives [Ost70] of A
1
, . . . , A
n
is a sequence of n distinct elements a
1
, . . . , a
n
with
a
i
A
i
, 1 i n. Let R be a binary relation that encodes A
1
, . . . , A
n
as follows:
R(i, x) i x A
i
. Let G be the graph of R as constructed above. Clearly, G has a
perfect matching i A
1
, ..., A
n
has a system of distinct representatives. By Lemma 5.8,
consistent
, y.S
1
(x, y) S
2
(x
, y). This
query has a nonkey-to-nonkey join, and was shown to be intractable in Lemma 5.1. The
other query has a cycle of nonkey-to-key joins, and is shown to be intractable in Lemma
5.11.
Chapter 5. Complexity-Theoretic Analysis 92
The next lemma shows that the problem of computing consistent answers for con-
junctive queries is in coNP.
Lemma 5.10. Let q be a conjunctive query. The problem CONSISTENT(q, ) is in coNP.
Proof. Let I be an instance. In order to decide whether
t , consistent
t] can be checked
in polynomial time, since q is a conjunctive query.
In the next lemma, we show the coNP hardness of computing consistent answers for
one of the two particular queries that will be used in Lemma 5.14. The coNP hardness
of the other query was proven in Lemma 5.1.
Lemma 5.11. Let q = x, y.T
1
(x, y) T
2
(y, x). Then, the problem CONSISTENT(q, ) is
coNP-hard.
Proof. We will prove hardness by reduction from MONOTONE-3SAT. Let =
1
m
be a monotone 3CNF formula. We shall build an instance I as follows:
For each atom z, let
i
1
, . . . ,
i
n
be the positive clauses where z occurs. Add tuples
T
1
(<
i
1
, . . . ,
i
n
>, z) and T
2
(z, <
i
1
, . . . ,
i
n
>) to I.
For each atom z, let
i
1
, . . . ,
i
n
be the negative clauses where z occurs. Add tuples
T
1
(<
i
1
, . . . ,
i
n
>, z) and T
2
(z, <
i
1
, . . . ,
i
n
>) to I.
We now show that consistent
, z) 1 such
that c ,= c
. By construction of I, if T
2
(z, d) I, then d = c or d = c
. By Propositions
3.6 and 3.7, either T
2
(z, c) 1 or T
2
(z, c
) 1. Thus, 1 [= q; contradiction.
We now build a valuation v for the variables of as follows. For each variable z,
we let v(z) = true if there is some c such that T
1
(c, z) 1 and c is a list of positive
clauses; and we let v(z) = false if there is some i such that T
1
(c, z) 1, and c is a list
of negative clauses. It is easy to see that v is a truth valuation that satises .
() Assume that is satisable. Let v be a truth assignment for the variables of .
We shall build a repair 1 as follows. For each positive clause
i
, select a variable z that
appears in
i
and such that v(z) = true. Add T
1
(c, z) to 1, where c is a list of positive
Chapter 5. Complexity-Theoretic Analysis 93
clauses. For each negative clause
i
, select a variable z that appears in
i
and such that
v(z) = false. Add T
1
(c, z) to 1, where c is a list of negative clauses. For each variable
z, if v(z) = false, add T
2
(z, c) to 1, where c is a list of positive clauses; if v(z) = true,
add T
2
(z, c) to 1, where c is a list of negative clauses. It is easy to see that 1 ,[= q.
We now give some auxiliary results before proving Lemma 5.14. The next lemma
generalizes Lemma 5.11 from cycles of length two to the case of cycles of arbitrary length.
Lemma 5.12. Let q be the query w
1
, . . . , w
m
.S
1
(w
m
, w
1
)S
2
(w
1
, w
2
) S
m
(w
m1
, w
m
).
Let q
= x, y.T
1
(x, y)T
2
(y, x) Then, there is a polynomial time reduction from the prob-
lem CONSISTENT(q
such that I
[= T
1
(x, y) T
2
(y, x)[
q
] do
Let
q
(w
m
) =
q
(x)
Let
q
(w
1
) =
q
(y)
Create a new constant c
new
for i := 2 to m1 do
Let
q
(w
i
) = c
new
end for
Add the tuples of S
1
(w
m
, w
1
) S
2
(w
1
, w
2
) S
m
(w
m1
, w
m
)[
q
] to I
end for
We claim that consistent
(q
, I
) = true i consistent
(q, I) = true.
() Let 1 be a repair of I over the schema of q. We shall build a repair 1
over the
schema of q
as follows:
for each tuple S
1
(c
m
, c
1
) of 1 do
Add a tuple T
1
(c
m
, c
1
) to 1
for each c
new
such that S
2
(c
1
, c
new
) 1 and S
m
(c
new
, c
m
) 1 do
Add a tuple T
2
(c
1
, c
m
) to 1
end for
end for
Since consistent
(q
, I
) = true, 1
[= q
[= T
1
(x, y) T
2
(y, x)[
q
]. Let c
m
=
q
(x), c
1
=
q
(y). Since T
2
(c
1
, c
m
) 1
, there
exists c
new
such that S
2
(c
1
, c
new
) 1 and S
m
(c
new
, c
m
) 1. Let
q
be a valuation for the
variables of q such that:
q
(w
m
) = c
m
q
(w
1
) = c
1
q
(w
i
) = c
new
, for 1 < i < m
Since T
1
(c
m
, c
1
) 1
, S
1
(c
m
, c
1
) 1. By construction of
q
, S
2
(c
1
, c
new
) 1 and
S
m
(c
new
, c
m
) 1. For 2 < i m, notice that by construction of I, there are no tuples
S
i
(c
i
, d
i
) and S
i
(c
i
, d
i
) in I such that d
i
,= d
i
. Therefore, by Propositions 3.6 and 3.7,
every tuple in the extension of S
i
in I appears in the extension of S
i
in 1. By construction
of I, S
i
(c
new
, c
new
) I, for 3 i m 1. Thus, S
i
(c
new
, c
new
) 1. We conclude that
1 [= S
1
(w
m
, w
1
) S
2
(w
1
, w
2
) . . . S
m
(w
m1
, w
m
)[
q
]. Thus, 1 [= q.
() Let 1
be a repair of I
do
Add a tuple S
1
(c
m
, c
1
) to 1
Let c
new
be a constant such that S
2
(c
1
, c
new
) I and S
m
(c
new
, c
m
) I
Add a tuple S
2
(c
1
, c
new
) to 1
for i := 3 to m1 do
Add a tuple S
i
(c
new
, c
new
) to 1
end for
Add a tuple S
m
(c
new
, c
m
) to 1
end for
It is easy to see that 1 is a repair of I. Since consistent
(q, I) = true, 1 [=
q. Thus, there exists some valuation
q
such that 1 [= S
1
(w
m
, w
1
) S
2
(w
1
, w
2
)
. . . S
m
(w
m1
, w
m
)[
q
]. Let
q
be such that:
q
(x) =
q
(w
m
)
q
(y) =
q
(w
m
1
)
It is easy to see that 1
[= T
1
(x, y) T
2
(y, x)[
q
]. Thus, 1
[= q
.
Chapter 5. Complexity-Theoretic Analysis 95
5.2.3 Generalizing the Basic Cases
Our strategy for proving the dichotomy will be to show that if q has a subquery q
that
is known to be intractable (in particular, a cycle), then q is not tractable. This does not
hold in general, but as we show with the next auxiliary result, it holds for the queries in
c
.
Lemma 5.13. Let q be a Boolean query such that q c
. Let R
1
(x
1
, y
1
), . . . ,
R
n
(x
n
, y
n
) be the literals of q. Let q
is a cycle.
Let L = x
1
, y
1
, . . . , x
m
, y
m
. Assume that:
x
i
occurs in x
i
, for 1 i m, and
y
i
occurs in y
i
, for 1 i m, and
for 1 i m, if w L and w occurs in R
i
, then w occurs in S
i
.
Then, there is a polynomial-time reduction from the problem CONSISTENT(q
) to
CONSISTENT(q, ).
Proof. Let F = w : w occurs in R
i
, and 1 i mL. Let U = w : w occurs in q
F L.
Let I
such that I
[= S
1
(x
1
, y
1
) S
m
(x
m
, y
m
)[
q
]
do
for each variable w such that w F do
Let
q
(w) =
F
(w)
end for
for each variable w such that w U do
Create a new constant c
new
Let
q
(w) = c
new
Chapter 5. Complexity-Theoretic Analysis 96
end for
for i := 1 to m do
Let
q
(x
i
) =
q
(x
i
)
Let
q
(y
i
) =
q
(y
i
)
end for
Add the tuples of R
1
(x
1
, y
1
) R
n
(x
n
, y
n
)[
q
] to I
end for
We claim that consistent
(q
, I
) = true i consistent
(q, I) = true.
() Let 1 be a repair of I over the schema of q. We shall build an instance 1
over
the schema of q
as follows.
for i := 1 to m do
for each tuple R
i
(c
i
,
d
i
) of 1 do
Let c
i
be the constant that appears in c
i
at the position of one of the occurrences
of x
i
in x
i
.
Let d
i
be the constant that appears in
d
i
at the position of y
i
in y
i
Add S
i
(c
i
, d
i
) to 1
end for
end for
We make the following observations with respect to the construction of 1
. By con-
struction of I, if R
i
(c
i
,
d
i
) I, the same constant appears in c
i
at all the positions where
x
i
appears in x
i
. By Proposition 3.6, 1 I. Thus, in the construction of 1
, it suces
to choose the constant that occurs in c
i
at any of the positions where x
i
occurs in x
i
.
Assume that 1
is not a repair of I
i
such
that d
i
,= d
i
, S
i
(c
i
, d
i
) 1
and S
i
(c
i
, d
i
) 1
. By construction of 1
i
,
d
i
) 1 such that c
i
appears in c
i
and c
i
at all the positions
where x
i
appears in x
i
; and d
i
and d
i
appear in
d
i
and
d
i
, respectively, at the position
of y
i
in y
i
. Clearly,
d
i
,=
d
i
. By construction of I, if w is a variable such that w , L,
w is assigned the value
F
(w) in every tuple of I. By Proposition 3.6, 1 I. Thus,
c
i
=c
i
. Since
d
i
,=
d
i
, 1 does not satisfy the key constraints of . Thus 1 is not a repair;
contradiction. We conclude that 1
is a repair of I
.
Since consistent
(q
, I
) = true, 1
[= q
[= S
1
(x
1
, y
1
) S
m
(x
m
, y
m
)[
q
]. Let
m
be a valuation for the variables of
R
1
, . . . , R
m
such that:
m
(x
i
) =
q
(x
i
), for 1 i m
m
(y
i
) =
q
(y
i
), for 1 i m
m
(w) =
F
(w) if w F
Let w be a variable that appears in R
i
, for 1 i m. If w L and w occurs in
R
i
, by hypothesis, w occurs in S
i
. If w , L, then w F, by denition of F. Since
1
[= S
1
(x
1
, y
1
) S
m
(x
m
, y
m
)[
q
], and
m
(w) =
F
(w) if w F, we conclude that
1 [= R
1
(x
1
, y
1
) R
m
(x
m
, y
m
)[
m
].
By construction of I, there is a valuation
q
for the variables of q such that:
m
(w) =
q
(w) if w appears in R
i
, for 1 i m; and
I [= R
1
(x
1
, y
1
) R
n
(x
n
, y
n
)[
q
].
Let R
i
(x
i
, y
i
) be a literal of q such that i > m. Notice that we assume that the join
graph of q
is a cycle. Since q is in c
i
,
d
i
) are added at dierent iterations, then c
i
,= c
i
. Therefore,
by Proposition 3.6 and 3.7, every tuple in the extension of R
i
in I is in the extension of
R
i
in 1. Therefore, 1 [= R
1
(x
1
, y
1
) R
n
(x
n
, y
n
)[
q
].
() Let 1
be a repair of I
do
Let R
i
(c
i
,
d
i
) be a tuple of I such that c
i
appears in c
i
at all the positions of x
i
in
x
i
, and d
i
appears in
d
i
at the position of y
i
in y
i
Add R
i
(c
i
,
d
i
) to 1
end for
end for
for i := m + 1 to n do
for each tuple R
i
(c
i
,
d
i
) in I do
Chapter 5. Complexity-Theoretic Analysis 98
Add R
i
(c
i
,
d
i
) to 1
end for
end for
We will now show that 1 is a repair of I. Towards a contradiction, assume that 1 is
not a repair of I. Then, there are values c
i
,
d
i
, and
d
i
such that
d
i
,=
d
i
, R
i
(c
i
,
d
i
) 1,
and R
i
(c
i
,
d
i
) 1.
First, assume that 1 i m. For every variable w such that w , L and w occurs
in R
i
, w F. Thus, w is assigned the same constant
F
(w) in every tuple of I. By
Proposition 3.6, 1 I. Therefore, there are constants c
i
, d
i
and d
i
such that d
i
,= d
i
, c
i
appears in c
i
at the positions of x
i
in x
i
, and d
i
and d
i
appears in
d
i
and
d
i
, respectively,
at the position of y
i
in y
i
. By construction of 1, there are tuples S
i
(c
i
, d
i
) and S
i
(c
i
, d
i
)
in 1
. Since d
i
,= d
i
, 1
. Thus, 1
is not a repair;
contradiction.
Now, assume that m < i n. Notice that we assume that the join graph of q
is
a cycle. Since q is in c
i
,
d
i
) are added at dierent iterations, then c
i
,= c
i
. Therefore,
the extension of R
i
in I satises the key dependencies of . Thus, by construction of
1, the extension of R
i
in 1 satises the key constraints of . Thus, 1 is a repair of I;
contradiction.
We conclude that 1 is a repair of I. Since consistent
[= S
1
(x
1
, y
1
) S
m
(x
m
, y
m
)[
q
]. Thus, 1
[= q
.
We are now ready to prove Lemma 5.14, which gives a polynomial-time reduction
from the problem of computing consistent answers for the queries of Lemmas 5.1 or 5.11
to every query in c
hard
. From this, Theorem 5.5 follows directly.
Chapter 5. Complexity-Theoretic Analysis 99
Lemma 5.14. Let q be a query such that q c
hard
. Then, there is a polynomial-time
reduction from CONSISTENT(q
) to CONSISTENT(q, ), where q
, y.S
1
(x, y) S
2
(x
, y)
x, y.T
1
(x, y) T
2
(y, x)
Proof. Let G be the join graph of q. Let G
is connected, and
G
, and G
is connected, then G
is a tree.
Let P = R
1
, R
2
, R
1
) be a cycle of G
. Let R
1
(x
1
, y
1
) and R
2
(x
2
, y
2
) be the literals in
G
such that x
occurs in x
2
and x
= S
1
(x, y) S
2
(x
) to CONSISTENT(q, ).
Let P = R
1
, . . . , R
m
, R
1
) be a cycle of G
. Let R
1
(x
1
, y
1
),. . . , R
m
(x
m
, y
m
) be the
literals of P. Let w
1
, w
2
, . . . , w
m
be variables such that w
i
occurs in y
i
and in R
(i mod m)+1
,
for every 1 i m. Assume that there is some w
i
such that 1 i m and w
i
occurs in
some literal R
j
of q such that j ,= i and j ,= (i mod m)+1. Then R
1
, . . . , R
i
, R
j
, . . . , R
1
is a cycle. Therefore G
such that G
is connected,
and G
= S
1
(w
m
, w
1
)S
2
(w
1
, w
2
). . . S
m
(w
m1
, w
m
).
It can be checked that q and q
) to CONSISTENT(q, ). Let
q
= x, y.T
1
(x, y) T
2
(y, x). By Lemma 5.12, there is a polynomial-time reduction from
CONSISTENT(q
) to CONSISTENT(q
).
Finally, we give the proof for Theorem 5.5, the main result of this chapter.
Theorem 5.5. Let q be a query such that q c
hard
. Then, CONSISTENT(q, ) is coNP-
complete in data complexity.
Chapter 5. Complexity-Theoretic Analysis 100
Proof. By Lemma 5.10, CONSISTENT(q, ) is in coNP. In order to prove hardness, let q
, y.S
1
(x, y) S
2
(x
, y)
x, y.T
1
(x, y) T
2
(y, x)
By Lemma 5.14, there is a polynomial-time reduction from CONSISTENT(q
) to
CONSISTENT(q, ). By Lemmas 5.1 and 5.11, CONSISTENT(q
) is coNP-hard. Thus,
CONSISTENT(q, ) is coNP-hard.
5.3 Related Work
Chomicki and Marcinkowski [CM05] and Cal`, Lembo and Rosati [CLR03a] thoroughly
study the decidability and complexity of consistent query answering for several classes
of queries and integrity constraints. In order to show intractability of a class, they
take the usual approach of exhibiting one query of the class for which the problem is
intractable. To the best of our knowledge, the result that we present in Section 5.2 is the
rst dichotomy result in the area of consistent query answering.
Both Chomicki and Marcinkowski and Cal`, Lembo and Rosati show that the problem
of obtaining consistent answers for conjunctive queries under primary key constraints is
coNP-complete. Chomicki and Marcinkowski also show an example of a query with just
one literal but two key dependencies for which the problem is coNP-complete. This gives
further support for our decision of considering exactly one key dependency per relation.
Cal`, Lembo and Rosati show the undecidability of the problem of obtaining consis-
tent answers when the set of constraints contains primary keys and arbitrary inclusion
dependencies. They also show the problem becomes decidable for foreign key constraints
(it is coNP-complete). Chomicki and Marcinkowski study the same problem but under
a semantics where only tuple deletion is allowed (i.e., repairs are always subsets of the
inconsistent database). In this case, the problem is
2
p
-complete, and becomes coNP-
complete if the inclusion dependencies are restricted to be acyclic.
Chapter 6
ConQuer: System Implementation
and SQL Rewritings
In this chapter, we present ConQuer, a system for querying inconsistent databases.
We demonstrated this system at the International Conference on Very Large Databases
(VLDB) [FFM05b]. In Section 6.1, we describe the system implementation and a typical
scenario where it can be used. Then, in Sections 6.2 and 6.3, we present the SQL rewrit-
ings that are at the core of ConQuers approach. In Section 6.4, we show how, if desired,
ConQuer can process the database oine in order to improve the performance of the
queries. Finally, in Section 6.5, we review other systems that are related to ConQuer.
6.1 System Implementation
ConQuer is implemented in Java and follows a modular architecture. It consists of the
following components:
Query Rewriting Module. It rewrites an input SQL query into another SQL
query that computes the consistent answers. The details of the rewritings are
presented in Sections 6.2 to 6.4. The SQL queries are parsed using javacc.
Query Execution Engine. The rewritten queries are executed using IBM DB2
UDB Version 8.2. The connection with the database is done through JDBC.
Conict Resolution Module. Provides a tracing facility to nd the data that
leads to dierences between the answer to the original query and the consistent
answer. This module also permits a user to update the database to correct errors.
101
Chapter 6. ConQuer: System Implementation and SQL Rewritings 102
Figure 6.1: Interface for entering hypothetical primary key constraints in ConQuer
User Interface. Query results are displayed using a Web-accessible interface that
is implemented in PHP.
We illustrate a typical use case of ConQuer on a database with information about
airports. The user rst species a set of primary key constraints using the interface shown
in Figure 6.1. These are the constraints that should hold on a consistent database, but
may be violated by the actual database that is being queried. Notice that for the same
schema and database, there is the exibility of running queries under dierent sets of
potentially violated primary key constraints. Then, the user writes a SQL query within
the interface. In Figure 6.2, we show a query where the user is asking for all the countries
that have airports located north of parallel 63N. The result to the query is shown in Figure
6.3. The consistent answers are shown in bold, and the potential answers (i.e., possible
answers that are not consistent answers) are shown in italics. For example, in this case
Italy is a potential answer.
While consistent answers are best suited for decision making, potential answers can be
used to understand the reasons why a database is inconsistent. In this case, the user could
click on Italy and obtain an explanation, which is shown in Figure 6.4. The explanation
is the lineage (or why-provenance) [BKT01, CW03] of the result, i.e., the tuples in the
database that contribute to the answer. According to the explanation, Italy is a potential
answer because it has one airport that appears as satisfying the query (parallel 63) in
Chapter 6. ConQuer: System Implementation and SQL Rewritings 103
Figure 6.2: Interface for entering queries in ConQuer
one tuple, and violating it (parallel 45) in another. Notice that in the comment to the
query, the user wrote select countries that are located north of Trondheim. Trondheim
is a Norwegian city, and the user may have background knowledge telling that all Italian
cities are south of Norwegian cities. Thus, the user could use the explanation obtained
from ConQuer in order to remove the tuple for the Italian airport located on parallel 63.
6.2 ConQuer Rewritings for Queries without Aggre-
gation
In this section, we present the SQL rewritings produced by ConQuer for a class of Select-
Project-Join (SPJ) queries with set semantics. We delay the treatment of conjunctive
queries that return duplicates until the next section, where the number of duplicates
returned by the queries can be counted with the count(*) aggregate function. We rst
give the query rewriting algorithm, and then we illustrate it with a number of examples.
6.2.1 Rewriting Algorithm
We now present a SQL rewriting algorithm for SPJ queries that are equivalent to a
conjunctive query in the class c
forest
, introduced in Denition 3.4, which we repeat next.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 104
Figure 6.3: Query results in ConQuer
Figure 6.4: Query explanation in ConQuer
Chapter 6. ConQuer: System Implementation and SQL Rewritings 105
Denition 3.4. Let q be conjunctive query without repeated relation symbols and all of
whose nonkey-to-key joins are full. Let G be the join graph of q. We say that q c
forest
if G is a forest (i.e., every connected component of G is a tree).
The above denition requires three conditions on the conjunctive query. First, that
the query has no repeated relation symbols. For an SPJ SQL query, this means that each
relation can be used at most once in the where clause. Second, that all its nonkey-to-key
joins must be full. For an SPJ query, this means that if an attribute of a key of a relation
r
1
is equated in the where clause with a nonkey attribute of another relation r
2
, then all
the attributes of the key of r
1
are equated to nonkey attributes of r
2
. Finally, the join
graph of q must be a forest. The notion of a join graph is introduced in Denition 3.1,
and we repeat it next.
Denition 3.1 (join graph). Let q be a conjunctive query. The join graph G of q is a
directed graph such that:
the vertices of G are the literals of q;
there is an arc from literal R
i
to literal R
j
if i ,= j, and there is some variable w
such that w is existentially-quantied in q, w occurs at the position of a nonkey
attribute in R
i
, and w occurs in R
j
.
An analogous denition can be given for the join graph of an SPJ SQL query. The
vertices of the graph will be the relation symbols in the from clause of the query. Fur-
thermore, there will be an arc from relation r
i
to relation r
j
if there is an attribute A
in r
i
such that (1) A is not in the key of r
1
(it is a nonkey attribute), (2) A does not
appear in the select clause of the query, and A is not equated to any attribute B such
that B appears in the select clause of the query (this corresponds to the notion of
an existentially-quantied variable for conjunctive queries); and (3) there is some equal-
ity in the where clause relating A to some attribute B of r
2
(i.e., a nonkey-to-key or
nonkey-to-nonkey join).
1
We can now give a denition analogous to c
forest
for SPJ SQL queries. A query q is
in class c
sql
forest
if no relation appears twice in the from clause of q, all the nonkey-to-key
joins of q are full, and the join graph of q is a forest.
1
This denition works for repeated relation symbols as well. In such case, we assume that if a relation
appears more than once in the from clause, then it is aliased to a new name using the as operator.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 106
We are now ready to give ConQuers rewriting algorithm for SPJ queries in c
sql
forest
.
The algorithm is called RewriteForestSQL and is shown in Figure 6.5. The algorithm
takes as input a SQL query q in c
sql
forest
and a set of key constraints (one per relation of
the schema), and returns a SQL rewriting Q of q.
In the rewriting Q, the attributes of the relations in q play dierent roles. In par-
ticular, we will distinguish the attributes that the query projects on (i.e., that appear
in the select clause), and the attributes that appear in the key of a relation that is
at the root of some tree in the join graph of q. In the rest of the discussion, we will
call these attributes projecting attributes, and key-root attributes, respectively. The for-
mer are denoted in Figure 6.5 with the symbols S
1
, . . . , S
l
; the latter are denoted with
K
1
, . . . , K
n
.
The rewriting Q has three subqueries, specied using a with clause: candidates-
SubQuery, countViolSubQuery and countProjSubQuery. The purpose of candidates-
SubQuery is to prune the number of values for the key-root attributes that should be
considered by the other subqueries. In particular, candidatesSubQuery applies the
selection conditions of the original query q, and projects on its key-root attributes. These
attributes are used to perform an inner join in the next subquery (countViolSubQuery).
If the selectivity of q is low (i.e., few tuples satisfy its conditions), and the query optimizer
pushes down the selection conditions of candidatesSubQuery in the query plan, we would
expect the rewriting to have a low overhead with respect to the original query. We validate
this conjecture in Section 7.2.
Let cO^To be the list of conditions in the where clause of q. In the from clause
of countViolSubQuery, we count the number of tuples that violate the conditions of
cO^To, we group by the key-root attributes, and keep the result in an attribute called
countViol as follows:
sum(case when cO^To then 0 else 1 end)
over (partition by K
1
, . . . , K
n
)
as countViol
Notice the use of the partition by clause. This clause (introduced in the OLAP
Amendment to SQL [ISO01]) diers from the typical group by clause in that it permits
grouping by a set of attributes that may not include all the attributes in the select
Chapter 6. ConQuer: System Implementation and SQL Rewritings 107
clause. This is useful here because we partition by the root-key attributes, but the
select clause of countViolSubQuery also includes the projecting attributes of the query.
In the main body of the query, we lter out the tuples whose key-root attributes are
involved in a violation of cO^To by checking the condition countViol=0.
The from clause of subquery countViolSubQuery is obtained by calling a procedure
called GetJoinsExpression (shown in Figure 6.6), with the join graph of q and the list
of conditions cO^To as parameters. This procedure consists of two parts. In the rst
part, an inner join is computed for the key-to-key joins of relations that are at the root
of some tree of the join graph. Notice that since these relations are in distinct connected
components of the join graph, they are not related by a nonkey-to-key join. In the second
part, the procedure produces a left outer join expression for each tree of the join graph.
This is done by recursively calling the procedure GetTreeJoinsExpression for the nodes
of each tree (also shown in Figure 6.6). The expression returned by GetTreeJoinsExpres-
sion is a left outer join of all relations in the input tree, listed in an order corresponding
to a preorder traversal of the trees.
We will illustrate shortly (in Example 6.4) the rewriting for queries where some of
the root-key attributes do not appear in the select clause (that is, some root-key at-
tributes are not projecting attributes). We will argue that in such cases, we would
like to count the number of distinct values for the projecting attributes, grouping by
the root-key attributes. We will also show how to do this by using the max aggre-
gate function (with a partition by clause) and the rank OLAP function. In the al-
gorithm RewriteForestSQL of Figure 6.5, the rank function is used in the subquery
countViolSubQuery, and the max function is used in the subquery countProjSubquery.
The result of this aggregation is kept in an attribute called countProjection, which
keeps the count of distinct values for each instantiation of the root-key variables. This
attribute is used in the main body of the rewriting, where we check countProjection=1.
In the subqueries, we project not only on the projecting attributes S
1
, . . . , S
l
, but
also on the root-key attributes K
1
, . . . , K
n
. However, in the main query of the rewriting
we project only on the attributes S
1
, . . . , S
l
. In this way, the rewritten query Q and the
input query q return tuples for the same set of attributes.
For the sake of clarity, we omitted the order by clause in the query q. However,
dealing with ordering in the rewriting is quite easy. We just need to add the attributes
of the order by of q to the select clause of the subqueries, and include them in the
order by clause of the main body of the rewriting.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 108
Algorithm RewriteForestSQL(q, )
Input: q, a SQL query in c
sql
forest
of the form
select <list of attributes>
from <list of relations>
where <list of conditions>
, a set of key constraints (one per relation)
Output: Q, a SQL query that computes consistent
.(employee(e, s
) s
1000)
Notice that the rst and second conjuncts of the rst-order rewriting Q
1
actually
correspond to the original query q
1
. Thus, the rewriting starts with a subquery called
candidatesSubQuery that retrieves the employee names that satisfy q
1
(and are thus
candidates to be consistent answers).
Chapter 6. ConQuer: System Implementation and SQL Rewritings 110
Algorithm GetJoinsExpression(G, cO^To)
Input: G, a join graph that forms a forest
cO^To, a list of conditions of the form xy,
where is some binary comparison operator such as =, ,=, <, etc.
Output: a subexpression of a SQL query
Let r
1
, . . . , r
m
be the relations at the root of all trees of G
Initialize 1O1^o as the string r
1
for i := 2 to m do
Let 1O1^o be the conjunction of all join conditions (i.e., equalities) between attributes
of r
i1
and r
i
Concatenate join r
i
on 1O1^o to 1O1^o
end for
Initialize T O1^o as an empty expression
Let T
1
, . . . , T
m
be the trees of G rooted at r
1
, . . . , r
m
for i := 1 to m do
Concatenate the expression returned by GetTreeJoinsExpression(T
i
, cO^To) to
T O1^o
end for
return 1O1^o and T O1^o
Algorithm GetTreeJoinsExpression(T, cO^To)
Input: T, a join graph that forms a tree
cO^To, a list of conditions of the form xy,
where is some binary comparison operator such as =, ,=, <, etc.
Output: a subexpression of a SQL query
Initialize /OO1^o as an empty string
if T consists of more than one node r then
Let r
1
, . . . , r
m
be the relations whose root is a child of r
for i := 1 to m do
Let 1O1^o be the conjunction of all join conditions (i.e., equalities) between at-
tributes of r and r
i
Concatenate left outer join r
i
on 1O1^o to /OO1^o
end for
for i := 1 to m do
Let T
i
be the subtree of T rooted at r
i
Concatenate the expression returned by GetTreeJoinsExpression(T
i
, cO^To) to
/OO1^o
end for
end if
return /OO1^o
Figure 6.6: Procedures to obtain an expression for the joins of a query
Chapter 6. ConQuer: System Implementation and SQL Rewritings 111
with candidatesSubQuery as (
select emplKey
from employee
where salary <= 1000)
Since emplKey is a key of the relation employee, in the repairs, each employee name
will be associated with exactly one salary. However, in the inconsistent database, an
employee name may appear with several dierent salaries. Thus, the rewriting must
ensure that the employee names in the consistent answers are associated with salaries
satisfying the selection condition of the input query q
1
(i.e., that the salary is less or
equal than 1000) in every tuple of the inconsistent relation employee where the employee
name appears. This is done in Q
1
with the expression s
.employee(e, s
) s
<= 1000.
It is straightforward to translate this expression into SQL using nested queries and the
not exists construct. However, from our empirical observations in the context of DB2,
we have noticed that such constructs lead in many cases to inecient queries. Thus,
for the sake of eciency, the rewritings produced by ConQuer avoid the not exists
construct. One way of doing this is to count, for each employee, the number of salaries
in the inconsistent database that violate the selection condition of q
1
. If there are no
violations (i.e., the number of salaries violating the condition for the employee is zero),
then the employee name satises the selection condition in every tuple of the inconsistent
relation. This can be achieved with the following subquery.
with countViolSubQuery as (
( select emplKey,
sum(case
when salary 1000 then 0 else 1 end) as countViol
from employee
where exists (select *
from candidatesSubQuery C where
C.emplKey=employee.emplKey)
group by emplKey)
In the above subquery, we count the number of violations for each employee. We keep
this count in an attribute called countViol. The nal result of the query consists of the
Chapter 6. ConQuer: System Implementation and SQL Rewritings 112
employee names for which there are no violations (countViol = 0). In the subquery,
for each tuple of employee, we compute a case statement. If the salary in the tuple
is less than or equal to 1000 (i.e., it satises the selection condition of q
1
) we output
a value of zero (meaning no violation). Otherwise, we output 1 (meaning a violating
tuple). The query aggregates these values, summing them up for each employee name.
If the sum for an employee name is zero, that means that there are no violating tuples
involving that employee name. Otherwise, we get the number of violating tuples (hence
the name, countViol). In the main body of the query (which we give below), we return
all employee names that are not involved in any violation.
select emplKey
from countViolSubQuery
where countViol = 0
Join
We now present two examples to illustrate the rewriting of queries that contain join
conditions. In the rst example, we show the rewriting for a query that has one join
condition. In the second example, we show the rewriting for a query with a more complex
join graph.
Example 6.2. Let R be a schema with relations employee(emplKey, deptFKey), and
dept(deptKey, mgrName). Consider a SQL query q
2
that retrieves the names of all
employees whose department appears in the dept relation:
q
2
: select distinct emplKey
from employee,dept
where employee.deptFKey= dept.deptKey
Notice that q
2
has an inner join specied with the condition employee.deptFKey=
dept.deptKey of its where clause. In conjunctive query notation, q
2
can be written as
follows.
q
2
(e) = d, m.employee(e, d) dept(d, m)
It can be easily checked that q
2
is in the class c
forest
of conjunctive queries. The
rst-order query rewriting obtained by applying the algorithm RewriteForest(q
2
, ) is
the following:
Chapter 6. ConQuer: System Implementation and SQL Rewritings 113
Q
2
(e) = d, m.employee(e, d) dept(d, m) d.(employee(e, d) m.R
2
(d, m))
We could translate Q
2
to SQL using a not exists construct to achieve the eect of
the universal quantier. Although this may be a reasonable strategy for a simple query
like q
2
, we will show in the next example that it leads to deeply nested rewritings when
the original queries have several joins.
We now illustrate how to avoid the not exists construct in the rewritings. As in
the previous example, we can count, for each employee, the number of tuples violating
the conditions of the input query (in this case, the join condition). In order to detect
violations of the join condition employee.deptFKey=dept.emplKey, we need to check
whether there is a tuple in the employee relation whose department is not in the dept re-
lation. This can be achieved by performing a left outer join between the relations as
follows:
with candidatesSubQuery as (
select emplKey
from employee,dept
where employee.deptFKey= dept.deptKey ),
countViolSubQuery as (
select emplKey,
sum(case
when employee.deptFKey=dept.emplKey then 0 else 1 end)
as countViol
from employee left outer join dept
on employee.deptFKey=dept.emplKey
where exists (select *
from candidatesSubQuery C where
C.emplKey=employee.emplKey)
group by emplKey )
select emplKey
from countViolSubQuery
where countViol = 0
Chapter 6. ConQuer: System Implementation and SQL Rewritings 114
Notice that there is a subquery called countViolSubQuery, specied using a with
clause. In this subquery, we count the number of violations for each employee. We keep
this count in an attribute called countViol. The nal result of the query consists of the
employee names for which there are no violations (countViol = 0). In the computa-
tion of countViol, we use a case statement. If there is a join with some tuple of the
dept relation, we output a value of zero (meaning no violation). Otherwise, we output 1
(meaning a violating tuple). Notice that we can detect the violations of the (inner) join
of the input query q
2
because we are performing a left-outer join in the rewritten query
Q
2
. Had we performed an inner join in Q
2
, the tuples that do not join on the department
would have never been seen by the case statement.
As in the previous example, the query aggregates the values for countViol, summing
them up for each employee name. If the sum for an employee name is zero, there are no
violating tuples involving that employee name. Otherwise, we get the number of violating
tuples.
We just illustrated how we can avoid the use of not exists in the SQL rewritings
by performing a left outer join. In next example, we show why we adopt this strategy
in ConQuer: a naive translation may lead to a deeply nested query , where the level of
nesting may be as large as the number of relations in the from clause of the query.
Example 6.3. Let Rbe a schema with relations employee(emplKey, cityFKey, deptFKey),
dept(deptKey, mgrName), city(cityKey, provFKey), and prov(provKey, countryName).
Consider a SQL query q
3
that retrieves the names of all employees that are located in
Canada and whose manager is Peter:
q
3
: select distinct emplKey
from employee, city, prov, dept
where employee.cityFKey=city.cityKey
and city.provFKey=prov.provKey
and employee.deptFKey=dept.deptKey
and prov.countryName= "Canada"
and dept.mgrName="Peter"
In conjunctive query notation, q
3
can be written as follows.
q
3
(e) = d, c, m, p.employee(e, d, c) city(c, p) prov(p, Canada) dept(d, Peter)
Chapter 6. ConQuer: System Implementation and SQL Rewritings 115
Figure 6.7: Join graph of query q
3
.
It can be checked that q
3
is in class c
forest
. In particular, notice that the join graph of q
3
(given in Figure 6.7) is a tree. As shown in Chapter 3, a rst-order rewriting of q
3
can
be obtained by recursively traversing its join graph. The rst-order query rewriting Q
3
obtained by applying RewriteForest(q
3
, ) is the following:
Q
3
(e) = d, c, m, p.employee(e, d, c) dept(d, m) city(c, p) prov(p, Canada) Q
(e)
where :
Q
(c) Q
IV
(d))
Q
(p)
Q
.(prov(p, w
) w
= Canada)
Q
IV
(d) = dept(d, Peter) u
.(dept(d, u
) u
= Peter)
The universal quantiers can be translated to SQL using the not exists construct.
However, this may lead to an inecient query. First, because it would have four self
joins (since each relation appears twice in the rewriting). Second, because each recursive
invocation of the algorithm produces a new universal quantier, and a new subquery
within its scope. For example, Q
, and Q
.
As a consequence, the level of nesting of the SQL rewriting Q
3
would be three, which
corresponds to the height of the join graph.
As we showed in the previous example, in ConQuer we avoid using the not exists
construct by performing a left-outer join of the relations in each tree of the join graph.
The SQL rewriting produced by ConQuer in this case is the following:
Chapter 6. ConQuer: System Implementation and SQL Rewritings 116
with candidatesSubQuery as (
select emplKey
from employee,city, prov,dept
where employee.cityFKey=city.cityKey
and city.provFKey=prov.provKey
and employee.deptFKey=dept.deptKey
and prov.countryName= "Canada"
and dept.mgrName="Peter" ),
countViolSubQuery as (
select emplKey,
sum(case
when employee.cityFKey=city.cityKey
and city.provFKey=prov.provKey
and employee.deptFKey=dept.deptKey
and prov.countryName= "Canada"
and dept.mgrName="Peter"
then 0 else 1 end) as countViol
from employee left outer join dept on employee.deptFKey=dept.deptKey
left outer join city on employee.cityFKey=city.cityKey
left outer join prov on city.provFKey=prov.provKey
where exists (select *
from candidatesSubQuery C where
C.emplKey=employee.emplKey)
group by emplKey )
select emplKey
from countViolSubQuery
where countViol = 0
It is important to note that the SQL rewriting has only two subqueries, even though
q
3
has four relations, and a join graph with a tree of depth three.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 117
Projection and the Need for OLAP Functions
In Example 6.1, we dealt with a query that projects on the key attribute of the relation
employee. If a query does not project on the key attribute, then special care must be
taken in the rewriting. We illustrate this with the next example.
Example 6.4. Let R be a schema with our standard employee(emplKey, salary) rela-
tion. Let q
4
be a query that retrieves all salaries (regardless of the employee name).
q
4
: select distinct salary
from employee
Comparing q
4
to q
1
, the former query does not project on the key attribute emplKey,
and it has no where clause. In conjunctive query notation, q
4
can be written as follows.
q
4
(s) = e.employee(e, s)
The rst-order query rewriting obtained by invoking RewriteForest(q
4
, ) is the
following.
Q
4
(s) = e.employee(e, s) s
.(employee(e, s
) s
= s)
Again, we would like to avoid the naive (but inecient) translation of Q
4
into SQL
that uses the not exists construct. Intuitively, Q
4
returns the salaries s for which there
is at least one employee name that is associated to s and only to s in the tuples of the
inconsistent relation employee. In this way, we ensure that salary s will appear in every
repair. One way of writing Q
4
in SQL is the following:
select salary
from employee
where emplKey is in
select emplKey
from employee
group by emplKey
having count(distinct salary)=1
Chapter 6. ConQuer: System Implementation and SQL Rewritings 118
In our empirical observations, the self join of the above query sometimes leads to
inecient queries. The self join is needed because we are not including the salary
attribute in the select clause of the subquery. This is not an arbitrary decision. Rather,
it is forced by the syntax of SQL. In SQL, all the attributes of the select clause must
appear in the group by clause. If we include salary in the select clause of the
subquery, we must also group by it, and hence we are unable to count the number of
distinct salaries per employee name. We will show shortly how we overcome this problem
in ConQuers rewritings.
We just argued that there are some query rewritings for which there is no obvious way
of avoiding self joins, and that this is caused by the syntax of the group by clause. This
problem was addressed in the OLAP Amendment to the SQL standard [ISO01], which
introduces aggregate functions with a partition by clause. The OLAP Amendment to
the standard has been implemented by all major database vendors. In particular, for
DB2, the standard has been supported since Version 7 (we are using Version 8.2).
The partition by clause is more exible than group by for two reasons. First, there
can be one partition by clause for each aggregate function, whereas there can only be
one group by for the entire query. Second, unlike group by, the attributes of the select
clause are not required to appear in the partition by clauses of the query. We illustrate
the use of the partition by clause with the next example.
Example 6.5. Consider the following SQL query:
select emplKey,salary,
sum(salary) over (partition by emplKey)
as countProjection
from employee
The query returns triples of values. The rst two values of each triple correspond to
employee names and salaries in the relation employee. The last attribute is the sum of
the salaries for the employee name in the tuple. Notice that the attribute emplKey is
in the partition by clause, but the salary attribute is not. So we are projecting on
two attributes (emplKey and salary), but considering only one of them for grouping the
results of the aggregate function. This cannot be done with a group by clause.
Let us nish this example by showing an application of the query to an actual
database. Consider the database I = employee(John, 1000), employee(John, 2000),
Chapter 6. ConQuer: System Implementation and SQL Rewritings 119
employee(Mary, 1000). The result of applying the SQL query above to I is the following
(John, 1000, 3000), (John, 2000, 3000), (Mary, 1000, 1000).
In the next example, we show how the partition by clause could be used in order
to avoid self joins in the rewritings.
Example 6.4. (continued) Recall that we had obtained a rewriting of query q
4
that
performs a self join on the employee relation. We can write an equivalent query without
a self join by taking advantage of the partition by clause.
with countProjSubQuery as (
select emplKey,
salary,
count(distinct salary) over (partition by emplKey) as countProj
from employee )
select salary
from countProjSubQuery
where countProj = 1
In the subquery countProjSubQuery, we obtain the number of distinct salaries for
each employee name (which we keep in a variable called countProj). The rewriting then
returns the salaries of employees for which there is exactly one salary in the database
(countProj = 1).
The query rewriting that we just obtained avoids the use of a self join by using the
partition by clause. Unfortunately, though, this is not the end of the story. The
version of DB2 that we use in ConQuer currently supports the partition by clause for
a variety of aggregate functions (such as sum, min, max, count(*), and avg), but it does
not support the count(distinct) function. Nevertheless, the eect of count(distinct)
can be obtained by combining the use of the max aggregation function (with a partition
by clause) and an OLAP function called rank() as follows.
with rankProjSubQuery as (
select emplKey, salary,
rank() over (partition by emplKey order by salary)
as rankProjection
Chapter 6. ConQuer: System Implementation and SQL Rewritings 120
from employee ),
countProjSubQuery as (
select emplKey, salary,
max(rankProjection) over (partition by emplKey)
as countProjection
from rankProjSubQuery )
select distinct salary
from countProjSubQuery
where countProjection = 1
First, let us explain the use of the rank() function. The syntax of rank() is the
following:
rank() over
(partition by <partition attributes> order by <order attributes>)
The function creates groups for each tuple of values (instantiation) of the attributes
in the partition by clause, as we discussed before for other functions. The tuples of
each group are ordered according to the attributes in the order by clause, and assigned
a number according to their position in this ordering. If there is a tie (in our example,
two tuples with the same employee name and salary), the tuples are mapped to the same
number.
Let us illustrate the semantics of the rank() function in the context of our example
rewriting. Consider a database I = employee(John, 1000), employee(John, 2000),
employee(Mary, 1000). Then, the function rank() over (partition by emplKey
order by salary) would map (John, 1000) to 1, (John, 2000) to 2, and (Mary, 1000)
to 1.
Now consider the instance I as an inconsistent database with respect to (which
contains a constraint stating that emplKey is the key of the employee relation). In
the subquery rankProjSubQuery of the rewritten query, we compute the ranking func-
tion for each tuple and keep the value in an attribute called rankProjection. Then,
in the subquery countProjSubQuery, we obtain the maximum value of the attribute
rankProjection for each employee name, and keep it in an attribute called count-
Projection. Notice that the grouping is done by employee names since the attribute
Chapter 6. ConQuer: System Implementation and SQL Rewritings 121
emplKey is in the partition by clause of the max aggregate function. In our example,
we would obtain (John, 1000, 2), (John, 2000, 2),(Mary, 1000, 1). In the nal result,
we would like to get salary 1000 because it appears associated with Mary in every re-
pair, but not 2000 because it does not appear in all repairs. We obtain this in the query
rewriting by checking the condition countProjection=1.
6.3 ConQuer Rewritings for SPJ Queries with Ag-
gregation
In this section, we present the SQL query rewritings produced by ConQuer for queries
with grouping and aggregation. We rst present the algorithm and then illustrate it with
some examples.
6.3.1 Rewriting algorithm
We now present the SQL rewriting algorithm for SPJ queries with aggregation that are
equivalent to the aggregate conjunctive queries in class c
aggforest
, introduced in Denition
4.1, which we repeat next.
Denition 4.1. Let q be an aggregate conjunctive query. We say that q is in class
c
aggforest
if q is of the form
select z, [count(*)[ F(u)]
from q
(z, u)
group by z
where q
is a conjunctive query in c
forest
, and F is one of the aggregation functions
min, max, or sum.
We can now give a denition analogous to c
aggforest
for SPJ SQL queries with aggre-
gation.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 122
Denition 6.1. We say that query q is in class c
sql
aggforest
if q is the form
select S
1
, . . . , S
l
,[count(*)],F
1
(A
1
), . . . , F
u
(A
u
)
from <list of relations>
where <list of conditions>
group by S
1
, . . . , S
l
where S
1
, . . . , S
l
, A
1
, . . . , A
u
are attributes of the relations in the from clause, and
F
1
, . . . , F
u
may be any of the aggregation functions min, max, and sum.
We are now ready to give ConQuers rewriting for queries in c
sql
aggforest
. The algorithm
is called RewriteAggSQL, and is shown in Figure 6.8. It takes as input a SQL query q in
class c
sql
aggforest
and a set of key constraints (one per relation of the schema), and returns
a SQL rewriting Q of q.
In the rewriting Q, the attributes of the relations in q play dierent roles. As in
the algorithm RewriteForestSQL for queries without aggregation, we have projecting
and key-root attributes. The former are the attributes that q projects on (i.e., that
appear in its select clause), and the latter are the attributes that appear in the key
of a relation that is at the root of some tree in the join graph of q. In addition, in
RewriteAggSQL, we have aggregation attributes, that is the attributes that appear as
arguments of some aggregation function of q. In Figure 6.8, we denote the projecting
attributes with the symbols S
1
, . . . , S
l
; the key-root attributes with K
1
, . . . , K
n
; and the
aggregation attributes with A
1
, . . . , A
u
.
We denote the aggregation functions of q with F
1
, . . . , F
u
. In the gure, we assume
that the 0-ary function count(*) is present in the query (but during the explanation it
will be easy to see what can be dropped if count(*) is not present).
The rewriting Q has ve subqueries, specied using a with clause: candidatesSub-
Query, countViolSubQuery, contribAllSubQuery, contribConsistentSubQuery, and
contribNonConsistentSubQuery.
As in the algorithm RewriteForestSQL, the purpose of candidatesSubQuery is to
determine the values for the key-root attributes that should be considered by the other
subqueries. The subquery countViolSubQuery has the same purpose (counting the num-
ber of violations per key-root value) as the subquery of the same name in the rewrit-
ing RewriteForestSQL. One dierence is that here we need to compute the attribute
Chapter 6. ConQuer: System Implementation and SQL Rewritings 123
satConds which keeps track of whether each tuple satises the conditions of the query
(denoted as cO^To). The other dierence is that in the select clause of the subquery,
we must project on the aggregation attributes since their values are needed to perform
aggregation in the rest of the rewriting.
The other three subqueries are used to compute the contributions to the lower and
upper bounds of each aggregate result. The subquery contribAllSubQuery computes,
for each instantiation of the key-root and projecting attributes, the minimum and max-
imum value for each aggregation attribute. In particular, in the subquery we group by
K
1
, . . . , K
n
, S
1
, . . . , S
l
(the key-root and projecting attributes), and for each aggregation
F
i
(A
i
) in the select clause of q, we compute attributes bottomA
i
and topA
i
as min(A
i
)
and max(A
i
), respectively. We also compute an attribute countProjection, to keep
track of the projection on nonkey attributes.
The subqueries contribConsistentSubQuery and contribNonConsistentSubQuery
compute the contribution of the consistent and nonconsistent tuples to the aggre-
gation. The former are the tuples whose key-root values satisfy the following two con-
ditions. First, they have the same value for the projecting attributes in every tuple
where they appear (checked with condition countProjection = 1). Second, they are
not involved in a violation of the selection conditions cO^To in any of the tuples where
they appear (checked with condition countViol=0). The tuples that violate at least
one of these conditions are considered nonconsistent and dealt with in the subquery
contribNonConsistentSubQuery.
For the consistent tuples, the contributions computed in contribConsistentSub-
Query correspond to the bottom and top values from contribAllSubQuery. That is,
the attributes bottomA
i
and topA
i
of contribAllSubQuery appear in the select clause
of contribConsistentSubQuery. The computation of the contributions of the noncon-
sistent tuples is more involved. In contribNonConsistentSubQuery, the expression of
the select clause that handles the contributions is obtained by calling the procedure
GetBoundsNonConsistent given in Figure 6.9. Notice in the gure that the contributions
are dierent depending on the aggregation function. The rationale and correctness proof
for these contributions were given in Chapter 4. In the gure, we do not include the 0-ary
operator count(*). For this operator, we need to return the attributes bottomCount and
topCount with values of zero and one, respectively.
In the subqueries, we project not only on the projecting attributes S
1
, . . . , S
l
but
also on the root-key attributes K
1
, . . . , K
n
. However, in the main query of the rewriting
Chapter 6. ConQuer: System Implementation and SQL Rewritings 124
we project and group by only the attributes S
1
, . . . , S
l
(i.e., we project out the key-root
attributes). In this way, the rewritten query Q and the input query q return tuples
for the same set of attributes. We also compute the greatest lower bound (glbA
i
) and
lowest upper bound (lubA
i
) for each tuple of values for the projecting attributes. This
is obtained by performing the corresponding aggregation function (min, max, or sum) on
the top and bottom values computed in the previous subqueries. For the 0-ary func-
tion count(*), the bounds are computed by summing up the values of the attributes
bottomCount and topCount from the previous subqueries. Notice that there is also a
condition having sum(bottomCount) > 0. This is included in order to ensure that the
tuples for the projecting attributes are consistent answers.
For the sake of clarity, we omitted the order by clause in the query q. However,
dealing with ordering in the rewriting is quite easy. We just need to add the attributes
of the order by clause of q to the select clause of the subqueries, and nally add an
order by clause to the main subquery. The only special case that must be considered
is when an aggregate attribute appears in the order by clause. Since for each aggregate
attribute of q we have two attributes in the rewritten query (one for each bound), we
must (arbitrarily) decide whether the ordering will be by either the greatest lower or the
lowest upper bound.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 125
Algorithm RewriteAggSQL(q, )
Input: q, a SQL query in c
sql
aggforest
of the form
select <list of attributes>,<list of aggregation functions>
from <list of relations>
where <list of conditions>
group by <list of attributes>
, a set of key constraints (one per relation)
Output: Q, a SQL query that computes aggconsistent
end if
if F
i
= min
return bottomA
i
, 0 as topA
i
end if
if F
i
= max
return 0 as bottomA
i
, topA
i
end if
Figure 6.9: Algorithm to obtain the bottom and top contributions of nonconsistent
tuples
6.3.2 Examples
We next illustrate the rewriting for a query that uses the count aggregation function.
Example 6.6. Let R be a schema with relation employee(emplKey, salary, age). Con-
sider a SQL query q
5
that, for each age in the database, gives the number of occurrences
of the age on tuples for employees whose salary is less than or equal to 1000.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 128
q
5
: select age, count(*)
from employee
where salary <= 1000
group by age
In the aggregate conjunctive query notation introduced in Chapter 4, q
5
can be written
as follows.
q
5
(a, cnt) = select a, count(*)
from employee(e, s, a) s 1000
group by a
The above query is in the class c
aggforest
for which we gave a query rewriting algorithm
in Chapter 4. A key idea of that algorithm is to rst produce a rst-order rewriting for
a conjunctive query, and then perform aggregation on the result of the rst-order query.
For our example, this conjunctive query is q
, ) (the algorithm
introduced in Chapter 3).
Let Q
5
be the query rewriting for q
5
obtained by invoking RewriteCount(q
5
, ) (the
algorithm of Figure 4.1 of Chapter 4). In that rewriting, the greatest lower bound is
obtained as follows:
QGlb(s, glb)= select s, count(*)
from QConsistent(e, s)
group by s
Notice that aggregation is performed on the result of the rst-order query QConsistent(e, s).
Thus, for computing the greatest lower bound in the SQL rewriting, we can reuse the al-
gorithm RewriteForestSQL introduced in Section 6.2. In particular, we will use the next
two subqueries, which are similar to those that would be produced by RewriteForestSQL(q
, )
(we will show the dierences next).
with candidatesSubQuery as (
select emplKey
from employee
where salary <= 1000 )
Chapter 6. ConQuer: System Implementation and SQL Rewritings 129
with countViolSubQuery as (
select emplKey,age,
rank() over (partition by emplKey
order by age) as rankProjection,
sum(case when salary <= 1000 then 0 else 1 end)
over (partition by emplKey) as countViol,
case when salary <= 1000 then yes else no end
as satConds
from employee
where exists (select *
from candidatesSubQuery C where
C.emplKey=employee.emplKey) )
with contribAllSubQuery as (
select emplKey,age,
max(rankProjection) over (partition by emplKey)
as countProjection,
countViol
from rankProjSubQuery
where satConds=yes
group by emplKey,age,countViol,rankProjection )
The above subqueries dier from the ones that would be produced by Rewrite-
ForestSQL in the following aspects. In countViolSubQuery, we compute an attribute
satConds that keeps track of whether each tuple satises or violates the selection con-
dition of q
5
(i.e., that the salary is less than or equal to 1000). This is dierent from
the attribute countViol because countViol counts the violations for all tuples where a
key value (employee name, in this case) appears, whereas satConds may take dierent
values on dierent tuples of the same employee, depending on the salary that appears in
the tuple. The third subquery corresponds to the subquery countProjSubQuery of the
Chapter 6. ConQuer: System Implementation and SQL Rewritings 130
algorithm RewriteForestSQL, but it has a dierent name here (contribAllSubQuery)
because, as we will show shortly, it is used to compute the contribution of each tuple
to the lower and upper bounds of count(*). In this subquery, we check the condition
satConds=yes. The intuitive reason is that the tuples that do not satisfy the con-
ditions of q
5
(and hence satConds = no) do not contribute neither to the lower nor
to the upper bound of count(*), and should thus be ltered out.
Let us now consider the computation of the lowest upper bound. In the query Q
5
returned by RewriteCount, this bound is obtained as follows:
QLub(a, lub) = select a, count(*)
from q
(e, a)(e.QConsistent(e, a)). The naive way of writing this expression in SQL may be
inecient because QConsistent already contains q
, but do satisfy q