Professional Documents
Culture Documents
Algorithms For Acyclic Datarase Schemes: Mihalis Yannakakis
Algorithms For Acyclic Datarase Schemes: Mihalis Yannakakis
Mihalis Yannakakis
AfSSTRACT: Many real-world situations can be captured by a set of functional dependencies and a single join dependency of a particular form called acyclic [B..]. The join dependency corresponds to a natural decomposition into meaningful1 objects (an acyclic database scheme). 0ur purpose in this paper is to describe efficient a& rithms in this setting for various problems, such as computing projections, minimizing joins, inferring dependencies, and testing for dependency satisfaction.
called acyclic
[S..].
equivalent to some set of multivalued dependencies. Such sets of multivalued dependencies are conflict-free. a notion introduced by
[t] in the study of the relation between the network and the reiational model. [S] argues also that most real-world sets of mvds fall
1. INTRODUCTION An important pati in the design of relational database schemes is the specificatiuo of constraints satisfied by the data, called dependencies. The first dependencies to be introduced were the functional dependencies [C]. Their properties are well understood and efficient algorithms have been developed for infetring new dependencies and designing database schemes [BE, Ber]. Multivalued dependencies
into this category or can be put in such a form. The class of acyclic database schemes contains the class of loopfree Bachman diagram
schemes of IL], the class of simply connected schemes of [Z], and bears a close resemblance to the class of tree queries of [BG J. In this pape.t we shall give efficient algorithms for several prob lems on acyclic database schemes, such as computing projections. testing satisfaction of the dependencies by a database, inferring other dependencies. h Seaion 2 we review the basic terminology. Set-
[F, Z] were introduced to describe those cases where a relation can be decomposed into two of its projccdons, and join dependencies fR] for a decomposition into several projections without loss of information; i.e. the original relation can be reconstructed by joining the projections. There are also efficient algorithms for the inference of
tions 3-6 deal with a database scheme and its associated join dependency (no functional dependencies). in Section 3 we assume a general database scheme Q. We examine the complexity of computing a projection of the join of the relations in a database ow Q, of
muitivalued dependencies [Bee] and their use in the design process. However, in general they are harder to grasp and deal with than
determining if the projection can be computed by joining only some of the relations, and of inferring dependencies from the join dependency associated with Q. In Se&m 4 we examine the same problems when Q is an acyclic database scheme. In Se&ion 5 we define the association D[X] between the attributes of a bet X represented by a database D and show how to compute it. In Section 6 we examine loop-free Bachman diagram schemes and give a characterization of them in terms of lossless joins. Section 7 assumes a set of functional dependencies and an acyclic join dependency; it examines the inference of other dependencies, and testing if a given database satisfies
functional dependencie+e.g. the. best known algorithm to infer a join dependency from multivalued dependencies takes expootential time and space [ABU]. Joii dependencies are studied in [BV, MSY, Y].
Recently, [FMU] advanced the hypothesis that most real-world situations have a particularly simple structure: They can be. captured by some functional dependencies and a single join-dependency that describes a natural decomposition into meaningfull objezts.
82
CH1701-2/81/0000/008~.75
0 1981 IEEE
the dependencies.
trs(T)-tuple of distinguished symbols. A tableau T defines a map ping fr from relations over U (or universal relations) to relations
2. TERMINOLOGY
over trs(T) as follows. A valuation p is a mapping that maps for each attribute A, S(A) into Dam(A). A valuation is extended to tuples comlionentwiseand to relations elementwise. fr is defined as follows. fr(R) =. {p(q) ( p is a valuation with p(T) C R). A tableau T, is contained in another tableau Tz ( denoted T, cr Tz) if they both have the same target relation schemeand fT,(R) E frJR) for every universal relation R; Tt and Tz are equivaknt (or
In this Section we will go briefly over the basic relational theory terminology. For more details the reader is referred to [U]. The universeis a finite set U of atrributes. A relation schemeR is a subset of CJ. A dumbuseschemeQ (over U) is a set or relation schemeswith union c/. Every attribute A has an associatedset of values, its domuinDam(A), If X is a set of attributes, an X-trrpfe(or
X-vulue)
such that
c is set-inclusion)and trs(T,) = trs(Tz) then T2 Gr Ti. We now lit someof the basic resultsof the theory of tableaux [ASUI, ASU2].
p(A) C Dam(A) for each A in X. A relation R over relation scheme R is a finite set of R-tuples. A database D over a databasescheme Q is a set of relations containing one relation over each relation schemeof Q.
The projection t[Y] of an X-tuple t onto a subsetY of X is the
structed recursively from I$ as follows. If + = nxo, and S is the tableau for o, T is obtained from 5 by changingeach distinguished symbol
4
with A L X into a new nondistinguished symbol. If Wok, and Ti, Tz . ... T1 are the tableaux for
0 = alWozP4
relations over schemesat, .. .. & respectively and R = Ug. join of RI, . . . . Rk, denoted R, W w(R*.
i=l , ..I. k.
(11,.. . . ok. then T is the union of the Tis. (2) Let T,, T2 be two tableaux with the sametarget relation scheme. A homomorphism /r from Tg to Tz is a mapping from S(A) to S(A) for each A, such that
h(4)
Using the projection operator xx (X c 6) and the join operator we can build project-join relational expressions. An expression$ defines a mapping from relations over U to relations over a certain set of attributes, the t&et relation scheme of I$. trs(+).
An expre
from the tuples of T, to the tuples of Tz induced by h is a containment mapping. Such a homomorphismexists if and only if Tr Cr T,.
(3) Each tableau T has a minimal subset 7 equivalent to T; f is unique up to renaming of nondistinguishedsymbolsand is called the
minbwl equiv4lcnt
sion 4 is contabvd in an expremion # or 4 G $ if trs(&) = trs(g) and $(R) G S(R) for every relation R over LI; 41 and $ are equivalent, denok4 ~$I=I$,if $ G QIand $I E 9. A useful tool for compating expredsions is the tableau [ASVI]. Each attribute A of U hasanasociatedsymbolsetS(A)={a,a,,nt,
distinguished symbol &
+, then T is the tableau of an expressionJ, equivalent to 4 that contains the minimum number of (binary) joins.
A fnnctiorvll dependency (er fd) is a statement of the form
~~~];aiscalleda
X - Y where X, Y C U. It is satisfied by a universal relation R, if for all tuples I,, ta of R with t~[Xl = IAX], also ti[Y] = tt[Y] holds. A join dependency $or jd) is a statementof the form l Q where Q is a databaseschemeover LI; it is satisfied by a univetsal relation R if W{q(R) I& t RI = R. A multivubted dependency (or mvd)
is a reIatIon over U with the symbol setsas the domainsof the attribute+. The target elation schemerrr(T) of T is the set of ~atttibutee
83
X -
Y is the,join
dependency
*{XI,
An
The procedure
(i.e. are
embedded join dependency (ejd) is like a join dependency the union attributes; satisfied We will of the relation
U S(A)
In this case, if of
say that
an element S(A)
with an element
and is denoted
If two
(idemit fol-
constants
potency).
the idempotency
arises. either
the order in which the rules are will arise or else the
lows that the join of the relations the jd *Q. A remplate dependency dependency; it is a statement
a contradiction
final relation (or rd) [sU] is a very general kind of With hypergraph.
chase,(R) a database
A hypergraph
It is satisfied
an arbitrary etc.
of nodes. over
another
like
path,
to hypergraphs with
P I= u) if o holds in every
relation
satisfying MMS]).
A procedure
straightforward universe
hypergraph
associated
Q has the
The tableau
T, of a
schemes of Q as its
The tableau T,, of a functional the one has distinguished in the rest, and of X and The a tuple
set of edges.
dependency
for the rest of this paper that Q is connected; then the attributes of each other, considering in the different components
of XI and nondistinguished
the other tuple has distinguished new nondistinguished chase procedure with projection symbols
in a straightforward
modifies
the tableau
*Q is defined associated
sr in it if o is a template
dependency,
or to eliminate
hypergraph.
the definition
the nondistinguished
characterizations.
and is implied
by (or is equivalent
dependencies. FD-rule.
A database
scheme Q is acyclic
T, are as follows.
are going to use the following schemes Q[B..]. (1) Q can be reduced attribute relation
characterizations
database
there are two tuples rt, rr that agree on X but disagreeon bute B of other,
r,
replace
of one of r,(B),
I@]
to the empty
deleting
an a
keeping
a distinguished
or the nondistinguished
and deleting
JD-rule.
(2) There
the relation
the subgraph
84
the prgject-join mapping). There are however somecasesof particular interest which can be efficiently decided even for general join dependencies. Let I = *(X1, .. .. X,) be a join dependency.
(1) Inferring lossless joins [BcV]. MdWd
represenrs p.
Acyclic schemes Q have also the following properties. (3) We say that a database D = {R,, . .., &} over p is @l/y)
reduced
Let S =
{Vj]je,,
relation R such that R, = ~4 (R) for each & E 0. If R is a database over an acyclic scheme Q, then D is reduced iff Ri = n,t, (Ri W Rj) for each pair of relations Ri, Rj in D. There is an efficient algorithm using a particular kind of joins (called semijoins) which computes the reducrion of D : D = {RI, . . . . Rk}
Y = U Y,. We can decide if J implies that S has a lossless join (i.e. i if Wny,(R) = ny(R) for every universal relation satisfying J) as folj lows. Form a bipartite graph G withnode-sthe XjS and the attributes in U-Y and an edge between an attribute and a set if the attribute is in the set. Let K,, .., K, be the connectedcomponents of G. Let Zi be all the attributes of Y that belong to a set Xj in Xi, iff each Zi is confor i = 1, .. .. I. [BeV] showsthat Wn, is lossless i J tained in some Yj. 0 (2) Let S c {X1, . .. . X,} andX c UX,. x,cs
where Ri = IQ, (YR,) [BG]. (4) If Z is a set of attributes, let R(Z) be the family of nonempty sets & fl Z where & C Q. Q(Z) is an acyclic scheme over the universe2, cakd the schemegenerated by Z.
nx,) = nx
(*).
We claim that J
implies (*)
if
and only if
nx (,C$s nx,) - 71x(,5] nx,) is a tautology (i.e. holds for all univergiven tuple is in the join of the relations in D. If however the tuple is only partially specified (in someset of attributes X) then the problem is much harder : Theorem 3.1. It is NP-complete to teat if an X-tuple I is in
Q(~R,)
note that for all universal relations Supposethat there is a relation R with
C nx[ms(R)].
dependencyI but violates () since ?r,(R) = ax[m,(R)] (from the idempotency of m,). Conversely, if nx[m,(R)] = nx[ms(R)] for
scheme,even if D is the projection of a universal instance. Proof Membership in NP is obvious. For the NP-hardness part we use a result of [SY], that it is NP-completeto teat if an expression I$, = nx(yny,) is contained in another expression+z = nx(y~~~). L.et T, be the tableau of 0, and s1 its summary. From the theory of tableaux, Q, G 4* if and only if s1 c 4z(TI). Thus, we can decide if
sincem,(l) = I.
Thus, it suffices to test if mx(x$tsnx) - ox (i$l nxJ. Both expressionsare simple [ASLIl], and therefore their equivalence can be tested in polynomial time using tableaux techniques[AStll]. 0 The significanceof (2) is the following: Supposethat a universal instanceR (satisfying.J) is decomposed into its projections on the
As a consequence of Theorem 1, it is NP-completeto test if an arbitrary join dependencyimplies a template dependencyof the form nxmS = nx, (Membership in NP follows from the indempotencyof
4s. Then we can test efficiently if a projection of it (on some set X) can be recovered by joining only some of the relations in the database. Moreover, we can find efficiently the minimum number
4. ACYCLIC of such relations whose join gives the projection of R on X: we just have to minimize the (simple) tableau of nx m, [ASUI, ASU2]. From the theory of tableaux, all minimal subsets S of J with nxmr = nx have the same minimum cardinafity, and the expressions nxms have identical tableaux T up to renaming of nondistinguished symbols. Therefore, for every minimal S = {Xi,, . , X,},
DATABASE SCHEMES
Let D = {R,, . . . . Rh} be a database over the acyclic scheme Q = {Et, . . . . &}. Let X be a set of attributes. We shag show how
to compute nx(WRi) in time polynomial in the size of the input and the output. The algorithm is based on the relation between acyclic
database schemes and tree queries. Let T be a tree representing the scheme Q as explained in Section 2. At first we compute a fult reduction D of D as in [BG] using semijoins. Then we root the tree at an arbitrary node say &, and prune it leaf-toroot. Let T, be the ,
nxms(R) = nx [W fly,(R)], where Yj is the set of attributes in which I the j-th row of the minimal tableau T has a distinguished or repeated nondistinguished symbol (Yj C X,). In other words for every involves the join of
subtree of T rooted at node &. Zi the set of attributes labelling the edge from- & to its father Bj (2, = &n&i), and Xi the set of attri-
exactly the same relations (my(R)) - the best choice of S depends on how fast n,,,(R) can be obtained from Rij = n+(R) (organization of
butes of X that are contained in some node of T,. When the turn of & comes to be deleted, Ri has been replaced by a relation & the various relations, supporting data structures, etc.). the set of attributes Xi&. Note that we might have nx(,ys nx,) = nx even though the join is lossy - i.e. xDtsnx, + nr where Y = xlJ, Xi. However, any nonredundant join for computing the projection on X is lossless: Theorem 3.2 Let *J be a jd and suppose that S is a miminal subset of J such that nx(xk4, xx,) = nx is implied by *J. xP4sTX, is a lossless join. 0 Then The deletion of node & is carried out by deleting the relation Ri and replacing the current relation Rj of its father 4 by Rj W rrxX(Ri). It is easy to show by an induction that when node & is deleted, the current relation Ri is equal to nxA ( W Rl). 467 Thus, over
when we are left with the root, the relation that is stored there is Rt = no, (&$R,), and nx@t) = m(WRJ. I
We had to reduce. first the database so that intermediate results Regarding the actual computation of nx(yRi), the size of the output will not get exponential in the input and the output: relation can be exponential in the size of the input database. Thus, the best we can hope for is an algorithm polynomial in the size of current version Rj of Ri is bounded by piI the input and the output. However, we know that it is NP-complete to determine if YR, is empty or a (universal) R, = nq (R) [WY]. relation R with Theorem 4.1 If {R,, . . . . Rk} is a database over an acychc scheme, then mX(WRi) can be computed in time polynomial in the I input and the output. o Simple variations of the algorithm can be used to (unless P = NP) there is no efficient (polynomial in the sizes of the input and the output) algorithm to carry out either the reduction of the database or the pin of a reduced database. Also. it is NPcomplete to test if a universal relation satisfies a join-dependency. WY1 (1) Test if a universal relation R satisfies an acyclic join dependency *1. Apply the algorithm to D = {nx,(R) 1Xi c J} with X = (I while Inx(yRj)l. Lemma 4.1 Throughout the computation, the size d 0 the
making sure that the various relations along the way dont become different from the corresponding projections of R. Another way for
86
doing this is to take a set M of mvds equivalent to *J with !M] s PI (there is always such a set [B. .]) and check if R satisfies M. (2) Include selections. Let Y be a set of attributes and SI(A) a subset of Dam(A) for each
which have A-value in Sl(A) for each A CY, we first select those
tuples from each relation Ri that can contribute to the result. That is. we remove from Ri those tuples I with r[A] f S!(A) for some A in Yf&. Then we can apply the algorithm to the remaining relations.
Even though Tx is a minimal subtree covering X, it might still The following now is immediate. contain redundant relations. CoroIIary 4.1 (1) Given a database {RI, . . ,Rk} and an Xfilm database with rehnions FD (film-director), tuple t, we can decide in polynomial time if r C nx(WRi). FA (film-actor) (2) We can decide in polynomial time if an acyclic join dependency implies a template dependency. 0 arranged as in the tree of Figure 2. FP (film-producer), For example, suppose that we have a
Returning to the computation of nx (YRJ , we note that we Figure 2 do not have to carry out the second phase using the whole tree. The minimal subtrce that relates directors to producers is the Since the join of the relations in the database satisfies the jd *R we whole tree. can use dependencies that are implied by it. change the arrangement so that FD and FP become adjacent. This Lemma 4.2. The join of a subset of the relations, whose is true in general: Theorem 4.2 Let S be a subset of schemes from D that join Thus, it suffices to join a set of relations whose schemes form a subtree T of T and have a union containing X. We will say that 7 cover3 X. If X is contained in some relation scheme $, then we can Proof (sketch) obtain nx(WR,) 8 by projecting the corresponding relation R, onto X. Let S = {&t, . . . . &,} and Y = U R,,. From our discussion in Section 3 (see Property (1) of general join dependencies) the schemes S join losslessly if and only if the schemes in Q-S Lemma 4.3. If X is not contained in two or more relation partitioned schemes, then there is a unique minimal subtree Tx of T covering X. gr rl Y c &,, and (b) if 8, and & belong to different K,s then Proof (sketch) The subtree Tx is defined as follows. Let & be a node of T I& II & n (U-Y) At = 0.
K, U {&,} are acyclic
However,
We could
losslessly. There is a tree T representing Q such that the schemes in S form a subtree of T.
can be then
into
if &, 6 K,
and T1. T2, . . . . T[ the subtrces hanging from it (see Figure 1). If one of the Tis covers X then & t Tx. Otherwise we include & into TX.
schemes. We can construct thus trees To, T,, . . . . T, representing respectively S, Kt U &I}, ..,, K, U (&,,,}. We attach the trees
87
&I,
. . . . Bi, with Yj c hj (for j = 1, . . . . m). Parts (1) and (2) hold even if
lQ
is an arbitrary compute
4.2 All 0
can be
the correct
efficiently.
Note that a lossless join might give a different superset) database than the projection is not the projection
result (a proper if the there if 5. ASSOCIATIONS WHEN THE DATABASE INSTANCE of Figure producers 2, and suppose we and directors. The (p,d) IS NOT THE
of the join of all the relations, of a universal this particular engineer) instance. And
PROJECTION Consider
might be good reason for wanting we have also a relation base, the way to compute FS. By contrast
FS (film-sound directors
data-
between
of talkies
is by joining
FD with
to the Corollary,
lossy joins can be hard to comto an acyclic one thus, the NP-
by) pro-
pute: Every n by
However,
adding
of documentaries is inapplicable
in computing
the join
Actor
unknown,
unspecified
these cases. 3.2 we can conclude the projection of R. We Let D = {RI, set of X-tuples ., R,} be a database. We denote by D[X] of each universal Q with the
Combining
join computing
relation for
satisfying
R, G T&, (R)
use this fact now to devise an easy method minimal S = {Y,, *R, nxms and join for X. Tbat is, we
i = 1,
, Y,,,} of sets such that nxm,. every subset S of n, for Sets &,,
tuple [Ml).
= mo R(D).
A tuple of X.
= nr.
contains
m distinct
with
Y, C E,, (for
D[X] is
j = 1, . . . . m). We compute Q be a family S using the following schemes, initially elimination algorithm. Let
in the previous
of relation
set equal to n.
1) Elim-
D[X] or determine
are hard even if that if D is the
D is the projection
projection now that
of a universal instance
steps 1)
of a universal
Suppose
(2) after
Q is an acyclic
scheme.
Consider
Remark
nrmr.
= TX.
(2) If S is
Theorem
schemes of
any subset of Q that has for each j = 1, .__, m a scheme &, containing Y, then the j.d. *Q implies xxms n with nxms = xx implied by *Q, = nx. (3) If S is any subset of distinct sets
Q in place of the database D there, the set X in place of Y and with S/(A) the set of nonnull A-values for each A in X. From Remark
then S contains
(2) we have :
88
from Lemma 5.1 and Theorem 4.2. 0 In the next Section we shall see that there are some schemes which have a representing tree such that the union can be dropped from Theorem 5.2, and thus D[X] can be computed by just joining somerelations from D . Even for a general Q however,we can compute D[X] without introducing null values - and clearly faster than the general method described at the beginning of the Section (as shown also in [Sal). Let S = {Yr, .. ., Y,,,}be the collection of sets in which the rows of the tableau of nxmn have a distinguished or repeated nondistinguished symbols. Let Ri = ,,& my,(Ri).
I
an X-tuple is in D[X] and compute D[X] (or a projection of it) in time polynomial in the size of the input and the output. o Returning to the representation of Q by a tree T, we noted before that navigating through T (joining the relations on the way) producesvalid associations (the joins are lossless). In other words, if the databaseis the projection of a universal relation R (satisfyingthe join dependencyQ), then the computed relation is the projection of
R on the correspondingset Y of attritrJtes - and is therefore indepen-
dent of the choice of the tree T that representsQ. If D is not the projection of a universal instance, then again only valid associations will be produced (i.e. every produced Y-tuple is in D[Y]); however, some valid associationsmight be lost, and the result might be sensitive to the tree T chosen. We shall show that D[X] is the set of associations derived by considering o/l trees representingQ. Lemma 5.1 Let Q be a general database scheme,
D = {RI, . . . . Rk} a database over it. Let I be a tuple of R*(D)
Tkwem
D = {RI, . . ..Rt}adatabaseoverit.
defined on the set of attributes Y. Then there is a subset5 of Q of relation schemeswith a loasleas join with r[&] C Ri for each & C S and Y = U I& C S} 0 T&em 5.2 Let Q be an acyclic database scheme,
D = {RI, .. .. Rk} a databaseover it, and X a set of attributes. (1) if X ie contained in a relation scheme, then
Therefore, the
set of X-total tuples of the left-hand side (= ox (WR,)) is contained i in the set of X-total tuples of the right-hand side (= D[X]).
D[X] = U {nxRi 1.X E; &). (2) If X is not contained in any relation scheme, then I
Ri) 1T a tree representingQ}, where Tx is the
(2) D[xl
E TX W;). 1
Let I be an X-total tuple of R(D) defined on Z 1 X. From Lemma 5.1 there is a subfamily S of Q of relation schemeswith r[&] C Ri for each &CS, Z=U&gicS} and ms=nrmp. Thus,
minimal subtreeof T covering X as in Lemma 4.3. P&f (1) From the construction of R(D), c nxR(D) C nxR(D).
U (~xRi 1X G Ri}
qrnr = nxmp and therefore S contains distinct sets l&r, .. ., I&,,, With Yj C&j for j = 1, ... . m. Since t&j each j. C Rij. r[Yj] C RJ, for For the opposite inclusion, it is easy to see
that any lossless join of a family 5 of schemes covering X, must contain at Icast one schemecontaining X. The conclusionthen follows from Lemma5.1.
(2) lie
r[X] C ax (YR;).
89
6. LOOP-FREE BACWAN DIAGRAMS A loop-free Bochman diagram (IfLSd) [Ba. Li] is a tree T with directions on its edges, whose nodes are distinct sets of attributes, satisfying the following conditions. (1) Every attribute is a node, (2) If X - Y, then X c Y, and (3) If X E Y are nodes, then X : Y. ( is the transitive closure of -1. Let Q be the database scheme containing the nodes of the diagram; we will call Q a lfsd scheme. Clearly. the diagram T is a tree representationof Q ; thus Q is an acyclic scheme. From the definition of a loop-free Bachmandiagram, it follows easily that (a) for every set X, the set of nodes that contain X form a rooted tree, (b) If X and Y are two nodes with a nonempty intersection, then X fl Y is also a node: their lowest commonancestor.
Let p be this path. We have , nqWqt, = W{q, ( & C p}. Let S be the set of nodes that lie in such a Path p connectingtwo setsin S with a nonempty intersection. We have, mr = mr. Since S has a connected hypergraph, S induces a subtree of T. The conclusion now follows from Lemma 4.2. 0 Corotlary 6.1 ff D is a database over a IfBd schemeQ, then all joins (and their projections)can be computedefficiently. Prouf Let S be a subsetof Q. The join of the relations in each connetted component of (the hypergraph of) S is lossless and thus can be computed effkiendy. The join of the relations in S is just the Cartesianproduct of thesejoins.0 Note that an arbitrary acyclic schememay have a connected are the only oneswith subsetwith a lossy join. In fact, IfBd schemes the (CJ) property: Theorem 6.2 Let Q be a databaseschemewith the (CJ) pro-
Loop-free Bachman diagram schemeshave the fottowing nice pmw. (CJ) : The jd *Q-imp&es that every subsetof Q with a connected hypergraphhaq a lossless join. Theorem 6.1 Every IfBd schemehas the (CJ) property.
perty. Then Q is contained in a IfBd schemeQ with the samemaximal sets(and thus with an equivalent jd). The proof of Theorem 6.2 is w mas whose proof we omit. Lemma 6.1 if Q satisfies(Cl), then the closureunderintersection of Q has also the (CJ) property.0 on the following two Lem-
Let Q be the databaseschemeof a loop-free Bachmandiagram T, and let S be a subset of Q with a connected hypergraph. Let &, Bj be two relation schemesin S with a nonempty intersecfii. From property (b) above there is a path from & to & in T that goes from & up to &II&~ meeting nodes that are subsetsof & and then down to 4 through nodesthat are subsetsof 8, (see Figure 3).
Let Q now be the union of the closureunder intersectionof Q and the set of all singletons. Let us define a directed graph G on theelements ofQbyhavinganarcX-YifX.YfQ,XGY,and there is no Z in Q with X c Z c Y (i.e. - is the transitive reduction of E). Lemma 6.2 If Q has the (CI) property then G contains no (undirected) cycles.0 The closure under intersection of an arbitrary family of sets can have size exponential in the size of the original family . (Just considerthe family of subsetsof cardinality n - 1 of a set of sire n.)
figure 3
90
This is not true for families with the (CJ) property: Lemma 6.3 If Q is a database scheme with the (CI) property and Q is as above, then In] 5 In] + 2]U]. IJ
7. AN ACYCLIC Thus, we can represent a database scheme with the (U) petty by a loopfree pro TIONAL Elachman diagram without having to introduce too many new relation schemes. These relations can facilitate the computation of the associations D[X] by having essentially ready the relations RJ of Theorem 5.3. A database D over Q is legal if it satisfies the following exisrence construinrs. For all nodes X,Y with X - Y the corresponding relations satisfy Rx > nx(Ry). Theorem 6.3 Let D be a legal database over the database scheme Q of a loopfree butes. (1) If X is contained in at least two relation D[X] = x,$r, schemes, then Bachman diagram T, and X a set of attri-
JOIN DEPENDENCY
DEPENDENCIES
As in Section 5 , a database D = {Rt, . .., Rk} represents the information common to all containing universal instances (i.e.
instances R with 4 (R) Q Ri) which satisfy the dependencies, if there is such a containing universal instance [HI. We can check the dependencies and find the information represented by D, by forming a universal relation R(D) as in Section 5 and then chasing the dependencies on R(D). In this Section we
will see how to do this efficiently in the presence of an acyclic loin dependency Q and a set of functional dependencies.
7.1. Festhrg a functional dependency. Let D = (RI, . . . . Rk} be a database over the acyclic scheme Q and f: X - A a functional dependency. We shall show how to check if WR, satisfiesf, and list all violations if it does not, i.e. list all pairs
(2) If X is not contained in two or more relation schemes, then D[X] = 71x ($r x &), where Tx is as in Lemma 4.3. 0
We leave it as an open problem to determine those acyclic schemes Q that have a representing tree T which can produce all D[X] (by joining connected subsets), even with the addition of (It is easy to see that some kind of such con-
of distinct A-values associated with the same X-value, in time polynomial in the size of D: Note that we cannot simply compute
?rxA(yRi) and then check f, since its size might not be polynomially bounded in that of D. Let T be a tree representing Q. We root the tree at a node containing attribute A, say 8,. We prune the tree leaf to root while
existence constraints.
straints is necessary for this to hold.) Such a tree would keep, in some sense, attributes as closely connected as possible. For example, the scheme of Figure 2 has the Bachman diagram of Figure 4 (nodes A, P, D are unmzssary).
associating a graph Gi with every relation Ri. The nodes of Gi are the tuples of the current version of Ri. initially, there is an edge
between two tuples of R, if they agree on the attributes of X fl &. Let Ti be the subtree of T rooted at node &, 2, the set of attri-
FD
Fire me 4
f-P
has such a tree
butes label@
the edge from & to its father - (2, C &) - and Xi the
set of attributes of X that arc covered by T,. The deletion of node & is carried out as follows. Let RI be the current version of Ri (a
relation on the set of attributes &) and Rj the current version for its father 4. At first we project RI onto Zi, and merge nodes of Gj that
91
Two merged
now plays the role of the set X there (i.e. apply the algorithm to the minimal subtree of T covering X4, or the tree of the set S there for
U).
nodes II, v of Gi are adjacent if there were two nodes u, v respectively merged into them that were adjacent. Then we replace RI by RI W nZ,(Ri), delete Ri and update Gj as follows. TWO nodes a, Y
As for testing if D satisfies the functional dependency f: X - A and the join dependency Q (or equivalently, whether D[XA] satisfies fl, we can either (i) form the universal relation R(D), project back down en the schemes and then check if f is satisfied in the join of the new relations, or still netter (ii) find the set S for XA as in Theorem 4.3, compute the corresponding R;s as for Theorem 5.3 and then check if f holds in the join of the Ris. However, if we
of the new Gj are adjacent if they were adjacent in the old Gj and their Z,-projectii were adjacent in Gj. Let RI be the version of
the root relation when all other nodes have been deleted and G, the corresponding graph. Let R,, be the projection of RI onto A, and G., the graph (on R,,) obtained from Gt by merging all nodes with the same A-projection. tions off X-value. We claim that the edges of G,, give all viola-
in yRi, i.e. all pairs of A-values associated with the same Thus, YRi satitiesf iff GA has no edges
have a set F of functional dependencies, it is not sufficient to check each dependency individually; i.e. D may satisfy Q and f for each fd
f of F but still vidate the dependencies as a whole. At first note that, apart from the graphs, the algorithm computes x4( YRJ as in the beginning of Scction 4. (The reduction process is not needed here since the output is small.) Therefore, when node & is deleted the current relation RI is equal to ~4($r / R,). In this case we must first reduce the database (so that intermediate results will be bounded by the size of the input and the output) and then we can root the tree at any node. Besides testing a functional of the graph Gi, two &-tuples a and v are adjacent iff there are two dependency the algorithm can be used also to compute some queries tuples u, v of $r I R, which agree with each other on Xi, and with invdving Therefore, at the end two A-values quijoins; e.g. Have these two actors worked with the same directnr?, Which films are made by the same director and pry ducer? Equijoins can be transformed to natural joins by renaming attributes. joins. Thearem 7.1 Let D be a database over an acyclic scheme Q and f: X - A a functional dependency. We can check in polynomial time if yRi satisfiesf. and list all violations (pairs of A-values associ7.2. Caarqsuthg the chase Let p = {&, . . . . &} he an acyclic join dependency (database scheme) and F a set of functianal dependencies of the form X - A. We could easily modify the algorithm to give for each pair of We shall show how to compute efficiently the chase of a relation R A-vilucs in the list of violations an X-value with which they are assounderQ andF. ciated, or all such X-values, if so desired. In the last case the al@ rithm would run in time polynomial in the input and the output. Also, if the database is already reduced (the projection of a universal instance), then we can use the improvements of Sectibnn4 where XA [MSY] shows that any tuplc in the chase of a tableau under a single join dependency and a set of functional dependencies can be generated nundeterminstically in polynomial time. Our algorithm However, the transformed query may invdve cyclic to A similar algorithm can be used to list for two sets X, Y, all pairs of Y-tupks that are associated with a common X-tuple in YRi.
at, a2 are,adjacent in GA iff there are two tuples r,. rt in yRi which agree with each other cm X, and have f,[A] = at, rr[A] = as.
For example,
above qrmsponds
n,F(FDWFPWFDWFP)
where F is a renaming of F.
92
P implies another join dependency , but functional and multivalued amounts essentiallyto turning all nondeterministicstepsin this algodependenciescan be efficiently inferred [MSY, V]. Also, it is easy rithm into deterministic. From R we shall construct (in polynomial to modify the proof of Theorem 3.1 to show that it is NP-complete time) a relation R by identifying all symbolsthat the chasewill, and such that chure(R) = mpR. Thus, by computing the chase efficiently we mean that we can check if a certain (possibly partially satisfiesa set of functional dependencies (no jds). specified) tuple is in it, generate a projection of it or the total tuples in a projection (if there are mills) in time polynomial in the sixe of the input and the output.
REFERENCES
to determine if a given databaseD satisfiesP. even if D is the pro jection of a universal instance. [H] showshow to test if a database
WUI
Computingthe thaw of R
While there is a changedo
A. V. Aho, C. Reeri, J. D. Ulhnan, The theory of joins in relational databases,ACM Trans. on Database Systems 4(3), 297-314,(1979). A. V. Aho, Y. Sagiv, J. D. Ulhnan, Equivalence among relational expressions, SIAM J. Computing, E(2), 218-246,(1979). , Efficient optimixation of a classof relational expression, ACM Trans. on Database Systems, 4(4) C. W. Rachman, Data structure diagrams, Data Base, l(2), 410, (1969).
Foreachf:X-A inFdo Begin LetDbetheprojectionofRontoP; Apply the algorithm of section 6.1 for f and D; Identify A-values in the sameconnectedcomponent of G4 and delete duplicate tuples from R;
Ed;
[ASUl]
[ASU2]
Clearly, the while-loop cannot be executed more than k] times. Since the algorithm of Section 6.1 takes polynomial amount of time in the sixe of D (which is bounded by kw]), the algorithm stopswith a final relation R in time polynomial in k, ,p] and pi. We claim that chase(R) = mpR. Clearly, any two values
IW
C. Reeri, Gn the membershipproblem for multivalued dependenciesin relational databases,ACM Trans. on Databasesystems. C. Reeri, D. A. Bernstein, Computational problems related to the design of normal form relational schemes,ACM Trans. on DatabaseSystems,4(l), 3059. (1979). C. Reeri, R. Fagin, D. Maie:, A. Mendelxon, J. D. Ulhnan, M. Yannakakis, Roperties of acyclic database schemes,ACM Symp. Theory of Computing, (1981). C. Reeri, hf. Y. Vardi, on the prcpertia of total join dependencies, Proc. Workshop on Formal Bases for Bttabases, Toulouse, (1979). P. A. Remstein. Synthesizing third normal form relations from functional dependencies,ACM Trans. on DatabaseSystems,l(4), 277-298,(1976). P. A. Ikmstein and N. Goodman, Ihe theory of semi-joins, TR CCA-79-27, Computer Corp. of America, (1979). C. Rerge, Graphs and Hypergraphs, North-Holland, 1973. E. F. Codd, A relational model for large shared databank$, Comm. ACM, 13(6), 377-387,(1970). R. Fagin, Multivah~eddependencies and a new normal form for relational databases,ACM Trans. on Datahas Systems,2(3), 262-278.(1977).
PBI
identified by the algorithm are ax-reedy identified. SincempR then satisfies the j.d. 0 and all functional dependenciesin F, it is the chaseof R. Theorem 7.2 We can compute in polynomial time from a relation R another relation R such that the chaseof R under an acyclic join dependencyp and a set of functional dependences is equal to mpR. 0 CnroUary 7.1 Given a set P consistingof functional dependencies and a single acyclic join dependency,we can decide in polyno mial time (1) if a given databasesatisfiesP, (2) if a given template dependencyis implied by P. 0 For a set P of functional dependenciesand a single general pin dependency*Q, we know that it is NP-compkte to determine if
P-1
WI
WI
WI
[%I
[Cl PI
93
muI
R. Fagin, A. Mendelzon, J. D. Ullman. A simplified universal relation assumption and its properties,
RJ2900,LBM,SanJo~e,CaL,(1980).
WY1
P. Honeyman, R. Ladner, M. Yannakakis, Testing the universal instance assumption, Inf. Pmt. Letters, 10(l), 14-19(1980). Y. E. Lien, On the equivalence of databasemodels, Bell Labs memorandum,(1979). D. Maier, Discarding the universal instance assumption preliminary results, Proc. XPl Conference, Stonybrook, NY, (19gO). D. Maier, A. Mendelzon, Y. Sagiv, Testing implications of data dependencies,ACM Trans. on Database Systems,4(4), 455469, (1979). D. Maier, Y. Sagiv, M. Yannakakis, Testing implications of functional and loin dependencies,Journal of ACM, to appear. J. R&men, Theory of relations for databases - a tutorial survey Proc. 7th Symp. on Mathematical Foundations of Computer Science, Lecture Notes in Cornputer Science64, Springer-Verlag,537-551,(1978). F. Sadri, J. D. Ullman, A complete axiomatization for a large class of dependenciesin relational databases, Proc. 12th Am. ACM Symp. on Theory of Computing, 117-122,(1980). Y. Sagiv, Can we use the universal instanceassumption without using null values? , 7th ACM-SIGMOD Intl Gmf. on Managementof Data, 108-120,(1981). Y. Sagiv and M. Yannakakis, Equivalences among relational expressions with the union and difference operators, Journal of ACM, 27(4), 633-655,(1980). E. Sciore, Real-world MVDs, 7th ACM-SIGMOD Intl Conf. on Managementof Data, 121-132,(1981). J. D. Ulhnan, Principles of Database Systems, Computer SciencePress,1979. M. Y. Vardi Inferring multivalued dependenciesfrom functional and loin dependencies,Dept. of Applied Math, Weizmann Inst. of Science, Rehovot, Israel, (1980). A. Walker, Time and space in a lattice of universal relations with blank entries, Proc. XPl Conference, Stonybrook, NY, (1980). C. Zaniolo, Analysis and design of relational schemata for databasesystems,TR UCLA-ENG-7769, Dept. of Camp. Sci., UCLA, (1976).
D-1 PI
PMSI
WW
[RI
WI
@I
WI
PI WI [VI
WI
PI
94