Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

ALGORITHMS FOR ACYCLiC DATARASE SCHEMES

Mihalis Yannakakis

Bell Laboratories Murray Hill, NJ 079?4

AfSSTRACT: Many real-world situations can be captured by a set of functional dependencies and a single join dependency of a particular form called acyclic [B..]. The join dependency corresponds to a natural decomposition into meaningful1 objects (an acyclic database scheme). 0ur purpose in this paper is to describe efficient a& rithms in this setting for various problems, such as computing projections, minimizing joins, inferring dependencies, and testing for dependency satisfaction.

called acyclic

[S..].

Acyclic join dependencies are those that are

equivalent to some set of multivalued dependencies. Such sets of multivalued dependencies are conflict-free. a notion introduced by

[t] in the study of the relation between the network and the reiational model. [S] argues also that most real-world sets of mvds fall

1. INTRODUCTION An important pati in the design of relational database schemes is the specificatiuo of constraints satisfied by the data, called dependencies. The first dependencies to be introduced were the functional dependencies [C]. Their properties are well understood and efficient algorithms have been developed for infetring new dependencies and designing database schemes [BE, Ber]. Multivalued dependencies

into this category or can be put in such a form. The class of acyclic database schemes contains the class of loopfree Bachman diagram

schemes of IL], the class of simply connected schemes of [Z], and bears a close resemblance to the class of tree queries of [BG J. In this pape.t we shall give efficient algorithms for several prob lems on acyclic database schemes, such as computing projections. testing satisfaction of the dependencies by a database, inferring other dependencies. h Seaion 2 we review the basic terminology. Set-

[F, Z] were introduced to describe those cases where a relation can be decomposed into two of its projccdons, and join dependencies fR] for a decomposition into several projections without loss of information; i.e. the original relation can be reconstructed by joining the projections. There are also efficient algorithms for the inference of

tions 3-6 deal with a database scheme and its associated join dependency (no functional dependencies). in Section 3 we assume a general database scheme Q. We examine the complexity of computing a projection of the join of the relations in a database ow Q, of

muitivalued dependencies [Bee] and their use in the design process. However, in general they are harder to grasp and deal with than

determining if the projection can be computed by joining only some of the relations, and of inferring dependencies from the join dependency associated with Q. In Se&m 4 we examine the same problems when Q is an acyclic database scheme. In Se&ion 5 we define the association D[X] between the attributes of a bet X represented by a database D and show how to compute it. In Section 6 we examine loop-free Bachman diagram schemes and give a characterization of them in terms of lossless joins. Section 7 assumes a set of functional dependencies and an acyclic join dependency; it examines the inference of other dependencies, and testing if a given database satisfies

functional dependencie+e.g. the. best known algorithm to infer a join dependency from multivalued dependencies takes expootential time and space [ABU]. Joii dependencies are studied in [BV, MSY, Y].

Recently, [FMU] advanced the hypothesis that most real-world situations have a particularly simple structure: They can be. captured by some functional dependencies and a single join-dependency that describes a natural decomposition into meaningfull objezts.

Furthermore, the join-dependency is in most cases of a special form,

82

CH1701-2/81/0000/008~.75

0 1981 IEEE

the dependencies.

trs(T)-tuple of distinguished symbols. A tableau T defines a map ping fr from relations over U (or universal relations) to relations

2. TERMINOLOGY

over trs(T) as follows. A valuation p is a mapping that maps for each attribute A, S(A) into Dam(A). A valuation is extended to tuples comlionentwiseand to relations elementwise. fr is defined as follows. fr(R) =. {p(q) ( p is a valuation with p(T) C R). A tableau T, is contained in another tableau Tz ( denoted T, cr Tz) if they both have the same target relation schemeand fT,(R) E frJR) for every universal relation R; Tt and Tz are equivaknt (or

In this Section we will go briefly over the basic relational theory terminology. For more details the reader is referred to [U]. The universeis a finite set U of atrributes. A relation schemeR is a subset of CJ. A dumbuseschemeQ (over U) is a set or relation schemeswith union c/. Every attribute A has an associatedset of values, its domuinDam(A), If X is a set of attributes, an X-trrpfe(or
X-vulue)

is a mapping u from X into lJ km(A), *Lx

such that

T, =r Tz) if T1 Gr T2 and T2 Gr T,. Note that, if T, c T2 (where

c is set-inclusion)and trs(T,) = trs(Tz) then T2 Gr Ti. We now lit someof the basic resultsof the theory of tableaux [ASUI, ASU2].

p(A) C Dam(A) for each A in X. A relation R over relation scheme R is a finite set of R-tuples. A database D over a databasescheme Q is a set of relations containing one relation over each relation schemeof Q.
The projection t[Y] of an X-tuple t onto a subsetY of X is the

(1) For every project-join expression 4 there is a tableau T with


&J(R) = jr(R) for every universal relation R. The tableau T is con-

structed recursively from I$ as follows. If + = nxo, and S is the tableau for o, T is obtained from 5 by changingeach distinguished symbol
4

restriction of I to Y. The projection nr (R) of a relation R over X to YisthesetofprojectIonsofthetuplesinRtoY. LetR,,...,R,be The

with A L X into a new nondistinguished symbol. If Wok, and Ti, Tz . ... T1 are the tableaux for

0 = alWozP4

relations over schemesat, .. .. & respectively and R = Ug. join of RI, . . . . Rk, denoted R, W w(R*.
i=l , ..I. k.

(11,.. . . ok. then T is the union of the Tis. (2) Let T,, T2 be two tableaux with the sametarget relation scheme. A homomorphism /r from Tg to Tz is a mapping from S(A) to S(A) for each A, such that
h(4)

.WRk (or 7 Ri or 6 Rj for

Rk}), is the set of R-tuplw t, with t&j

= n, and h(T,) c h(T2); the mapping

Using the projection operator xx (X c 6) and the join operator we can build project-join relational expressions. An expression$ defines a mapping from relations over U to relations over a certain set of attributes, the t&et relation scheme of I$. trs(+).
An expre

from the tuples of T, to the tuples of Tz induced by h is a containment mapping. Such a homomorphismexists if and only if Tr Cr T,.

(3) Each tableau T has a minimal subset 7 equivalent to T; f is unique up to renaming of nondistinguishedsymbolsand is called the
minbwl equiv4lcnt

sion 4 is contabvd in an expremion # or 4 G $ if trs(&) = trs(g) and $(R) G S(R) for every relation R over LI; 41 and $ are equivalent, denok4 ~$I=I$,if $ G QIand $I E 9. A useful tool for compating expredsions is the tableau [ASVI]. Each attribute A of U hasanasociatedsymbolsetS(A)={a,a,,nt,
distinguished symbol &

mHeuu of T. If T is the tableau of an expression

+, then T is the tableau of an expressionJ, equivalent to 4 that contains the minimum number of (binary) joins.
A fnnctiorvll dependency (er fd) is a statement of the form

~~~];aiscalleda

X - Y where X, Y C U. It is satisfied by a universal relation R, if for all tuples I,, ta of R with t~[Xl = IAX], also ti[Y] = tt[Y] holds. A join dependency $or jd) is a statementof the form l Q where Q is a databaseschemeover LI; it is satisfied by a univetsal relation R if W{q(R) I& t RI = R. A multivubted dependency (or mvd)

the ais are nondi~tingwkhed. A tabteuu T

is a reIatIon over U with the symbol setsas the domainsof the attribute+. The target elation schemerrr(T) of T is the set of ~atttibutee

in which T-has 4 distinguishedsymbol. The summarysr of T is the

83

X -

Y is the,join

dependency

*{XI,

XZ} where 2 = U-Xl.

An

far as possible. chase,(T,) applied),

The procedure

has the Church-Rosser on the order in which

property the ales

(i.e. are

embedded join dependency (ejd) is like a join dependency the union attributes; satisfied We will of the relation

except that all the

does not depend

schemes does not have to contain of relation R if W{q(R)

and P (= (I iff it succeeds (MMS]. can be applied also to relations A. R with

i.e. if S is a collection by a universal usually relation

schemes, the ejd l S is 1 Xi C S} = %x(R). The expression by

The chase procedure elements from Dam(A) of Dam(A)

U S(A)

for each attribute is identified

In this case, if of

say that

S has a lossless join. mapping

an element S(A)

(or constam) of an FDrule,

with an element

W{ nx, ) Xi C S} is called a project-join ms. It has the properties From

and is denoted

in the application are identified,

we keep the constant.

If two

(1) ms Q nx, and (2) ms(ms)=ms of the project-join mapping

(idemit fol-

constants

then the chase procedure Again,

stops and we say

potency).

the idempotency

that a conrradicrion applied is immaterial:

arises. either

the order in which the rules are will arise or else the

lows that the join of the relations the jd *Q. A remplate dependency dependency; it is a statement

of a database D over Q satisfies

a contradiction

final relation (or rd) [sU] is a very general kind of With hypergraph.

chase,(R) a database

is the same. scheme Q or a jd *Q we can associate [Bg], like a graph, a

of the form Tlsr where T is a tableau by a universal relation R if

A hypergraph

consists of a set of from a graph is that Graph-notions in a

and sr its summary. fT(R) = nx(R) A set I

It is satisfied

nodes and a set of edges. an edge can contain

The only difference number carry

where X is the target relation of dependencies implies

scheme of T. dependency P. o (or for

an arbitrary etc.

of nodes. over

another

like

path,

%onnectedness way. The

to hypergraphs with

P I= u) if o holds in every

relation

satisfying MMS]).

A procedure

straightforward universe

hypergraph

associated

Q has the

testing if P I= D is the chase ([AM, template dependency o = T/Q is T.

The tableau

T, of a

U as its set of nodes, and the relation We will denote it also as Q.

schemes of Q as its

The tableau T,, of a functional the one has distinguished in the rest, and of X and The a tuple

set of edges.

We are going to assume if Q is disconnected of Q are independent way by

dependency

o = X - Y has two tuples;

for the rest of this paper that Q is connected; then the attributes of each other, considering in the different components

symbols in the attributes

of XI and nondistinguished

the other tuple has distinguished new nondistinguished chase procedure with projection symbols

symbols in the attributes

and the results generalize separately.

in a straightforward

in the rest of the attributes. T, trying to include

each component in [B..)

An acyclic join dependency properties of the here but

modifies

the tableau

*Q is defined associated

in terms of the topological We will not give

sr in it if o is a template

dependency,

or to eliminate

hypergraph.

the definition

the nondistinguished

symbols from the attributes

of l if o is an fd; if only sets P of the

rather list some equivalent which mvds. implies

characterizations.

It is a join dependency to ) some set of We

this happens then it succeeds. functional tableau and join

Here we will consider The

and is implied

by (or is equivalent

dependencies. FD-rule.

rules for modifying

A database

scheme Q is acyclic

if *Q is an acyclic jd. of acyclic

T, are as follows.

If f: X - Y is an fd in P and an attriby the sym-

are going to use the following schemes Q[B..]. (1) Q can be reduced attribute relation

characterizations

database

there are two tuples rt, rr that agree on X but disagreeon bute B of other,

r,

replace

all occurrences symbol

of one of r,(B),

I@]

to the empty

set by repeatedly scheme, scheme.

deleting

an a

keeping

a distinguished

or the nondistinguished

if it occurs in exactly scheme if it is contained is a tree T with of T induced

one relation in another

and deleting

bol with a lower subscript. tableau resulting T by mpT. ahe

JD-rule.

If *Q is a jd in P, replace the is the in P as

chase of T, under P, chasc,(T,,)~, the rules for the dependencies

(2) There

the relation

schemes as nodes such that an attribute A is

tableau after applying

the subgraph

by the nodes containing

84

connected(i.e. a subtree) for each attribute A of U. We say that T

the prgject-join mapping). There are however somecasesof particular interest which can be efficiently decided even for general join dependencies. Let I = *(X1, .. .. X,) be a join dependency.
(1) Inferring lossless joins [BcV]. MdWd

represenrs p.
Acyclic schemes Q have also the following properties. (3) We say that a database D = {R,, . .., &} over p is @l/y)
reduced

or the projection of a universal ia.s&nceif there is a universal

Let S =

{Vj]je,,

be a collection of subsets of U with

relation R such that R, = ~4 (R) for each & E 0. If R is a database over an acyclic scheme Q, then D is reduced iff Ri = n,t, (Ri W Rj) for each pair of relations Ri, Rj in D. There is an efficient algorithm using a particular kind of joins (called semijoins) which computes the reducrion of D : D = {RI, . . . . Rk}

Y = U Y,. We can decide if J implies that S has a lossless join (i.e. i if Wny,(R) = ny(R) for every universal relation satisfying J) as folj lows. Form a bipartite graph G withnode-sthe XjS and the attributes in U-Y and an edge between an attribute and a set if the attribute is in the set. Let K,, .., K, be the connectedcomponents of G. Let Zi be all the attributes of Y that belong to a set Xj in Xi, iff each Zi is confor i = 1, .. .. I. [BeV] showsthat Wn, is lossless i J tained in some Yj. 0 (2) Let S c {X1, . .. . X,} andX c UX,. x,cs

where Ri = IQ, (YR,) [BG]. (4) If Z is a set of attributes, let R(Z) be the family of nonempty sets & fl Z where & C Q. Q(Z) is an acyclic scheme over the universe2, cakd the schemegenerated by Z.

3. GENERAL DATABASE SCHEMES Let D = {RI. Q = (&,...,&) be a database scheme and

Deciding if nx$t$ Method.

nx,) = nx

(*).

We claim that J

implies (*)

if

and only if

. . , Rk} a databaseover it. It is trivial to determine if a

nx (,C$s nx,) - 71x(,5] nx,) is a tautology (i.e. holds for all univergiven tuple is in the join of the relations in D. If however the tuple is only partially specified (in someset of attributes X) then the problem is much harder : Theorem 3.1. It is NP-complete to teat if an X-tuple I is in
Q(~R,)

sal relations). At fint


R, px[m,(R)]

note that for all universal relations Supposethat there is a relation R with

C nx[ms(R)].

~rx m,(R) G nxms(R).

Let R = m,(R). Then R satisfiesthe join

where D = {Ri}i c 1,k is a database over an arbitrary

dependencyI but violates () since ?r,(R) = ax[m,(R)] (from the idempotency of m,). Conversely, if nx[m,(R)] = nx[ms(R)] for

scheme,even if D is the projection of a universal instance. Proof Membership in NP is obvious. For the NP-hardness part we use a result of [SY], that it is NP-completeto teat if an expression I$, = nx(yny,) is contained in another expression+z = nx(y~~~). L.et T, be the tableau of 0, and s1 its summary. From the theory of tableaux, Q, G 4* if and only if s1 c 4z(TI). Thus, we can decide if

every R, then for every instance I satisfying J we have


nx[ms(l)] = nx[m,(l)] = a,&),

sincem,(l) = I.

Thus, it suffices to test if mx(x$tsnx) - ox (i$l nxJ. Both expressionsare simple [ASLIl], and therefore their equivalence can be tested in polynomial time using tableaux techniques[AStll]. 0 The significanceof (2) is the following: Supposethat a universal instanceR (satisfying.J) is decomposed into its projections on the

4, c 42 by testing whether s1 E ox (YRi) where Ri = ~4 (TI). 0

As a consequence of Theorem 1, it is NP-completeto test if an arbitrary join dependencyimplies a template dependencyof the form nxmS = nx, (Membership in NP follows from the indempotencyof

4s. Then we can test efficiently if a projection of it (on some set X) can be recovered by joining only some of the relations in the database. Moreover, we can find efficiently the minimum number

4. ACYCLIC of such relations whose join gives the projection of R on X: we just have to minimize the (simple) tableau of nx m, [ASUI, ASU2]. From the theory of tableaux, all minimal subsets S of J with nxmr = nx have the same minimum cardinafity, and the expressions nxms have identical tableaux T up to renaming of nondistinguished symbols. Therefore, for every minimal S = {Xi,, . , X,},

DATABASE SCHEMES

Let D = {R,, . . . . Rh} be a database over the acyclic scheme Q = {Et, . . . . &}. Let X be a set of attributes. We shag show how

to compute nx(WRi) in time polynomial in the size of the input and the output. The algorithm is based on the relation between acyclic

database schemes and tree queries. Let T be a tree representing the scheme Q as explained in Section 2. At first we compute a fult reduction D of D as in [BG] using semijoins. Then we root the tree at an arbitrary node say &, and prune it leaf-toroot. Let T, be the ,

nxms(R) = nx [W fly,(R)], where Yj is the set of attributes in which I the j-th row of the minimal tableau T has a distinguished or repeated nondistinguished symbol (Yj C X,). In other words for every involves the join of

minimal set S, the computation of nx(R),

subtree of T rooted at node &. Zi the set of attributes labelling the edge from- & to its father Bj (2, = &n&i), and Xi the set of attri-

exactly the same relations (my(R)) - the best choice of S depends on how fast n,,,(R) can be obtained from Rij = n+(R) (organization of

butes of X that are contained in some node of T,. When the turn of & comes to be deleted, Ri has been replaced by a relation & the various relations, supporting data structures, etc.). the set of attributes Xi&. Note that we might have nx(,ys nx,) = nx even though the join is lossy - i.e. xDtsnx, + nr where Y = xlJ, Xi. However, any nonredundant join for computing the projection on X is lossless: Theorem 3.2 Let *J be a jd and suppose that S is a miminal subset of J such that nx(xk4, xx,) = nx is implied by *J. xP4sTX, is a lossless join. 0 Then The deletion of node & is carried out by deleting the relation Ri and replacing the current relation Rj of its father 4 by Rj W rrxX(Ri). It is easy to show by an induction that when node & is deleted, the current relation Ri is equal to nxA ( W Rl). 467 Thus, over

when we are left with the root, the relation that is stored there is Rt = no, (&$R,), and nx@t) = m(WRJ. I

We had to reduce. first the database so that intermediate results Regarding the actual computation of nx(yRi), the size of the output will not get exponential in the input and the output: relation can be exponential in the size of the input database. Thus, the best we can hope for is an algorithm polynomial in the size of current version Rj of Ri is bounded by piI the input and the output. However, we know that it is NP-complete to determine if YR, is empty or a (universal) R, = nq (R) [WY]. relation R with Theorem 4.1 If {R,, . . . . Rk} is a database over an acychc scheme, then mX(WRi) can be computed in time polynomial in the I input and the output. o Simple variations of the algorithm can be used to (unless P = NP) there is no efficient (polynomial in the sizes of the input and the output) algorithm to carry out either the reduction of the database or the pin of a reduced database. Also. it is NPcomplete to test if a universal relation satisfies a join-dependency. WY1 (1) Test if a universal relation R satisfies an acyclic join dependency *1. Apply the algorithm to D = {nx,(R) 1Xi c J} with X = (I while Inx(yRj)l. Lemma 4.1 Throughout the computation, the size d 0 the

Also it is NP-complete to determine if a (R) [MSY]. Thus, probably

universal relation R satisfies R = ?a

making sure that the various relations along the way dont become different from the corresponding projections of R. Another way for

86

doing this is to take a set M of mvds equivalent to *J with !M] s PI (there is always such a set [B. .]) and check if R satisfies M. (2) Include selections. Let Y be a set of attributes and SI(A) a subset of Dam(A) for each

attribute A in Y. To compute the projections on X of the tuples in


yRi

Ri ... L!fch -6 7i: T2


Figure 1 It can be shown that (1) every subtrce that covers X has to contain TX, and (2) TX covers X and is connected.0

which have A-value in Sl(A) for each A CY, we first select those

tuples from each relation Ri that can contribute to the result. That is. we remove from Ri those tuples I with r[A] f S!(A) for some A in Yf&. Then we can apply the algorithm to the remaining relations.

Even though Tx is a minimal subtree covering X, it might still The following now is immediate. contain redundant relations. CoroIIary 4.1 (1) Given a database {RI, . . ,Rk} and an Xfilm database with rehnions FD (film-director), tuple t, we can decide in polynomial time if r C nx(WRi). FA (film-actor) (2) We can decide in polynomial time if an acyclic join dependency implies a template dependency. 0 arranged as in the tree of Figure 2. FP (film-producer), For example, suppose that we have a

Returning to the computation of nx (YRJ , we note that we Figure 2 do not have to carry out the second phase using the whole tree. The minimal subtrce that relates directors to producers is the Since the join of the relations in the database satisfies the jd *R we whole tree. can use dependencies that are implied by it. change the arrangement so that FD and FP become adjacent. This Lemma 4.2. The join of a subset of the relations, whose is true in general: Theorem 4.2 Let S be a subset of schemes from D that join Thus, it suffices to join a set of relations whose schemes form a subtree T of T and have a union containing X. We will say that 7 cover3 X. If X is contained in some relation scheme $, then we can Proof (sketch) obtain nx(WR,) 8 by projecting the corresponding relation R, onto X. Let S = {&t, . . . . &,} and Y = U R,,. From our discussion in Section 3 (see Property (1) of general join dependencies) the schemes S join losslessly if and only if the schemes in Q-S Lemma 4.3. If X is not contained in two or more relation partitioned schemes, then there is a unique minimal subtree Tx of T covering X. gr rl Y c &,, and (b) if 8, and & belong to different K,s then Proof (sketch) The subtree Tx is defined as follows. Let & be a node of T I& II & n (U-Y) At = 0.
K, U {&,} are acyclic

However,

clearly DP = noP (FD WFP).

We could

schemes form a subtrce (connected subgraph) of T, is lossless. 0

losslessly. There is a tree T representing Q such that the schemes in S form a subtree of T.

If not, then the intersection of all 7 covering X is also a subtree covering X:

can be then

into

sets KI, .._, K,,, so that (a)

if &, 6 K,

first we show that S and each of

and T1. T2, . . . . T[ the subtrces hanging from it (see Figure 1). If one of the Tis covers X then & t Tx. Otherwise we include & into TX.

schemes. We can construct thus trees To, T,, . . . . T, representing respectively S, Kt U &I}, ..,, K, U (&,,,}. We attach the trees

87

Tt, . . . . T,,, to TO at the nodes corresponding desired tree. Cornlhuy computed


q

to Ril, . . . . &,, to get the

&I,

. . . . Bi, with Yj c hj (for j = 1, . . . . m). Parts (1) and (2) hold even if

0 j.d. Part (3)

lQ

is an arbitrary compute

4.2 All 0

lomless joins (and their projections)

can be

does not ; i.e. the algorithm Yjs if tItc jd is not acyclic.

will not necessarily

the correct

efficiently.

Note that a lossless join might give a different superset) database than the projection is not the projection

result (a proper if the there if 5. ASSOCIATIONS WHEN THE DATABASE INSTANCE of Figure producers 2, and suppose we and directors. The (p,d) IS NOT THE

of the join of all the relations, of a universal this particular engineer) instance. And

PROJECTION Consider

OF A UNIVERSAL the example scheme

might be good reason for wanting we have also a relation base, the way to compute FS. By contrast

join: for example, in the film

FS (film-sound directors

data-

want to know the associations natural interpretation

between

of talkies

is by joining

FD with

of such a query for (directed npo (yRi) with (or their

is the set of pairs a film produced

to the Corollary,

lossy joins can be hard to comto an acyclic one thus, the NP-

where direct0r.d ducer p. of directors attribute

has worked computing

by) pro-

pute: Every n by

database scheme n can be augmented appropriate relation schemes;

However,

will miss all associations producers. That is, the etc.) in

adding

of documentaries is inapplicable

completeness of the relations

results in Section 3 are relevant over Q. Theorem 4.2 with Theorem

in computing

the join

Actor

unknown,

unspecified

these cases. 3.2 we can conclude the projection of R. We Let D = {RI, set of X-tuples ., R,} be a database. We denote by D[X] of each universal Q with the

Combining

that the schemes of any nonredundant on X form a connected

join computing

that are in the X-projection the join dependency

relation for

set in some tree representation for deriving will find

satisfying

R, G T&, (R)

use this fact now to devise an easy method minimal S = {Y,, *R, nxms and join for X. Tbat is, we

a (in fact all) a collection by the jd *Q implies

i = 1,

., k. Such a relation as follows.

R is called a conmining insronce. D[X]


We form a universal in each R, with Let R(D) instance R(D) by new distinct nulls of

can be computed padding (marked

, Y,,,} of sets such that nxm,. every subset S of n, for Sets &,,

= xx is implied which . . . . &, the jd

out to U every nulls in [WI,

tuple [Ml).

= mo R(D).

A tuple of X.

= nr.

contains

m distinct

with

Y, C E,, (for

R*(D) is X-fornl if it contains no nulls in the attributes


the projection onto X of all X-total tuples of R(D). Sections,

D[X] is

j = 1, . . . . m). We compute Q be a family S using the following schemes, initially elimination algorithm. Let

From our discussion

in the previous

it follows that for

of relation

set equal to n.

1) Elim-

a general database scheme it is hard to compute if a particular X-tuple

D[X] or determine
are hard even if that if D is the

inate an attribute Eliminate a set

in U - X if it belongs to exactly in Q if it is contained

one set of Q, 2) in another set.

is in it, since both problems instance. (Note,

D is the projection
projection now that

of a universal instance

S = {Y,, . . . . Y,,,} is the final family and 2) as far as possible. Theorem

Q of sets after applying

steps 1)

of a universal

then D[X] = nx(WR,)).

Suppose
(2) after

Q is an acyclic

scheme.

Consider

Remark

4.3 (1) The j.d. *Q implies

nrmr.

= TX.

(2) If S is

Theorem

4.1, with the projection

of R(D) on the relation

schemes of

any subset of Q that has for each j = 1, .__, m a scheme &, containing Y, then the j.d. *Q implies xxms n with nxms = xx implied by *Q, = nx. (3) If S is any subset of distinct sets

Q in place of the database D there, the set X in place of Y and with S/(A) the set of nonnull A-values for each A in X. From Remark

then S contains

(2) we have :

88

Theorem 5.1 If Q is an acyclic databasescheme,, we can teat if

from Lemma 5.1 and Theorem 4.2. 0 In the next Section we shall see that there are some schemes which have a representing tree such that the union can be dropped from Theorem 5.2, and thus D[X] can be computed by just joining somerelations from D . Even for a general Q however,we can compute D[X] without introducing null values - and clearly faster than the general method described at the beginning of the Section (as shown also in [Sal). Let S = {Yr, .. ., Y,,,}be the collection of sets in which the rows of the tableau of nxmn have a distinguished or repeated nondistinguished symbols. Let Ri = ,,& my,(Ri).
I

an X-tuple is in D[X] and compute D[X] (or a projection of it) in time polynomial in the size of the input and the output. o Returning to the representation of Q by a tree T, we noted before that navigating through T (joining the relations on the way) producesvalid associations (the joins are lossless). In other words, if the databaseis the projection of a universal relation R (satisfyingthe join dependencyQ), then the computed relation is the projection of
R on the correspondingset Y of attritrJtes - and is therefore indepen-

dent of the choice of the tree T that representsQ. If D is not the projection of a universal instance, then again only valid associations will be produced (i.e. every produced Y-tuple is in D[Y]); however, some valid associationsmight be lost, and the result might be sensitive to the tree T chosen. We shall show that D[X] is the set of associations derived by considering o/l trees representingQ. Lemma 5.1 Let Q be a general database scheme,
D = {RI, . . . . Rk} a database over it. Let I be a tuple of R*(D)

Tkwem

5.3 [Sal Let Q be a general database scheme, LetXbeasetofattributesand


j

D = {RI, . . ..Rt}adatabaseoverit.

let the R;s be defined as above. Then D[X] = mx (W Rj).


Pmd . (1) TX (9 Rj) S D[xI. I

defined on the set of attributes Y. Then there is a subset5 of Q of relation schemeswith a loasleas join with r[&] C Ri for each & C S and Y = U I& C S} 0 T&em 5.2 Let Q be an acyclic database scheme,

We know that nxms. = rxmp.


Ri E v,W).

From the definition of D(R),


TllUS,

nx (WRY) G nxms.R(D) C mxmnR(D) = qR(D).

D = {RI, .. .. Rk} a databaseover it, and X a set of attributes. (1) if X ie contained in a relation scheme, then

Therefore, the

set of X-total tuples of the left-hand side (= ox (WR,)) is contained i in the set of X-total tuples of the right-hand side (= D[X]).

D[X] = U {nxRi 1.X E; &). (2) If X is not contained in any relation scheme, then I
Ri) 1T a tree representingQ}, where Tx is the

(2) D[xl

E TX W;). 1

D[X] = U {nx ($r

Let I be an X-total tuple of R(D) defined on Z 1 X. From Lemma 5.1 there is a subfamily S of Q of relation schemeswith r[&] C Ri for each &CS, Z=U&gicS} and ms=nrmp. Thus,

minimal subtreeof T covering X as in Lemma 4.3. P&f (1) From the construction of R(D), c nxR(D) C nxR(D).
U (~xRi 1X G Ri}

qrnr = nxmp and therefore S contains distinct sets l&r, .. ., I&,,, With Yj C&j for j = 1, ... . m. Since t&j each j. C Rij. r[Yj] C RJ, for For the opposite inclusion, it is easy to see

that any lossless join of a family 5 of schemes covering X, must contain at Icast one schemecontaining X. The conclusionthen follows from Lemma5.1.
(2) lie

Let Y = U Yj; We have r[Y] C WRT, and therefore i i


0

r[X] C ax (YR;).

one inclusion folknvs fromlemma 4.2, and the other

89

6. LOOP-FREE BACWAN DIAGRAMS A loop-free Bochman diagram (IfLSd) [Ba. Li] is a tree T with directions on its edges, whose nodes are distinct sets of attributes, satisfying the following conditions. (1) Every attribute is a node, (2) If X - Y, then X c Y, and (3) If X E Y are nodes, then X : Y. ( is the transitive closure of -1. Let Q be the database scheme containing the nodes of the diagram; we will call Q a lfsd scheme. Clearly. the diagram T is a tree representationof Q ; thus Q is an acyclic scheme. From the definition of a loop-free Bachmandiagram, it follows easily that (a) for every set X, the set of nodes that contain X form a rooted tree, (b) If X and Y are two nodes with a nonempty intersection, then X fl Y is also a node: their lowest commonancestor.

Let p be this path. We have , nqWqt, = W{q, ( & C p}. Let S be the set of nodes that lie in such a Path p connectingtwo setsin S with a nonempty intersection. We have, mr = mr. Since S has a connected hypergraph, S induces a subtree of T. The conclusion now follows from Lemma 4.2. 0 Corotlary 6.1 ff D is a database over a IfBd schemeQ, then all joins (and their projections)can be computedefficiently. Prouf Let S be a subsetof Q. The join of the relations in each connetted component of (the hypergraph of) S is lossless and thus can be computed effkiendy. The join of the relations in S is just the Cartesianproduct of thesejoins.0 Note that an arbitrary acyclic schememay have a connected are the only oneswith subsetwith a lossy join. In fact, IfBd schemes the (CJ) property: Theorem 6.2 Let Q be a databaseschemewith the (CJ) pro-

Loop-free Bachman diagram schemeshave the fottowing nice pmw. (CJ) : The jd *Q-imp&es that every subsetof Q with a connected hypergraphhaq a lossless join. Theorem 6.1 Every IfBd schemehas the (CJ) property.

perty. Then Q is contained in a IfBd schemeQ with the samemaximal sets(and thus with an equivalent jd). The proof of Theorem 6.2 is w mas whose proof we omit. Lemma 6.1 if Q satisfies(Cl), then the closureunderintersection of Q has also the (CJ) property.0 on the following two Lem-

Let Q be the databaseschemeof a loop-free Bachmandiagram T, and let S be a subset of Q with a connected hypergraph. Let &, Bj be two relation schemesin S with a nonempty intersecfii. From property (b) above there is a path from & to & in T that goes from & up to &II&~ meeting nodes that are subsetsof & and then down to 4 through nodesthat are subsetsof 8, (see Figure 3).

Let Q now be the union of the closureunder intersectionof Q and the set of all singletons. Let us define a directed graph G on theelements ofQbyhavinganarcX-YifX.YfQ,XGY,and there is no Z in Q with X c Z c Y (i.e. - is the transitive reduction of E). Lemma 6.2 If Q has the (CI) property then G contains no (undirected) cycles.0 The closure under intersection of an arbitrary family of sets can have size exponential in the size of the original family . (Just considerthe family of subsetsof cardinality n - 1 of a set of sire n.)

figure 3

90

This is not true for families with the (CJ) property: Lemma 6.3 If Q is a database scheme with the (CI) property and Q is as above, then In] 5 In] + 2]U]. IJ

with the appropriate

existence constraints (but has no loopfree

Bachman diagram); Q U {FD} has no such tree.

7. AN ACYCLIC Thus, we can represent a database scheme with the (U) petty by a loopfree pro TIONAL Elachman diagram without having to introduce too many new relation schemes. These relations can facilitate the computation of the associations D[X] by having essentially ready the relations RJ of Theorem 5.3. A database D over Q is legal if it satisfies the following exisrence construinrs. For all nodes X,Y with X - Y the corresponding relations satisfy Rx > nx(Ry). Theorem 6.3 Let D be a legal database over the database scheme Q of a loopfree butes. (1) If X is contained in at least two relation D[X] = x,$r, schemes, then Bachman diagram T, and X a set of attri-

JOIN DEPENDENCY

WITH A SET OF FUNC-

DEPENDENCIES

As in Section 5 , a database D = {Rt, . .., Rk} represents the information common to all containing universal instances (i.e.

instances R with 4 (R) Q Ri) which satisfy the dependencies, if there is such a containing universal instance [HI. We can check the dependencies and find the information represented by D, by forming a universal relation R(D) as in Section 5 and then chasing the dependencies on R(D). In this Section we

will see how to do this efficiently in the presence of an acyclic loin dependency Q and a set of functional dependencies.

where Y is the unique minimal set of Q containing X.

7.1. Festhrg a functional dependency. Let D = (RI, . . . . Rk} be a database over the acyclic scheme Q and f: X - A a functional dependency. We shall show how to check if WR, satisfiesf, and list all violations if it does not, i.e. list all pairs

(2) If X is not contained in two or more relation schemes, then D[X] = 71x ($r x &), where Tx is as in Lemma 4.3. 0

We leave it as an open problem to determine those acyclic schemes Q that have a representing tree T which can produce all D[X] (by joining connected subsets), even with the addition of (It is easy to see that some kind of such con-

of distinct A-values associated with the same X-value, in time polynomial in the size of D: Note that we cannot simply compute

?rxA(yRi) and then check f, since its size might not be polynomially bounded in that of D. Let T be a tree representing Q. We root the tree at a node containing attribute A, say 8,. We prune the tree leaf to root while

existence constraints.

straints is necessary for this to hold.) Such a tree would keep, in some sense, attributes as closely connected as possible. For example, the scheme of Figure 2 has the Bachman diagram of Figure 4 (nodes A, P, D are unmzssary).

associating a graph Gi with every relation Ri. The nodes of Gi are the tuples of the current version of Ri. initially, there is an edge

between two tuples of R, if they agree on the attributes of X fl &. Let Ti be the subtree of T rooted at node &, 2, the set of attri-

FD
Fire me 4

f-P
has such a tree

butes label@

the edge from & to its father - (2, C &) - and Xi the

set of attributes of X that arc covered by T,. The deletion of node & is carried out as follows. Let RI be the current version of Ri (a

relation on the set of attributes &) and Rj the current version for its father 4. At first we project RI onto Zi, and merge nodes of Gj that

S&RUT Q = {FPM, FPA, FAR, FP, FA,F}

91

correspond to tuples with the same Z,-projection.

Two merged

now plays the role of the set X there (i.e. apply the algorithm to the minimal subtree of T covering X4, or the tree of the set S there for
U).

nodes II, v of Gi are adjacent if there were two nodes u, v respectively merged into them that were adjacent. Then we replace RI by RI W nZ,(Ri), delete Ri and update Gj as follows. TWO nodes a, Y

As for testing if D satisfies the functional dependency f: X - A and the join dependency Q (or equivalently, whether D[XA] satisfies fl, we can either (i) form the universal relation R(D), project back down en the schemes and then check if f is satisfied in the join of the new relations, or still netter (ii) find the set S for XA as in Theorem 4.3, compute the corresponding R;s as for Theorem 5.3 and then check if f holds in the join of the Ris. However, if we

of the new Gj are adjacent if they were adjacent in the old Gj and their Z,-projectii were adjacent in Gj. Let RI be the version of

the root relation when all other nodes have been deleted and G, the corresponding graph. Let R,, be the projection of RI onto A, and G., the graph (on R,,) obtained from Gt by merging all nodes with the same A-projection. tions off X-value. We claim that the edges of G,, give all viola-

in yRi, i.e. all pairs of A-values associated with the same Thus, YRi satitiesf iff GA has no edges

have a set F of functional dependencies, it is not sufficient to check each dependency individually; i.e. D may satisfy Q and f for each fd

f of F but still vidate the dependencies as a whole. At first note that, apart from the graphs, the algorithm computes x4( YRJ as in the beginning of Scction 4. (The reduction process is not needed here since the output is small.) Therefore, when node & is deleted the current relation RI is equal to ~4($r / R,). In this case we must first reduce the database (so that intermediate results will be bounded by the size of the input and the output) and then we can root the tree at any node. Besides testing a functional of the graph Gi, two &-tuples a and v are adjacent iff there are two dependency the algorithm can be used also to compute some queries tuples u, v of $r I R, which agree with each other on Xi, and with invdving Therefore, at the end two A-values quijoins; e.g. Have these two actors worked with the same directnr?, Which films are made by the same director and pry ducer? Equijoins can be transformed to natural joins by renaming attributes. joins. Thearem 7.1 Let D be a database over an acyclic scheme Q and f: X - A a functional dependency. We can check in polynomial time if yRi satisfiesf. and list all violations (pairs of A-values associ7.2. Caarqsuthg the chase Let p = {&, . . . . &} he an acyclic join dependency (database scheme) and F a set of functianal dependencies of the form X - A. We could easily modify the algorithm to give for each pair of We shall show how to compute efficiently the chase of a relation R A-vilucs in the list of violations an X-value with which they are assounderQ andF. ciated, or all such X-values, if so desired. In the last case the al@ rithm would run in time polynomial in the input and the output. Also, if the database is already reduced (the projection of a universal instance), then we can use the improvements of Sectibnn4 where XA [MSY] shows that any tuplc in the chase of a tableau under a single join dependency and a set of functional dependencies can be generated nundeterminstically in polynomial time. Our algorithm However, the transformed query may invdve cyclic to A similar algorithm can be used to list for two sets X, Y, all pairs of Y-tupks that are associated with a common X-tuple in YRi.

We can show also by an induction that in the corresponding version

u and v respectively on &.

at, a2 are,adjacent in GA iff there are two tuples r,. rt in yRi which agree with each other cm X, and have f,[A] = at, rr[A] = as.

For example,

the second query

above qrmsponds

n,F(FDWFPWFDWFP)

where F is a renaming of F.

ated with the same X-value) if it does not. 0

92

P implies another join dependency , but functional and multivalued amounts essentiallyto turning all nondeterministicstepsin this algodependenciescan be efficiently inferred [MSY, V]. Also, it is easy rithm into deterministic. From R we shall construct (in polynomial to modify the proof of Theorem 3.1 to show that it is NP-complete time) a relation R by identifying all symbolsthat the chasewill, and such that chure(R) = mpR. Thus, by computing the chase efficiently we mean that we can check if a certain (possibly partially satisfiesa set of functional dependencies (no jds). specified) tuple is in it, generate a projection of it or the total tuples in a projection (if there are mills) in time polynomial in the sixe of the input and the output.
REFERENCES

to determine if a given databaseD satisfiesP. even if D is the pro jection of a universal instance. [H] showshow to test if a database

WUI
Computingthe thaw of R
While there is a changedo

A. V. Aho, C. Reeri, J. D. Ulhnan, The theory of joins in relational databases,ACM Trans. on Database Systems 4(3), 297-314,(1979). A. V. Aho, Y. Sagiv, J. D. Ulhnan, Equivalence among relational expressions, SIAM J. Computing, E(2), 218-246,(1979). , Efficient optimixation of a classof relational expression, ACM Trans. on Database Systems, 4(4) C. W. Rachman, Data structure diagrams, Data Base, l(2), 410, (1969).

Foreachf:X-A inFdo Begin LetDbetheprojectionofRontoP; Apply the algorithm of section 6.1 for f and D; Identify A-values in the sameconnectedcomponent of G4 and delete duplicate tuples from R;
Ed;

[ASUl]

[ASU2]

Clearly, the while-loop cannot be executed more than k] times. Since the algorithm of Section 6.1 takes polynomial amount of time in the sixe of D (which is bounded by kw]), the algorithm stopswith a final relation R in time polynomial in k, ,p] and pi. We claim that chase(R) = mpR. Clearly, any two values

IW

C. Reeri, Gn the membershipproblem for multivalued dependenciesin relational databases,ACM Trans. on Databasesystems. C. Reeri, D. A. Bernstein, Computational problems related to the design of normal form relational schemes,ACM Trans. on DatabaseSystems,4(l), 3059. (1979). C. Reeri, R. Fagin, D. Maie:, A. Mendelxon, J. D. Ulhnan, M. Yannakakis, Roperties of acyclic database schemes,ACM Symp. Theory of Computing, (1981). C. Reeri, hf. Y. Vardi, on the prcpertia of total join dependencies, Proc. Workshop on Formal Bases for Bttabases, Toulouse, (1979). P. A. Remstein. Synthesizing third normal form relations from functional dependencies,ACM Trans. on DatabaseSystems,l(4), 277-298,(1976). P. A. Ikmstein and N. Goodman, Ihe theory of semi-joins, TR CCA-79-27, Computer Corp. of America, (1979). C. Rerge, Graphs and Hypergraphs, North-Holland, 1973. E. F. Codd, A relational model for large shared databank$, Comm. ACM, 13(6), 377-387,(1970). R. Fagin, Multivah~eddependencies and a new normal form for relational databases,ACM Trans. on Datahas Systems,2(3), 262-278.(1977).

PBI

identified by the algorithm are ax-reedy identified. SincempR then satisfies the j.d. 0 and all functional dependenciesin F, it is the chaseof R. Theorem 7.2 We can compute in polynomial time from a relation R another relation R such that the chaseof R under an acyclic join dependencyp and a set of functional dependences is equal to mpR. 0 CnroUary 7.1 Given a set P consistingof functional dependencies and a single acyclic join dependency,we can decide in polyno mial time (1) if a given databasesatisfiesP, (2) if a given template dependencyis implied by P. 0 For a set P of functional dependenciesand a single general pin dependency*Q, we know that it is NP-compkte to determine if

P-1

WI
WI

WI

[%I
[Cl PI

93

muI

R. Fagin, A. Mendelzon, J. D. Ullman. A simplified universal relation assumption and its properties,
RJ2900,LBM,SanJo~e,CaL,(1980).

P. Honeyman, Testing functional dependencysatisfaction, to appear in Journal of ACM.

WY1

P. Honeyman, R. Ladner, M. Yannakakis, Testing the universal instance assumption, Inf. Pmt. Letters, 10(l), 14-19(1980). Y. E. Lien, On the equivalence of databasemodels, Bell Labs memorandum,(1979). D. Maier, Discarding the universal instance assumption preliminary results, Proc. XPl Conference, Stonybrook, NY, (19gO). D. Maier, A. Mendelzon, Y. Sagiv, Testing implications of data dependencies,ACM Trans. on Database Systems,4(4), 455469, (1979). D. Maier, Y. Sagiv, M. Yannakakis, Testing implications of functional and loin dependencies,Journal of ACM, to appear. J. R&men, Theory of relations for databases - a tutorial survey Proc. 7th Symp. on Mathematical Foundations of Computer Science, Lecture Notes in Cornputer Science64, Springer-Verlag,537-551,(1978). F. Sadri, J. D. Ullman, A complete axiomatization for a large class of dependenciesin relational databases, Proc. 12th Am. ACM Symp. on Theory of Computing, 117-122,(1980). Y. Sagiv, Can we use the universal instanceassumption without using null values? , 7th ACM-SIGMOD Intl Gmf. on Managementof Data, 108-120,(1981). Y. Sagiv and M. Yannakakis, Equivalences among relational expressions with the union and difference operators, Journal of ACM, 27(4), 633-655,(1980). E. Sciore, Real-world MVDs, 7th ACM-SIGMOD Intl Conf. on Managementof Data, 121-132,(1981). J. D. Ulhnan, Principles of Database Systems, Computer SciencePress,1979. M. Y. Vardi Inferring multivalued dependenciesfrom functional and loin dependencies,Dept. of Applied Math, Weizmann Inst. of Science, Rehovot, Israel, (1980). A. Walker, Time and space in a lattice of universal relations with blank entries, Proc. XPl Conference, Stonybrook, NY, (1980). C. Zaniolo, Analysis and design of relational schemata for databasesystems,TR UCLA-ENG-7769, Dept. of Camp. Sci., UCLA, (1976).

D-1 PI

PMSI

WW

[RI

WI

@I
WI

PI WI [VI

WI

PI

94

You might also like