Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Multiple ,Join Size Estimation by Virtual Domains

(extencled abstract)

Allen Van Gelder*

University of California, Santa Gruz

Abstract for large expressions involving many joins. Due to

A model is described to estimate the size of intermediate associativity and commutativity of the join, search

relations produced by large relational algebra expres- spaces quickly become very large [KBZ86]. The
exhaustive approach of System R [SAC+ 79] becomes
sions, in particular, those containing several equi-joins.
impractical, and many approximation heuristics have
The intended application is within query optimization
searches, where fast estimates are needed as many al- been investigated [IW87, IK90, IK91, LVZ92, SG88,

ternative plans are examined. It is shown that previous Swa89]. In this setting, where thousands of alternatives
may be investigated, it would not be practical to put
methods, which use an independence assumption when
significant resources into one size estimate. Therefore,
several attributes are joined, can lead to unrealistically
extensive statistics and/or sampling methods have not
low size estimates. This method attempts to overcome
been combined with large scale searches.
that problem by the introduction of “virtual domains”,
which avoid the independence assumption. The method For large scale optimization to pay off it is not
does not require extensive statistics about the database. necessary to find the globally optimum plan. It is
After describing an “exact” version, an approximation sufficient to avoid really bad plans. However, with a
that is simpler and faster is presented. model of limited accuracy underpinning cost estimates,
there is a danger that plans that are evaluated as
low cost can have an unexpectedly high cost. This
1 Introduction
phenomenon has been reported in connection with
optimization of relational queries is a relatively thank- selections against a bibliographic database [Lyn88].
less task because the lack of regular structure in relation
This paper describes how order-of-magnitude errors
instances makes the various costs difficult and compli-
can arise in connection with estimates of join size. The
cated to predict. One of the factors that influences
dangerous situations are those in which relations are
costs is the size of intermediate relations that are formed
joined on several attributes, or one attribute appears in
during the computation of an involved query. Follow-
several join predicates. The “traditional” method used
ing the ground-breaking work on System R [SACt79,
in System R [SAC+ 79] and apparently adopted by most
AKS80], research has divided along a number of lines.
researchers in large scale optimization can sometimes
One line has been to use static statistics more effectively
underestimate the size of such an intermediate relation
to estimate selectivity of selections [PSC84, Lyn88]. A
by factors of 100’s or 1000’s, as shown in Section 3. This
second has been to take samples dynamically to get
is just the kind of error that can lead to a really bad plan
statistics on the actual data [HOT88, LN90]. Both
selection.
approaches are oriented toward small expressions.
The purpose of this paper is to develop an improved
Other research has concentrated on search strategies
model for estimating join size when several attributes
*Work partially supported by NSF grants CCR-89-58590 aud are joined and when one attribute appears in several
IRI-91O’Z513 joins. It introduces a “virtual domain” model upon
Permission to copy without fee all or part of this material is which to base the size estimate (Section 4). The
granted provided that the copies ara not made or distributed for
method requires no more statistics than System R
direct commercial advantage, the ACM copyright notica and the
title of the publication and its date appear, and notice is given uses; specifically} it uses the cardinalities of the base
that copying ia by permission of tha Association for Computing relations and the numbers of distinct values of attributes
Machinery. To copy otherwise, or to republish, requires a fee involved in equi-joins. It is designed to give “rough and
and/or specific permission.
ready” estimates quickly in a setting where many such
ACM-PODS-5/93/Washington, D.C.
@1993 ACM O-8979J.593-3/93jOOO~/OJ 80...$7.50

180
estimates must be cranked out. The System R method [SAC+ 79] to estimate the
The “theoretical” version is based on heuristics and selectivity factor p relies on the availability of a dwtinct
assumptions that are generalizations of those used to values number for each relation:
justify the System Recalculation. A central assumption
is that of referential integrity on joined attributes; this p(R.a=S.b) = 1/ max(d(R.a), d(S.6)) (System R)

assumption is extended to sets of joined attributes.


This apparently arbitrary formula is actually well
The principal innovation is that an assumption of
reasoned, as will be discussed later. For now, we want
independence of join predicates is avoided without
to illustrate that its use in compound expressions may
requiring extensive (or any) statistics on correlations.
lead to problems.
(The independence assumption is the main culprit
in gross underestimates. ) The technicalities of size
Example 3.1: Consider the relation scheme in Fig-
estimation involve estimating the number of dangling
ure 1, where long names are followed by abbreviations
tuples (Section 5) and calculation of join size after their
and the number of distinct values of various attributes
elimination (Section 6).
is shown. The rationale of the numbers is that the
However, the “theoretical” version involves what may
En~p-Tin~e relation only contains records for current
be unacceptably expensive computations in practice.
employees to conserve space, but the Employee relation
Approximate methods that are more efficient in com-
has some ex-employees as well. The company has 10
putation are proposed for practical use (Section 7, Ap-
branches, which use overlapping empno’s. The Emp-
pendix A).
Tame relation contains weekly data for about two years
(100 dates).
2 Notation Consider the following expression, which asks for

We shall use capital letters or capitalized words to Time records of employees on their current project (and

denote relation names and symbols beginning with related data).

lowercase letters for attributes. We use R.a to denote


w (( Emp. br=Time. brA
attribute a in relation R.
Emp.e=Time.eA
Joins and selections are written in prefix form, with
Emp.cpj=Time.pjA
the predicate first, then the relational operands:
Emp.cpj=Proj.pj); Emp, Time, Proj)

M (( R.a=S.a A S.&7’.b); R, S, cr(T.c=l; 2’))


The selectivities involved are

denotes the join of R, S and T’ where R and $ are p(Emp. br = Time. br) = 0.1
joined on a, and S and T’ are joined on b, and T’ is the p(lhnp. e = Tzme. e) = 0.001
intermediate relation produced by selecting on c= 1 in p(Emp. cpj = Time.pj) = 0.01
T. p(Emp. cpj = Proj.pj) = 0.008
We consider only equi-joins, not more general &joins,
so we will drop the “equi-” prefix hereafter. If these selectivities are assumed to be independent,

For representing sizes, we use the notation their product, which is 8.10-9, is considered to be the
probability that a randomly chosen tuple of the cross
d(R.a) ‘!f distinct values of attribute a in relation R product is in the result. Since the cross product has
5.1010 tuples, this leads to an estimate of 400 tuples in
Also, IRI denotes the cardinality, or number of tuples, the result.
of R. This estimate is unlikely to be reasonable. Appar-
ently there are 700–1400 employees represented in the

3 Problems with Multiple Attribute Time relation, averaging 150–300 tuples each. Assum-
ing employees have been on their current project for
Joins
an average of 15 weeks, the join could easily contain
When a join expression involves multiple attributes, the 20,000 tuples. Clearly something has gone wrong, due
traditional method is to estimate a selectivity factor for to the independence assumption. But, lacking elabo-
each equality predicate, then multiply them, thereby rate statistics, it is not clear what other assumption is
assuming independence [SAC+ 79]. This section shows possible. This topic is addressed in later sections. ❑
how this can lead to unrealistically low size estimates.
Informally, the selectivity factor for a join of R and S Estimates of size that are off by orders of magnitude
is the probability that a randomly chosen pair of tuples may lead to poor choices of join order by optimizers; a
from each relation is joinable. temporary relation that is estimated to be very small

181
Relation Tuples Attributes(Abbrev, Distinct Values)
Employee(Emp) 2,000 branch(br, 10), empno(1000), currproj(cpj, 50)
Emp-Time(Time) 200,000 branch(br, 10), empno(700), date(200), project(pj, 100)
Project (Proj) 125 project(pj, 125), manager(mgr, 25), budget(bdg, 115)

Figure 1: Example relation scheme with statistics.

1
will probably be computed early and used as input i
to later operations in a complex expression. If it
turns out to be large, it will be heavy baggage that
is carried through the rest of the computation, perhaps
unnecessarily.
Al
Before examining a new approach, we consider an-
other problem that can arise in compound expressions.
This problem stems from the hidden effects of transitiv-
o 1 r I
ity of equality.

Example 3.2: Consider this relation scheme.

Relation Tuples Attributes(Distinct Values) Figure 2: The nondimensional function &(p).


R1 1,000 cl(loo), ...
R2 1,000 a(200), ...
R3 1,000 a(500), ... 4 Virtual Domain Technique
This section introduces the virtual domain model for
Suppose an expression is presented as:
size estimation. First we establish some notation.

M (( R1.a=R2.a A R2.a=R3.a); RI, R2, R3)


Definition 4.1: Assume that a relation of n k–tuples
was drawn (uniformly at random with replacement)
We get selectivities P(R1 .a=li2.a) = 0.005 and
from a cross product of virtual domains D1, . . . . Dk
p(Ii2.a=R3.a) = 0.002, with a product of 10-5. So
of cardinalities ml, . . . . ~k. Assume that the number
the result is estimated to have 10,000 tuples.
of distinct values of attribute i that were “observed”
But the equivalent expression
is d~ (observed). Let di be the average value of

M (( R1.a=R3.a A R3.a=li22.a); RI, R3, R2) d~ (observed).


It will also be convenient to define some key ratios:
produces selectivities p(R1.a=R3.a) = 0.002 and 15i = diln and pi = mi~n. ❑
p(R3.a=R2.a) = 0.002, with a product of 4.10-6,
The probability that a specific x c Di was not drawn
and an estimated result of 4,000 tuples.
in n samvles
. is
But there is worse to come. The expression could have
been written as

w (( R1.a=R2.aAR2.a= R3.aARl.a=R3.a); RI, R2, R3) It follows that

Now
0.002,
2.10-8.
we have
and
Now
p(R1.a=R2.a)
p(R3.a=R2.a)
the estimated
=
“= 0.005,
0.002
result
with
p(R2.a=R3.a)

size is only
the product
20 tuples.
=
of ‘i=m+(+)n)
Definition 4.2: For positive integers n, let us define:
•1

of the
(learly it is undesirable
relational algebra
that
expression
minor syntactic
should permit
features
size
C/i(p) ‘gf p
( (’-3”)
1 – +

estimates to vary from 20 to 10,000 tuples. This shows Also, we define @n(co) = 1. ❑
again that methods that work well on simple expressions
break down on larger expressions, unless great care is Observe that ~n is monotonically strictly increasing
taken. from ~ to 1 as p increases from ~ to co (see Figure 2).

182
For large 71, we have the well-known approximation example, let us consider a join on branch and empno.

q$n(p)=p( l-e;). The predicate is

NOW the ratios 6i and Pi are related by


(Ernp. br= Time.br A Emp.empno= Time. empno]

6i = ~~(Pi)
leading to the selectivity
M = d;l(fi)
1
The inverse is well-defined for ~ ~ 6, s 1 provided we p = d(Entp.br) d(lhp. empno)
allow m as a value. The inverse has no closed form,
but it can be evaluated by an iterative method, or by For the moment suppose we had a combined attribute

approximations, to be discussed later. br-empno in each relation. Then the selectivity would

The main idea of the virtual domain method is to be


1
assume that the number of distinct values di in a
“ = d(Enlp. br-empno)
relation is average for n samples, and to infer the virtual
domain size by the equation Thus, the multiplication rule implicitly assumes that
the number of distinct values of the pair, br–empno, is
mi = ?l~~ 1(di/’7l) the product of the numbers of distinct values of each
attribute. In the example, this product is 10,000. But
We allow nli = co when di = n. (The implementation it is clearly impossible to have 10,000 distinct values in
can easily avoid representing co.) a relation of 2,000 tuples.

4,2 Combining Predicates Using Virtual


4.1 Combining Predicates Using Independence
Domains
Let us review the rationale for the System R predicate
To overcome the problems with the independence
selectivity rule, from the beginning of Section 3.
assumption, it is necessary to find a new way to
Suppose we are joining on (R.a=,$.b). Then there is
combine selectivities. In this section we develop a
a strong presumption that these attributes are related
method to estimate the number of distinct values of
in some essential way. A common kind of referential
tsets of attributes, which we will call sub-tuples. We
zntegrzty constraint is that one set of vaIues is a subset of
are interested in the sub-tuples consisting of attributes
the other. These may arise in the cases of foreign keys,
that appear in the equi-join predicate. The idea is to
as in Example 3.1 where it is logical that Tzme. empno
estimate how many distinct such sub-tuples occur in the
~ Emp. empno. To apply this presumption to size
relations being joined.
estimation, we assume that whichever relation has the
For a selectivity-combining operation to make sense,
smaller number of distinct values in the joined attribute
it must be associative and commutative, and must yield
is a subset of the other. For concreteness, suppose
values in the range O to 1. The following lemma is useful
d(R.a) < d(.$.tr). Then we assume that R has no
for “discovering” or “inventing” binary functions that
dangling tuples. Now if we pick a random pair of tuples
are associative and commutative.
(r, s), where r c R,s c S, then the probability that
they join is
Lemma 4,1: Let f : X x X 4 X be an associative-
commutative function and let g : X + Y be bijective
p(R.a=S.b) = ~ Prob(r.a = x) Prob(s.b = z) (1-1 and onto). Then h : Y x Y -+ Y is also associative-
zER. a
commutative, where h is defined by
for d(R.a) < d(S.b)

h(yl, y2) = 9( f(9-l(Yl), 9-1( Y2)))


If we assume that all values in S.b occur equally often,
this reduces to p(R.a=S.b) = lid(,$.b). The case where If e ~ X is an identity of .f, then g(e) is an identity of
d(S.b) s d(R.a) is symmetric, leading to the System R h.
formula. Proof: Commutativity is obvious. Direct substitution
When the join is on several attributes, it is desir- verifies the identity property. For associativity, we need
able to be able to combine selectivities for individual h(yl, 1~(Y2, Y3)) = ~~(h(Yl, Y2), Y3). But
attributes somehow. An obvious choice is multiplica-
tion, which can be rationalized by an assumption of h(y,, /t(y,, y3)) = g(f(9–l(Yl), f(9–l(Y2)> 9–1(Y3))))
independence. lt(h(yl , y2), y3) = g(f(f(9–l(Yl), 9–1(Y2))1 9–1(Y3)))
However, a close examination of Example 3.1 reveals
how this strategy can have problems. From that and the conclusion follows by associativity of f. ~

183
Definition 4.3: We define the binary function @n as 5 Dangling Tuple Elimination
follows:
Whenever the domains of joined attributes do not

@n(($, , 62) gf ~n(n~;l(6~)&l(&)) coincide, there will be “dangling” tuples; that is, tuples
that do not contribute to the join result. The System
where & is given by Definition 4.2. u R size estimation can also be viewed as first eliminating
dangling tuples, then estimating the join size from
Corollary 4.2: The function @n is associative and
“productive” tuples. The function @n developed in the
commutative on the spacel [~, 1], and has the identity
previous section allows us to generalize to multi-attibute
(+):
joins.
Proof: It was observed that #m is bijective from [~, co]
Consider a join with (R.a=S.b), where d(R.a) <
to [~, 1], so let it play the role of g in Lemma 4.1. Let
d(,$.b). Recall that we assume R.a c S.b due to
f(zl, Z2) = rmlxz and apply the lemma. #
referential integrity. Let S’ be the relation that results
Now let us apply the strange function 0. to the from eliminating the dangling tuples (such that b does
estimation of numbers of distinct pains of attributes. not match any value in Ala) from S. The relation sizes
Suppose attributes 1 and 2 have dl and dz distinct are IRI, ISI, and I,S” I, respectively. We assume that all
values, respectively, in a relation with n tuples. Define b values occur equally often, so
tii = d~/n, as usual. The virtual domain sizes are
d(R.a)
nit = pin, where
1s’1= IJ$I=
& = &* (6i) i=l,2
The rest of the S tuples “dangle”. Now a standard size
Now suppose we draw n samples from the cross-product estimate using multiplicities is
domain, of size ml m2. The expected number of distinct
values dl z, in analogy with the single-attribute case
join size = d(R.a)
(Section 4), is:
(&) (%)

which again leads to the System R formula, since


d(R.a) =“ d(S’.b).
Defining 612 = d12/n, we have: ‘Now let’ us ‘consider how the reduction of S bv
the factor d(R.a)/d(S.b) affects the number of distinct
612 = @n(til, 62)
values of other attributes. Using the virtual domain
Let us examine some of the properties of 0., and model, another attribute, say S.c, with d(S. c) distinct
interpretations of 6. values is estimated to have a virtual domain of size

d(S.c)
()
1. Because On is associative-commutative with an
identity, it can be used as a “set-reduce” function.
m. = ISI 4;\
p[
That is, we can regard it as a function of a set of 6’s,
and use it to combine size estimates of any number But now, instead of taking ISI samples, we take IS’1, so
of attributes consistently. we expect the number of distinct values to be

2. The value 6 = ~ corresponds to a domain with one


value,
products.
confirming its role as the identity for cross
d(S’.c) = ]S’1 @lS’1
() fi

Note that when mc << 1S’1, then # is nearly the


3. The value 6 = 1 indicates that the attributes of the
identity, so d(S’. c) is little changed from d(S’.c), and
sub-tuple corresponding to 6 constitute a key (for
rounding to an integer ensures there is no change at all.
this instance, at least): each sub-tuple is unique.

Example 5.1: Referring to the table in Figure 1,


4. The possibly dubious extension & (co) = 1 implies
let us see the effects of the removal of dangling .
that On (1,6) = 1 for any 6. But this is consistent,
tuples. Throughout this discussion, many quantities
because a superset of key attributes is also a key.
are estimates, although they are stated as actual
Although we now have a tool to estimate the number values (“the reduced relation has . . .“, rather than “the
of distinct values of the sub-tuples involved in the join reduced relation is estimated to have . . .“ ).
predicate, it cannot be applied immediately, for reasons The relation Proj has d(Proj .pj) = 125, whereas
discussed in the next section. d(Emp. cpj) = 50. Therefore, 6070 of the Proj tuples are
1Recall that [a, b] denotes the interval {z [ a < z < b}. made dangling by the join predicate (Emp. cpj=Proj .pj).

184
We require
d(lhnp.epj)
virtual domains.
d(Proj’
= 50.
For
.p] ) =
Other
the
50 to agree
attributes
virtual domains
with
are
the
treated
of
value

Proj.
using
of

bdg
Due to feedback,
EIEEEEa the reductions have created a sub-
an c1 Pro]. mgr we have
stantial number of new dangling tuples in R’ and S’.
Because the reduced relations have been “filtered” it is
nlbdg = 1254;;5(115/1’25) = 740
no longer plausible to view them as a random sample
nhngr = 125 &&(25/125) = 25.2
from a virtual cross-product domain. It is not clear how
to further apply the virtual domain model. If we were to
IJsing these values with the reduced relation Proj’:
blindly continue, the fixpoint would be 1 tuple in each
d(Proj’. Mg) = 50450(740) = 48 relation. Actually, the System R method does estimate

d(Proj’. mgr) = 50c#siI(25.2) = 22 that the join has only 1 tuple!


We propose a rather arbitrary solution, which is
These attributes happen not to be involved in join to reduce domains as little as possible to achieve
predicates in this example. agreement. This has the effect of making a conservative
In the Emp relation, of the 1000 distinct values of (i.e., high) size prediction. In this example, d(R’’.a)
empno only 70Yc0 match an empno in Ttme. Therefore, must be reduced to 345 to agree with s(S’ b), so the
eliminating those dangling tuples reduces Emp to 1400 cardinality of R“ must be similarly reduced. However,
tuples, and in the process sets d(Entp’.enipno) = 700. no adjustment of d(R’ .b) is forced, so d(li?’ .b) = 345.
(Recall that these are the working estimates; the actual Similar adjustments are made for S“. We use the data
number of tuples remaining could vary from 700 to for R“ and S“ (345 for everything) as the final result of
1700.) This reduction affects d(En~p.cpj) (which was dangling tuple removal. Thus the join is estimated to
50), by only a miniscule amount, not enough to change have 345 tuples. ❑
its integer value. Intuitively, whether we take 2000
or 1400 samples in a domain of 50, we expect to hit 6 Join Size by Multiplicities
every value. Similarly, d(lhnp. br) is unaffected by the
After estimating the reductions in sizes of base rela-
reduction.
tions due to elimination of the dangling tuples, we can
Now we eliminate dangling tuples from ‘Time, based
obtain estimates for the numbers of distinct values of
on its having d( Time.pj) = 100 and d(En~p.cpj) = 50.
each ‘(combination attribute” that defines the attributes
This reduces Tzme to 100,000 tuples, but again, the
involved in the relation’s join predicates, using the
integer values of d( Time. empno) and d( Time. hr) are
function On from Definition 4.3. The method involves
not affected, being so much smaller.
calculating a multiplicity of each relation, and estimat-
After removing dangling tuples, the new table is
ing the number of distinct values of the unvuersal jozn
shown in Figure 3. The compound domain br–empno-pj
key, as defined below.
is now identical for Emp and Time. ❑ For the general procedure on relations Ri, i =
1, . . . . k, we assume without loss of generality that the
Observe that feedback can occur in the reductions
attributes are renamed so that the join is a “natural
to remove dangling tuples. In practice, the feedback
join”. The set of all join-related attributes is denoted
usually dies out quickly, as in the previous example. A
by A, and is called the unzversal join key. That is, for
general analysis of the feedback phenomenon is beyond
convenience of notation we assume all join predicates
the scope of this abstract. We illustrate one pathological
are of the form (Ri.a=Ri .a) for some attribute a E A.
case next. With one approximate method described
The lack of dangling tuples implies that the relations
later, feedback can be avoided.
satisfy the join dependency Ri = ~B, (Mj (RJ )) for each

Example 5.2: Assume the following initial table. i.


Let d(A) denote the number of distinct values of the
Relation Tuples Attributes(Dzstinct Values) universal join key and let Mi denote the multtpltcity or
R 1,000 a(1000), b(500) Ri, i.e., the average number of tuples in R~ that match
,Y
1,000 a(500), b(1000) each universal join key tuple. Then

Suppose the join predicate is (R.a=S.a A R.b=S.b). The jotn st.ze = d(A) IIf=l Mi
virtual domains of R.b and S.a are found to be 628.
First we remove the dangling tuples of R based on the The general procedure to estimate d(A) and M, is a bit
a attribute. By the virtual domain model, this leads involved, and can be expensive, but most of it only has
to an estimate d(R’.b) = 345. Similarly, d(S’.a) = 345. to be done once for the entire search. Before describing
The new table is the general case, we illustrate with some examples.

185
Relation Tuples Attributes(Abbrev, Distinct Values)
Emp’ 1,400 branch(br, 10), empno(700), currproj(cpj, 50)
Time! 100,000 branch(br, 10), empno(700), date(200), project(pj, 50)
Proj’ 50 project(pj, 50), manager(mgr, 22), budget(bdg, 48)

Figure 3: Example schema and (estimated) statistics after removal of dangling tuples. Estimates are based on virtual
domain model.

Example 6.1: Recall the final table of Example 5.1, as Now we turn to the details of the general case for d(A)
shown in Figure 3. Our working assumption is that all and Mi. We identify a compound attribute Kz for each
domains involved in joins are consistent now, including Ri, which consists of all attributes in that relation that
combination domains, In particular, we assume that appear in join predicates; we call Ki the join key of the
Emp’. br-empno-cpj has the same values as Ttme’. br- relation Ri (e.g., br-empno-pj in the previous example).
empno-pj. The number of such values is estimated to The notation d(abc) now unambiguously means the
be the lesser of the estimates in each relation. This number of distinct values of the set of attributes (or
is essentially an extension of the referential integrity compound attribute) abc.
assumption to compound domains. Recall that O. can Regarding each join key as a subset of the space
be regarded as a function on a set. The competing of all (join-related) attributes A, the set of join keys
estimates are: induces a partition on A. The number of distinct values
for each partition element is taken as the minimum of
d(Emp’. br-empno-cpj = the estimates in the various relations in which those
@1400{*, —
::;0) J@-}=
1400 1398 attributes appear (by the natural join assumption, all
d( Time’. br-empno-pj = appearances of the attribute are involved in the join
condition).
Q1OOOOO{-, -, ~o::oo} = $6983

So we assume both domains have 1398 distinct values of Example 6.3: For the scheme

this compound attribute. Consequently, the multiplicity


{R(a, b,c, e), S(a, b,d), T(c, d,e)}
of tuples is 1.002 for Emp’ and 100000/1398 = 71 for
Time’; in other words, we estimate that each value of
the partition contains ab, c, d and e. The number of
br- empno-pj occurs 71 times in Tzme’. Clearly the
distinct values of ab is estimated as
multiplicity for Proj’ is 1. The size is estimated to
be the
compound
product
domain
of the number
covering all join
of distinct
attributes
values
and
in the
the
d(ab) = min{lRl@l~l (~, W
) ,

multiplicities of each relation: lsl@[sl (&, & )}

~om size = 1398.1.1.002.71 = 99457 •1

Compare this with the estimate of 400 by the System R Combinations of two partition elements are estimated
formula. •l by taking the minimum over all estimates in relations
that contain those attributes. In the previous example
Example 6.2: Recall the scheme from Example 3.2. d(ce) is the minimum of the estimates obtained from
relations R and T using @l~l and @[T[.
Tupies Attributes(Disiinct Values)

ErTzEE7
Combinations of three partition elements are esti-
mated from combinations of two and single elements,
and so on. Only combinations that occur entirely within
a relation are estimated. Thus d(abcd) is not estimated
in the above example, but d(abce) is computed as
The System R estimate for the size of the join of all three
relations varied from 20 to 4,000 to 10,000 depending d(ab) d(ce)
on syntax of the expression. Our method eliminates d(abce) = lRl@l~l
( ~’~ )
dangling tuples, giving d(A) = 100, IR2’I = 500, and
IR3’I = 200. So Ml = 1, Mz = 5, and MS = 2. The This process culminates in the calculation of distinct-
result is estimated to have 1,000 tuples. ❑ value estimates for all join keys. The multiplicities of

186
the R; are now computable as This function is also associative, commutative, and has
~ as its identity. In terms of distinct values of compound
lvf~ = lR~l/d(K;) i=l ,. ... k attributes:

It remains to calculate the number of distinct values di dj


dij = n @n —>— = min(didj, n)
in the entire space, d(A). When the join keys form () nn
an acyclic hypergraph [U1182], follow the “ear removal”,
or Graham reduction algorithm. Suppose the join keys This formula has the following natural interpretation:

are removed in the order K1, . . . . Kk. (For any other Whenever the cross-product of actual domains (projec-

permutation the adjustment is obvious.) Define the tions) is larger than the relation, and those attributes

remainmg common attributes are specified in an equi-join, assume they comprise a


key.
The elimination of dangling tuples is modeled simply
with q$n, as follows. When reducing a relation’s
where the second equation reflects the fact that Ki is estimated size from [RI to IR’1, adjust d(R. a) only if
an ear with respect to Ki+l, . . . . Kk. Then it is easy it exceeds IR’l; in this case, set d(R’.a) = IR’I. In other
to show by induction that the following expression is words, reduce the number of distinct values estimated
well-defined: only when it exceeds the cardinality of the (reduced)
relation. Feedback cycles can always be avoided by
reducing first for the attribute with the least number
of distinct values.

That is, any Graham reduction yields the same answer


Example 7.1: Referring to the table in Figure 1, and
for d(A). Unfortunately, cyclic hypergraphs of join keys
do not have such a unique expression for d(A). comparing with Example 5.1, let us see the effects of
the removal of dangling tuples, as estimated with I&.
Example 6.4: Let d(a) = 2, d(b) = 3, d(c) = 5, The relation Proj has d( Proj .pj) = 125, whereas
d(d) = 6, d(bc) = 8, and d(ac) = 9. The join keys d( Emp. cpj) = 50. Therefore, we require d(Proj’.pj) =
are ab, bc and ac. The numbers given are realizable. 50 to agree with the value of d(Emp. cpj) = 50,
Defining C~ = Ki rl Ki+l leads to estimates of 28.8, 43.2 Other attributes are treated using approximated virtual
and 72, depending on order. Defining Ci = Ki n(Ki+l U domains.
d(Proj’. bdg) =50
~ ~Kk) leads to estimates of 14.4, 16, and 27. The
d(Proj’.mgr) = 25
maximum cardinality of any abc relation that satisfies
the constraints is 16 (as a x be is an upper bound). H Following the rule that smallest actual domains are
adjusted first, !llme is reduced based on d( Time .pj) =
To deal with cyclic hypergraphs we advocate this
100 and d(llrnp. cpj) = 50. This reduces Time to
heuristic: Proceed with the Graham reduction as long as
100,000 tuples, but does not affect any estimates of
possible. When stuck, for non-ears Kj, where i < j < k,
distinct values.
calculate ~j = I<j (l (Ki U ~. . Kk ). Then remove a non- Lastly, in the Emp relation, of the 1000 distinct values
ear of maximum d(Cj ) value, and continue with Graham
of empno only 70’?ZOmatch an empno in Time. There-
reduction until stuck again, etc.
fore, eliminating those dangling tuples reduces Emp to
1400 tuples, and in the process sets d(llmp’.empno) to
7 A Simple Approximation 700. This reduction does not affect other attributes
The function & of Definition 4.2 can be approximated because they all are estimated at less than 1400.
by the extremely simple function (see Appendix A for After removing dangling tuples in this simplified
a smoother alternative): model, the new table is shown in Figure 4. Using @n
to estimate the final join size, we simply observe that
10.700.50 > 1400 so estimate that the compound
attribute br-empno-cpj has 1400 distinct values. Both
Emp’ and Proj’ have multiplicities of 1, so the estimated
The inverse is not quite genuine, as values of p greater join size is 100,000. ❑
than 1 cannot be mapped back into. This turns out not
to matter in practice because & will never be applied Although the function evaluation is easier with this
to p > 1. The analog of Definition 4.3 is: simplified virtual domain model, the difficulties in the
final steps that arose from cyclic hypergraphs of join
d?n(c$l
, 62) = min(7u$16Z, 1) ~5til,&51 keys remain.

187
Relation Tuples Attributes(Abbrev, Distinct Values)
Emp’ 1,400 branch(br, 10), empno(700), currproj(cpj, 50)
Ttme’ 100,000 branch(br, 10), empno(700), date(200), project(pj, 50)
Proj’ 50 project(pj, 50), manager(mgr, 25), budget(bdg, 50)

Figure 4: Example schema and statistics after removing dangling tuples. Estimates are based on simplified virtual
domain model.

Although the cost of evaluating an expression de- ACM SIGMOD Int’1 Conj. on Management oj
pends on the order of operations, the size of the result Data, ]990.

does not. Consequently, size estimates can be done once [IK91] Y. E. Ioannidis and Y. C. Kang. Left-deep vs.
in a large scale optimization, and re-used as necessary bushy trees: An analysis of strategy spaces and
during the search. This should make the cost of com- its implications for query optimization. In ACM
puting a more careful estimate acceptable. SIGMOD Int ’1 Conf. on Management of Data,
1991.

8 Conclusion and Future Work [IW87] Y. E. Ioannidis and E. Wong. Query optimization
There is little reported in the literature about the corre- by simulated annealing, In ACM SIGMOD Int’1
Conf. on Management of Data, 1987.
spondence (if any) between plans chosen by optimizers
and actual optimum plans on actual databases. Sev- [KBZ86] R. Khishnamurty, H. Boral, and C. Zaniolo.
eral published “experimental verifications” are actually Optimization of nonrecursive queries. In I,?th
against simulated databases that were generated in ac- Int’1 Conf. on Very Large Data Bases, 1986.

cordance with the assumptions inherent in the method [LN90] R. J. Lipton and J. F. Naughton. Practical
being verified. There seem to be practical problems: it selectivity estimation through adaptive sampling.
is not clear what databases and queries constitute good In ACM SIGMOD Int’1 Conj. on Management of
benchmarks; actual databases often have sensitive or Data, 1990.
valuable information, and are not typically put into the
[LVZ92] R. S. G. Lanzelotte, P. Valduriez, and M. Zait.
public domain for experimentation; experimenters often Optimization of object-oriented recursive queries
cannot get control of the low-level database code for a using cost-controlled strategies. In ACM SIG-
real-world system. This kind of validation is necessary MOD Int’1 Conj. on Management of Data, 1992.
as a check on speculative theory (such as this paper).
[Lyn88] C. A. Lynch. Selectivity estimation and query op-
This paper has developed what we hope is a more
timization in large databases with highly skewed
realistic model for estimating the size of intermediate
distributions of column vaJues. In Ilth Int’1 Conj.
relations than those currently in use. It can be on Very Large Data Bases, 1988.
“plugged into” existing methods that follow System R.
[PSC84] G. Piatetsky-Shapiro and C. ConnelL Accurate
It is amenable to extension should more statistics be
estimation of the number of tuples satisfying a
available.
condition. In ACM SIGMOD Int ’1 Conf. on
Management oj Data, 1984.
Acknowledgements

We are grateful to Serge Abiteboul and Patrick Val- [SAC+ 79] P. G. Selinger, M. M. Astrahan, D. D. Cham-
berlain, R, A. Lorie, and T. G. Price. Access
duriez for helpful discussions.
path selection in a relational database nlanage-
ment system. In ACM SIGMOD Int’1 Conj. on
References
Management of Data, 1979.
[AKS80] M. M. Astrahan, W. Kim, and M. Schkolnick.
[SG88] A. Swami and A. Gupta. Optimization of large
Evaluation of the System R Access Path Selection
join queries. In ACM SIGMOD Int ’1 Conf on
Mechanism. Technical Report RJ2797, IBM
Management of Data, 1988.
Research Laboratory, San Jose, CA, 1980.

[HOT88] W.-C. Hou, C;. Ozsoyoglu, and B. Taneja. Statis- [Swa89] A. Swami. Optimization of large join queries:
tical estimators for relational algebra expressions. combining heuristics and combinatorial tech-
In ACM S’ymposizm on Principles oj Database niques, In ACM SIGMOD Int ’1 Conj. on Man-
agement of Data, 1989.
Systems, 1988.

[IK90] Y. E. Ioannidis and Y. C. Kang. Randomized [Ul182] J. D. Unman. Principles of Database Systems.

algorithms for optimizing large join queries. In Computer Science Press, Rockville, Md., 1982.

188
Appendix A A Smoother
Approximation
The function qbn of Definition 4.2 can be approximated
more smoothly than the method of Section 7 as follows:

JTZ(P) = l+/L- J-- O<p <co

The limits are well-defined at co and 1, respectively.


The analog of Definition 4.3 is:

This function is also associative, commutative, but has


O as its identity. The value ~ is nearly an identity for
large n, however. The polyadic form of $n is:

This
&{61, . . ..6k}=&

approximation can be
(%!(ni;’(’a))
computed in closed form,
whereas the original definition required an iterative
computation.

189

You might also like