Professional Documents
Culture Documents
153850.153872
153850.153872
(extencled abstract)
A model is described to estimate the size of intermediate associativity and commutativity of the join, search
relations produced by large relational algebra expres- spaces quickly become very large [KBZ86]. The
exhaustive approach of System R [SAC+ 79] becomes
sions, in particular, those containing several equi-joins.
impractical, and many approximation heuristics have
The intended application is within query optimization
searches, where fast estimates are needed as many al- been investigated [IW87, IK90, IK91, LVZ92, SG88,
ternative plans are examined. It is shown that previous Swa89]. In this setting, where thousands of alternatives
may be investigated, it would not be practical to put
methods, which use an independence assumption when
significant resources into one size estimate. Therefore,
several attributes are joined, can lead to unrealistically
extensive statistics and/or sampling methods have not
low size estimates. This method attempts to overcome
been combined with large scale searches.
that problem by the introduction of “virtual domains”,
which avoid the independence assumption. The method For large scale optimization to pay off it is not
does not require extensive statistics about the database. necessary to find the globally optimum plan. It is
After describing an “exact” version, an approximation sufficient to avoid really bad plans. However, with a
that is simpler and faster is presented. model of limited accuracy underpinning cost estimates,
there is a danger that plans that are evaluated as
low cost can have an unexpectedly high cost. This
1 Introduction
phenomenon has been reported in connection with
optimization of relational queries is a relatively thank- selections against a bibliographic database [Lyn88].
less task because the lack of regular structure in relation
This paper describes how order-of-magnitude errors
instances makes the various costs difficult and compli-
can arise in connection with estimates of join size. The
cated to predict. One of the factors that influences
dangerous situations are those in which relations are
costs is the size of intermediate relations that are formed
joined on several attributes, or one attribute appears in
during the computation of an involved query. Follow-
several join predicates. The “traditional” method used
ing the ground-breaking work on System R [SACt79,
in System R [SAC+ 79] and apparently adopted by most
AKS80], research has divided along a number of lines.
researchers in large scale optimization can sometimes
One line has been to use static statistics more effectively
underestimate the size of such an intermediate relation
to estimate selectivity of selections [PSC84, Lyn88]. A
by factors of 100’s or 1000’s, as shown in Section 3. This
second has been to take samples dynamically to get
is just the kind of error that can lead to a really bad plan
statistics on the actual data [HOT88, LN90]. Both
selection.
approaches are oriented toward small expressions.
The purpose of this paper is to develop an improved
Other research has concentrated on search strategies
model for estimating join size when several attributes
*Work partially supported by NSF grants CCR-89-58590 aud are joined and when one attribute appears in several
IRI-91O’Z513 joins. It introduces a “virtual domain” model upon
Permission to copy without fee all or part of this material is which to base the size estimate (Section 4). The
granted provided that the copies ara not made or distributed for
method requires no more statistics than System R
direct commercial advantage, the ACM copyright notica and the
title of the publication and its date appear, and notice is given uses; specifically} it uses the cardinalities of the base
that copying ia by permission of tha Association for Computing relations and the numbers of distinct values of attributes
Machinery. To copy otherwise, or to republish, requires a fee involved in equi-joins. It is designed to give “rough and
and/or specific permission.
ready” estimates quickly in a setting where many such
ACM-PODS-5/93/Washington, D.C.
@1993 ACM O-8979J.593-3/93jOOO~/OJ 80...$7.50
180
estimates must be cranked out. The System R method [SAC+ 79] to estimate the
The “theoretical” version is based on heuristics and selectivity factor p relies on the availability of a dwtinct
assumptions that are generalizations of those used to values number for each relation:
justify the System Recalculation. A central assumption
is that of referential integrity on joined attributes; this p(R.a=S.b) = 1/ max(d(R.a), d(S.6)) (System R)
We shall use capital letters or capitalized words to Time records of employees on their current project (and
denotes the join of R, S and T’ where R and $ are p(Emp. br = Time. br) = 0.1
joined on a, and S and T’ are joined on b, and T’ is the p(lhnp. e = Tzme. e) = 0.001
intermediate relation produced by selecting on c= 1 in p(Emp. cpj = Time.pj) = 0.01
T. p(Emp. cpj = Proj.pj) = 0.008
We consider only equi-joins, not more general &joins,
so we will drop the “equi-” prefix hereafter. If these selectivities are assumed to be independent,
For representing sizes, we use the notation their product, which is 8.10-9, is considered to be the
probability that a randomly chosen tuple of the cross
d(R.a) ‘!f distinct values of attribute a in relation R product is in the result. Since the cross product has
5.1010 tuples, this leads to an estimate of 400 tuples in
Also, IRI denotes the cardinality, or number of tuples, the result.
of R. This estimate is unlikely to be reasonable. Appar-
ently there are 700–1400 employees represented in the
3 Problems with Multiple Attribute Time relation, averaging 150–300 tuples each. Assum-
ing employees have been on their current project for
Joins
an average of 15 weeks, the join could easily contain
When a join expression involves multiple attributes, the 20,000 tuples. Clearly something has gone wrong, due
traditional method is to estimate a selectivity factor for to the independence assumption. But, lacking elabo-
each equality predicate, then multiply them, thereby rate statistics, it is not clear what other assumption is
assuming independence [SAC+ 79]. This section shows possible. This topic is addressed in later sections. ❑
how this can lead to unrealistically low size estimates.
Informally, the selectivity factor for a join of R and S Estimates of size that are off by orders of magnitude
is the probability that a randomly chosen pair of tuples may lead to poor choices of join order by optimizers; a
from each relation is joinable. temporary relation that is estimated to be very small
181
Relation Tuples Attributes(Abbrev, Distinct Values)
Employee(Emp) 2,000 branch(br, 10), empno(1000), currproj(cpj, 50)
Emp-Time(Time) 200,000 branch(br, 10), empno(700), date(200), project(pj, 100)
Project (Proj) 125 project(pj, 125), manager(mgr, 25), budget(bdg, 115)
1
will probably be computed early and used as input i
to later operations in a complex expression. If it
turns out to be large, it will be heavy baggage that
is carried through the rest of the computation, perhaps
unnecessarily.
Al
Before examining a new approach, we consider an-
other problem that can arise in compound expressions.
This problem stems from the hidden effects of transitiv-
o 1 r I
ity of equality.
Now
0.002,
2.10-8.
we have
and
Now
p(R1.a=R2.a)
p(R3.a=R2.a)
the estimated
=
“= 0.005,
0.002
result
with
p(R2.a=R3.a)
size is only
the product
20 tuples.
=
of ‘i=m+(+)n)
Definition 4.2: For positive integers n, let us define:
•1
of the
(learly it is undesirable
relational algebra
that
expression
minor syntactic
should permit
features
size
C/i(p) ‘gf p
( (’-3”)
1 – +
estimates to vary from 20 to 10,000 tuples. This shows Also, we define @n(co) = 1. ❑
again that methods that work well on simple expressions
break down on larger expressions, unless great care is Observe that ~n is monotonically strictly increasing
taken. from ~ to 1 as p increases from ~ to co (see Figure 2).
182
For large 71, we have the well-known approximation example, let us consider a join on branch and empno.
6i = ~~(Pi)
leading to the selectivity
M = d;l(fi)
1
The inverse is well-defined for ~ ~ 6, s 1 provided we p = d(Entp.br) d(lhp. empno)
allow m as a value. The inverse has no closed form,
but it can be evaluated by an iterative method, or by For the moment suppose we had a combined attribute
approximations, to be discussed later. br-empno in each relation. Then the selectivity would
183
Definition 4.3: We define the binary function @n as 5 Dangling Tuple Elimination
follows:
Whenever the domains of joined attributes do not
@n(($, , 62) gf ~n(n~;l(6~)&l(&)) coincide, there will be “dangling” tuples; that is, tuples
that do not contribute to the join result. The System
where & is given by Definition 4.2. u R size estimation can also be viewed as first eliminating
dangling tuples, then estimating the join size from
Corollary 4.2: The function @n is associative and
“productive” tuples. The function @n developed in the
commutative on the spacel [~, 1], and has the identity
previous section allows us to generalize to multi-attibute
(+):
joins.
Proof: It was observed that #m is bijective from [~, co]
Consider a join with (R.a=S.b), where d(R.a) <
to [~, 1], so let it play the role of g in Lemma 4.1. Let
d(,$.b). Recall that we assume R.a c S.b due to
f(zl, Z2) = rmlxz and apply the lemma. #
referential integrity. Let S’ be the relation that results
Now let us apply the strange function 0. to the from eliminating the dangling tuples (such that b does
estimation of numbers of distinct pains of attributes. not match any value in Ala) from S. The relation sizes
Suppose attributes 1 and 2 have dl and dz distinct are IRI, ISI, and I,S” I, respectively. We assume that all
values, respectively, in a relation with n tuples. Define b values occur equally often, so
tii = d~/n, as usual. The virtual domain sizes are
d(R.a)
nit = pin, where
1s’1= IJ$I=
& = &* (6i) i=l,2
The rest of the S tuples “dangle”. Now a standard size
Now suppose we draw n samples from the cross-product estimate using multiplicities is
domain, of size ml m2. The expected number of distinct
values dl z, in analogy with the single-attribute case
join size = d(R.a)
(Section 4), is:
(&) (%)
d(S.c)
()
1. Because On is associative-commutative with an
identity, it can be used as a “set-reduce” function.
m. = ISI 4;\
p[
That is, we can regard it as a function of a set of 6’s,
and use it to combine size estimates of any number But now, instead of taking ISI samples, we take IS’1, so
of attributes consistently. we expect the number of distinct values to be
184
We require
d(lhnp.epj)
virtual domains.
d(Proj’
= 50.
For
.p] ) =
Other
the
50 to agree
attributes
virtual domains
with
are
the
treated
of
value
Proj.
using
of
bdg
Due to feedback,
EIEEEEa the reductions have created a sub-
an c1 Pro]. mgr we have
stantial number of new dangling tuples in R’ and S’.
Because the reduced relations have been “filtered” it is
nlbdg = 1254;;5(115/1’25) = 740
no longer plausible to view them as a random sample
nhngr = 125 &&(25/125) = 25.2
from a virtual cross-product domain. It is not clear how
to further apply the virtual domain model. If we were to
IJsing these values with the reduced relation Proj’:
blindly continue, the fixpoint would be 1 tuple in each
d(Proj’. Mg) = 50450(740) = 48 relation. Actually, the System R method does estimate
Suppose the join predicate is (R.a=S.a A R.b=S.b). The jotn st.ze = d(A) IIf=l Mi
virtual domains of R.b and S.a are found to be 628.
First we remove the dangling tuples of R based on the The general procedure to estimate d(A) and M, is a bit
a attribute. By the virtual domain model, this leads involved, and can be expensive, but most of it only has
to an estimate d(R’.b) = 345. Similarly, d(S’.a) = 345. to be done once for the entire search. Before describing
The new table is the general case, we illustrate with some examples.
185
Relation Tuples Attributes(Abbrev, Distinct Values)
Emp’ 1,400 branch(br, 10), empno(700), currproj(cpj, 50)
Time! 100,000 branch(br, 10), empno(700), date(200), project(pj, 50)
Proj’ 50 project(pj, 50), manager(mgr, 22), budget(bdg, 48)
Figure 3: Example schema and (estimated) statistics after removal of dangling tuples. Estimates are based on virtual
domain model.
Example 6.1: Recall the final table of Example 5.1, as Now we turn to the details of the general case for d(A)
shown in Figure 3. Our working assumption is that all and Mi. We identify a compound attribute Kz for each
domains involved in joins are consistent now, including Ri, which consists of all attributes in that relation that
combination domains, In particular, we assume that appear in join predicates; we call Ki the join key of the
Emp’. br-empno-cpj has the same values as Ttme’. br- relation Ri (e.g., br-empno-pj in the previous example).
empno-pj. The number of such values is estimated to The notation d(abc) now unambiguously means the
be the lesser of the estimates in each relation. This number of distinct values of the set of attributes (or
is essentially an extension of the referential integrity compound attribute) abc.
assumption to compound domains. Recall that O. can Regarding each join key as a subset of the space
be regarded as a function on a set. The competing of all (join-related) attributes A, the set of join keys
estimates are: induces a partition on A. The number of distinct values
for each partition element is taken as the minimum of
d(Emp’. br-empno-cpj = the estimates in the various relations in which those
@1400{*, —
::;0) J@-}=
1400 1398 attributes appear (by the natural join assumption, all
d( Time’. br-empno-pj = appearances of the attribute are involved in the join
condition).
Q1OOOOO{-, -, ~o::oo} = $6983
So we assume both domains have 1398 distinct values of Example 6.3: For the scheme
Compare this with the estimate of 400 by the System R Combinations of two partition elements are estimated
formula. •l by taking the minimum over all estimates in relations
that contain those attributes. In the previous example
Example 6.2: Recall the scheme from Example 3.2. d(ce) is the minimum of the estimates obtained from
relations R and T using @l~l and @[T[.
Tupies Attributes(Disiinct Values)
ErTzEE7
Combinations of three partition elements are esti-
mated from combinations of two and single elements,
and so on. Only combinations that occur entirely within
a relation are estimated. Thus d(abcd) is not estimated
in the above example, but d(abce) is computed as
The System R estimate for the size of the join of all three
relations varied from 20 to 4,000 to 10,000 depending d(ab) d(ce)
on syntax of the expression. Our method eliminates d(abce) = lRl@l~l
( ~’~ )
dangling tuples, giving d(A) = 100, IR2’I = 500, and
IR3’I = 200. So Ml = 1, Mz = 5, and MS = 2. The This process culminates in the calculation of distinct-
result is estimated to have 1,000 tuples. ❑ value estimates for all join keys. The multiplicities of
186
the R; are now computable as This function is also associative, commutative, and has
~ as its identity. In terms of distinct values of compound
lvf~ = lR~l/d(K;) i=l ,. ... k attributes:
are removed in the order K1, . . . . Kk. (For any other Whenever the cross-product of actual domains (projec-
permutation the adjustment is obvious.) Define the tions) is larger than the relation, and those attributes
187
Relation Tuples Attributes(Abbrev, Distinct Values)
Emp’ 1,400 branch(br, 10), empno(700), currproj(cpj, 50)
Ttme’ 100,000 branch(br, 10), empno(700), date(200), project(pj, 50)
Proj’ 50 project(pj, 50), manager(mgr, 25), budget(bdg, 50)
Figure 4: Example schema and statistics after removing dangling tuples. Estimates are based on simplified virtual
domain model.
Although the cost of evaluating an expression de- ACM SIGMOD Int’1 Conj. on Management oj
pends on the order of operations, the size of the result Data, ]990.
does not. Consequently, size estimates can be done once [IK91] Y. E. Ioannidis and Y. C. Kang. Left-deep vs.
in a large scale optimization, and re-used as necessary bushy trees: An analysis of strategy spaces and
during the search. This should make the cost of com- its implications for query optimization. In ACM
puting a more careful estimate acceptable. SIGMOD Int ’1 Conf. on Management of Data,
1991.
8 Conclusion and Future Work [IW87] Y. E. Ioannidis and E. Wong. Query optimization
There is little reported in the literature about the corre- by simulated annealing, In ACM SIGMOD Int’1
Conf. on Management of Data, 1987.
spondence (if any) between plans chosen by optimizers
and actual optimum plans on actual databases. Sev- [KBZ86] R. Khishnamurty, H. Boral, and C. Zaniolo.
eral published “experimental verifications” are actually Optimization of nonrecursive queries. In I,?th
against simulated databases that were generated in ac- Int’1 Conf. on Very Large Data Bases, 1986.
cordance with the assumptions inherent in the method [LN90] R. J. Lipton and J. F. Naughton. Practical
being verified. There seem to be practical problems: it selectivity estimation through adaptive sampling.
is not clear what databases and queries constitute good In ACM SIGMOD Int’1 Conj. on Management of
benchmarks; actual databases often have sensitive or Data, 1990.
valuable information, and are not typically put into the
[LVZ92] R. S. G. Lanzelotte, P. Valduriez, and M. Zait.
public domain for experimentation; experimenters often Optimization of object-oriented recursive queries
cannot get control of the low-level database code for a using cost-controlled strategies. In ACM SIG-
real-world system. This kind of validation is necessary MOD Int’1 Conj. on Management of Data, 1992.
as a check on speculative theory (such as this paper).
[Lyn88] C. A. Lynch. Selectivity estimation and query op-
This paper has developed what we hope is a more
timization in large databases with highly skewed
realistic model for estimating the size of intermediate
distributions of column vaJues. In Ilth Int’1 Conj.
relations than those currently in use. It can be on Very Large Data Bases, 1988.
“plugged into” existing methods that follow System R.
[PSC84] G. Piatetsky-Shapiro and C. ConnelL Accurate
It is amenable to extension should more statistics be
estimation of the number of tuples satisfying a
available.
condition. In ACM SIGMOD Int ’1 Conf. on
Management oj Data, 1984.
Acknowledgements
We are grateful to Serge Abiteboul and Patrick Val- [SAC+ 79] P. G. Selinger, M. M. Astrahan, D. D. Cham-
berlain, R, A. Lorie, and T. G. Price. Access
duriez for helpful discussions.
path selection in a relational database nlanage-
ment system. In ACM SIGMOD Int’1 Conj. on
References
Management of Data, 1979.
[AKS80] M. M. Astrahan, W. Kim, and M. Schkolnick.
[SG88] A. Swami and A. Gupta. Optimization of large
Evaluation of the System R Access Path Selection
join queries. In ACM SIGMOD Int ’1 Conf on
Mechanism. Technical Report RJ2797, IBM
Management of Data, 1988.
Research Laboratory, San Jose, CA, 1980.
[HOT88] W.-C. Hou, C;. Ozsoyoglu, and B. Taneja. Statis- [Swa89] A. Swami. Optimization of large join queries:
tical estimators for relational algebra expressions. combining heuristics and combinatorial tech-
In ACM S’ymposizm on Principles oj Database niques, In ACM SIGMOD Int ’1 Conj. on Man-
agement of Data, 1989.
Systems, 1988.
[IK90] Y. E. Ioannidis and Y. C. Kang. Randomized [Ul182] J. D. Unman. Principles of Database Systems.
algorithms for optimizing large join queries. In Computer Science Press, Rockville, Md., 1982.
188
Appendix A A Smoother
Approximation
The function qbn of Definition 4.2 can be approximated
more smoothly than the method of Section 7 as follows:
This
&{61, . . ..6k}=&
approximation can be
(%!(ni;’(’a))
computed in closed form,
whereas the original definition required an iterative
computation.
189