Professional Documents
Culture Documents
On Estimating Access Costs in Relational Databases
On Estimating Access Costs in Relational Databases
North-Holland
In the last decade relational data base manage- In order to correctly evaluate the cost of execut-
ment systems received a great deal of attention ing a selection statement (or the selection part of
from researchers; they offer, in fact, many ad- an update) over a relation via an index, the opti-
vantages over more conventional systems. From mizer must have a statistical model of the data
the user’s point of view one of the most appealing available. The model should be able to estimate
features is probably the fact that data bases may for a given column value the expected number of
be managed via high-level non-procedural query tuples containing that value and the number of
languages. On the other hand, the performance of page fetches needed to retrieve those records. The
a relational DBMS is a critical factor when com- most common approach [2,5-91 relies on simplify-
pared with other systems. For this reason, consid- ing assumptions:
erable research effort has been made on develop- (a) Distribution of column values: The values of
ing complex software modules, commonly called each column are uniformly distributed over the
optimizers, whose task is the choice of an efficient records; furthermore, column values are uniformly
access path for a given query [l-4]. distributed over the domain;
The optimizer generally compares different (b) Distribution of the records with a given col-
choices from among the available access paths in umn value over the relation pages: Two cases are
order to reduce the query cost both in terms of the considered for a given relation:
number of page fetches and CPU time. The capa- @I) the sorted column, where it exists, forces a
bility to select the best access path depends greatly correspondence between data pages and
on the degree of formulae refinement and the data column values;
base parameters used by the optimizer in cost W’) the unsorted column, where column values
prediction. In this paper we propose a method to are assumed uniformly distributed over the
evaluate the cost of accessing tuples of a relation relation tuples and, consequently, over the
via an index. This method takes into account the pages.
measurement of possible clustering of column These assumptions are justified since monitoring
values in data pages, thus overcoming some of the and maintaining statistics are complex operations
limits of the most common approaches [5-81, which except for those cases that are easily represented
are based on uniform distribution assumptions. by well-known distributions. To clarify the use of
158
Volume 19, Number 3 INFORMATION PROCESSING LETTERS 19 October 1984
more likely that (3) and (4) would give accurate in some pages. Let us define
results, if there is no dependence between DNO
PID~ = c NPID,,~
and EMPNO.
k
In our experience with the use of the relational
system EASIER [11,12], developed at the University as the sum over all column values in the index of
of Bologna, in relations storing information about PID numbers. PID, will always lie within
1.59
Volume 19, Number 3 INFORMATION PROCESSING LETTERS 19 October 1984
clustering of column values in some pages. For- values are clustered to a lesser extent. If the mea-
mulae (3) and (4) are not. sured PID numbers are
(2) We avoid the problems due to the ap-
PID S.PPNo = 4000 (i.e., cfsUppNo= 1.5),
proximation in (3) or the number of iterations in
(4). PID PnPE = 2000 (i.e., cf,,, = 3)
The method proposed is suitable for systems
the expected costs using (5) are
that do not collect statistics on key values and
query frequencies as a practical alternative to the EPS”,,,, =&*4000=67,
use of (3) and (4) without affecting the actual
optimizers with heavy modifications. EP,,, = & * 2000 = 40.
(PTNO, PTYPE, SUPPNO, SHELF, . . . , QTY, PRICE) In this article we have proposed an operational
sorted on SHELF, with indexes on PTYPE (part method for the evaluation of the cost of accessing
type) and SUPPNO (supplier number). We have to tuples of a relation via an index. The clustering of
estimate the cost of accessing the tuples for the column values in data pages is taken into acount
following query: in the cost evaluation. The proposed method uses
the average number of different page identifiers
select PTNO, PRICE, QTY associated in the index with each column value.
from SPARE-R
This number may be obtained by means of a
where PTYPE = ‘SCREW’
simple measurement. Other commonly used meth-
and SUPPNO= 15 ods, assuming uniform distribution of selected tu-
Let us assume that ples in the relation pages, tend to overestimate the
costs and may lead the query optimizer to wrong
NTPL = 6000, NPAG = 300, N,,,,,, = 60,
access path choices.
N PnPE = 50.
160
Volume 19. Number 3 INFORMATION PROCESSING LETTERS 19 October 1984
161 S.B. Yao, Approximating block accesses in data base number of desired records with respect to a given query,
organizations, Comm. ACM 20 (4) (1977). ACM TODS 3 (1) (1978).
[71 K.Y. Whang, G. Wiederhold and D. Sagalowitz, Estimat- IL41 R. Demolombe, Estimation of the number of tuples satis-
ing block accesses in database organizations: A closed fying a query expressed in relational algebra, Proc. VLDB
noniterative formula, Comm. ACM 26 (11) (1983). 1980, Montreal, 1980.
PI V.S. Luk, On estimating block accesses in database organi- WI S. Christodoulakis, Estimating selectivities in data bases,
zation, Comm. ACM 26 (11) (1983). CSRG 136, Ph.D. Thesis, University of Toronto, 1981.
[91 T.-Y. Cheung, A statistical model for estimating the num- H61 S. Christodoulakis, A multivariate statistical model for
ber of records in a relational database, Inform. Process. data base performance evaluation, Applied Probability
Lett. 15 (3) (1982) 115-118. and Computer Science Conf., Florida, 1981.
[lOI M.M. Astrahan et al., System R: A relational approach to I171 M. Schkolnick and P. Tiberio, Considerations in develop-
database management, ACM TODS 1 (2) (1976). ing a design tool for a relational DBMS, Proc. IEEE
[ill S. Bergamaschi and F. Bonfatti, EASIER: Un linguaggio COMPSAC Conf., 1979; also: Data Base Management in
relazionale per utenti finali, Rivista di Informatica 10 (2) the 1980’s, IEEE cat. n. EHO-181-8, 1981.
(1980) (in Italian). H81 F. Bonfatti, D. Maio and P. Tiberio, A separability-based
WI F. Bonfatti, D. Maio, M. Spadoni and P. Tiberio, An method for secondary index selection in physical database
indexing technique for relational data bases, Proc. IEEE design, in: S. Ceri, ed., Methodology and Tools for Data
COMPSAC, Chicago, 1980. Base Design (North-Holland, Amsterdam, 1983).
P31 CT. Yu, W.S. Luk and M.K. Siu, On the estimation of the
161