Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Information Processing Letters 19 (1984) 157-161 19 October 1984

North-Holland

ON ESTIMATING ACCESS COSTS IN RELATIONAL DATABASES

D. MAIO, M.R. SCALAS and P. TIBERIO


Dipnrtrmento di Elettronica, Informatica e Sistemistica CIOC-CNR, University of Bologna, 40126 Bologna, Italy

Communicated by Kenneth C. Sevcik


Received 20 February 1983
Revised 12 September 1983; 22 March 1984

Keywords: Data management, relational databases, indexes, query optimization

1. Introduction 2. Cost formulae

In the last decade relational data base manage- In order to correctly evaluate the cost of execut-
ment systems received a great deal of attention ing a selection statement (or the selection part of
from researchers; they offer, in fact, many ad- an update) over a relation via an index, the opti-
vantages over more conventional systems. From mizer must have a statistical model of the data
the user’s point of view one of the most appealing available. The model should be able to estimate
features is probably the fact that data bases may for a given column value the expected number of
be managed via high-level non-procedural query tuples containing that value and the number of
languages. On the other hand, the performance of page fetches needed to retrieve those records. The
a relational DBMS is a critical factor when com- most common approach [2,5-91 relies on simplify-
pared with other systems. For this reason, consid- ing assumptions:
erable research effort has been made on develop- (a) Distribution of column values: The values of
ing complex software modules, commonly called each column are uniformly distributed over the
optimizers, whose task is the choice of an efficient records; furthermore, column values are uniformly
access path for a given query [l-4]. distributed over the domain;
The optimizer generally compares different (b) Distribution of the records with a given col-
choices from among the available access paths in umn value over the relation pages: Two cases are
order to reduce the query cost both in terms of the considered for a given relation:
number of page fetches and CPU time. The capa- @I) the sorted column, where it exists, forces a
bility to select the best access path depends greatly correspondence between data pages and
on the degree of formulae refinement and the data column values;
base parameters used by the optimizer in cost W’) the unsorted column, where column values
prediction. In this paper we propose a method to are assumed uniformly distributed over the
evaluate the cost of accessing tuples of a relation relation tuples and, consequently, over the
via an index. This method takes into account the pages.
measurement of possible clustering of column These assumptions are justified since monitoring
values in data pages, thus overcoming some of the and maintaining statistics are complex operations
limits of the most common approaches [5-81, which except for those cases that are easily represented
are based on uniform distribution assumptions. by well-known distributions. To clarify the use of

OQ20-0190/84/$3.00 0 1984, Elsevier Science Publishers B.V. (North-Holland) 157


Volume 19, Number 3 INFORMATION PROCESSING LETTERS 19 October 1984

the above assumptions let us make reference to a accessed is


well-known DBMS which is widely described in
EL = fJoB * NLEAF~~~. (2)
the literature: System R [lo]. In the following we
assume that the reader is familiar with relational Because of (b2) the expected number of data pages
DB technology and shall use SQL as an example to be accessed is given by
language to describe the queries. In this paper we
shall refer to a DBMS with similar characteristics. EP = NPAG * (1 - (1 - ~/NPAG) nT). (3)
We recall that in System R at most one index can
This formula is the same as the one introduced by
be used for each relation in tuple accessing and the
Cardenas [5] and is commonly used in evaluating
index structure is like a B+-tree. The index leaves
file access costs. The formula was criticized by
contain all the values of the column, each value
Yao [6] since it implies selection with replacement.
being followed by the ordered list of the identifiers
Yao’s formula is iterative:
(TIDS) of the tuples where that value appears. Each
TID is composed of a ‘page identifier (called PID in EP=
this paper) and an offset inside the page. ET NTPL - NTPL/NPAG - i + 1
In order to show the use of the uniform as- NTPL - i + 1
sumption we refer here to the following relation:
EMPLOYEE(DNO, EMPNO, NAME, JOB, HYEAR, . . .)
(4)
when
which, for each employee, stores the department
ET < NTPL.- NTPL/NPAG
number, identification number, name, activity and
other information. Given a selection query such as or
select NAME EP = NPAG when ET > NTPL - NTPL/NPAG.
from EMPLOYEE
where JOB = ‘ RESEARCHER’ For the discussion on the limits of these formulas
and ... the reader is referred to [7,8].
Both assumptions (a) and (b2) are often unre-
the System R optimizer must evaluate the cost of
alistic and may lead to wrong access path choices.
accessing tuples along different access paths: in-
Different assumptions for the distribution of col-
dexes and sequential relation scan. The cost in
umn values require refined knowledge of the data;
terms of page requests (index and data pages)
the system catalogs should store adequate distribu-
using a secondary index, for instance on JOB
tion parameters and the optimizer should be able
(unsorted column), is evaluated as follows. Assume
to use appropriate evaluation formulas. We note
that:
here that distribution of the records with a given
- N,oa is the number of different jobs;
column value over the relation pages depends also
- NTPL is the number of tuples in EMPLOYEE;
on the record placement. Consider, for instance,
- NPAG is the number of pages in EMPLOYEE;
the relation EMPLOYEE, where the activity (JOB) is
- NLEAF~~~ is the number of leaves of the index
frequently related to the type of department. In
built on the JOB column.
fact, in a research department JOB will often as-
(For simplicity, in the following we disregard the
sume the value ‘RESEARCHER and very seldom
intermediate index levels and we avoid evidentiat-
that of ‘LAWYER’. The reverse happens in a legal
ing the ceiling function in the formulas.)
department. If the relation is sorted on DNO,
Because of (a) the expected number of tuples
clusters of some JOB values will occur with high
satisfying the predicate JOB = ‘RESEARCHER’ is
probability in some pages and not in others and
ET=+ * NTPL (1) formulas (3) and (4) overestimate the cost, while
JOB other JOB values, e.g. ‘SECRETARY’, are distributed
where l/NJoB is, in this case, the ‘predicate filter more uniformly all over the departments. If, in-
factor’ fJoB. The expected number of leaves to be stead, the relation were sorted on EMPNO it is

158
Volume 19, Number 3 INFORMATION PROCESSING LETTERS 19 October 1984

more likely that (3) and (4) would give accurate in some pages. Let us define
results, if there is no dependence between DNO
PID~ = c NPID,,~
and EMPNO.
k
In our experience with the use of the relational
system EASIER [11,12], developed at the University as the sum over all column values in the index of
of Bologna, in relations storing information about PID numbers. PID, will always lie within

census, land use and regional planning, the pres-


NPAG < PID, G NTPL
ence of significant clusters of values in the pages
has been observed for some attributes. Consider, where NPAG is the number of relation pages, NTPL
for instance, that farms located in given areas is the number of relation tuples (TIDS) (note that
present similar types of cultivations and peculiari- bounds cannot be reached in all cases).
ties: in other words, there are evident clusters due PID~ can be easily measured when the index is
to attribute value dependencies. built, updated either when the index itself is mod-
Recently, some more involved models have been ified or periodically by performing a sequential
proposed in the literature, which avoid the uni- scan over the leaves. Obviously, PID~ must be
formity and independence assumptions in estimat- stored in the system catalog. For column j we
ing tuples and page selectivities [13-151. In partic- introduced in [12] the concept of a clustering
ular, Christodoulakis [15,16] presented a multi- factor:
variate statistical model which generalizes the pre-
cf, = NTPL/PID, ,
vious approaches and evidentiates that careless
modelling of attribute correlations, distributions i.e., the average number of tuples with the same
over the domains, data placement and workload column value per page. Since in this context we
characteristics may lead to serious errors in the assume that each column value equally contributes
estimated performance. The study is exhaustive to the clusters’ presence in the pages, we have
both from the theoretical and experimental point
of view. However, more insight must be provided NPID~,~ = PIDJNKEY,
in order to evaluate the increasing computational
where NKEY, is the cardinality of column j.
effort versus performance improvement which can
Now, for an “ = ” predicate the expected num-
be obtained by implementing these techniques in
ber of data pages containing tuples satisfying the
the optimizer access cost models.
selection predicate is given by
The aim of this paper is to propose a different
simplifying evaluation method which overcomes EPj = NPID~,~.
some of the limits of (3) and (4) by taking into
account the cluster presence in data pages. Since, Referring to the previous example we can replace
as mentioned above, determining the shape of the (3) or (4) simply with
real distributions and using them for estimating EP; = PID,,,/N,,,. (5)
selectivities would significantly affect the structure
of catalogs and optimizers of present systems, we Extension to conjunctions and disjunctions is
proposed a simple model which requires only one straightforward.
extra information item that can be easily detected In conclusion, by storing the number of
from indexes with a low cost. PIDS : PID~ in the system catalog for each index, in
Let us, for an index built on column j, define addition to the other characteristics, and by pay-
NTID,,~ as the number of TIDS associated to value ing a little computation overhead to keep it up-
k of the column j. If we count the number NPIDj,k dated, we obtain the following advantages:
of different page identifiers in NTID~,~ we have (1) We consider the arithmetic average of the
real number of pages to be accessed for each
NPID~,~ < NTID, k .

column value, instead of the mean obtained as-


The inequality holds if the Kth value is clustered suming (b2). Our approach is sensitive to the

1.59
Volume 19, Number 3 INFORMATION PROCESSING LETTERS 19 October 1984

clustering of column values in some pages. For- values are clustered to a lesser extent. If the mea-
mulae (3) and (4) are not. sured PID numbers are
(2) We avoid the problems due to the ap-
PID S.PPNo = 4000 (i.e., cfsUppNo= 1.5),
proximation in (3) or the number of iterations in
(4). PID PnPE = 2000 (i.e., cf,,, = 3)
The method proposed is suitable for systems
the expected costs using (5) are
that do not collect statistics on key values and
query frequencies as a practical alternative to the EPS”,,,, =&*4000=67,
use of (3) and (4) without affecting the actual
optimizers with heavy modifications. EP,,, = & * 2000 = 40.

Therefore, ignoring clustering effects leads in this


An example example to overestimating the access costs and,
even worse, to a wrong choice of the access path.
We now illustrate formula (5) derived above Furthermore, it should be noted that more realistic
and compare it with (3). The example shows that cost estimates play an important role in physical
ignoring the effect of clustering may lead an opti- design, especially when this is carried out using the
mizer to incorrect access path evaluations. optimizer itself as the cost model [17,18].
Suppose we have the following relation on spare
parts:
SPARE_PT 3. Conclusions

(PTNO, PTYPE, SUPPNO, SHELF, . . . , QTY, PRICE) In this article we have proposed an operational
sorted on SHELF, with indexes on PTYPE (part method for the evaluation of the cost of accessing
type) and SUPPNO (supplier number). We have to tuples of a relation via an index. The clustering of
estimate the cost of accessing the tuples for the column values in data pages is taken into acount
following query: in the cost evaluation. The proposed method uses
the average number of different page identifiers
select PTNO, PRICE, QTY associated in the index with each column value.
from SPARE-R
This number may be obtained by means of a
where PTYPE = ‘SCREW’
simple measurement. Other commonly used meth-
and SUPPNO= 15 ods, assuming uniform distribution of selected tu-
Let us assume that ples in the relation pages, tend to overestimate the
costs and may lead the query optimizer to wrong
NTPL = 6000, NPAG = 300, N,,,,,, = 60,
access path choices.
N PnPE = 50.

Formula (3) estimates References


= 86,
EP~UPPNO EPmpE = 100.
[l] P.G. Selinger et al., Access path selection in a relational
database management system, Proc. ACM SIGMOD
Therefore, in the case of a system that uses one
Conf., Boston, 1979.
index per table to access tuples, the optimizer [2] M.M. Astrahan, M. Schkolnick and W. Kim, Performance
would choose the index on SUPPNO. Let us now of the SYSTEM R access path selection mechanism, Proc.
assume that a supplier may provide parts classified IFIP 80 Conf., Melbourne, 1980.
under several part types and parts of a part type [3] E. Wong and K. Youssefi, Decomposition: A strategy for
query processing, ACM TODS 1 (3) (1979).
may be provided by different suppliers. Consider
[4] S.B. Yao, Optimization of query evaluation algorithms,
that the shelves hold parts of the same type; ACM TODS 4 (2) (1979).
therefore, sorting the relation on SHELF implies a [5] A.F. Cardenas, Analysis and performance of inverted data
significant cluster of PTYPE values, while SUPPNO base structures, Comm. ACM 18 (5) (1975).

160
Volume 19. Number 3 INFORMATION PROCESSING LETTERS 19 October 1984

161 S.B. Yao, Approximating block accesses in data base number of desired records with respect to a given query,
organizations, Comm. ACM 20 (4) (1977). ACM TODS 3 (1) (1978).
[71 K.Y. Whang, G. Wiederhold and D. Sagalowitz, Estimat- IL41 R. Demolombe, Estimation of the number of tuples satis-
ing block accesses in database organizations: A closed fying a query expressed in relational algebra, Proc. VLDB
noniterative formula, Comm. ACM 26 (11) (1983). 1980, Montreal, 1980.
PI V.S. Luk, On estimating block accesses in database organi- WI S. Christodoulakis, Estimating selectivities in data bases,
zation, Comm. ACM 26 (11) (1983). CSRG 136, Ph.D. Thesis, University of Toronto, 1981.
[91 T.-Y. Cheung, A statistical model for estimating the num- H61 S. Christodoulakis, A multivariate statistical model for
ber of records in a relational database, Inform. Process. data base performance evaluation, Applied Probability
Lett. 15 (3) (1982) 115-118. and Computer Science Conf., Florida, 1981.
[lOI M.M. Astrahan et al., System R: A relational approach to I171 M. Schkolnick and P. Tiberio, Considerations in develop-
database management, ACM TODS 1 (2) (1976). ing a design tool for a relational DBMS, Proc. IEEE
[ill S. Bergamaschi and F. Bonfatti, EASIER: Un linguaggio COMPSAC Conf., 1979; also: Data Base Management in
relazionale per utenti finali, Rivista di Informatica 10 (2) the 1980’s, IEEE cat. n. EHO-181-8, 1981.
(1980) (in Italian). H81 F. Bonfatti, D. Maio and P. Tiberio, A separability-based
WI F. Bonfatti, D. Maio, M. Spadoni and P. Tiberio, An method for secondary index selection in physical database
indexing technique for relational data bases, Proc. IEEE design, in: S. Ceri, ed., Methodology and Tools for Data
COMPSAC, Chicago, 1980. Base Design (North-Holland, Amsterdam, 1983).
P31 CT. Yu, W.S. Luk and M.K. Siu, On the estimation of the

161

You might also like