Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Applying Cosine Series to XML Structural Join

Size Estimation

Cheng Luo, Zhewei Jiang, Wen-Chi Hou, Qiang Zhu, and Chih-Fang Wang
1
Computer Science Department in Southern Illinois University Carbondale,
Carbondale, IL 62901, U.S.A.
{cluo, zjiang, hou, wang}@cs.siu.edu
2
Computer and Information Science Department in University of Michigan,
Dearborn, MI, 48128, U.S.A.
qzhu@umich.edu

Abstract. As XML has become the de facto standard for data presenta-
tion and exchanging on the Web, XML query optimization has emerged
as an important research issue. It is widely accepted that structural joins,
which evaluate the containment (ancestor-descendant) relationships be-
tween XML elements, are important to the XML query processing. Esti-
mating structural join size accurately and quickly thus becomes crucial
to the success of XML query plan selection. In this paper, we propose
to apply Cosine transform to structural join size estimation. Our ap-
proach captures structural information of XML data using mathemat-
ical functions, which are then approximated by the Cosine series. We
derive a simple formula to estimate the structural join size using the
Cosine series. Theoretical analyses and extensive experiments have been
performed. The experimental results show that, compared with state-of-
the-art IM-DA-Est method, our method is several order faster, requires
less memory, and yields better or comparable estimates.

1 Introduction
Extensible Markup Language (XML) has recently become the de facto standard
for presenting, storing, and exchanging data on the Internet. Queries over XML
data are usually specified as pattern trees [12] or path expressions [3,5].
Existing approaches that estimate the XML query selectivity follow two trends.
One is to estimate the selectivity of path expressions or pattern trees [1,4,7,13].
Methods in this direction rely on some statistics to capture the structures of XML
documents. The other trend is to identify the key operations performed in the
query and estimate the selectivity of these operations. Since trees can be viewed
as collections of paths, and paths can be further interpreted as links between
pairs of XML nodes, structural joins that study the structural relationships
between pairs of XML nodes have been recognized as vital operations of XML
queries. A structural join between an ancestor set A and a descendant set D is
to find all pairs of x, y such that x ∈ A, y ∈ D and x contains y.
Due to the importance of structural join operations, a variety of methods have
been proposed. While most of them concentrate on efficient execution of struc-
tural join operations[9,17,2,11], few [16,15] address the issue of structural join

S. Bressan, J. Küng, and R. Wagner (Eds.): DEXA 2006, LNCS 4080, pp. 761–770, 2006.

c Springer-Verlag Berlin Heidelberg 2006
762 C. Luo et al.

size estimation, which is nevertheless crucial to the query optimization because


from which selectivity of the paths or trees can be derived easily.
Wu, et al. [16] proposed the PH histogram and Coverage histogram while
Wang, et al. [15] proposed the adaptive PL histogram, as well as two sampling
methods called IM-DA-Est and PM-Est. The PH histogram and Coverage his-
togram [16] represent the entire XML dataset as a two-dimensional feature space
and partition this space into predefined grid cells. Each grid cell is associated
with a count that indicates the number of nodes that fall in it. A structural join
is then estimated according to the spatial relationships between grid cells. The
PL histogram models the XML dataset as a one-dimensional feature space and
a structural join is computed as the sum of the average number of descendant
nodes contained in each bucket. IM-DA-Est and PM-Est [15] perform structural
joins on samples and then scale up the results proportionally to estimate the
structural join size. It has been shown [15] that Wang’s IM-DA-Est provides the
best estimation in all the methods discussed.
In this paper, we approximate the distribution of the nodes that satisfy a
predicate with a small number of Cosine coefficients, and then estimates the
join size by performing simple calculations on these coefficients. The experimen-
tal results show that, compared with state-of-the-art method IM-DA-Est, our
method can be more than 105 times faster, requires much less memory space,
and generates better or comparable results.
The rest of the paper is organized as follows. Section 2 briefly reviews related
research in XML structural join size estimation. Section 3 models the distribution
of XML nodes by mathematical functions and applies Cosine series to XML
structural join size estimation. Section 4 compares our method with the sampling
based method IM-DA-Est. Detailed theoretical analyses and experimental results
are presented. Finally, Section 5 concludes this paper.

2 Related Work
To facilitate structural join operations, Wu, et al. [16] proposed a region coding
scheme, which is similar to the one adopted in the Niagara [17] project. The coding
scheme assigns a pair of values, start and end, called the region codes, to each node
in the XML data tree. The region codes specify the nodes’ locations and coverage.
A structural join between an ancestor node a and a descendant node d is essentially
to evaluate the logical expression of a.start ≤ d.start && d.end ≤ a.end.
Existing techniques addressing structural join include histogram- and sampling-
based algorithms [15,16]. The PH histogram [16] maps XML nodes to points
in a two-dimensional space that is partitioned into predefined grid cells. The
structural join size is estimated by examining the spatial relationships between
the grid cells based on the assumption that XML nodes are uniformly distributed
in the two-dimensional space. However, such an assumption could lead to poor
estimation accuracy especially when the ancestor nodes are not self-nested. The
Coverage histogram [16] is thus proposed to remedy this problem by estimating
the fraction of coverage.

You might also like