Professional Documents
Culture Documents
Applying Cosine Series To XML Structural Join Size Estimation
Applying Cosine Series To XML Structural Join Size Estimation
Size Estimation
Cheng Luo, Zhewei Jiang, Wen-Chi Hou, Qiang Zhu, and Chih-Fang Wang
1
Computer Science Department in Southern Illinois University Carbondale,
Carbondale, IL 62901, U.S.A.
{cluo, zjiang, hou, wang}@cs.siu.edu
2
Computer and Information Science Department in University of Michigan,
Dearborn, MI, 48128, U.S.A.
qzhu@umich.edu
Abstract. As XML has become the de facto standard for data presenta-
tion and exchanging on the Web, XML query optimization has emerged
as an important research issue. It is widely accepted that structural joins,
which evaluate the containment (ancestor-descendant) relationships be-
tween XML elements, are important to the XML query processing. Esti-
mating structural join size accurately and quickly thus becomes crucial
to the success of XML query plan selection. In this paper, we propose
to apply Cosine transform to structural join size estimation. Our ap-
proach captures structural information of XML data using mathemat-
ical functions, which are then approximated by the Cosine series. We
derive a simple formula to estimate the structural join size using the
Cosine series. Theoretical analyses and extensive experiments have been
performed. The experimental results show that, compared with state-of-
the-art IM-DA-Est method, our method is several order faster, requires
less memory, and yields better or comparable estimates.
1 Introduction
Extensible Markup Language (XML) has recently become the de facto standard
for presenting, storing, and exchanging data on the Internet. Queries over XML
data are usually specified as pattern trees [12] or path expressions [3,5].
Existing approaches that estimate the XML query selectivity follow two trends.
One is to estimate the selectivity of path expressions or pattern trees [1,4,7,13].
Methods in this direction rely on some statistics to capture the structures of XML
documents. The other trend is to identify the key operations performed in the
query and estimate the selectivity of these operations. Since trees can be viewed
as collections of paths, and paths can be further interpreted as links between
pairs of XML nodes, structural joins that study the structural relationships
between pairs of XML nodes have been recognized as vital operations of XML
queries. A structural join between an ancestor set A and a descendant set D is
to find all pairs of x, y such that x ∈ A, y ∈ D and x contains y.
Due to the importance of structural join operations, a variety of methods have
been proposed. While most of them concentrate on efficient execution of struc-
tural join operations[9,17,2,11], few [16,15] address the issue of structural join
S. Bressan, J. Küng, and R. Wagner (Eds.): DEXA 2006, LNCS 4080, pp. 761–770, 2006.
c Springer-Verlag Berlin Heidelberg 2006
762 C. Luo et al.
2 Related Work
To facilitate structural join operations, Wu, et al. [16] proposed a region coding
scheme, which is similar to the one adopted in the Niagara [17] project. The coding
scheme assigns a pair of values, start and end, called the region codes, to each node
in the XML data tree. The region codes specify the nodes’ locations and coverage.
A structural join between an ancestor node a and a descendant node d is essentially
to evaluate the logical expression of a.start ≤ d.start && d.end ≤ a.end.
Existing techniques addressing structural join include histogram- and sampling-
based algorithms [15,16]. The PH histogram [16] maps XML nodes to points
in a two-dimensional space that is partitioned into predefined grid cells. The
structural join size is estimated by examining the spatial relationships between
the grid cells based on the assumption that XML nodes are uniformly distributed
in the two-dimensional space. However, such an assumption could lead to poor
estimation accuracy especially when the ancestor nodes are not self-nested. The
Coverage histogram [16] is thus proposed to remedy this problem by estimating
the fraction of coverage.