Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Li GL, Feng JH. An eective semantic cache for exploiting XPath query/view answerability.

JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 25(2): 347361 Mar. 2010

An Eective Semantic Cache for Exploiting XPath Query/View Answerability


Guo-Liang Li ( Jian-Hua Feng (

), Member, CCF, ACM, and ), Senior Member, CCF, Member, ACM, IEEE

Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology Tsinghua University, Beijing 100084, China E-mail: {liguoliang, fengjh}@tsinghua.edu.cn Received April 2, 2008; revised October 9, 2009. Abstract Maintaining a semantic cache of materialized XPath views inside or outside the database is a novel, feasible and ecient approach to facilitating XML query processing. However, most of the existing approaches incur the following disadvantages: 1) they cannot discover enough potential cached views suciently to eectively answer subsequent queries; or 2) they are inecient for view selection due to the complexity of XPath expressions. In this paper, we propose SCEND, an eective Semantic Cache based on dEcompositioN and Divisibility, to exploit the XPath query/view answerability. The contributions of this paper include: 1) a novel technique of decomposing complex XPath queries into some much simpler ones, which can facilitate discovering more potential views to answer a new query than the existing methods and thus can adequately exploit the query/view answerability; 2) an ecient view-section method by checking the divisibility between two positive numbers assigned to queries and views; 3) a cache-replacement approach to further enhancing the query/view answerability; 4) an extensive experimental study which demonstrates that our approach achieves higher performance and outperforms the existing state-of-the-art alternative methods signicantly. Keywords XML query processing, semantic cache, view selection, cache lookup

Introduction

answers query Q if there exists a query CQ which, when executed on the result of Vi , gives the result of Q. We denote this as CQ Vi Q. We call CQ Compensating Query. When some cached view can answer an issued query, we have a hit; otherwise we have a miss. There are several applications for such a semantic cache. Firstly, consider its use inside the XML database system. Suppose query Q can be answered by view V with compensating query CQ. Then, we can answer Q by executing CQ, which is simpler than Q, on the result of V that is a much smaller XML fragment than the original data instance. This can result in a signicant speedup, as we show in our experiments. Secondly, the semantic cache can also be maintained at the application tier. Here, there will be additional savings for a hit, from not having to connect to the backend database. For a heavily loaded backend server, these savings can be large. This kind of middle-tier caching has become popular for Web applications using relational databases[4] . Further, the semantic cache can also be maintained in a

XML is increasingly being used in data intensive applications and has become de facto standard over the Internet. Major database vendors are incorporating native XML support in the latest versions of their relational database products. The number and size of XML databases are rapidly increasing, and XML data become the focus of query evaluators and optimizers. In a relational database system, the in-memory buer cache is crucial for good performance, and a similar buer cache can be employed in XML systems. Maintaining semantic cache of query results has been proposed[1-3] . They address the computational cost and complement the buer cache. The cached queries are basically materialized views, which can be used in query processing. Thus, at any moment, the semantic cache contains some views {V1 , V2 , . . . , Vn }. When the system has to evaluate a new query Q, it inspects each view Vi in the cache and determines whether it is possible to answer Q from the cached result of Vi . We say that view V

Regular Paper This work is partly supported by the National Natural Science Foundation of China under Grant No. 60873065, the National High Technology Research and Development 863 Program of China under Grant Nos. 2007AA01Z152 and 2009AA011906, and the National Basic Research 973 Program of China under Grant No. 2006CB303103. c 2010 Springer Science + Business Media, LLC & Science Press, China

348

J. Comput. Sci. & Technol., Mar. 2010, Vol.25, No.2

dierent database system, on a remote host. Thus, unlike the page-based buer cache, it can be employed in a distributed setting too. Finally, the semantic cache can also be employed in a setting like distributed XQuery[5] where sub-queries of a query might refer to remote XML data sources, connected over a WAN. Here, a sub-query that hits in the local cache will not have to be sent over the network, and the savings can be huge. Checking query/view answerability requires matching operations between the tree patterns of the query and view. Looking up the semantic cache by iterating all the views will be rather inecient when the number of views is large. Mandhani and Suciu[2] proposed a semantic cache which maintains a table for XML views in the relational database and needs string match and other complicated operations, and thus this method is not cost-ecient. Further, it cannot discover sucient views to answer a new query. We present some examples to show how queries are answered from the cache and illustrate the disadvantage of existing studies. They will make clearer the challenges in doing efcient lookup in a large cache and also illustrate query rewriting for cache hits. Example 1. Suppose there is a cached view V and seven queries Q1 , Q2 , . . . , Q7 as shown in Fig.1. V = a[ //e]//b[c[ //a][b]][a]/d[ //b][c > 50], Q4 = a[ //e]//b[c[ //a][b]][a]/d[ //a[b][d][c > 50] Q5 = a[ //e]//b[c[ //a][b]][a]/d[ //a[b]][c > 100]. It is obvious that the results of V contain the results of Q4 and Q5 . Consider Q4 , CQ4 = d[ //a[b][d]], we

only need to check whether the element d in the result of V satises CQ4 . Note that processing CQ4 on V is much easier than processing Q4 on the original instance; consider Q5 , CQ5 = d[ //a[b]][c > 100], we only need to check whether the element d in the result of V satises CQ5 . We need not process Q4 and Q5 based on the traditional XML query processing methods. Alternatively, we construct compensating queries and answer such simpler queries, which can save I/O consumption and improve query performance. It is easy to nd out that the results of V also contain the results of Q1 , Q2 and Q3 . However in a naive cache system, neither of Q1 , Q2 , . . . , Q5 can be answered by V , because they are not equivalent to V . Even if in [2], Q1 , Q2 and Q3 cannot be answered by V as they do not satisfy the string match. Moreover, although Q6 is also string match with V , it is obvious that V cannot answer Q6 . Further, the proposed method cannot support // and wildcard in the XPath queries. To address above-mentioned problems, in this paper, we will demonstrate how to discover V to answer Q eectively. We propose SCEND, an ecient Semantic Cache based on dEcompositiN and Divisibility. In SCEND, V can answer Q1 , Q2 . . . , Q5 . Most importantly, we can eectively lter out Q6 and Q7 in view selection. To summarize, we make the following contributions: We propose SCEND, an ecient Semantic Cache based on dEcompositiN and Divisibility, for eective query caching and view selection, which can signicantly improve the query/view answerability.

Fig.1. A cached view and seven queries. V is a cached query and Q1 , Q2 , . . . , Q7 are seven user issued queries. We describe how to use V to answer queries Q1 , Q2 , . . . , Q7 .

Guo-Liang Li et al.: SCEND: An Eective Semantic Cache

349

We introduce a novel technique for eective view selection by checking the divisibility of two positive integers assigned to queries and views, which can signicantly improve the eciency of view selection. We demonstrate an eective technique to exploit the cache answerability by decomposing complex XPath queries into some much simpler ones, which can discover sucient cached view to answer queries so as to improve the cache hit rate. We have implemented our proposed approach and conducted an extensive performance study using both real and synthetic datasets with various characteristics. The results showed that our algorithm achieved high performance and outperformed existing state-of-the-art approaches[2] signicantly. The rest of this paper is organized as follows. We start with the background and introduce some preliminaries in Section 2. Section 3 proposes a novel strategy for ecient cache lookup and Section 4 presents a new technique for eective view selection. We devise eective algorithms for view selection and query rewriting in Section 5. Section 6 proposes a novel method for cache replacement. In Section 7, we provide our experimental results and review related work in Section 8. Finally, we make a conclusion in Section 9. 2 2.1 Problem Statement Preliminaries

the results of V to answer Q as follows. We rst get the element sets Sa , Sb by selecting elements that satisfy a/b on the cached results of a, b respectively, and then compute the result set Sd by selecting the element in the result of d, which has a parent in Sb . In this paper, to improve query/view answerability, besides caching the result of the returned node, we also cache the results of some other nodes. We will introduce the techniques of view section and query processing with cached views in the following subsections. 2.2 Notations

Denition 1 (Tree Pattern). A tree pattern is a labeled tree TP = (V , E ), where V is the vertex set, E is the edge set. Each vertex v has a label, denoted by v .label, in tagSet { }, where tagSet is the set of all element names in the context. An edge can be a child edge (P-C edge) representing the parent-child relationship (/) or a descendant edge (A-D edge) representing the ancestor-descendant relationship (//). In this paper, XPath queries are naturally represented as tree patterns. We will reason out the use of these tree patterns to derive a sound procedure for answering this question whether V can answer Q and how to construct a compensating query CQ. We present an example showing how to represent an XPath query as a tree pattern. Fig.1 shows the tree pattern for V = a[ //e]//b[c[ //a][b]][a]/d[ //b][c > 50]. Child and descendant axes are respectively denoted by single slash and double slashes. The ellipse-shaped nodes are predicates qualifying their parent nodes. Note that the returned node d of the query is marked by yellow circle. For any view V and query Q, V can answer Q, if the result of Q is contained in that of V . A query p is contained in a query q if and only if there is a homomorphism from q to p[6] , and there is a classical characterization result for conjunctive queries against relational databases. Similar characterizations can also be given for some tree patterns. We use the concepts of hemimorphism and tree inclusion[7] to dene the inclusion between two trees. If we employ the results of V to answer Q, then Q must be included in V ; however this is a necessary but not sucient condition. Moreover, Tree Pattern Inclusion is very complicated and dicult to validate, thus we introduce the concept of Restrictive Tree Pattern Inclusion. The dierence between them is, the latter must assure h is an injection. More importantly, Restrictive

This subsection formally introduces the technique of query/view answerability. The question that we consider is that: given a view V and a query Q, does V answer Q , and if yes, then what should CQ be so that CQ V Q. However, for a certain view V , selecting which nodes and their answers to cache is an important problem. Note that many more nodes selected to cache, the higher the hit rate is, the more storage is involved to cache them. Selecting more nodes of a certain query or more queries to cache is alternative with limited memory. In addition, which node is selected to cache will also inuence the performance of XML-DBMS. Note that the result of V contains that of Q does not imply V can answer Q. For example, suppose V = a[c]//b/d[e], and Q = a[c]/b/d[e], the result of Q is contained in that of V . If only the result of the returned node d of V is cached, it is impossible to answer Q based on the result of V . As there are no results of nodes a, b in the cache, and we do not know which d element in the cached view satises a/b/d. However, if the results of nodes a, b in V are also cached, we can use

A view V implies both its cached query and the corresponding result, and when there is no ambiguity, we may also refer to V as the cached query.

350

J. Comput. Sci. & Technol., Mar. 2010, Vol.25, No.2

Inclusion is easier to validate than Inclusion, which will be further demonstrated in Section 4. Example 2. In Fig.1, as Q1 , Q2 , . . . , Q5 are restrictively included in V , the results of them are contained in V . i, 1 i 5, we need construct compensating queries CQi , which satises CQi V Qi , to answer Qi . Note that processing CQi on V is much easier than directly answering Qi . 3 SCEND: An Eective Semantic Cache

In this section, we present a novel framework of semantic cache for eective query caching and view selection. 3.1 Criteria for Answerability
Fig.2. Tree patterns for V and Q.

For simplicity, we rst give some notations. Denition 2 (Main Path). A tree patterns Main Path is the path from the root node to the returned node in the query tree pattern. Nodes on this path are called the axis nodes, while the others are called predicate nodes. The query depth of a tree pattern Q is the number of axis nodes, denoted as Dep(Q). Denition 3 (Prex(Q, k ) and Predicates). Prex(Q, k ) is the query obtained by truncating query Q at its k -th axis node. The k -th axis node is included, but its predicates are not. Preds(Q, k ) is the set of predicates of the k -th axis node of Q. Inx(Q, k ) is composed of the k -th axis node and its predicates without the (k + 1)-th axis node. Qk denotes the subtree of Q rooted at its k -th axis node. To help users understand our notations better, we give some examples. Consider the query Q = a[v ]/b[@w = val1][x[ //y ]]//c[z > val2]. In Fig.2 the depth of Q is three, and a, b, c are its rst, second and third axis nodes respectively. Prex(Q, 2) = a[v ]/b; Preds(Q, 2) = {@w = val 1; x[ //y ]}; Q2 = b[@w = val 1][x[ //y ]]//c[z > val 2]; Inx(Q, 2) = b[@w = val 1][x[ //y ]]. The XPath fragment we cover includes the // axis and node labels. Predicates can be any of these: equalities with string or numeric constants, comparisons with numeric constants, or an arbitrary XPath expression from this fragment. We also consider join predicates. In this paper, to improve cache answerability, we cache the results of all the axis nodes. We give a weak sucient condition of Q V as formalized in Theorem 1. For ease of presentation, we introduce some notations. MainPath(Q, k ) denotes the path of Q from the root to the k -th axis node of Q. MainPath(V ) denotes the main path of V . The depth

of V is denoted as Dep(V ). axisNode(Q, k ) denotes the k -th axis node of Q. Theorem 1. Q V , if (i) MainPath(Q, Dep(V )) MainPath(V ); and (ii) k , 1 k Dep(V ), 1) Inx(Q, k ) Inx (V , k ); and 2) axisNode(Q, k ) = axisNode(V , k ). Proof. As MainPath(Q, Dep(V )) MainPath(V ), so h , h is a homomorphism from MainPath(V ) to MainPath(Q, Dep(V )). As Inx(Q, k ) Inx (V , k ), so hk , hk is a homomorphism from Inx(V , k ) to Inx(Q, k ). As the k -th axis nodes of Q and V are the same, a is a node in MainPath(V ), a must be an axis node. Without loss of generality, suppose ak is the k -th axis node, thus ak is the root node of Inx(V , k ), and h (ak ) = hk (ak ) = ak . Therefore, we can construct h as follows: v V , there must exist only one k , v Inx (V , k ), h = hk . It is obvious that h is a homomorphism from V to Q. Hence, Q V . Based on Theorem 1, we give a condition that V can answer Q. Denition 4 (V Q). V Q if V and Q satisfy: i) MainPath(Q, Dep(V )) MainPath(V ); and ii) k , 1 k Dep(V ), Inx(Q, k ) Inx (V , k ) and axisNode(Q, k ) = axisNode(V , k ). Corollary 1. If V Q, the result of V contains that of Q, that is the view V can answer query Q. Corollary 1 is obvious based on Theorem 1, which assures that if V Q, Q can be answered by the result of V . In this paper, if V Q, we say that Q can be answered by the cached views; otherwise Q cannot be answered. For example, in Fig.1, i, 1 i 5, V Qi , thus Q1 , Q2 , . . . , Q5 can be answered by V ; while V cannot answer Q6 and Q7 .

Guo-Liang Li et al.: SCEND: An Eective Semantic Cache

351

Note that for any tree patterns V and Q, it is much easier to check whether Q V is true through Denition 4 than whether Q V holds. Because the main path only contains /, // and without [ ], the complexity of the containment of XPath[/,//,] is proven to be polynomial[8] . Thus it is easy to check whether MainPath(Q, Dep(V )) MainPath(V ) is true. In addition, Inx(Q, k ) is simpler than Q, thus it is easy to check whether Inx(Q, k ) Inx (V , k ) holds. To answer Q, we should construct a compensating query, CQ, which satises CQ V Q. We will present how to construct CQ in Subsection 3.2. If there are more than one cached views that satisfy V Q, it is better to select the best V , which needs the least additional operations on V to answer Q. We will address this issue in the following subsection. 3.2 Compensating Queries

introduce a novel technique to accelerate view selection in this section. 4.1 Tree Patterns Prime ProducT (PPT)

To improve the cache hit rate, we cache the results of all the axis nodes, which can improve the performance of the semantic cache. For the example in Fig.1, if we only cache the result of the result node d of V , V can answer Q4 and Q5 . However, if we cache the results of a, b, d in the the cache instead of the result of only d, which is linear to cache only d, V can answer Q1 , Q2 , . . . , Q5 , and thus cache hit rate is improved. Suppose k , 1 k Dep(V ), CQk = Inx (Q, k ), k VR is the cached result of the k -th axis node of V , and Inx(Q, k ) is considered as a query taking the k -th axis node as its returned node, CQMP = MainPath(Q, Dep(V )), D = Dep(V ), QD = QD which takes the returned node of Q as its own returned node. Theorem 2. (CQ MP (CQ 1 VR 1 , CQ2 VR 2 , . . . ,CQ D VR D )) QD Q. Proof. As Inx(Q, k ) Inx (V , k ), CQk VR k Inx (Q, k ). As MainPath(Q, Dep(V )) MainPath(V ), (CQ MP (CQ 1 VR 1 , CQ2 VR 2 , . . . ,CQ D VR D )) Prex (Q, D). Accordingly, (CQ MP (CQ 1 VR 1 , CQ2 VR 2 , . . ., CQ D VR D )) QD Q. Theorem 2 describes how to construct CQ to answer Q on the cached results of V as shown in Fig.2. CQk VR k means that we can get the result of Inx (Q, k ) by querying CQk on VR k . CQMP (CQ 1 VR 1 , CQ2 VR 2 , . . . ,CQ D VR D ) means we can get the result of the D-th axis node of Q by integrating each CQk VR k with CQMP . Finally, we get the result of Q by processing QD . 4 View Selection We propose how to select the best V to answer Q and
The nodes with the same label are taken as the same node.

We have presented how to check whether V can answer Q in Section 3. However, if there are hundreds and thousands of views in semantic cache, Denition 4 is inecient. To accelerate the view selection, we introduce a more eective technique. We begin by introducing a novel concept of Tree Patterns Prime ProducT (PPT) and then give a technique to improve the eciency of view selection as formalized in Theorem 3. Denition 5 (Tree Patterns Prime ProducT (PPT)). We assign dierent nodes in a tree pattern with distinct prime numbers [9] . A Tree Pattern TPs Prime ProducT (PPT) is dened as: TPPPT = (u,v )T p (p(u) p(v )), where (u, v ) is any edge of TP; and p(u) is the assigned prime number of u.

Fig.3. Assigned prime numbers of V and Q. (a) Q. (b)(d) V1 .

Example 3. In Fig.3, we assign dierent nodes with distinct prime numbers as follows: a(2), b(3), c(5), d(7), e(11), (1), and wildcard is always assigned with 1, since can be matched by any label. We have, QPPT = (2 3) (3 3) (3 2) (2 5) (5 7) = 113400, V1PPT = (2 3) (3 1) (1 2) (2 7) = 504, V2PPT = (2 7) (7 1) (1 2) (2 3) = 1176,

352

J. Comput. Sci. & Technol., Mar. 2010, Vol.25, No.2

V3PPT = (2 3) (3 2) (2 5) (5 2) = 3600. Theorem 3. Given two tree patterns P and Q, if P is restrictively included in Q (i.e., P Q), then QPPT |PPPT , where X |Y denotes that integer Y can be exactly divided by integer X (i.e., there exists another integer Z, Y = XZ ). Proof. (u, v ) Q, let EQ i = {(h(u), h(v ))|(u, v ) Q, h is a homomorphism from Q to P , which satises if u = v , then h(u) = h(v )}. If u = , p(u) = 1; else u = , h(u) = u, thus p(u)|p(h(u)). Therefore, (u, v ) Q, p(u) p(v )|p(h(u)) p(h(v )). Thus, , p(h(u)) p(h(v )) = p(u) p(v ). Thus, , (p(h(u)) p(h(v )))
(h(u),h(v ))EQ i

determine whether VPPT |QPPT is true, if true, then we check whether Q is included in V ; otherwise, V cannot answer Q, and V can be ltered directly. Accordingly, it accelerates cache lookup. 4.2 SCEND Framework

= PPPT =

(p(u) p(v )).


(u,v )Q

(p(h(u)) p(h(v )))


(h(u),h(v ))EQ i

(h(u),h(v ))EQ i

(p(h(u)) p(h(v ))) (p(u) p(v ))


(u,v )Q

(p(h(u)) p(h(v )))


(h(u),h(v ))EQ i

= QPPT
(h(u),h(v ))EQ i

(p(h(u)) p(h(v ))).

Therefore, QPPT |PPPT . Corollary 2. If QPPT | PPPT , V cannot answer Q. Proof. We prove it by contradiction. Suppose QPPT |PPPT is false. If QPPT |PPPT is false then Q V is false according to Theorem 3 (converse-negative proposition). If Q V is false, then V Q is false (converse-negative proposition) and Denition 4. That is, V cannot answer Q. Thus, this corollary is true. Theorem 3 gives a necessary but not sucient condition for P is included in Q. Corollary 2 describes which view cannot answer Q. Accordingly, we can lter out many views that cannot answer Q through V and Qs PPT according to Corollary 2, which is very easy to implement. Example 4. In Fig.3, as V2PPT |QPPT and V3PPT |QPPT are false, Q is not included in V2 and V3 . Thus, V2 and V3 cannot answer Q. As Q is included in V1 , V1PPT |QPPT is true. Similarly, in Fig.1, as VPPT |Q6PPT and VPPT |Q7PPT are false, it is very easy to lter Q6 and Q7 directly since they cannot answer V . However, Mandhani et al.[2] have to employ some complex operations to infer query/view answerability. Therefore, when looking for V to answer Q, we rst

This subsection gives the framework of our semantic cache and presents an optimization technique to facilitate checking whether V can answer Q as formalized in Theorem 4. Theorem 4. V can answer Q, if V and Q satisfy the following conditions: (i) MainPath(Q, Dep(V )) MainPath(V ); (ii) k , 1 k Dep(V ), 1) AxisNode(V , k ) = AxisNode(Q, k ) 2) Inx (Q, k ) Inx (V , k ) 3) Inx (V , k )PPT |Inx (Q, k )PPT ; (iii) MainPath(V )PPT |MainPath(Q, Dep(V ))PPT ; (iv) VPPT |QPPT . Proof. As MainPath(Q, Dep(V )) MainPath(V ) in (i), so h , h is a homomorphism from MainPath(V ) to MainPath(Q, Dep(V )). As Inx(Q, k ) Inx (V , k ) in ii), so hk , hk is a homomorphism from Inx(V , k ) to Inx(Q, k ). As the k -th axis nodes of Q and V are the same in (ii), so a is a node in MainPath(V ), a must be an axis node. Without loss of generality, suppose ak is the k -th axis node, thus ak is the root node of Inx(V , k ), and h (ak ) = hk (ak ) = ak . Therefore, we can construct h as follows: v V , there must exist only one k , v Inx (V , k ), h = hk . It is obvious that h is a homomorphism from V to Q. Hence, Q V . Thus V can answer Q. In addition, for 3), as Inx (Q, k ) Inx (V , k ), Inx (V , k )PPT |Inx (Q, k )PPT must be true based on Theorem 3. Similarly, for (iii), as MainPath(Q, Dep(V )) MainPath(V ), MainPath (V )PPT |MainPath(Q, Dep(V ))PPT must be true. For (vi), based on Corollary 2, only if VPPT |QPPT , V can answer Q. These three conditions can be used for early termination so as to improve eciency. That is, if one of the conditions is not true, V cannot answer Q. We note that checking whether V can answer Q through Theorem 4 is more ecient than through checking Denition 4. As some views that cannot answer Q can be eciently ltered out by checking VPPT |QPPT , MainPath(V )PPT |MainPath(Q, Dep(V ))PPT , and Inx (V , k )PPT |Inx (Q, k )PPT . If one of the conditions is not true, we assume that V cannot answer Q. We note that the worst case of our method is still Co-NP-complete[10]. However, we can do early termination in many cases, which can improve the eciency of nding a view to answer a new query among a large number of views.

Guo-Liang Li et al.: SCEND: An Eective Semantic Cache

353

Fig.4. Architecture of semantic cache and the SQL for selecting views that can answer Q.

To facilitate view selection based on Theorem 4, we devise the architecture of the semantic cache as shown in Fig.4. The view in the cache can be indexed in the RDBMS, and we can use the DBMS capabilities to select views that can answer Q by issuing an SQL statement as shown in Fig.4. If there are more than one views that can answer Q, we always select V , which has the longest depth, to answer Q, and it needs the least additional operations to construct CQ. In Fig.4, Table TreePattern records the basic information of each view, where TPID denotes view V s tree pattern ID (system generated primary key); PPT denotes Prime ProducT of V ; MP denotes the main path of V ; MP PPT denotes Prime ProducT of V s main path; OS denotes the Occupied Size of V ; VF denotes Visited Frequency of V ; RVT denotes Recently Visited Time of V ; FDT denotes Fetch Delay Time to process V , i.e., the time of processing V based on general XML query processing method without using the cached views. Table AxisNode records each axis node in V , where TPID is AxisNodes reference key (refer to Table TreePatterns primary key TPID); ANID denotes which axis node is in V , and ANID of k -th axis node of V is k ; AN Name is AxisNodes label; Inx(V , i)PPT, Prex(V , i)PPT are PPTs of Inx(V , i) and Prex(V , i) respectively; Inx(V , i)RST is the result of i-th axis node. In this paper, once V and Q satisfy the four conditions in Theorem 4, we can answer Q through V with some simple operations. It is obvious that, in our approach, a certain V can answer more queries than [2]. Moreover, our method is more eective for nding V to answer Q as if none of the conditions is satised, V cannot answer Q and such a V can be skipped. The divisibility of two integers is easy to validate, and MainPath(Q, Dep(V )) MainPath(V ),

Inx(Q, k ) 5

Inx (V , k ) is easier to check than Q

V.

Algorithms

This section proposes two algorithms for view section and compensating query reconstruction. 5.1 View Selection Algorithm

To further improve the eciency of view section, we devise an eective algorithm for view selection so as to eectively look for a best V to answer Q. Algorithm View-Selection in Fig.5 gives the algorithm. If we can nd a view V to answer query Q in Algorithm View-Selection, it is called cache hit; otherwise, it is called cache miss. View-Selection is implemented based on the SQL statement in Fig.4 and it can skip some views that cannot answer Q. Note that, to select the best V to answer Q, we maintain the views in the semantic cache sorted according to Dep(V ) in descend order. We always select V that has the longest depth to answer Q. Moreover, View-Selection skips the views which do not satisfy VPPT |QPPT and MainPath(Q, D)PPT |MainPath(V )PPT in line 3. Note that it is very eective to check the divisibility of two positive integers. If the view cannot answer this query, we can skip it directly. Subsequently, only the views, which satisfy MainPath(Q, D) MainPath(V ), can answer Q in line 5. As the main path only contains /, // and , MainPath(Q, D) MainPath(V ) can be validated in polynomial time, which is proved in [8]. Moreover, View-Selection skips the views which do not satisfy (iii) in Theorem 4. Accordingly, we can select V to answer Q eectively. Example 5. In Fig.1, consider that queries Q6 and Q7 come, as VPPT | Q6PPT and VPPT | Q7PPT , we can determine that V cannot answer Q6 and Q7 based on

354

J. Comput. Sci. & Technol., Mar. 2010, Vol.25, No.2

Corollary 1 (line 3 of Algorithm View-Selection). Thus, we can eectively select the best view to answer queries. Now we give the time complexity analysis of the View-Selection algorithm. For each view V , we need verify whether Qs main path is included in V s main path, and the time complexity is Dep(v )2 . For each node on the main path, we need verify whether Inx (Q, k ) Inx (V , k ), and the time complexity is |Inx (Q, k )||Inx (V ,k)| , where |Inx(V , k )| is the number of nodes in Inx(V , k ). Thus the total time complexity is n (Dep(v )2 + 1 k n |Inx (Q, k )||Inx (V ,k)| ), where n is the number of views in the cache.

into a unique standard form. We can get the standard form of a tree pattern by calling procedure TransformationTree in Fig.6. Suppose the results of Inx(V , k ) and Inx(Q, k ) are respectively QR k and VR k . If Inx(V , k ) and Inx(Q, k ) are in the same standard form, we have QR k = VR k as shown in lines 34 in algorithm QueryRewriting in Fig.6; otherwise, Query-Rewriting will

Fig.5. View-Selection algorithm.

5.2

Query Rewriting Algorithm

This subsection presents an algorithm QueryRewriting in Fig.6 to construct a compensating query so as to reconstruct the best V to answer Q. We begin by introducing the standard form of a tree pattern. There are some tree patterns which are equivalent but in dierent expressions[2] . For example, suppose P = a[c[d]/e]/b[e[f ]/g ] and Q = a[c[d][e]]/b[e[f ][g ]]. Although P and Q are not the same, they are equivalent. We take Q as the standard form of P . To address this issue, we transform the tree patterns into their standard form as follows. Given a query Q, for any node that has more than one child, we sort its children by their labels in lexicographical order. Accordingly, all the equivalent queries will be transformed

Fig.6. Query-Rewriting algorithm.

Guo-Liang Li et al.: SCEND: An Eective Semantic Cache

355

process each CQk on the corresponding sub-view VRk to get QR k , i.e., QRk =CQ k VR k = Inx (Q, k ) VR k as shown in lines 56 in Algorithm Query-Rewriting in Fig.6. CQk VR k is used to get the result of querying CQk on sub-view VR k , which is similar to general XML query processing method. However, it is much easier than directly processing Q. We employ a holistic twig join algorithm[11] to implement it. Then, we retrieve the result of the path in Q from the 1st axis node to the k -th axis node based on algorithm PathStack[12] as shown in line 8, the complexity of which is O(|QR k | + |FVR MP |). Finally, if D = Dep(V ) = Dep(Q), that is, the D-th axis node is the returned node of QD , Query-Rewriting returns FVR MP directly in line 10; otherwise it gets the result of Q by querying QD on FVRMP in line 12. Example 6. Considering queries Q1 , Q2 , . . . , Q7 in Fig.1, we nd that V can answer Q1 , Q2 , . . . , Q5 , but cannot answer Q6 and Q7 . Because VPPT | Q6PPT and VPPT | Q7PPT , Q6 and Q7 are ltered out directly in line 3 of Algorithm ViewSelection in Fig.5, then we call Algorithm Query-Writing in Fig.6 to construct CQ to answer Q1 , Q2 , . . . , Q5 . Considering Q1 , as SubQueryMatch(Inx(V , 1), 1 1 Inx(Q1 , 1)) is not true, we have QR 1 1 =CQ 1 VR 1 = As SubQueryMatch(Inx(V , 2), Inx (Q1 , 1) VR 1 1. Inx(Q1 , 2)) is true and SubQueryMatch(Inx(V , 3), 2 3 Inx(Q1 , 3)) is true, QR2 1 = VR 1 and QR 1 = 3 MP in line 8 by processing the VR 1 . We get FVR main path of view V as illustrated in Fig.6. As Dep(Q1 ) = Dep(V ), we get the result of the returned node c of Q1 by processing QD . For Q2 , as SubQueryMatch(Inx(V , 1), Inx (Q2 , 1)) is not true, 2 2 As Subwe get QR 2 2 =CQ 2 (Inx (Q2 , 2)) VR . QueryMatch(Inx(V , 1), Inx(Q2 , 1)), and SubQuery1 Match(Inx(V , 3), Inx (Q2 , 3)) hold, QR1 2 = VR 2 and 3 3 D QR2 = VR 2 . Finally, we get FVR through PathStack in line 8. As Dep(Q1 ) = Dep(V ), we return FVR MP in line 10. 2 3 Consider Q3 , CQ1 3 = a[e], CQ3 = b[c[a][b]], CQ3 = MP D d[b], and CQ3 = a/b/d, thus we get FVR through PathStack in line 8. As Dep(Q3 ) = Dep(V ), we return FVR D directly. Similarly, we can get the results of Q4 and Q5 . Table 1 gives each CQk ,CQ MP and QD for Q1 , Q2 , . . . , Q5 , where broken line of CQMP denotes it is exactly MainPath(Q, D), broken line of CQk denotes it is Inx(Q, k ), and broken line of QD denotes it is QD . Now we give the time complexity analysis of the Query-Rewriting algorithm. For each node on V s main path, we need verify whether subqueries Inx(V , k ) and Inx(Q, k ) are matched. Given the root nodes of Inx(V , k ) and Inx(Q, k ), we need sort their

children, thus the time complexity is |Inx (V , k )| log(|Inx (V , k )|) + |Inx (Q, k )| log(|Inx (Q, k )|), where |Inx (V , k )| is the number of nodes in Inx(V , k ). i D), Then the algorithm combines QR i (1 and the complexity is Dep(v ). Accordingly, the complexity of the algorithm is Dep(v ) (|Inx (V , k )| log(|Inx (V , k )|) + |Inx (Q, k )| log(|Inx (Q, k )|)).
Table 1. Compensating Queries of Q1 Q5 Query CQMP CQ1 Q1 Q2 Q3 Q4 Q5 a/b/d CQ2 CQ3 QD a[e] d[b]/c b[c[ //d[a]][b]] a[e] b[c[a][b]] d[b] d[ //a[b][d]] d[ //a[b]][c > 100]

Cache Replacement

If the space for admitting a new query and its result is not sucient, some cached queries and their corresponding results need to be replaced. We integrate LFU and LRU into LFRU in this paper, that is, we always replace the cached query which is the least frequently and recently used query. As the frequent query patterns are more likely to be issued subsequently, we cache the recent frequent query patterns. When cache replacement is needed, we rst replace the infrequent query patterns and their corresponding answers. If the space for admitting the new query result is still not sucient, the cached results corresponding to some frequent query patterns will be replaced according to some replacement policies. Inspired from LFU and LRU, in this paper, we integrate LFU and LRU into LFRU and propose a novel cache replacement based on LFRU. We always replace the least frequently and recently used query. In our approach, the cached queries are classied into two categories according to the visited time. One category is the recent 20% visited queries and the other category is the other 80% queries. We assign the two parts with two importance ratios, and . Suppose the queries in the database is {q1 , q2 , . . . , qn }, we record the visited frequency fi , the recent visited time ti , the execution cost ci and the occupied size si for each query qi . We always rst replace query qi if (i fi ci )/si is minimal among all such queries, where i = , if qi is in the category of 20% recent queries; 1, otherwise.

We note that recent queries are more important generally, therefore should be larger than 1. We use our

356

J. Comput. Sci. & Technol., Mar. 2010, Vol.25, No.2

incremental algorithms[13] to mine the frequent queries to cache. We will experimentally demonstrate the effectiveness of our proposed techniques in Section 7. 7 Experimental Study

In this section, we present the experiments conducted to evaluate the eciency of various algorithms and the obtained results. Mandhani et al.[2] proposed a technique for view selection based on string match. We call it SCSM, Semantic Cache based on String Match. However SCSM cannot fully exploit the query/view answerability. We compared our method SCEND with the existing stateof-the-art method SCSM[2] , containment checking algorithm CheckContainment[10] , and the naive cache, which requires exact string match between the query and view. CheckContainment needs to check the containment between each view and the query. All the algorithms were coded in C++, and the experiments were conducted on an AMD 2600+ PC with 1 GB RAM, running Windows 2000 server. We used the beta 2 release of Microsoft SQL Server 2005 for both the cache and XML databases. We cached the views similar to [2] in the semantic cache. Moreover, we randomly added some // and in the XPath queries. We employed the datasets DBLP[14] , TreeBank[15], and XMark[16] for our experiments: 1) XMark is synthetic and generated by an XML data generator; 2) DBLP is a collection of papers and articles; 3) TreeBank has a highly recursive structure. The deep
Table 2. Characteristics of Datasets Datasets XMark DBLP TreeBank Average No. Nodes 8.4 7.6 12.2 Max Depth 11 8 20 Max Fan-Out 11 12 10

recursive structure of this data makes it an interesting case for the experiment. Based on the DTDs of selected datasets, some // and nodes are added to construct the queries and views as the input. Dierent characteristics of queries are summarized in Table 2. As TreeBank has a complicated schema, it leads to lower search performance than other datasets. In contrast, the average number of nodes, maximum depth and fan-out of queries reect the complexity of the datasets. All the datasets follow the default Zipf distribution with exponent z , where z is a parameter, and the probability of choosing the i-th query is i1 z. 7.1 Cache Hit Rate

This subsection evaluates the query/view answerability of various methods. We employed the metrics of cache hit rate. Fig.7 shows the experimental results with dierent Zipf exponent Z used for generating queries. Cached views and test queries we employed were 200 000 and 100 000 respectively for each Z value. With Z increases, the locality of the queries increases, and thus the cache hit rates increase. We note that SCEND always achieves high cache hit rate in that it employs the decomposingbased method for view selection, which can exploit sufcient views to answer queries. Although CheckContainment is better than SCEND, CheckContainment is rather expensive in checking the containment between views and queries, especially for large numbers of views. Fig.8 shows how the cache hit rate varies with the number of queries. We cached 200 000 queries and set Z as 1.2. The cache hit rate for SCEND does not shoot up as the number of queries increases. This is so because our cache replacement policy is very eective for query caching and replacement. However, the cache hit rate for SCSM and the naive cache varies with the dierent numbers of queries. We observe that SCEND achieves higher cache hit rate than the alternative methods, that is more than 20% higher than SCSM[2] and 50% higher

Fig.7. Cache hit rate vs. dierent Zipf exponents (200 000 views).

http://www.cs.washington.edu/research/xmldatasets/data/treebank/.

Guo-Liang Li et al.: SCEND: An Eective Semantic Cache

357

Fig.8. Cache hit rate vs. dierent numbers of queries (Z = 1.2).

than naive cache on each dataset. Thus, the query/view answerability that we capture is much richer than SCSM and naive cache. Moreover, dierent datasets will not inuence the performance of caching. This contrast reects the better scalability of our method. 7.2 Cache Lookup Time

We in this subsection evaluate the eciency of cache lookup. Note that the lookup time does not include the time in obtaining the result of Q by executing CQ (for a cache hit) or Q (for a cache miss). Fig.9 shows the experimental results with dierent Zipf exponent Z used for generating queries. Cached views and test queries we employed were 200 000 and 100 000 respectively for each Z value. Fig.10 shows how the average cache lookup time varies with the number

of queries. Here we see how well the lookup scales to a large number of cached views. In all cases, we cached 200 000 queries. We can see that the lookup time for SCEND remains constant at around 4 ms, even as the number of queries increases to 6 million. This time is very small compared to the time taken to execute a typical XPath query. However, CheckContainment takes more than 2000 milliseconds for cache lookup, which is even much more than the query processing time. This experimental result is exactly what we would like. Moreover, SCEND is better than SCSM, which takes more than 12 ms per lookup. The naive cache takes a mere 2 ms per lookup. However, in terms of query processing performance, this dierence will be oset by the higher hit rate of the semantic cache, we will compare it later.

Fig.9. Average cache lookup time vs. dierent Zipf exponents (200 000 views).

Fig.10. Average cache lookup time vs. dierent numbers of queries (Z = 1.2).

358

J. Comput. Sci. & Technol., Mar. 2010, Vol.25, No.2

7.3

Query Performance

We evaluate the query performance of our proposed semantic cache. If cache hits, we compose CQ on V to answer Q; otherwise, if cache misses, we will process Q directly. We rst evaluate the query performance with dierent numbers of queries and xing the Zipf exponent Z as 1.2. The number of cached queries and test queries are 200 000 and 100 000 respectively. Fig.11 shows the average elapsed time of processing a query. The queries took 1600 milliseconds for no caching. However, CheckContainment increases this to more than 2000 milliseconds as CheckContainment takes more time for cache lookup. Having the naive cache brings this down to 1000 ms, while employing SCSM brings this down to 600 ms. SCEND brings this down to 200 ms, which is a speedup by the factors of 10, 8, 5 and 3 for CheckContainment, no cache, naive cache and SCSM respectively.

Further, Fig.12 shows how the average time per query varies with the number of queries. The average time for SCEND and no cache does not shoot up as the number of queries increases. However, the average time for CheckContainment, SCSM and the naive cache increases with the increase of queries. As the number of queries increases, the locality of the queries also changes. This will not inuence the no-cache-based method but inuences the other three methods. However, because our cache replacement policy is ecient for caching, the performance of SCEND will not drop down with the increase of the number of queries. This reects the better scalability of our method. Finally, Table 3 shows some additional experimental results. For a cache hit, the semantic cache needs to query a cached fragment. On the other hand, the naive cache simply retrieves the whole fragment. Considering this, the average lookup time per cache hit is 4.11 ms for SCEND, which is impressive, because the average

Fig.11. Average elapsed time vs. dierent Zipf exponents (200 000 views).

Fig.12. Average elapsed time vs. dierent numbers of queries (Z = 1.2). Table 3. Evaluation of Dierent Methods SCEND Avg. Lookup Time/Hit (ms) Avg. Lookup Time/Miss (ms) Avg. Lookup Time (ms) Avg. Time/Hit (ms) Avg. Time/Miss (ms) Avg. Time (ms) Hit Rate 4.11 10.40 4.81 78 1124 89 0.952 SCSE 15.66 36.82 20.01 455 1138 681 0.781 Naive Cache 1.64 1.67 1.66 1.81 1602 1276 0.214 No Cache 0 0 0 0 1603 1603 0 CheckContainment 2253 3426 2408 2442 4312 2549 0.963

Guo-Liang Li et al.: SCEND: An Eective Semantic Cache

359

time per cache hit is 78 ms. It is interesting to observe that the average time per miss for SCEND is 1124 ms, which is much longer than the overall average of 89 ms, but that cache lookup only takes an extra 4.81 ms. Thus, the queries that are cache misses, take longer to execute on the XML database than those which are cache hits. Compared with average time per miss, average lookup time for SCEND is negligible. Although CheckContainment can improve the cache hit rate, it involves much time for cache lookup. Thus, CheckContainment leads to low performance. Moreover, the cache hit rate of SCEND reaches 0.952 and the average time is only 89 ms, which are much better than that of SCSE, naive cache, no cache, and CheckContainment. 8 Related Work

XML has become a standard for information representation and exchange over the Internet. Many researchers have been studying the problem of XML indexing[8] , XML query processing[11-12,17-19] , frequent XML query pattern discovering[13,20] and XML query caching and answering[2,20-23] . Chen et al.[24] attempted to apply the ideas of semantic caching to XML query processing systems, in particular the XQuery engine. Semantic caching implies view-based query answering and cache management. Hristidis and Petropoulos[25] presented a novel framework for semantic caching of XML databases. The cached XML data were organized using a modication of the incomplete tree, which has many desirable properties, such as incremental maintenance, containment decidability and remainder queries generation in PTIME. Xu[26] introduced a novel framework for a new semantic caching system, which oers the representation system of cached XML data, the algorithms to decide whether a new query can be totally answered by cached XML data or not, and to incrementally maintain the cached XML data. The work most related to our method is that of containment between XPath queries. Miklau and Suciu[10] proved that this problem is Co-NP-complete. A polynomial time algorithm is also presented for checking containment, which is sound but not complete. Balmin et al.[21] employed the materialized XPath views to answer queries. However, their method is inecient for view selection if there are a large number of views in the cache. Further, their criterion for query/view answerability is exactly containment between queries and views. Their version of what we call Compensating Queries requires navigating up from the returned nodes of the view being used. For each view, they store one or more of XML fragments, object ids, and typed data values, and they dened query/view answerability accordingly.

This choice allows them to maintain some cached views outside the database too, and target applications like middle-tier caching and distributed XQuery. Application-tier caching for relational database has received a lot of attention lately, in the context of database-driven websites[4,27] . Our caching framework enables the same for XML databases. Further, when the cache is maintained inside the XML database system, object ids of the result nodes can be stored instead of the entire result fragment. The techniques that we describe in this paper will remain equally applicable. Chen and Rundensteiner[28] proposed a semantic cache of XQuery views. It focuses on various aspects of the query/view matching problem, which is harder for XQuery. Having XQuery views will result in smaller cached results and concise rewritten queries, which will speed up cache hits. However, cache lookup optimization is much harder due to the more complex matching involved, and lookup is likely to become the bottleneck when there are a large number of cached views to consider. Mandhani et al.[2] proposed a method about how to nd V in views to answer Q by string match, but when there are large numbers of views in cache, it is rather inecient. Moreover, it may involve cache miss for some queries, which can be answered by some views in cache, although. In this paper we demonstrate how to improve the cache hit rate, and adequately exploit the query/view answerability. Discovering frequent XML query patterns turns out to be a signicant and eective premise of query optimization for its capability of focus capturing. The rapid growth of XML repositories has provided the impetus to design and develop systems that can store and query XML data eciently, and thus discovering frequent XML Query Patterns has recently attracted a large amount of attention as the answers of these queries can be stored and cached so as to improve the query performance. The advantage of caching is that when a user renes a query by adding or removing one or more query terms, many of the answers that have already been cached can be delivered to the user right away. This avoids the expensive evaluation of repeated or similar queries. As to frequent XML query patterns mining, to the best of our knowledge, XQPMiner[29] is the rst algorithm to mine frequent XQP s with a global XQP schema guided enumeration mining algorithm for frequent XQP mining. It follows the traditional idea of generate-and-test paradigm for tree-structured data mining. Global query pattern tree needs to be generated for XQP enumeration, as well as expensive candidate generation and containment testing. FastXMiner[20] is the most ecient mining algorithm for XML frequent query pattern discovery, as only

360

J. Comput. Sci. & Technol., Mar. 2010, Vol.25, No.2

valid candidate XQP s are enumerated for costly containment testing, as opposed to all the candidates of XQPMiner[29] . increQPMiner[30] studies the problem of incremental mining by using the mined results of the original databases. However, increQPMiner is not as ecient as our incremental algorithms[13] as increQPMiner does not take full advantages of the mined results of the original database. More importantly, we proposed a novel method[13] for eective incremental mining by employing the F -index and Q/F -index to facilitate the mining of frequent query patterns. More recently, we proposed a novel method of exploiting sequencing views in semantic cache to accelerate XPath query evaluation so as to improve the answerability of caching[23] . We devised an ecient approach to improve the answerability of semantic web by decomposing XML queries into simple components and employing a technique of the divisibility of prime number products[22] for eective view selection, which leads to a dermatic improvement over prior work in terms of cache lookup. As a major value-added version of our preliminary work[22] , we here summarize the major extensions as follows. Firstly, we have added a novel cache replacement technique to further improve the performance of our proposal (Section 6). Secondly, we have added some examples to make the paper much more understandable (Examples 5, 6 and 7). Thirdly, we proposed to use DBMS capabilities for eective view selection (Subsection 4.2). Finally, we have conducted an extensive new performance study on dierent datasets to further evaluate our algorithm and compared our approach with the existing state-of-the-art methods (Figs. 712 and Table 3). 9 Conclusion

existing state-of-the-art methods signicantly. References


[1] Dar S, Franklin M J, J onsson B T, Srivastava D, Tan M. Semantic data caching and replacement. In Proc. VLDB 1996, Mumbai (Bombay), India, September 3-6, 1996, pp.330-341. [2] Mandhani B, Suciu D. Query caching and view selection for XML databases. In Proc. VLDB 2005, Trondheim, Norway, August 30-September 2, 2005, pp.469-480. [3] Feng J H, Li G L, Ta N. A semantic cache framework for secure XML queries. J. Comput. Sci. & Technol., 2008, 23(6): 988-997. [4] Luo Q, Krishnamurthy S, Mohan C, Pirahesh H, Woo H, Lindsay B G, Naughton J F. Middle-tier database caching for e-business. In Proc. ACM SIGMOD Int. Conf. Management of Data, Madison, USA, June 3-6, 2002, pp.600-611. [5] Re C, Brinkley J, Hinshaw K, Suciu D. Distributed XQuery. In Proc. Information Integration on the Web (IIWeb), VLDB Workshop, Toronto, Canada, Aug. 30, 2004, pp.116-121, [6] Chandra A K, Merlin P M. Optimal implementation of conjunctive queries in relational data bases. In Proc. STOC, May 2-4, 1977, Boulder, Colorado, USA, pp.77-90. [7] Miklau G, Suciu D. Containment and equivalence for a fragment of XPath. J. ACM, 2004, 51(1): 2-45. [8] Milo T, Suciu D. Index structures for path expressions. In Proc. ICDT, Jerusalem, Israel, January 10-12, 1999, pp.277295. [9] Wu X, Lee M L, Hsu W. A prime number labeling scheme for dynamic ordered XML trees. In Proc. ICDE, Boston, USA, March 30-April 2, 2004, pp.66-78. [10] Miklau G, Suciu D. Containment and equivalence for an XPath fragment. In Proc. PODS, Madison, USA, June 35, 2002, pp.65-76. [11] Li G, Feng J, Zhang Y, Zhou L. Ecient holistic twig joins in leaf-to-root combining with root-to-leaf way. In Proc. DASFAA, Bangkok, Thailand, April 9-12, 2007, pp.834-849. [12] Bruno N, Koudas N, Srivastava D. Holistic twig joins: Optimal XML pattern matching. In Proc. ACM SIGMOD Int. Conf. Management of Data, Madison, Wisconsin, June 3-6, 2002, pp.310-321. [13] Li G, Feng J, Wang J, Zhang Y, Zhou L. Incremental mining of frequent query patterns from XML queries for caching. In Proc. ICDM, December 18-22, 2006, Hong Kong, China, pp.350-361. [14] http://dblp.uni-trier.de/xml/. [15] http://www.cs.washington.edu/research/. [16] http://www.xml-benchmark.org/. [17] Al-Khalifa S, Jagadish H V, Patel J M, Wu Y, Koudas N, Srivastava D. Structural joins: A primitive for ecient XML query pattern matching. In Proc. ICDE 2002, February 26March 1, 2002, San Jose, USA, pp.141-152. [18] Chen T, Lu J, Ling T W. On boosting holism in XML twig pattern matching using structural indexing techniques. In Proc. ACM SIGMOD Int. Conf. Management of Data, Baltimore, USA, June 14-16, 2005, pp.455-466. [19] Lu J, Ling T W, Chan C Y, Chen T. From region encoding to extended dewey: On ecient processing of XML twig pattern matching. In Proc. VLDB, Trondheim, Norway, August 30-September 2, 2005, pp.193-204. [20] Yang L H, Lee M L, Hsu W. Ecient mining of XML query patterns for caching. In Proc. VLDB, Berlin, Germany, September 9-12, 2003, pp.69-80. [21] Balmin A, Ozcan F, Beyer K S, Cochrane R, Pirahesh H. A framework for using materialized XPath views in XML query processing. In Proc. VLDB 2004, Toronto, Canada, August 31-September 3, 2004, pp.60-71.

We have proposed a semantic cache, namely SCEND, for eective query caching and view selection to adequately exploit XPath query/view answerability. To enhance the query/view answerability, we decomposed the complex queries into some simpler ones and used the decomposed queries to evaluate the query/view answerability of the complex queries and views, which can exploit sucient views to answer queries. To improve the eciency of view selection, we present a novel technique based on divisibility of two numbers. We assign each query node with a prime number and prove that the divisibility of two assigned positive numbers of the query and the cached view is a necessary condition for the query/view answerability, which can signicantly improve the eciency of cache lookup. We have implemented our method and the thorough experimental results give us rich condence to believe that our approach achieves high performance and outperforms the

Guo-Liang Li et al.: SCEND: An Eective Semantic Cache


[22] Li G, Feng J, Ta N, Zhang Y, Zhou L. SCEND: An ecient semantic cache to adequately explore answerability of views. In Proc. WISE 2006, Wuhan, China, October 23-26, 2006, pp.460-473. [23] Feng J, Ta N, Zhang Y, Li G. Exploit sequencing views in semantic cache to accelerate XPath query evaluation. In Proc. WWW 2007, Ban, Canada, May 8-12, 2007, pp.1337-1338. [24] Chen L, Rundensteiner E A, Wang S. XCache: A semantic caching system for XML queries. In Proc. ACM SIGMOD Int. Conf. Management of Data, Madison, USA, June 3-6, 2002, p.618. [25] Hristidis V, Petropoulos M. Semantic caching of XML databases. In Proc. ACM SIGMOD Int. Conf. Management of Data, Madison, USA, June 3-6, 2002, pp.25-30. [26] Xu W. The framework of an XML semantic aching system. In Proc. ACM SIGMOD Int. Conf. Management of Data, Baltimore, USA, June 13-16, 2005, pp.127-132. [27] Yagoub K, Florescu D, Issarny V, Valduriez P. Caching strategies for data-intensive Web sites. In Proc. VLDB 2000, September 10-14, 2000, Cairo, Egypt, pp.188-199. [28] Chen L, Rundensteiner E A. XCache: XQuery-based Caching System. In Proc. Int. Workshop on the Web and Databases, Madison, Wisconsin, June 3-6, 2002, pp.31-36. [29] Yang L H, Li M L, Hsu W, Acharya S. Mining frequent quer patterns from XML queries. In Proc. DASFAA 2003, March 26-28, Kyoto, Japan, 2003, pp.355-362. [30] Chen Y, Yang L H, Wang Y G. Incremental mining of frequent XML query pattern. In Proc. ICDM 2004, November 1-4, 2004, Brighton, UK, pp.343-346.

361 Guo-Liang Li received his B.S. degree from the Department of Computer Science and Technology, Harbin Institute of Technology (HIT), and M.S. and Ph.D. degrees from Department of Computer science and technology, Tsinghua University, where he is currently working as a faculty member. He is a member of China Computer Federation (CCF). His research interests are in the elds of database indexing, data integration, and data cleaning. He has published papers in the top international conferences, such as ACM SIGMOD, ACM SIGIR, VLDB, IEEE ICDE, WWW, ACM CIKM, IEEE ICDM, and top international journals, such as DMKD, Information System. Jian-Hua Feng received his B.S., M.S. and Ph.D. degrees in computer science and technology from Tsinghua University. He is currently working as a faculty of Department Computer Science and Technology in Tsinghua University. He is a senior member of CCF. His main research interests are in native XML database, keyword search and data mining. He has published papers in the international top journals and conferences, such as DMKD, Information Systems; ACM SIGMOD, ACM SIGKDD, VLDB, IEEE ICDE, ACM SIGIR, WWW, ACM CIKM, IEEE ICDM, SDM, ER.

You might also like