Professional Documents
Culture Documents
Project Report: University of Winnipeg
Project Report: University of Winnipeg
Project Report: University of Winnipeg
Project Report
Unordered Tree Matching
Harmeet Singh
Student # - 3050597
Introduction:
Problem Description :
Main aim of this report is to properly explain the problems regarding
the present XML querying strategies present in the industry. XML
Querying languages (such as Xpath, Xquery etc) use the tree modeling
approach for querying the data out of the XML Document.
Normal Approaches use extensive of join operations in the query which
makes it more complicated and moreover slow in terms of efficiency.
In this article, we discuss an efficient algorithm for tree mapping
problem in XML databases. Given a target tree T and a pattern tree Q,
the algorithm can find the embedding of Q in T in O(|T||Q|) time while
the existing approaches need exponential time in the worst case.
Motivation :
The main motivation was the fact that for every single method for
finding the embedding we have to first install a software for just for a
start.
By the method of the paper we don't need to install any proprietary
software so we can work on the problem using the open source library.
We can actually change the whole library which we want to use as to
our specification as the source code is also available and is natively
written in C++ language.
We can use this approach in xml database software as a plug-in to map a
sql query for a xml database.
This approach can be used for giving a extended support for current
Xpath and XSLT query languages.
This algorithm can be used for both DOM and SAX architecture trees
constructed dynamically using XERCES library .
Related Work :
Xerces-C++ : is a validating XML parser written in a portable subset of
c++. Xerces-C++ makes it easy to give your application the ability to read
and write XML data. A shared library is provided for parsing, generating,
manipulating, and validating XML documents using the DOM, SAX, and
SAX2 APIs.
RapidXml: is an attempt to create the fastest XML parser possible,
while retaining usability, portability and reasonable W3C
compatibility. It is an in-situ parser written in modern C++, with
parsing speed approaching that of strlen function executed on the
same data.
RapidXml has been around since 2006, and is being used by lots of
people. HTC uses it in some of its mobile phones.
gets its name from its use of a path notation as in URLs for
navigating through the hierarchical structure of an XML document.
Main Thrust :
NODE SEMANTICS :
In order to use the idea proposed by the research paper, we need
to define the semantics of the nodes traversal in the project. As
extension we can modify the code to find out the reachability
between two nodes in a tree.
Firstly a tree is considered namely T and visiting T in preorder
fashion, each node v will obtain a property pre(v) to maintain the
way nodes are traversed in the tree.
In a similar way, by traversing T in postorder, each node v will
get another number post(v). These two numbers can be used to
characterize the ancestor-descendant relationships as follows.
Let v and v be two nodes of a tree T. Then, v is a descendant of v
iff pre(v) > pre(v) and post(v) <post(v).
Algorithm Description :
The algorithm tree-matching( ) searches T bottom up in a recursive way
(see line 4). During the process, for each encountered node v in T, we
first check whether it is a leaf node (see line 2).
If it is a leaf node, the function leaf-node-check( ) is called (see line 12),
by which all the matching leaf nodes in Q will be stored in a temporary
variable S2 that will be added to (v) (see line 13).
If v is an internal node, lines 3 - 10 are first conducted and then the
function leaf-node-check( ) is invoked (see line 12).
By executing line 4, tree matching() is recursively called for each child
node vi of v.
After that, for each q appearing in (vi), its value is set to be v (see line
7).
In addition, qs parent is inserted into S, a temporary valuable to be used
in a next step Since (vi)s will not be used any more after this step,
they are simply removed (see line 8).
By executing lines 9 - 10, we check, for each q in S, whether v matches
q by calling node-check( ), in which the values of qs child nodes are
utilized to facilitate the checking's (see lines 3 - 5 in node-check( )).
Example :
Main XML Document:
Query Tree :
As we know the process will start with comparing the leaf nodes
as the algorithm works in a bottom up approach.
It will first compare the leaf node of both trees , and it will find the
match " Malibu " it will store the node in the data structure, in
which embedding will be stored.
It will also compute the values of the parent in each run and more
over two checks are called in each run i.e. node check and leaf
node check.
Once the node check or leaf node check is called the value of the
lists is updated and maintained in the manner described in the
research document.
Once a node is discovered as match, the tree matching procedure
will be called for each node in the sub tree of the discovered tree.
After Malibu the control of flow will go the parent of node that is
City, now the list of Malibu node in the query tree will store the
City of the document tree as the parent.
In a similar manner in every cycle, the list is updated with proper
ancestor and descendant values.\
At the end the list storing the match will get the result { Star,
Movie, City, Name,}
Analysis[1] :
This algorithm is almost the same as the previous one, but with
the merge operation involved, which effectively reduces the size
of each (v) from O(|Q|) to O(Qleaf).
Special attention should also be paid to line 7, by which we
generate a set S that contains the parent nodes of all those nodes
appearing in (vj)s (j = 1, ...,k), where vj is a child node of the
current node v.
Since the nodes in ( = merge(1, ..., k-1, k)) are left-to right
sorted (according to the nodes preorder and postorder
numbers), if there are more than one nodes in a sharing the same
parent, they must appear consecutively in the list. So each time
we insert a parent node q (of some q in ) into S, we need to
check whether it is the same as the previously inserted one.
If it is the case, q will be ignored. Thus, the size of S is also
bounded by O(Qleaf).
Computational complexity :
In this subsection, we prove the correctness of the algorithm treematching( ) and analyze its computational complexities.
Proposition 1: Let v be a node in T. Then, for each q in (v) generated
by tree-matching( ), we have T[v] contains Q[q].
Proof: We prove the proposition by induction on the height of Q,
height(Q).
Basic step: When height(Q) = 1, the proposition trivially holds.
Induction step: Assume that the proposition holds for any query tree Q
with height(Q) h. We consider a query tree Q of height h + 1. Let rQ be
the root of Q. Let q1, ..., qk be the child nodes of rQ.
Then, we have height(Q[qj]) h (j = 1, ..., k). In terms of the induction
hypothesis, for each q in Q[qj] (j = 1, ..., k), if it appears in (vi) (where vi
is a child node of v), we have T[vi] contains Q[q] and (q) will be set to
be v. Especially, if T[vi] contains Q[qj] (j = 1, ..., k), we have qj (vi) and
(qj) will be set to be v before v is checked against rQ.
Obviously, if label(v) = label(rQ) and for each qj (j = 1, ..., k), (qj) is
equal to v or a descendant of v, Q can be embedded into T[v]. So rQ is
inserted into (v).
Now we consider the time complexity of the algorithm, which can be divided
into four parts:
1. The first part is the time spent on merging (v1), ..., (vk),
where vi (i = 1, ..., k) is a child node of some node v in T. This
part of cost is bounded by O(Tii leaf d Q ) = O(|T|Qleaf), where
di represents the out degree of a node vi in T.
2. The second part is the time used for generating S from a
merged -list. Since the size of the -list is bounded by
O(Qleaf), so this part of cost is also bounded by O(Qleaf).
3. The third part is the time for checking a node vi in T against
each node qj in an S. Denote Si the set of the nodes in Q, which
Since at each time point at most T leaf nodes in T are associated with a
-list, the space overhead is bounded by O(TleafQleaf).
Future Works :
Pattern Tree Generation : In this subsection more of the
techniques that came around my notice for generation of the pattern
tree and document tree to facilitate query processing.
Experiments :
Data Structures:
Binary Trees:
struct Node
{
Node *left;
int data;
string name;
Node *right;
int preorderprefix;
int postorderprefix;
};
General Trees :
struct GTreeNode
{
int val;
int NChild;
GTreeNode **Child;
}
Data Used :
Created a document tree and pattern tree and tried to implement the
algorithm on both binary and general trees with some issues in general
trees.
Each node is associated with a node and a label but the comparison is
done on the basis of the data but can also be modified according to the
label also.
Test Results :
General Trees :
Conclusion:
Survey is done on various way to differentiate and find out the way to query
the data out of the XML document and tried to implement the method
described in the paper, Result generated for binary trees, general tree is also
implemented, code executed without any faults and due to low memory ,
facing runtime error. Method described in the paper can easily replace the
language Xquery or Xpath {which are general standard for querying the xml
data out of the xml document}.
References :
Chen, Y., 2007. An Efficient Algorithm for Tree Mapping in XML Databases. J.
Comput. Sci., 3: 487-493.
World Wide Web Consortium. XML Path Language (Xpath), W3C
Recommendation, Version 1.0, Nov. 1999, http://www.w3.org/TR/xpath.
World Wide Web Consortium. Xquery 1.0: An XML Query Language, W3C
Recommendation, Version 1.0, Dec. 2001, http://www.w3.org/TR/xquery.
http://xerces.apache.org/xerces-c/
http://www.codesynthesis.com/products/xsd/
http://rapidxml.sourceforge.net/manual.html
C++ Code :
Binary Tree:
#include <iostream>
#include<vector>
#include<cstdlib>
using namespace std;
enum BOOL {FALSE,TRUE};
vector<string> g;
struct Node
{
Node *left;
int data;
string name;
Node *right;
int preorderprefix;
int postorderprefix;
};
class BST
{
public:
BST();
~BST();
void operator= ( BST &);
Node* copy(Node *);
int operator== (BST &);
int compare(Node *, Node *);
void destruct(Node *);
while(temp != NULL)
{
if(temp->data == num)
{
x = temp;
found = TRUE;
return;
}
if(temp->data > num)
{
parent = temp;
temp = temp->left;
}
else
{
parent = temp;
temp = temp->right;
}
}
}
void BST::Insert(int num,string value)
{ string n = value;
BOOL found = FALSE;
Node *parent = NULL;
Node *x = NULL;
Search(num,found,parent,x);
if(found == TRUE)
{
cout << "\n" << num << " already exists in tree" << endl;
// return 0;
}
Node *temp = new Node;
temp->data = num;
temp->left = NULL;
temp->right = NULL;
temp->name =n;
if(pHead == NULL)
{
pHead = temp;
return;
}
(parent->data > num) ? (parent->left = temp):(parent->right = temp);
}
void BST::Delete(int num)
{
BOOL found = FALSE;
Node *parent = NULL;
Node *x = NULL;
Search(num,found,parent,x);
if(found == FALSE)
{
cout << "\n" << num << " doesn't exist in tree" << endl;
return;
}
Node *xSucc = NULL;
//When the node to be deleted has two childs
if(x->left != NULL && x->right != NULL)
{
Node *t = x;
//Save the address of x in a temporary pointer variable
t
xSucc = x->right;
//To traverse in-order, first move to one right node
while(xSucc->left != NULL)//and then move to the most left leaf node
{
x = xSucc;
//New Parent of xSucc
xSucc = x->left;
//New xSucc
}
//Now x is the in-order successor of t(i.e original x)
t->data = xSucc->data; //Copy the data of successor to t
//Now we just need to delete the in-order successor of t, which is now
xSucc
//Thus the parent will change now to x
parent = x;
//and the node to be deleted will be xSucc
x = xSucc;
if(parent->left == x)
{
parent->left = x->left;
delete x;
return;
}
else
{
parent->right = x->left;
delete x;
return;
}
}
}
void BST::Display()
{
//int choice;
//cout << "\n1.Pre-order, 2.In-order, 3.Post-order\n";
//cin >> choice;
//switch(choice)
//{
//case 1:
Pre(pHead);
//break;
//case 2:
//In(pHead);
// break;
//case 3:
Post(pHead);
// break;
//default:
// cout << "Wrong choice" << endl;
}
//void BST::In(Node *q)
//{
// if(q != NULL)
// {
// In(q->left);
Post(q->left);
Post(q->right);
q->postorderprefix = post;
cout << q->data <<"\t"<<q->name<<"\t"<<"\t"<<q>preorderprefix<<"\t"<<q->postorderprefix<<endl;
post++;
}
}
return 0;
}
else
compare(p->left,q->left);
compare(p->right,q->right);
}
}
int main()
{
BST treeA;
treeA.Insert(1,"Malibu");
treeA.Insert(3,"Winnipeg");
treeA.Insert(2,"City");
treeA.Insert(5,"Address");
treeA.Insert(6,"Star");
treeA.Insert(8,"Movie");
treeA.Insert(7,"StarMovieData");
cout<<"Creating the first query treeA is "<<endl;
system("PAUSE");
//Display the entered tree
treeA.Display();
post=0;
pre=0;
//Try to insert duplicate data
//tree.Insert(10);
//Try to delete a data which doesn't exists
//tree.Delete(50);
//Delete an existing data
//cout << "Deleteing 10" << endl;
//tree.Delete(10);
//Has two children
//Now Display
//tree.Display();
//cout << "Deleteing 1" << endl;
//tree.Delete(1);
//Has No children
//Now Display
//tree.Display();
//cout << "Deleteing 2" << endl;
//tree.Delete(2);
//Has one children, after deletion of 1
//Now Display
//tree.Display();
cout << "Creating the second query treeB"<<endl;
system("PAUSE");
BST treeB;
treeB.Insert(1,"City");
treeB.Insert(3,"Name");
treeB.Insert(2,"Star");
treeB.Insert(5,"Movie");
treeB.Insert(4,"StarMovieData");
treeB.Display();
//tree1.Display();
// cout << "Check if two tree are equal.\n";
//if(tree1 == tree)
//{
// cout << "Tree1 and Tree are equal." << endl;
//}
//else
//{
// cout << "Tree1 and Tree are not equal" << endl;
//}
//cout << "Delete 5 from tree" << endl;
//tree.Delete(5);
//tree.Display();
cout << "Now check if treeB and tree are equal" << endl;
system("PAUSE");
if(treeB == treeA)
{
cout << "TreeB can be embedded onto TreeA " << endl;
}
else
{
cout << "TreeB cannot be embedded onto TreeA" << endl;
}
system("PAUSE");
cout<<" The largest embedding found out in the two trees is";
for(int i=0;i < g.size();i++)
{
cout<<g[i]<<"\t";
}
return 0;
}
General Tree :
#include<iostream>
#include<stdlib.h>
#include<conio.h>
using namespace std;
// node structure follwos. each node has 3 parts : Value or info, No. of
children, an array of pointers (dynamic) to point to n child nodes
struct GTreeNode
{
int val;
int NChild;
GTreeNode **Child;
}**Root=NULL;
void CreateGeneralTree(GTreeNode**,int);
void ShowGeneralTree(GTreeNode*);
//will show details of all
nodes, one by one, not working fine.
int main()
{
int i, val, n;
//GTreeNode *NewNode ;
//
clrscr();
//Create Root Node.
cout<<"\nEnter Root Val " ;
cin>>val ;
cout<<"Enter No. of Children " ;
cin>>n;
//NewNode=(GTreeNode *)malloc(sizeof(GTreeNode)*n);
GTreeNode *NewNode = new GTreeNode;
NewNode->val=val;
NewNode->NChild=n; //new node has n children
for(i=0;i<n;i++)
//NewNode->Child[i] = NULL;
NewNode->Child[i]=(GTreeNode*)NULL; //initiall ymake
them all Null
Root=&NewNode; // root points to newnode. root is root of
tree.
CreateGeneralTree(Root,n); //now get details of n childern of
root & their children
ShowGeneralTree(NewNode);
return 0 ;
}
void CreateGeneralTree(GTreeNode **r, int n)
{
int i,k, m;
char ch ;
//allocate memory for n children.
(*r)->Child=(GTreeNode**)malloc(n*sizeof(GTreeNode));
//GTreeNode *temp = new GTreeNode;
////(*r)->Child=(GTreeNode**)new(sizeof(GTreeNode));
//Enter values of Child Nodes.
cout<<"\nNode Created at : "<< r<<"\n";
for(i=0;i<n;i++)
{
cout<<"\tEnter value for Child "<< i << " : ";
//cin>>*temp->Child[i]->val;
//*temp->Child[i]->NChild=0;
//*temp->Child[i]->Child=NULL;
cin>>(*r)->Child[i]->val;
(*r)->Child[i]->NChild=0;
(*r)->Child[i]->Child=NULL;
}
//ShowGTNode(*temp);
ShowGTNode(*r); //check is it ok.
cout<<"\nDo You Wish to Enter Info of Child Nodes : ";
ch=getch();
if(ch=='y' || ch=='Y')
{
for(k=0;k<n;k++)
{
cout<<"\n "<<k<<" Enter No. of Children of " << (*r)>Child[k]->val<<" : ";
cin>>m;
(*r)->Child[k]->NChild=m ;
CreateGeneralTree(&((*r)->Child[k]),m);
//recurisive
}
}
}
void ShowGTNode(GTreeNode *R)
{
int i, n=R->NChild;
cout<<"\nNo. Of Children : "<<n ;
for(i=0;i<n;i++)
cout<<R->Child[i];
}