Project Report: University of Winnipeg

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

University Of Winnipeg

Web based Document Databases


Applied Computer Science
Instructor : Dr. Yangjun Chen

Project Report
Unordered Tree Matching
Harmeet Singh
Student # - 3050597

Introduction:
Problem Description :
Main aim of this report is to properly explain the problems regarding
the present XML querying strategies present in the industry. XML
Querying languages (such as Xpath, Xquery etc) use the tree modeling
approach for querying the data out of the XML Document.
Normal Approaches use extensive of join operations in the query which
makes it more complicated and moreover slow in terms of efficiency.
In this article, we discuss an efficient algorithm for tree mapping
problem in XML databases. Given a target tree T and a pattern tree Q,
the algorithm can find the embedding of Q in T in O(|T||Q|) time while
the existing approaches need exponential time in the worst case.

Motivation :
The main motivation was the fact that for every single method for
finding the embedding we have to first install a software for just for a
start.
By the method of the paper we don't need to install any proprietary
software so we can work on the problem using the open source library.
We can actually change the whole library which we want to use as to
our specification as the source code is also available and is natively
written in C++ language.
We can use this approach in xml database software as a plug-in to map a
sql query for a xml database.
This approach can be used for giving a extended support for current
Xpath and XSLT query languages.
This algorithm can be used for both DOM and SAX architecture trees
constructed dynamically using XERCES library .

Related Work :
Xerces-C++ : is a validating XML parser written in a portable subset of
c++. Xerces-C++ makes it easy to give your application the ability to read
and write XML data. A shared library is provided for parsing, generating,
manipulating, and validating XML documents using the DOM, SAX, and
SAX2 APIs.
RapidXml: is an attempt to create the fastest XML parser possible,
while retaining usability, portability and reasonable W3C
compatibility. It is an in-situ parser written in modern C++, with
parsing speed approaching that of strlen function executed on the
same data.
RapidXml has been around since 2006, and is being used by lots of
people. HTC uses it in some of its mobile phones.

CodeSynthesis XSD: is an XML Data Binding compiler for C++


developed by Code Synthesis and dual-licensed under the GNU GPL
and a proprietary license. Given an XML instance specification (XML
Schema), it generates C++ classes that represent the given
vocabulary as well as parsing and serialization code.

Xpath: is the result of an effort to provide a common syntax and


semantics
for
functionality
shared
between
XSL
Transformations .The primary purpose of Xpath is to address parts
of an XML document.
In support of this primary purpose, it also provides basic facilities
for manipulation of strings, numbers and Booleans. Xpath uses a
compact, non-XML syntax to facilitate use of Xpath within URIs and
XML attribute values. Xpath operates on the abstract, logical
structure of an XML document, rather than its surface syntax. Xpath

gets its name from its use of a path notation as in URLs for
navigating through the hierarchical structure of an XML document.

Main Thrust :
NODE SEMANTICS :
In order to use the idea proposed by the research paper, we need
to define the semantics of the nodes traversal in the project. As
extension we can modify the code to find out the reachability
between two nodes in a tree.
Firstly a tree is considered namely T and visiting T in preorder
fashion, each node v will obtain a property pre(v) to maintain the
way nodes are traversed in the tree.
In a similar way, by traversing T in postorder, each node v will
get another number post(v). These two numbers can be used to
characterize the ancestor-descendant relationships as follows.
Let v and v be two nodes of a tree T. Then, v is a descendant of v
iff pre(v) > pre(v) and post(v) <post(v).

When we compare the labels of node b with node f we can


easily analyze that b is the ancestor of f according to the above
mentioned proposition.
Now we can take the example of any 2 nodes in the figure
above to prove the correctness of the rule.
We first compare the interval sequence of "a" with "h", that is
{1,7} with {7,6}. Pre of a =1, Post of a =7,Pre of h=7,post of h
=6.
Now we put the values into the formula that is {7>1} and
{6<7}. That simply means that h is a descendant of a and vice
versa.
Whenever the identity holds true for any pair of nodes in the
tree, interval sequence of descendant is subsumed into the
interval sequence of the ancestor.
Then, u is an ancestor of v if (p, q) is subsumed by (p, q).
In addition, if p < p and q < q, u is to the left of v. That means
the descendant child is in the left sub tree of node that is the
ancestor. Now we can also associate a level number if we want
to find out the about the depth of node in the tree and use it
for our advantage. Let the level number denoted by L.
For example, if pre(v) > pre(v), post(v) < post(v) and l(v) = l(v)
+ 1, then v is a child node of v.
Algorithm:
The basic algorithm to be given works in a bottom-up way. During the
process, two data structures are maintained and computed to facilitate the
discovery of subtree matchings.
Each node v in T is associated with a set, denoted (v), contains all
those nodes q in Q such that Q[q] can be imbedded into T[v], where
T[v] represents a subtree of T rooted at v.
Each q in Q is associated with a value (q), defined as follows:
Initially, for each q Q, (q) is set to . During the tree matching
process, (q) is dynamically changed as below:
1. Let v be a node in T with parent node u.
2. If q appears in (v), change the value of (q) to u.

Then, each time before we insert q into (v), we will


do
the
following checking :
(i) Check whether label(q) = label(v).
(ii) Let q1, ..., qk be the child nodes of q. For each
qi (i = 1, ..., k), check whether (qi) is equal to
v or to a descendent of v.
If both (1) and (2) are satisfied, insert q into (v).Below is a
bottom-up algorithm, working in a recursive way and taking a

node v in T as the input(which represents T[v]). Initially, the


input is the root of T. The algorithm will mark any node u in
T[v] if it finds that T[u] contains Q. In the process, two
functions are called :
node-check(u, q) :- It checks whether T[u]
contains Q[q]. If it is the case, return {q}.
Otherwise, it returns an empty set .
leaf-node-check(u) :- It returns a set of leaf nodes
in Q: {q1, ..., qk} such that for each qi (1 i k)
label(u) = label(qi).

Algorithm Description :
The algorithm tree-matching( ) searches T bottom up in a recursive way
(see line 4). During the process, for each encountered node v in T, we
first check whether it is a leaf node (see line 2).
If it is a leaf node, the function leaf-node-check( ) is called (see line 12),
by which all the matching leaf nodes in Q will be stored in a temporary
variable S2 that will be added to (v) (see line 13).
If v is an internal node, lines 3 - 10 are first conducted and then the
function leaf-node-check( ) is invoked (see line 12).
By executing line 4, tree matching() is recursively called for each child
node vi of v.
After that, for each q appearing in (vi), its value is set to be v (see line
7).
In addition, qs parent is inserted into S, a temporary valuable to be used
in a next step Since (vi)s will not be used any more after this step,
they are simply removed (see line 8).
By executing lines 9 - 10, we check, for each q in S, whether v matches
q by calling node-check( ), in which the values of qs child nodes are
utilized to facilitate the checking's (see lines 3 - 5 in node-check( )).

Example :
Main XML Document:

Query Tree :

As we know the process will start with comparing the leaf nodes
as the algorithm works in a bottom up approach.
It will first compare the leaf node of both trees , and it will find the
match " Malibu " it will store the node in the data structure, in
which embedding will be stored.

It will also compute the values of the parent in each run and more
over two checks are called in each run i.e. node check and leaf
node check.
Once the node check or leaf node check is called the value of the
lists is updated and maintained in the manner described in the
research document.
Once a node is discovered as match, the tree matching procedure
will be called for each node in the sub tree of the discovered tree.
After Malibu the control of flow will go the parent of node that is
City, now the list of Malibu node in the query tree will store the
City of the document tree as the parent.
In a similar manner in every cycle, the list is updated with proper
ancestor and descendant values.\
At the end the list storing the match will get the result { Star,
Movie, City, Name,}

Analysis[1] :
This algorithm is almost the same as the previous one, but with
the merge operation involved, which effectively reduces the size
of each (v) from O(|Q|) to O(Qleaf).
Special attention should also be paid to line 7, by which we
generate a set S that contains the parent nodes of all those nodes
appearing in (vj)s (j = 1, ...,k), where vj is a child node of the
current node v.
Since the nodes in ( = merge(1, ..., k-1, k)) are left-to right
sorted (according to the nodes preorder and postorder
numbers), if there are more than one nodes in a sharing the same
parent, they must appear consecutively in the list. So each time
we insert a parent node q (of some q in ) into S, we need to
check whether it is the same as the previously inserted one.
If it is the case, q will be ignored. Thus, the size of S is also
bounded by O(Qleaf).

Computational complexity :
In this subsection, we prove the correctness of the algorithm treematching( ) and analyze its computational complexities.
Proposition 1: Let v be a node in T. Then, for each q in (v) generated
by tree-matching( ), we have T[v] contains Q[q].
Proof: We prove the proposition by induction on the height of Q,
height(Q).
Basic step: When height(Q) = 1, the proposition trivially holds.
Induction step: Assume that the proposition holds for any query tree Q
with height(Q) h. We consider a query tree Q of height h + 1. Let rQ be
the root of Q. Let q1, ..., qk be the child nodes of rQ.
Then, we have height(Q[qj]) h (j = 1, ..., k). In terms of the induction
hypothesis, for each q in Q[qj] (j = 1, ..., k), if it appears in (vi) (where vi
is a child node of v), we have T[vi] contains Q[q] and (q) will be set to
be v. Especially, if T[vi] contains Q[qj] (j = 1, ..., k), we have qj (vi) and
(qj) will be set to be v before v is checked against rQ.
Obviously, if label(v) = label(rQ) and for each qj (j = 1, ..., k), (qj) is
equal to v or a descendant of v, Q can be embedded into T[v]. So rQ is
inserted into (v).
Now we consider the time complexity of the algorithm, which can be divided
into four parts:
1. The first part is the time spent on merging (v1), ..., (vk),
where vi (i = 1, ..., k) is a child node of some node v in T. This
part of cost is bounded by O(Tii leaf d Q ) = O(|T|Qleaf), where
di represents the out degree of a node vi in T.
2. The second part is the time used for generating S from a
merged -list. Since the size of the -list is bounded by
O(Qleaf), so this part of cost is also bounded by O(Qleaf).
3. The third part is the time for checking a node vi in T against
each node qj in an S. Denote Si the set of the nodes in Q, which

are checked against vi. We estimate this part of cost by the


following sum: O( TiSjji c ) = O(|T|Qleaf), where cj represents
the out degree of a node qj in Si.
4. The fourth part is the time for checking each node in T
against the leaf nodes in Q. Obviously, this part of cost is
bounded by O(Tileaf Q ) = O(|T|Qleaf).
In terms of the above analysis, we have the following proposition:
Proposition 2: The time complexity of tree-matching( ) is bounded by
O(|T|Qleaf).
Proof: See the above discussion.

Since at each time point at most T leaf nodes in T are associated with a
-list, the space overhead is bounded by O(TleafQleaf).

Future Works :
Pattern Tree Generation : In this subsection more of the
techniques that came around my notice for generation of the pattern
tree and document tree to facilitate query processing.

Use of Xerces library can help in querying the document as the


tree and query tree would be generated dynamically.

Which we could use during the run time processing of the


document.
The Main algorithm can be substantially improved by elaborating the
construction of (v)s. First, we notice that in the case that v is a leaf
node in T, (v) is a set of the leaf nodes in Q, which match v.
Modifying the library according to the type of xml document can speed
up the querying process as library has the general definitions of W3
Consortium.

Experiments :
Data Structures:
Binary Trees:
struct Node
{
Node *left;
int data;
string name;
Node *right;
int preorderprefix;
int postorderprefix;
};

General Trees :
struct GTreeNode
{
int val;
int NChild;
GTreeNode **Child;
}

Data Used :
Created a document tree and pattern tree and tried to implement the
algorithm on both binary and general trees with some issues in general
trees.
Each node is associated with a node and a label but the comparison is
done on the basis of the data but can also be modified according to the
label also.

Test Results :

General Trees :

Conclusion:
Survey is done on various way to differentiate and find out the way to query
the data out of the XML document and tried to implement the method
described in the paper, Result generated for binary trees, general tree is also
implemented, code executed without any faults and due to low memory ,
facing runtime error. Method described in the paper can easily replace the
language Xquery or Xpath {which are general standard for querying the xml
data out of the xml document}.

References :
Chen, Y., 2007. An Efficient Algorithm for Tree Mapping in XML Databases. J.
Comput. Sci., 3: 487-493.
World Wide Web Consortium. XML Path Language (Xpath), W3C
Recommendation, Version 1.0, Nov. 1999, http://www.w3.org/TR/xpath.

World Wide Web Consortium. Xquery 1.0: An XML Query Language, W3C
Recommendation, Version 1.0, Dec. 2001, http://www.w3.org/TR/xquery.
http://xerces.apache.org/xerces-c/
http://www.codesynthesis.com/products/xsd/
http://rapidxml.sourceforge.net/manual.html

C++ Code :
Binary Tree:
#include <iostream>
#include<vector>
#include<cstdlib>
using namespace std;
enum BOOL {FALSE,TRUE};
vector<string> g;
struct Node
{
Node *left;
int data;
string name;
Node *right;
int preorderprefix;
int postorderprefix;
};
class BST
{
public:
BST();
~BST();
void operator= ( BST &);
Node* copy(Node *);
int operator== (BST &);
int compare(Node *, Node *);
void destruct(Node *);

void Search(int num, BOOL &found, Node *&parent, Node *&x);


void Insert(int num,string value);
void Delete(int num);
void Display();
void In(Node *);
void Pre(Node *);
void Post(Node *);
//void nodevisedisplay(Node *);
private:
Node *pHead;
};
BST::BST()
{
pHead = NULL;
}
BST::~BST()
{
destruct(pHead);
}
void BST::destruct(Node *p)
{
if(p != NULL)
{
destruct(p->left);
destruct(p->right);
delete p;
}
}
void BST::Search(int num, BOOL &found, Node *&parent, Node *&x)
{
found = FALSE;
parent = NULL;
x = NULL;
if(pHead == NULL)
return;
Node *temp = pHead;

while(temp != NULL)
{
if(temp->data == num)
{
x = temp;
found = TRUE;
return;
}
if(temp->data > num)
{
parent = temp;
temp = temp->left;
}
else
{
parent = temp;
temp = temp->right;
}
}
}
void BST::Insert(int num,string value)
{ string n = value;
BOOL found = FALSE;
Node *parent = NULL;
Node *x = NULL;
Search(num,found,parent,x);
if(found == TRUE)
{
cout << "\n" << num << " already exists in tree" << endl;
// return 0;
}
Node *temp = new Node;
temp->data = num;
temp->left = NULL;
temp->right = NULL;
temp->name =n;
if(pHead == NULL)

{
pHead = temp;
return;
}
(parent->data > num) ? (parent->left = temp):(parent->right = temp);
}
void BST::Delete(int num)
{
BOOL found = FALSE;
Node *parent = NULL;
Node *x = NULL;
Search(num,found,parent,x);
if(found == FALSE)
{
cout << "\n" << num << " doesn't exist in tree" << endl;
return;
}
Node *xSucc = NULL;
//When the node to be deleted has two childs
if(x->left != NULL && x->right != NULL)
{
Node *t = x;
//Save the address of x in a temporary pointer variable
t
xSucc = x->right;
//To traverse in-order, first move to one right node
while(xSucc->left != NULL)//and then move to the most left leaf node
{
x = xSucc;
//New Parent of xSucc
xSucc = x->left;
//New xSucc
}
//Now x is the in-order successor of t(i.e original x)
t->data = xSucc->data; //Copy the data of successor to t
//Now we just need to delete the in-order successor of t, which is now
xSucc
//Thus the parent will change now to x
parent = x;
//and the node to be deleted will be xSucc
x = xSucc;

// Now the problem reduces to delete a node x, which have 0 child or 1


right child
}
//When the node to be deleted has no child.
if(x->left == NULL && x->right == NULL)
{
if(parent->left == x)
{
parent->left = NULL;
delete x;
return;
}
else
{
parent->right = NULL;
delete x;
return;
}
}
//When the node to be deleted has one child.
// One right child
if(x->left == NULL && x->right != NULL)
{
if(parent->left == x)
{
parent->left = x->right;
delete x;
return;
}
else
{
parent->right = x->right;
delete x;
return;
}
}
// One left child
if(x->left != NULL && x->right == NULL)
{

if(parent->left == x)
{
parent->left = x->left;
delete x;
return;
}
else
{
parent->right = x->left;
delete x;
return;
}
}
}
void BST::Display()
{
//int choice;
//cout << "\n1.Pre-order, 2.In-order, 3.Post-order\n";
//cin >> choice;
//switch(choice)
//{
//case 1:
Pre(pHead);
//break;
//case 2:
//In(pHead);
// break;
//case 3:
Post(pHead);
// break;
//default:
// cout << "Wrong choice" << endl;
}
//void BST::In(Node *q)
//{
// if(q != NULL)
// {
// In(q->left);

// cout << q->data <<"\t"<< q->name << "\t";


// In(q->right);
// }
// cout << endl;
//}
int pre = 0;
void BST::Pre(Node *q)
{
if(q != NULL)
{
q->preorderprefix = pre;
//cout << q->data <<"\t"<< q->name << "\t" << q->preorderprefix
<<endl;
pre = pre+1;
Pre(q->left);
Pre(q->right);
}
}
int post = 0;
void BST::Post(Node *q)
{
if(q != NULL)
{

Post(q->left);
Post(q->right);
q->postorderprefix = post;
cout << q->data <<"\t"<<q->name<<"\t"<<"\t"<<q>preorderprefix<<"\t"<<q->postorderprefix<<endl;
post++;
}
}

void BST::operator= ( BST &rhs)


{
pHead = copy(rhs.pHead);
}
Node* BST::copy(Node *q)
{
if(q == NULL)
{
return NULL;
}
Node *p = new Node;
p->data = q->data;
p->left = copy(q->left);
p->right = copy(q->right);
return p;
}
int BST::operator== (BST &rhs)
{
int flag;
flag = compare(pHead,rhs.pHead);
//cout<<flag;
return flag;
}
int BST::compare(Node *p, Node *q)
{
if( p == NULL && q == NULL)
return 1;
if(p != NULL && q != NULL)
{
cout<<"The current nodes under comparison: "<<p->name<<" "<<q>name<<endl;
string s;
s=p->name;
g.push_back(s);
system("PAUSE");
if(p->data != q->data)
{
//cout<<p->data<<" "<<q->data<<endl;

return 0;
}
else
compare(p->left,q->left);
compare(p->right,q->right);
}
}
int main()
{
BST treeA;
treeA.Insert(1,"Malibu");
treeA.Insert(3,"Winnipeg");
treeA.Insert(2,"City");
treeA.Insert(5,"Address");
treeA.Insert(6,"Star");
treeA.Insert(8,"Movie");
treeA.Insert(7,"StarMovieData");
cout<<"Creating the first query treeA is "<<endl;
system("PAUSE");
//Display the entered tree
treeA.Display();
post=0;
pre=0;
//Try to insert duplicate data
//tree.Insert(10);
//Try to delete a data which doesn't exists
//tree.Delete(50);
//Delete an existing data
//cout << "Deleteing 10" << endl;
//tree.Delete(10);
//Has two children
//Now Display
//tree.Display();
//cout << "Deleteing 1" << endl;
//tree.Delete(1);
//Has No children
//Now Display

//tree.Display();
//cout << "Deleteing 2" << endl;
//tree.Delete(2);
//Has one children, after deletion of 1
//Now Display
//tree.Display();
cout << "Creating the second query treeB"<<endl;
system("PAUSE");
BST treeB;
treeB.Insert(1,"City");
treeB.Insert(3,"Name");
treeB.Insert(2,"Star");
treeB.Insert(5,"Movie");
treeB.Insert(4,"StarMovieData");
treeB.Display();
//tree1.Display();
// cout << "Check if two tree are equal.\n";
//if(tree1 == tree)
//{
// cout << "Tree1 and Tree are equal." << endl;
//}
//else
//{
// cout << "Tree1 and Tree are not equal" << endl;
//}
//cout << "Delete 5 from tree" << endl;
//tree.Delete(5);
//tree.Display();
cout << "Now check if treeB and tree are equal" << endl;
system("PAUSE");
if(treeB == treeA)
{
cout << "TreeB can be embedded onto TreeA " << endl;
}
else

{
cout << "TreeB cannot be embedded onto TreeA" << endl;
}
system("PAUSE");
cout<<" The largest embedding found out in the two trees is";
for(int i=0;i < g.size();i++)
{

cout<<g[i]<<"\t";
}
return 0;
}

General Tree :
#include<iostream>
#include<stdlib.h>
#include<conio.h>
using namespace std;
// node structure follwos. each node has 3 parts : Value or info, No. of
children, an array of pointers (dynamic) to point to n child nodes
struct GTreeNode
{
int val;
int NChild;
GTreeNode **Child;
}**Root=NULL;
void CreateGeneralTree(GTreeNode**,int);
void ShowGeneralTree(GTreeNode*);
//will show details of all
nodes, one by one, not working fine.

void ShowGTNode(GTreeNode *R);


node. ok.

//will dhow details of one

int main()
{
int i, val, n;
//GTreeNode *NewNode ;
//
clrscr();
//Create Root Node.
cout<<"\nEnter Root Val " ;
cin>>val ;
cout<<"Enter No. of Children " ;
cin>>n;
//NewNode=(GTreeNode *)malloc(sizeof(GTreeNode)*n);
GTreeNode *NewNode = new GTreeNode;
NewNode->val=val;
NewNode->NChild=n; //new node has n children
for(i=0;i<n;i++)
//NewNode->Child[i] = NULL;
NewNode->Child[i]=(GTreeNode*)NULL; //initiall ymake
them all Null
Root=&NewNode; // root points to newnode. root is root of
tree.
CreateGeneralTree(Root,n); //now get details of n childern of
root & their children
ShowGeneralTree(NewNode);
return 0 ;
}
void CreateGeneralTree(GTreeNode **r, int n)
{
int i,k, m;

char ch ;
//allocate memory for n children.
(*r)->Child=(GTreeNode**)malloc(n*sizeof(GTreeNode));
//GTreeNode *temp = new GTreeNode;
////(*r)->Child=(GTreeNode**)new(sizeof(GTreeNode));
//Enter values of Child Nodes.
cout<<"\nNode Created at : "<< r<<"\n";
for(i=0;i<n;i++)
{
cout<<"\tEnter value for Child "<< i << " : ";
//cin>>*temp->Child[i]->val;
//*temp->Child[i]->NChild=0;
//*temp->Child[i]->Child=NULL;
cin>>(*r)->Child[i]->val;
(*r)->Child[i]->NChild=0;
(*r)->Child[i]->Child=NULL;
}
//ShowGTNode(*temp);
ShowGTNode(*r); //check is it ok.
cout<<"\nDo You Wish to Enter Info of Child Nodes : ";
ch=getch();
if(ch=='y' || ch=='Y')
{
for(k=0;k<n;k++)
{
cout<<"\n "<<k<<" Enter No. of Children of " << (*r)>Child[k]->val<<" : ";
cin>>m;
(*r)->Child[k]->NChild=m ;
CreateGeneralTree(&((*r)->Child[k]),m);
//recurisive
}

}
}
void ShowGTNode(GTreeNode *R)
{
int i, n=R->NChild;
cout<<"\nNo. Of Children : "<<n ;
for(i=0;i<n;i++)
cout<<R->Child[i];
}

void ShowGeneralTree(GTreeNode *r)


{
int i,n;
n=r->NChild;
cout<<"\nAddress : " <<r<<"\tValue : "<<r->val<<"\tChildren :
"<<n<<"\t";
for(i=0;i<n;i++)
cout<<" "<<r->Child[i]->val;
getch();
for(i=0;i<n;i++)
ShowGeneralTree(r->Child[i]);
}

You might also like