Professional Documents
Culture Documents
46 ChangeDetectioninWebPages
46 ChangeDetectioninWebPages
266
260
For example, a page created for a soccer tournament • Parser level order child enumeration (tree
might be continuously updated as the tournament traversal/parsing by level order);
progresses. After the tournament has ended, the page • Tree mapping (finding the combined r.m.s value
might change to a presentation about the tournament of the tags and content for change
results and sports injuries. detection/extraction); and
2. Presentation/Cosmetic changes are changes related • Change detector and presenter.
to the document representation that do not reflect
changes in the topic presented in the document. For Our extractor is very effective for real web data in
instance, changes to HTML tags can modify the practice and has linear scalability. Change detector
appearance of a Web page while it otherwise remains works basically for two semantic phases—content
the same. changes and the structured changes. We can further
3. Structural changes refer to the underlying divide this into 3 views-
connection of the document to other documents. As an 1. HTML view
example, consider changes in the link destinations of a 2. Content view
―Weekly Hot Links Web page. While this page might 3. Tree view
be conceptually the same, the fact that the destination The HTML view shows and compares the old and the
of the links has changed might be relevant, even if the new version of the web page e.g. any news portal. It
text of the links has not. Structural changes are also extracts the HTML source code of those pages and the
important to detect, as they often might not be visually changed portion is displayed in the source code only.
perceptible. The content view deals with the changed portions of
4. Behavioral changes refer to modifications to the the contents of the web page versions. It includes the
active components of a document. For Web pages, text, paragraphs, headers; etc. There is R.M.S value
this includes scripts, plug-ins and applets. The calculator which calculates the R.M.S. value of the
consequences of these changes are harder to predict, text or the paragraph, or heading. The tree view
especially since many pages hide the script code in traverses the document tree structure and presents the
other files. changes by comparing the number of nodes and by
comparing the array structure of the HTML page. e.g.
if the number of node on traversing the old and the new
versions of the tree are different that means that there
is new node added or deleted in the page structure. In
all we need some fact or figure (i.e. signature) on the
basis of which we can distinguish the content of the
Fig.1 Types of changes in the web pages node of the web page.
In this section, we study the types of problems that are
Structural changes can be in the HTML or XML emerging in detecting the new or changed information
document (structured or semi structured) of a specific of the web pages and for each problem, we present a
web page, these changes can be in the form of updating practical solution and analyze its efficiency. The
content (addition of a new node to the tree structure, problems and solutions are:
deletion of nodes from the structure), image or link 1) How can changes in web pages be detected
insertion /deletion etc. Techniques like page checksum, effectively? First we classify the problem into two
digital fingerprinting, page mirroring etc. are used for broad categories- the structural changes and the content
detecting content or semantic changes. We are using an based changes. We detect changes using a hypertext
improved version of page digest [8] to detect the document tree encoding of the document. The first
structural changes in the web page structure i.e. in the module renders an ordered tree structure of the input
HTML tags structure. The suggested algorithm is HTML web page, counts the number of nodes i.e. node
based on level order search which is another form of count if it is different that means the structure of the
breadth first search. We move in the tree level by level. web page has changed or a new node has been added
The basic modules, which constitute the application or deleted. Further case may arise where there are
are- equal number of nodes after comparing the older and
• Document tree construction (It takes a HTML file the newer version of the HTML web page tree
as an input file, parses the tree with the help of a structure, but the order of the tags have been changed
parser (as discussed later), identifies those opening or a nested structure has been added. In order to detect
tags as nodes of the trees, and maintaining the these types of changes parsing plays an important role,
parent child relationships constructs the tree of the which we have executed after studying different
given web page.) ; existing techniques based on data structures. The
technique, which we have used, incorporates level
267
261
order ordered tree traversal, which has its advantage of 6) TAG NAME: This store the tag name.
traversing the tree level by level.
2) How can changes between different versions of web
documents be presented/extracted effectively? We
present an efficient and effective algorithm for
comparing different versions of web documents. The
change extractor based on this algorithm includes three
phases: document tree construction, level order child
enumeration (tree traversal/parsing by level order), and
tree mapping (finding the combined R.M.S. value of
the tags and content for change detection/extraction).
Our extractor is very effective for real web data in
practice and has linear scalability.
3) How should a change both in structure or content be
detected and evaluated? We have constructed a
specialized set of arrays for HTML pages which show
the relationships among the tree nodes, i.e. a child or a
parent. By analyzing the content of the arrays, we can
easily determine the difference between the older
version of the web page and the newer version. There
are different cases in a HTML tree structure where the Fig. 3 HTML Tree structure of web page
order of the tags plays an important role.
Addition /Deletion of node (Tag) in the Tree
Suppose a new node is introduced into the tree i.e. a
3.2 Change Detection: new tag is added into the html code as described
below.
Method for structural change detection: -We consider
the following sample HTML web page whose structure
is to be detected for the changes using the proposed
algorithm.
There may be two types of changes in web pages : Fig 5. Final structure (after addition of TR)
(1) Addition/deletion of a node containing tag of the
HTML page. The following s the change in the structure
(2) Modification in the tag or tag value or change in
content. LEVEL INTIALLY LATER
The structure of every node of the tree representing the LEVEL 1 1 1
web page shall contain the following information: LEVEL 2 1 1
1) ID: This index stores the unique id representing LEVEL 3 1 2
each node; LEVEL 4 1 1
2) CHILD: This index stores the information of child.
3) PARENT: This index stores the information of the Initially:
parent; LEVEL = {1 2 3 4 }
4) LEVEL: This index stores the level, where the ID = {1 2 3 4 }
node exists; CHILD = {1 1 1 NULL }
5) CONTENT VALUE : This index stores the RMS LEVEL_ARRY = {1 1 1 1 }
value of sum of character of the content;
268
262
Later:
LEVEL = {1 2 3 3 4}
ID = {1 2 3 5 4}
CHILD = {1 2 1 NULL NULL} 4. Results
LEVEL_ARRY={1 1 2 1 1}
By using the above techniques, we have been able to
Now, by comparing the two sets we get the idea that detect the structural and content-based changes
the modification has been done at LEVEL 3. In order successfully. Though many researches in the past have
to find the modification at a given level, we use the been conducted to detect the changes in the pages, so
level order traversal The algorithm for level order in comparison to others our technique gives a linear
traversal with breadth first search will give us the performance since we have used level order traversal
location where the change has taken place. which indeed saves our time by avoiding traversal of
the whole tree. Rather our technique aims at traversing
Now, the second case is when the modification has only the changed portion of the tree thus saving time.
been done in the value or content of the text. Example: We conducted our tests on some of the practical web
Initially <td style="TEXT-ALIGN: left"> pages and have been able to extract the text
Structure of node <td > is as given below; successfully and present the changes between the two
proposed the concept of page quality, which is closely
related versions of the web page i.e. old and new
versions of web pages.
269
263
solutions to improve the quality, as well as freshness, Department of Computer Science, Texas A&M, College
of web index for retrieval. We leave it as a good future Station, TX, 77843-3112, US,2001
direction that how web change detection on the basis of
change frequency, quality, popularity, etc. can be used [13] Zubin Dalal, Suvendu Dash, Pratik Dave, Luis
Francisco-Revilla, Richard Furuta, Unmil Karadkar, Frank
in a unified framework for web index synchronization
Shipman, Managing Distributed Collections: Evaluating
problem. Webpage Changes, Movement, and Replacement,
Department of Computer Science and Center for the Study of
References: Digital Libraries, Texas A&M University, and College
Station, TX 77843-3112..JCDL’04, June 7–11, 2004,
[1] Junghoo Cho, Hector Garcia-Molina, Department of Tucson, Arizona, USA.Copyright 2004 ACM
Computer Science, Stanford, The Evolution of the Web and
Implications for an Incremental Crawler, CA 94305, [14] Latifur Khan, Lei Wang and Yan Rao, Change
December 2, 1999 Detection of XML Documents Using Signatures, Department
of Computer Science, University of Texas at Dallas,
[2] Jenny Edwards Kevin McCurley John Tomlin, An Richardson, TX 75083-0688
Adaptive Model for Optimizing Performance of an
Incremental Web Crawler, 2000 [15] Shuohao Zhang, Curtis Dyreson, and Richard T.
Snodgrass Schema- Less, Semantics-Based Change
[3] Junghoo Cho and Hector Garcia-Molina, Synchronizing a Detection for XML Detection, Washington State University,
database to improve freshness, submitted for publication, Pullman, Washington, U.S.A. WISE 2004, LNCS 3306, pp.
1999. 279–290,© Springer-Verlag Berlin Heidelberg 2004
http://www-db.stanford.edu/~cho/papers/cho-synch.ps.
[16] Timely Web tool, www.timelyweb.com/index.html
[4] Junghoo Cho, Hector Garcia-Molina, and Lawrence, [17] URLyWarningwww.bleepingcomputer.com
Efficient crawling through URL ordering, Page In
Proceedings of the 7th World-Wide Web Conference, 1998. [18] A. Ntoulas, J. Cho, and C. Olston. What’s new on the
web? The evolution of the web from a search engine
[5] Junghoo Cho, University of California, Los Angeles, perspective. In Proc. 13th International World Wide Web
California a Hector Garcia-Molina, Stanford University, Conference, 2004.
Stanford, California. Effective Page Refresh Policies for Web
Crawlers, [19] D. Fretterly, M. Manasse, M. Najork, and J. Wiener. A
large-scale study of the evolution of web pages. In Proc. 12th
[6] Junghoo Cho, University of California, Los Angeles and International World Wide Web Conference, 2003.
Hector Garcia- Molina, Stanford University, Estimating
Frequency of Change,
[8] Daniel Rocco, David Buttler, Ling Liu, Page Digest for
Large-Scale Web Services Georgia Institute of Technology,
College of Computing, Atlanta, GA 30332, U.S.A.,
Proceedings of the IEEE International Conference on E-
Commerce (CEC‘03), 2003
270
264