Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

10th International Conference on Information Technology

Change Detection in Web Pages

Divakar Yadav A.K.Sharma J.P.Gupta


JIIT University, Noida (India) YMCA, Faridabad (India) JIIT University, Noida (India
dsy99@rediffmail.com ashokkale2@rediffmail.com jp.gupta@jiit.ac.in

Abstract Instead of having users tracking when to visit web


pages of interest and identifying what and how the
A large amount of new information is posted page of interest has been changed, the information
on the Web every day. We can take the example of the change monitoring service is becoming increasingly
news portals, which keep on changing not only each popular, thus enabling information to be delivered
and every day but also within each and every hour. while they are still fresh. Several tools are available to
Now, which information or data is of how much assist users in tracking changes in web pages of
importance depends upon the perception of the specific interests. Examples of tools are Timely Web and URLy
user. The Internet and the World Wide Web have Warning [16, 17]. In this type of scenario, it becomes
enabled a publishing explosion of useful online natural that these tools are getting the popularity.
information, which has produced the unfortunate side The Web provides access to a wide variety of
effect of information overload: it is increasingly information but much of this information is fluid; it
difficult for individuals to keep abreast of fresh changes, moves, and occasionally disappears.
information. Bookmarks, paths over Web pages, and catalogs like
In this paper, we describe an approach for building a Yahoo! are examples of page collections that can
system for efficiently monitoring changes to Web become out-of-date as continuous changes are made to
documents. We discuss the mechanism that our their components. Maintaining these collections
proposed algorithm uses to discover and detect requires that they should be updated continuously.
changes to the Web pages efficiently. Our solution for Tools to help in this maintenance require an
searching new information from the web page by understanding of what changes are important and what
tracking the changes in web document’s structure has changes are not.
been discussed. In the methodology section, we present So keeping all these aspects of web in mind, in this
the algorithm and technique useful for detecting web paper we are proposing a scheme to detect the changes
pages that are changed, extracting changes from in the web documents. In section 3.1, we identify and
different versions of a web page, and evaluating the discuss the types of changes that can take place in the
significance of web changes. Our algorithm for web pages. Section 3.2 and 3.3 discuss the proposed
extracting web changes consists of three steps: solution for the problems identified in the previous
document tree construction, document tree encoding section. Here we make use of the tree structure of the
and tree matching (based upon the concept of R.M.S. web documents and traverse down the tree in order to
value of the content), for the detection of two types of detect the changes in the structure as well as in the
changes basically -- structural changes and content content of the web document. Instead of traveling
changes. It has linear time complexity and extracts down the whole tree we make use of the levels in the
effectively the changed content from different versions tree structure in order to detect the addition or deletion
of a web page. of the node at a specific level or levels. In order to
detect the change in the content we make use of the
1. Introduction R.M.S. value of the ASCII values of the characters of
the text extracted from the web document. They can be
The World Wide Web (the Web), as one of the in the form of the paragraphs, headers or any other
most popular applications on the Internet, continues to form of text.
grow at an astounding speed. Not only the size of static
web pages increases approximately 15% per month, 2. Related work
the number of dynamic pages generated by programs Now the question which arises in our minds as soon as
has been growing exponentially. The rapid growth of changes in the web pages becomes the centre of
the Web has affected the ways in which fresh concern is what makes retrieving and managing web
information is delivered and disseminated. changes an effective method for retrieving new
0-7695-3068-0/07 $25.00 © 2007 IEEE 259
265
DOI 10.1109/ICIT.2007.37
information from the web? Some of the studies show Ntoulas [18] collected a historical database
that although the Web is growing and changing fast, for the web by downloading 154 popular Web sites
the absolute amount of changed content on existing (e.g., acm.org, hp.com and oreilly.com) every week
web pages during a short period is significantly smaller from October 2002 until October 2003, for a total of 51
than the total amount of content in the Web. weeks. The average number of web pages downloaded
Two recent studies as mentioned below show the need weekly was 4.4 million. The experiments show that a
for management of changes in web pages. significant fraction (
around 50%) of web pages remain completely shingles of each document and measure the difference
unchanged during the entire period they studied. To of shingles
measure the degree of change, they compute the
between different versions of web improve web caching policies. Comparing with the
documents. They show that many of the pages that do research in active databases, the WebCQ [11] system
change undergo only minor changes in their content: differs in three ways: First, the WebCQ system targets
even after a whole year, 50% of the changed pages are at monitoring and tracking changes to arbitrary web
less than 5% different from their initial version. pages. Second, WebCQ monitors data provided by the
Fretterly [19] performed a large crawl that content providers on remote servers, and WebCQ
downloaded 151 million HTML pages. They then monitoring and tracking service requires neither the
attempted to fetch each of these 151 million HTML control over the data it monitors nor the structural
pages ten more times over a span of ten weeks during information about the data it is tracking.
Dec. 2002 to Mar. 2003. For each version of each A study [8] by Rocco, Butler, Luis has
document, they compute the checksum and shingles to contributed to this field by developing a mechanism for
measure the degree of change. The degree of change is efficient storage and processing of Web documents.
categorized into 6 groups: complete change (no The Page Digest design encourages a clean separation
common singles), large change (less than 30% of the structural elements of Web documents from their
common shingles), medium change (30%-70% content. Its encoding transformation produces many of
common shingles), small change (70%-99% common the advantages of traditional string digest schemes yet
shingles), no text change (100% common shingles), remains invertible without introducing significant
and no change (same checksum). Experiments show additional cost or complexity. Using the Page Digest
that about 76% of all pages fall into the groups of no encoding can provide at least an order of magnitude
text change and no change. The percentage for the speedup when traversing a Web document as compared
group of small change is around 16% while the to using a standard Document Object Model
percentage for groups of complete change and large implementation. The experiments show that change
change is only 3%. The above results are very detection using Page Digest operates in linear time,
supportive to our studies. They suggest that offering 75% improvement in execution performance
incremental method may be very effective in updating compared with existing systems. In addition, the Page
web indexes, and that searching for new information Digest encoding can reduce the tag name redundancy
appearing on the web by retrieving the changes will found in Web documents, allowing 30% to 50%
require a small amount of data processing as compared reduction in document size. They have used depth first
to the huge size of the Web. traversal to parse the document tree.
It has been discussed in [6] how to estimate the Various online tools to detect and notify web
change frequency of a web page by revisiting the page page changes are available on the internet. some of this
periodically. A study [1] has been done on how often a are like timelyWeb [16] and URLY Warning[17].
crawler should visit a page when it knows how often There has been a lot of research going on in this field
the page changes. Their initial experiment tries to as this has been a newer topic in the field of hypertext.
answer the following questions about the evolving Researchers are trying to increase the efficiencies of
web: the change detection algorithms in order to increase the
1) How often does a web page change? efficiencies of the web crawler.
2) What is the lifespan of a page?
3) How long does it take for 50% of the web to 3. Proposed Methods
change?
4) Can we des cribe changes of web pages by a 3.1 Classification of changes:
mathematical model?
References [1] and [2] experimentally study how Broadly speaking these changes can be classified
often web pages change. Reference [1] and [5] studies into 4 major categories: -
the relationship between the desirability of a page and 1. Content/Semantic changes refer to modifications
its lifespan. Some papers investigate page changes to of the page contents from the reader’s point of view.

266
260
For example, a page created for a soccer tournament • Parser level order child enumeration (tree
might be continuously updated as the tournament traversal/parsing by level order);
progresses. After the tournament has ended, the page • Tree mapping (finding the combined r.m.s value
might change to a presentation about the tournament of the tags and content for change
results and sports injuries. detection/extraction); and
2. Presentation/Cosmetic changes are changes related • Change detector and presenter.
to the document representation that do not reflect
changes in the topic presented in the document. For Our extractor is very effective for real web data in
instance, changes to HTML tags can modify the practice and has linear scalability. Change detector
appearance of a Web page while it otherwise remains works basically for two semantic phases—content
the same. changes and the structured changes. We can further
3. Structural changes refer to the underlying divide this into 3 views-
connection of the document to other documents. As an 1. HTML view
example, consider changes in the link destinations of a 2. Content view
―Weekly Hot Links Web page. While this page might 3. Tree view
be conceptually the same, the fact that the destination The HTML view shows and compares the old and the
of the links has changed might be relevant, even if the new version of the web page e.g. any news portal. It
text of the links has not. Structural changes are also extracts the HTML source code of those pages and the
important to detect, as they often might not be visually changed portion is displayed in the source code only.
perceptible. The content view deals with the changed portions of
4. Behavioral changes refer to modifications to the the contents of the web page versions. It includes the
active components of a document. For Web pages, text, paragraphs, headers; etc. There is R.M.S value
this includes scripts, plug-ins and applets. The calculator which calculates the R.M.S. value of the
consequences of these changes are harder to predict, text or the paragraph, or heading. The tree view
especially since many pages hide the script code in traverses the document tree structure and presents the
other files. changes by comparing the number of nodes and by
comparing the array structure of the HTML page. e.g.
if the number of node on traversing the old and the new
versions of the tree are different that means that there
is new node added or deleted in the page structure. In
all we need some fact or figure (i.e. signature) on the
basis of which we can distinguish the content of the
Fig.1 Types of changes in the web pages node of the web page.
In this section, we study the types of problems that are
Structural changes can be in the HTML or XML emerging in detecting the new or changed information
document (structured or semi structured) of a specific of the web pages and for each problem, we present a
web page, these changes can be in the form of updating practical solution and analyze its efficiency. The
content (addition of a new node to the tree structure, problems and solutions are:
deletion of nodes from the structure), image or link 1) How can changes in web pages be detected
insertion /deletion etc. Techniques like page checksum, effectively? First we classify the problem into two
digital fingerprinting, page mirroring etc. are used for broad categories- the structural changes and the content
detecting content or semantic changes. We are using an based changes. We detect changes using a hypertext
improved version of page digest [8] to detect the document tree encoding of the document. The first
structural changes in the web page structure i.e. in the module renders an ordered tree structure of the input
HTML tags structure. The suggested algorithm is HTML web page, counts the number of nodes i.e. node
based on level order search which is another form of count if it is different that means the structure of the
breadth first search. We move in the tree level by level. web page has changed or a new node has been added
The basic modules, which constitute the application or deleted. Further case may arise where there are
are- equal number of nodes after comparing the older and
• Document tree construction (It takes a HTML file the newer version of the HTML web page tree
as an input file, parses the tree with the help of a structure, but the order of the tags have been changed
parser (as discussed later), identifies those opening or a nested structure has been added. In order to detect
tags as nodes of the trees, and maintaining the these types of changes parsing plays an important role,
parent child relationships constructs the tree of the which we have executed after studying different
given web page.) ; existing techniques based on data structures. The
technique, which we have used, incorporates level

267
261
order ordered tree traversal, which has its advantage of 6) TAG NAME: This store the tag name.
traversing the tree level by level.
2) How can changes between different versions of web
documents be presented/extracted effectively? We
present an efficient and effective algorithm for
comparing different versions of web documents. The
change extractor based on this algorithm includes three
phases: document tree construction, level order child
enumeration (tree traversal/parsing by level order), and
tree mapping (finding the combined R.M.S. value of
the tags and content for change detection/extraction).
Our extractor is very effective for real web data in
practice and has linear scalability.
3) How should a change both in structure or content be
detected and evaluated? We have constructed a
specialized set of arrays for HTML pages which show
the relationships among the tree nodes, i.e. a child or a
parent. By analyzing the content of the arrays, we can
easily determine the difference between the older
version of the web page and the newer version. There
are different cases in a HTML tree structure where the Fig. 3 HTML Tree structure of web page
order of the tags plays an important role.
Addition /Deletion of node (Tag) in the Tree
Suppose a new node is introduced into the tree i.e. a
3.2 Change Detection: new tag is added into the html code as described
below.
Method for structural change detection: -We consider
the following sample HTML web page whose structure
is to be detected for the changes using the proposed
algorithm.

Fig 4. Initial structure (before addition)

Fig. 2 An example web page

There may be two types of changes in web pages : Fig 5. Final structure (after addition of TR)
(1) Addition/deletion of a node containing tag of the
HTML page. The following s the change in the structure
(2) Modification in the tag or tag value or change in
content. LEVEL INTIALLY LATER
The structure of every node of the tree representing the LEVEL 1 1 1
web page shall contain the following information: LEVEL 2 1 1
1) ID: This index stores the unique id representing LEVEL 3 1 2
each node; LEVEL 4 1 1
2) CHILD: This index stores the information of child.
3) PARENT: This index stores the information of the Initially:
parent; LEVEL = {1 2 3 4 }
4) LEVEL: This index stores the level, where the ID = {1 2 3 4 }
node exists; CHILD = {1 1 1 NULL }
5) CONTENT VALUE : This index stores the RMS LEVEL_ARRY = {1 1 1 1 }
value of sum of character of the content;

268
262
Later:
LEVEL = {1 2 3 3 4}
ID = {1 2 3 5 4}
CHILD = {1 2 1 NULL NULL} 4. Results
LEVEL_ARRY={1 1 2 1 1}
By using the above techniques, we have been able to
Now, by comparing the two sets we get the idea that detect the structural and content-based changes
the modification has been done at LEVEL 3. In order successfully. Though many researches in the past have
to find the modification at a given level, we use the been conducted to detect the changes in the pages, so
level order traversal The algorithm for level order in comparison to others our technique gives a linear
traversal with breadth first search will give us the performance since we have used level order traversal
location where the change has taken place. which indeed saves our time by avoiding traversal of
the whole tree. Rather our technique aims at traversing
Now, the second case is when the modification has only the changed portion of the tree thus saving time.
been done in the value or content of the text. Example: We conducted our tests on some of the practical web
Initially <td style="TEXT-ALIGN: left"> pages and have been able to extract the text
Structure of node <td > is as given below; successfully and present the changes between the two
proposed the concept of page quality, which is closely
related versions of the web page i.e. old and new
versions of web pages.

5. Conclusion and Future Work


Fig 6. Initial content of the arrays
We presented a study of how to model web
Calculation of R.M.S. value before change: changes and how to detect them in web pages which
Cont_value= l + e +f +t are volatile that means which change almost each and
= 108, 101,102,116 = 106.9170239. every day. The proposed algorithm extract changes
between different versions of web pages. We have
Now, <td style="TEXT-ALIGN: right"> been able to make out structural as well as content-
Structure of node <td >is: based changes by developing an efficient application in
java which scores on other techniques on the basis of
simplicity and understandability. Though level order
traversal is another form of the breadth first traversal,
so it may incorporate some of the drawbacks of the
Fig 7. Final array structure breadth first traversal . First, we found that detecting
changes using HTTP Meta data can successfully
Calculation of R.M.S. value after the change: reduce network traffic. Second, existing algorithms for
Cont_value= r + i +g +h +t = 108.5375511 extracting changes using tree edit distance have high
computational cost, which is inappropriate for large-
For anagrams and text where numbers of similar scale search engine development. We proposed a new
characters are same but at different positions e.g. algorithm, which reduces the cost to linear using both
NAME and MANE, the R.M.S. Value shall be same. tree encoding and level-by-level tree matching.
So, in order to avoid confusion we propose a solution Tracking changes can successfully retrieve the
in which we multiply the position of the character with summary but not the complete content of the newly
its ASCII value and then we follow: created page. A good future direction is to integrate
both changed content and newly created content into a
unified search index. Such index can have a higher
coverage of new information on the web. A further
thought is that whether updates on popular pages
We have taken an example of anagram NAME AND which are more likely to be updated by web authors are
MANE. In first word position of N=0, A=1, M=2, E=3 of high quality for retrieval. Cho [1] proposed the
and in second it is M=0, A=1, N=2, E=3.Since we are concept of page quality, which is closely, related to the
multiplying this with the ASCII value of each popularity metrics of web pages. In our prospect ion, if
character, so the R.M.S. value will be different. The such quality metric can be used to evaluate the updates
formula for R.M.S. value shall be: of web pages, then people may be able to develop

269
263
solutions to improve the quality, as well as freshness, Department of Computer Science, Texas A&M, College
of web index for retrieval. We leave it as a good future Station, TX, 77843-3112, US,2001
direction that how web change detection on the basis of
change frequency, quality, popularity, etc. can be used [13] Zubin Dalal, Suvendu Dash, Pratik Dave, Luis
Francisco-Revilla, Richard Furuta, Unmil Karadkar, Frank
in a unified framework for web index synchronization
Shipman, Managing Distributed Collections: Evaluating
problem. Webpage Changes, Movement, and Replacement,
Department of Computer Science and Center for the Study of
References: Digital Libraries, Texas A&M University, and College
Station, TX 77843-3112..JCDL’04, June 7–11, 2004,
[1] Junghoo Cho, Hector Garcia-Molina, Department of Tucson, Arizona, USA.Copyright 2004 ACM
Computer Science, Stanford, The Evolution of the Web and
Implications for an Incremental Crawler, CA 94305, [14] Latifur Khan, Lei Wang and Yan Rao, Change
December 2, 1999 Detection of XML Documents Using Signatures, Department
of Computer Science, University of Texas at Dallas,
[2] Jenny Edwards Kevin McCurley John Tomlin, An Richardson, TX 75083-0688
Adaptive Model for Optimizing Performance of an
Incremental Web Crawler, 2000 [15] Shuohao Zhang, Curtis Dyreson, and Richard T.
Snodgrass Schema- Less, Semantics-Based Change
[3] Junghoo Cho and Hector Garcia-Molina, Synchronizing a Detection for XML Detection, Washington State University,
database to improve freshness, submitted for publication, Pullman, Washington, U.S.A. WISE 2004, LNCS 3306, pp.
1999. 279–290,© Springer-Verlag Berlin Heidelberg 2004
http://www-db.stanford.edu/~cho/papers/cho-synch.ps.
[16] Timely Web tool, www.timelyweb.com/index.html
[4] Junghoo Cho, Hector Garcia-Molina, and Lawrence, [17] URLyWarningwww.bleepingcomputer.com
Efficient crawling through URL ordering, Page In
Proceedings of the 7th World-Wide Web Conference, 1998. [18] A. Ntoulas, J. Cho, and C. Olston. What’s new on the
web? The evolution of the web from a search engine
[5] Junghoo Cho, University of California, Los Angeles, perspective. In Proc. 13th International World Wide Web
California a Hector Garcia-Molina, Stanford University, Conference, 2004.
Stanford, California. Effective Page Refresh Policies for Web
Crawlers, [19] D. Fretterly, M. Manasse, M. Najork, and J. Wiener. A
large-scale study of the evolution of web pages. In Proc. 12th
[6] Junghoo Cho, University of California, Los Angeles and International World Wide Web Conference, 2003.
Hector Garcia- Molina, Stanford University, Estimating
Frequency of Change,

[7] Arvind Arasu, Junghoo Cho, Hector Garcia-Molina,


Searching the Web, Stanford University, 1999

[8] Daniel Rocco, David Buttler, Ling Liu, Page Digest for
Large-Scale Web Services Georgia Institute of Technology,
College of Computing, Atlanta, GA 30332, U.S.A.,
Proceedings of the IEEE International Conference on E-
Commerce (CEC‘03), 2003

[9] David Buttler, Daniel Rocco, Ling Liu, Efficient Web


Change Monitoring with Page Digest, 2001

[10] Buda Rahardjo, Roland H.C. Yap, Automatic


Information Extraction from Web Pages, , School of
Computing, National University of Singapore, Republic of
Singapore,2001

[11] Ling Liu Carlton Pu, Wei Tang, WebCQ – Detecting


and Delivering Information Changes on the Web, 2000

[12] Luis Francisco-Revilla, Frank M. Shipman III, Richard


Furuta, Unmil Karadkar, Avital Arora,Perception of Content,
Structure, and Presentation Changes in Web-based
Hypertext, , Center for the Study of Digital Libraries and

270
264

You might also like