Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

U.ARUNDHATHI* et al.

[IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

ISSN: 22503676
Volume - 2, Special Issue - 1, 99 104

A SURVEY ON WEB PAGE SEGMENTATION AND ITS APPLICATIONS


U.Arundhathi 1, V.Sneha Latha2, D.Grace Priscilla3
2

M.Tech, CSE, K L University, Andhra Pradesh, India, arundhathi.ummadisetty89@gmail.com Asst.Professor,CSE,K L University,Andhra Pradesh , India, sneha_2k@yahoo.com 3 M.Tech, CSE, K L University, Andhra Pradesh, India, prici.diyya@ gmail.com@lbrce.ac.in

Abstract
A web page is a document it creates the html that shows up on the internet when you type in or go to the web page's address. The information extraction from the Web are webpage structure understanding and natural language sentences processing. The webpage understanding problem which consists of three subtasks, webpage segmentation, webpage structure labeling, and webpage text segmentation and labeling. The effectiveness of leveraging layout and tag-tree structure for segmenting webpages and labeling HTML elements.

Index Terms: Webpage Segmentation, Webpage Structure Labeling and Web page text segmentation and labeling. -----------------------------------------------------------------------***----------------------------------------------------------------------1. INTRODUCTION
The World Wide Web contains huge amounts of data. However, we cannot benefit very much from the large amount of raw web pages unless the information within them is extracted accurately and organized well. Information Extraction (IE) plays an important role in web knowledge discovery and management. The name given to any process which selectively structures and combines data which is found, explicitly stated or implied, in one or more texts. The final output of the extraction process varies, in every case however, it can be transformed so as to populate some type of database. Information analysts working long term on specific tasks already carryout information extraction manually with the express goal of database creation. The World Wide Web is a vast and rapidly growing repository of information, and various kinds of valuable semantic information are embedded in web pages. Some basic understanding of the structure and the semantics of web pages could significantly improve peoples browsing and searching experience. the web using a template independent approach to segmenting webpages and labeling HTML elements[4]. By presenting the HTML elements according to their semantic meaning, users could save time from sifting the information from thousands of webpages. approaches can generate wrappers either with supervision or without supervision. The supervised approaches take in some manually labeled webpages and learn some extraction rules (i.e. wrappers) based on the labeling results. Unsupervised approaches do not need labeled training samples. They first automatically discover clusters of the webpages, and then produce wrappers from the clustered web pages. No matter how the wrappers are generated, they can only work on the web pages generated by the same template. Therefore, they are not suitable for general purpose webpage understanding. In contrast, template-independent approaches can process various pages from different templates. However, most of the methods in literature can only handle some special kinds of pages or specific tasks such as object block (i.e. data record) detection. 2.1Block-based Search We segment a webpage into semantic blocks and label the importance values of the blocks using a block importance model]. Then the semantic blocks, along with their importance values, are used to build block-based Web search engines.There are two types: 2.2 Object-Level Vertical Search: We extract and integrate all the Web information about a real world object/entityand generate a pseudo page for this object. These object pseudo pages are indexed to answer user queries, and users can get integrated information about a real-world object in one stop, instead of browsing through a long list of pages.

2. RELATED WORK:
Webpage understanding plays an important role in information retrieval from the Web. There are two main branches of work for webpage understanding, template-dependent approaches and template independent approaches. Template-dependent

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 99

U.ARUNDHATHI* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

ISSN: 22503676
Volume - 2, Special Issue - 1, 99 104

4. Web Page Segmentation To segment a webpage into semantically coherent units, the visual presentation of the page contains a lot of useful cues. Generally, a webpage designer would organize the content of a webpage to make it easy for reading. Thus, semantically coherent content is usually grouped together and the entire page is divided into regions for different content using explicit or implicit visual separators such as lines, blank areas, images, font sizes, and colors [17]. Our goal is to derive this content structure from the visual presentation of a webpage. Based on the definition, the output of webpage segmentation is the vision-tree of a webpage. Each node on this vision-tree represents a data region in the webpage, which is called a block. The root block represents the whole page. Each inner block is the aggregation of all its child blocks. All leaf blocks are atomic units (i.e., elements) and form a flat segmentation of the webpage. Since vision-tree can effectively keep related content together while separating semantically different blocks from one another, we use it as the data representation format of the webpage segmentation results. Figure 5 is a vision-tree for the page in Figure 4, where we use rectangles to denote the inner blocks and use ellipses to denote the leaf blocks (or elements). Due to space limitations, the blocks denoted by dotted rectangles are not fully expanded. Several methods have been explored to segment a web page into regions or blocks. In the DOM-based segmentation approach, an HTML document is represented as a DOM tree. Each block in VIPS is represented as a node in a tree. The root is the whole page; inner nodes are the top level coarser blocks, children nodes are obtained by partitioning the parent node into finer blocks, and all leaf nodes consist of a flat segmentation of a web page with an appropriate coherent degree. The stopping of the VIPS algorithm is controlled by a predefined DOC (PDOC),which plays a role as a threshold to indicate the finest granularity that we are satisfied. The segmentation only stops when the DOCs of all blocks are no smaller than the PDOC. Figure 2 shows the result of using VIPS to segment a sample CNN web page, shown in figure1

3. Entity Relationship Search: Renlifang is a different kind of search engine, one that explores relationships between entities. In Renlifang, user scan query the system about people, locations, and organizations and explore their relationships. These entities and their relationships are automatically mined from the text content on the Web. For each crawled webpage in Renlifang, the system extracts entity information and detects relationships, covering a spectrum of everyday individuals and well-known people, locations, organizations.Below we list the key features of Renlifang: Entity Relationship Mining and Navigation: Renlifang enables users to explore highly relevant information during searches to discover interesting relationships about entities associated with their query. 3.1 Expertise Finding: For example, Renlifang could return a ranked list of people known for dancing or any other topic. 3.2 Web-Prominence Ranking. Renlifang detects the popularity of an entity and enables users to browse entities in different categories ranked by their prominence on the Web during a given time period. 3.3 People Bio Ranking: Renlifang ranks text blocks from webpages by the likelihood of being Biography/description blocks. A conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models.

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 100

U.ARUNDHATHI* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

ISSN: 22503676
Volume - 2, Special Issue - 1, 99 104

Figure1: The vision-tree for webpage segmentation 5. Webpage Structure Labeling: After webpage segmentation, we will have a vision-tree representation of a webpage keeping semantically coherent content together as web blocks. The webpage structure labeling task is to assign semantic labels to the blocks on a webpage (i.e., nodes on vision-tree). For different applications, the semantic label space could be different. For example: For Web object extraction, the label space consists of a label called Object Block and several labels corresponding to the individual attribute names of the object (for example, the name, image, price and description of a product for sale). The web object extraction problem can be solved as a webpage structure labeling problem assuming we dont need to further segment the HTML elements which are the leaf nodes of the vision-tree [18]. For the webpage main block detection application, the label space could consist of the following: Main Block, Navigation Bar, Copyright, Advertisement, etc.,shown in figure2

Figure2.The HCRF model for Webpage structure Labeling 6. Webpage Text Segmentation and Labeling: The existing work on text processing cannot be directly applied to web text understanding. This is because the text content on webpages is often not as regular as those in natural language documents and many of them are less grammatical text fragments. One possible method of using NLP techniques for web text understanding is to first manually or automatically identify logically coherent data blocks, and then concatenate the text fragments within each block into one string via some pre-defined ordering method. The concatenated strings are finally put into a text processing method, such as CRYSTAL or Semi CRF, to identify target information, are two attemptsin this direction. It is natural to leverage the webpage structure labeling results to first concatenate the text fragments within the blocks generated by VIPS, and then use Semi-CRF to process the concatenated strings with the help of structure labeling results. However it would be more effective if we could jointly optimize the structure labeling task and the text segmentation and labeling task together. 7. Integrated Webpage Understanding: Now we have introduced the three subtasks of webpage understanding: webpage segmentation, webpage structure labeling, and webpage text segmentation and labeling. We

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 101

U.ARUNDHATHI* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY argue that we need a unified model to jointly optimize these webpage understanding tasks. This is because with more semantic understanding of the text tokens we could perform better structure labeling, and with better structure labeling we can perform better page segmentation, and vice versa.We dont have a unified model to integrate all the three subtasks yet, but we have done some initial work to jointly optimize two subtasks. The joint optimization of webpage segmentation and structure labeling tasks improves the performance of both tasks, and we will introduce another recent work on integrated webpage structure labeling and text segmentation and labeling. Vision-Tree Representation: For web data extraction, the first thing is to find a good representation format for webpages. Good representation can make the extraction task easier and improve extraction accuracy. In most previous work, tag-tree, which is a natural representation of the tag structure, is commonly used to represent a webpage. However, as Cai et al. (2004) pointed out, tag-trees tend to reveal presentation structure rather than content structure, and are often not accurate enough to discriminate different semantic portions in a webpage. VIPS makes use of page layout features such as font, color, and size to construct a vision-tree for a page. It first extracts all suitable nodes from the tag tree and then finds separators between these nodes. Here, separators denote horizontal or vertical lines in a webpage that visually do not cross any node. Based on these separators, the vision-tree of the webpage is constructed. Each node on this tree represents a data region in the webpage, which is called a block. The root block represents the whole page. Each inner block is the aggregation of all its child blocks. All leaf blocks are atomic units (i.e.,elements) and form a flat segmentation of the webpage. Record Detection and Attribute Labeling: Based on the definition of vision-tree, we now formally define the concepts of record detection and attribute labeling. 7.1 Record detection: Given a vision-tree, record detection is the task of locating the root of a minimal subtree that contains the content of a record. For a list page containing multiple records, all the records need to be identified.For instance, for the vision-tree in the two blocks in gray are detected as data records, given a particular vision-tree, we are not guaranteed to find the root nodes that correspond to data records. This is the very problem to be addressed by Dynamic Hierarchical Markov Random Fields.

ISSN: 22503676
Volume - 2, Special Issue - 1, 99 104

7.2 Attribute labeling: For each identified record, attribute labeling is the task of assigning attribute labels to the leaf blocks (elements) within the record.We can build a complete model to extract both records and attributes by sequentially combining existing record detection and attribute labeling algorithms. However, as we have stated, this decoupled strategy is highly ineffective. Therefore, we propose an integrated approach that conducts simultaneous record extraction and attribute labeling. Integrated Web Data Extraction Based on the above definitions, both record detection and attribute labeling are the task of assigning labels to blocks of the vision-tree for a webpage. Therefore, we can define one probabilistic model to deal with both tasks. 8. Web Data Extraction and Integration Process In integration process two continuously executed phases can be distinguished: System preparation creation and continuous update of resources required for usage of system. System usage actual lazy or eager data integration from dispersed sources.

During system preparation phase, first the sources need to be discovered. This step is currently assumed to be done by human beings the list of sources is usually provided to the integration system. Subsequently the sources need to be described their schema, query capabilities, quality, coverage needs to be provided for further steps in the integration

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 102

U.ARUNDHATHI* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY process. Most importantly schema descriptions are used in the next step to map them either one to the other or to some global schema. The last part of system preparation is a definition of navigation and extraction rules that allow the system to get to the data through a sources Web interface. These resources are used once query needs to be answered: sources to ask are selected based on their subject and coverage, the plan is generated using schema mappings and source query capabilities and the query is executed through a Web interface using navigation model for particular queries and data extraction rules defined for them. Once data are extracted from HTML pages, they need to be translated and cleaned according to data translation rules .The overall process includes all the activities necessary to integrate data from Web databases identified so far in the literature, and adds few steps (discovery of websources, declarative description of navigation model) that were neglected so far. As such it can serve as a basis for generic data integration system.

ISSN: 22503676
Volume - 2, Special Issue - 1, 99 104

[7]N. Kushmerick, Wrapper induction: Efficiency and expressiveness, Artif. Intell.,vol.118, no. 1-2, pp. 1568, 2000. [8]I. Muslea, S. Minton, and C. A. Knoblock, Hierarchical wrapper induction for semi structured information sources, Autonomous Agents and Multi-Agent Systems, vol. 4, no. 1/2, pp. 93114, 2001. [ 9] C.-H. Chang and S.-C. Lui, Iepad: information extraction based o pattern discovery, in Proceeding of WWW, 2001, pp. 681688. [10] V. Crescenzi, G. Mecca, and P. Merialdo, Roadrunner: Towards automatic data extraction from large web sites, in Proceeding of VLDB, 2001, pp. 109118. [11]. A. Arasu and H. Garcia-Molina, Extracting structured data fromweb pages, in Proceeding of SIGMOD Conference, 2003, pp. 337348. [12]. K. Lerman, S. Minton, and C. A. Knoblock, Wrapper maintenance: A machin learningapproach, J. Artif. Intell. Res. (JAIR), vol. 18, pp. 149181, 2003. [13] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. T. Yu, Fully automatic wrapper generation for search engines, in Proceeding of WWW, 2005, pp. 6675. [14]Ruihua Song, Haifeng Liu, Ji-Rong Wen, and Wei-Ying Ma. Learning Block Importance Models for Webpages. In Proc. of WWW, 2004. [15]. Deng Cai, Xiaofei He, Ji-Rong Wen, and Wei-Ying Ma. Block-Level Link Analysis In Proceedings of SIGIR, 2004. [16] Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma. Block-based Web Search In Proceedings of SIGIR, 2004. [17].Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-YingMa. VIPS: a Vision-based Page Segmentation Algorithm. Microsoft Technical Report, MSR-TR-2003-79, 2003. [18] Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang and WeiYing Ma. Simultaneous Record Detection and Attribute Labeling in Web Data Extraction. Proc. Of SIGKDD, 2006

CONCLUSION:
The webpage understanding problem, which consists of three subtasks webpage segmentation, webpage structure labeling, and webpage text segmentation and labeling are the solutions to the problem have to be template-independent because of its web-scale nature. In this an integrated model to incorporate both structure understanding and text content understanding for effective webpage understanding.

REFERENCES
[1] J. Cowie and W. Lehnert, Information extraction, Commun.ACM, vol. 39, no. 1, pp. 8091, 1996 [2]C. Cardie, Empirical methods in information extraction, AI Magazine, vol. 18, no. 4, pp.6580, 1997. [3]R. Baumgartner, S. Flesca, and G. Gottlob, Visual web information extractionWith lixto, in Proceeding of VLDB, 2001, pp. 119128. [4]J. Zhu, Z. Nie, J.-R. Wen, B. Zhang and W.-Y. Ma. Simultaneous Record Detection and Attribute Labeling in Web Data Extraction. Proc. of SIGKDD, 2006. [5] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, Block-based web search,in Proceeding of SIGIR, 2004, pp. 456463 [6] Z. Nie, Y. Ma, S. Shi, J.-R. Wen, and W.- Y. Ma, Web object retrieval, in Proceeding of WWW, 2007, pp. 8190.

BIOGRAPHIES
U.Arundhathi received her B.Tech. degree in Computer Science and Engineering from VelTech Multi tech Engineering College, Anna University in 2010.She is currently pursuing M.Tech in the Department of CSE,K.L.UNIVERSITY.

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 103

U.ARUNDHATHI* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

ISSN: 22503676
Volume - 2, Special Issue - 1, 99 104

V.SnehaLatha received her B.Tech degree in Electronics and Communication Engineering from JNTU and M.Tech degree in Computer Science and Engineering from Acharya Nagarjuna University. She is now pursuing Ph.D and working as an Assistant Professor in the Department of Computer Science & Engineering, K.L University, Vijayawada. She has got 5 years of teaching experience .She has published six research papers in various international journals and five research papers in various international conferences. She has participated in seminars and workshops. She is life member of professional societies CSI.

D. Grace Priscilla received her B.Tech. degree in Computer Science and Engineering Information from Narasaraopet Engineering College, JNTU University in 2010.She is currently pursuing M.Tech in the Department of Computer Science at KL University

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 104

You might also like