Architecture of Deep Web: Surfacing Hidden Value: Suneet Kumar Virender Kumar Sharma

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

International Journal of Computer Information Systems, Vol. 3, No.

3, 2011

Architecture of Deep Web: Surfacing Hidden Value


Suneet Kumar
Computer Science dept. Dehradun Institute of technology Dehradun, India Suneetcit81@gmail.com
Abstract Focused web crawlers have recently emerged as an alternative to the well-established web search engines. While the well-known focused crawlers retrieve relevant web-pages, there are various applications which target whole websites instead of single web-pages. For example, companies are represented by websites, not by individual web-pages. To answer queries targeted at Websites, web directories are an established solution. In this paper, we introduce a novel focused website crawler to employ the paradigm of focused crawling for the search of relevant websites. The proposed crawler is based on two-level architecture and corresponding crawl strategies with an explicit concept of websites. The external crawler views the web as a graph of linked websites, selects the websites to be examined next and invokes internal crawlers. Each internal crawler views the web-pages of a single given website and performs focused (page) crawling within that website. Our Experimental evaluation demonstrates that the proposed focused website crawler clearly outperforms previous methods of focused crawling which were adapted to retrieve websites instead of single web-pages. Deep Web, as a rich and largely unexplored data source, is becoming nowadays an important research topic. In previous years, data extraction from Web pages has received a lot of attention. Much experience has been also already accumulated in the area of traditional, relational databases integration. Today, these research areas converge, leading to development of systems for Deep Web data extraction and integration. Several approaches were proposed and many systems were built that enable extraction and integration of Deep Web data. In the paper we propose a classification framework allowing comparing different approaches based on the full model of data extraction and integration process. We classify the most important systems reported in the literature according to the proposed framework; they are evaluated with respect to their capabilities and coverage of the process. We conclude with the refinement of the architecture that would cover the complete data extraction and integration process for Web sources. Keywords- Deep web; ; Link references ; Searchable Databases ; Site page-views.

Virender Kumar Sharma


Electrical Engineering. dept. Bhagwant group of institutions Muzaffarnagar, India Viren_krec@yahoo.com in 2001 [4] has estimated that there exist ca. 200,000 Deep Web sites, providing access to 7,500TB of data. Further studies (2004) have estimated the number of Deep Web sites to reach slightly more than 300,000, providing access to ca. 450,000 databases through 1,260,000 query interfaces [12]. A lot of research in previous years has been devoted to information extraction from the Web (IEW) [22]. Data integration (DI) problems including schema mapping, query capabilities description and data translation and cleaning were studied in extend in previous years [23, 16]. Today, these research areas converge, leading to development of systems for Deep Web data extraction and integration (DWI). Deep Web poses new challenges to data extraction as compared with surface Web. New problems arise also for data integration unknown in traditional databases. Therefore new architectural requirements emerge for Deep Web data integration systems as described in Table 1
Challenge Semi-structural data model Heterogeneous data schema Data dispersion Source incompleteness Large amount of data, infeasibility of full materialization Heterogeneous data access interface Restricted query capabilities DI IE W + DWI + Resulting requirements Flexible data extraction, un-typed & weakly typed data Support for different query translation Scalability, source selection Effective query rewriting , relaxed constraints Selective querying

+ -/+ -

+ -/+ +

+ + +

Flexible data access layer Query rewriting based on query capabilities Extensibility/modular architecture

I.

INTRODUCTION (THE DEEP WEB)

The importance of Deep Web (DW) has grown substantially in recent years not only because its size [12], but also because Deep Web sources arguably contain the most valuable data, as compared to the so-called Surface Web [4]. An overlap analysis between pairs of search engines conducted

Dynamic + + challenges in data presentation technology Table 1. Characteristics of data integration (DI), information extraction from the Web (IEW) and Deep Web data integration (DWI)

In the paper we describe a classification framework allowing comparing different approaches based on the full

September Issue

Page 5 of 75

ISSN 2229 5208

model of data extraction and integration process. We propose also the refinement of the architecture models that would cover the complete data extraction and integration process for Web sources. II.
DEEP WEB DATA EXTRACTION AND INTEGRATION PROCESS

In integration process two continuously executed phases can be distinguished: system preparation creation and continuous update of resources required for usage of system system usage actual lazy or eager data integration from dispersed sources System Usage Resources
List of Web sources Content description Select web Sources

International Journal of Computer Information Systems, Vol. 3, No. 3, 2011 navigation and extraction rules that allow the system to get to the data through a sources Web interface. These resources are used once query needs to be answered: sources to ask are selected based on their subject and coverage, the plan is generated using schema mappings and source query capabilities and the query is executed through a Web interface using navigation model for particular queries and data extraction rules defined for them. Once data are extracted from HTML pages, they need to be translated and cleaned according to data translation rules. The overall process includes all the activities necessary to integrate data from Web databases identified so far in the literature, and adds few steps (discovery of web sources, declarative description of navigation model) that were neglected so far. As such it can serve as a basis for generic data integration system. III. OPERATONAL SOURCE SYSTEMS

System Preparatory

Discover Wet Sources

Describe individual Web sources

Schemas description Plan & Translate query

These are the operational systems of record that capture the transactions of the business. The source systems should be thought of as outside the data warehouse because there can be no or little control over the content and the format of the data in these operational legacy systems The operational systems contain little or no historical data, and if we have a data warehouse, the source systems can be relieved of much of the responsibility for representing the past. Each source system is often a natural stovepipe application, where little investment has been made to sharing common data such as product, customer, geography or calendar with other operational systems in the organization. IV.
CLASSIFICATION OF DATA EXTRACTION AND INTEGRATION APPLICATIONS

Integrate schemas

Querying capabilities description

Schemas mapping Define navigation Wrapper for individual sources

Execute query

Navigation model descriptor

Data extraction rules Data Translation Rules

Consolidate data

We identified 13 systems most prominently referred in the subject literature (AURORA [17], DIASPORA [19], Protoplasm [15], MIKS [2], TSIMMIS [5], MOMIS [1,3], GARLIC [18], SIMS [6], Information Manifold [11], Info master [14], DWDI [13,9], PICSEL [21], Denodo [20]) to base our architecture on the approaches reported to date. We used the following set of criteria to evaluate existing systems architectural decisions (it arose as an outcome of requirements posed by integration process and intended generality of the proposed architecture): - does the system support sources discovery, - can sources content / quality be described, - what are the means for schema description and integration - is source navigation present as declarative model or hard coded, - how are querying capabilities expressed, - how are extraction rules defined, - how are data translation rules supported, - is query translation available, - can the system address Deep Web challenges, - can the system be distributed,

Data Flow

Control Flow

Fig 1. Deep Web data extraction and integration process.

During system preparation phase, first the sources need to be discovered. This step is currently assumed to be done by human beings the list of sources is usually provided to the integration system. Subsequently the sources need to be described their schema, query capabilities, quality, coverage needs to be provided for further steps in the integration process. Most importantly schema descriptions are used in the next step to map them either one to the other or to some global schema. The last part of system preparation is a definition of

September Issue

Page 6 of 75

ISSN 2229 5208

- What is the supported type of information (unstructured / semi-structured / structured). After studying the literature the following conclusions were drawn (we restrict ourselves to conclusions only due to lack of space to present the detailed analysis results): - Almost none of the system supports sources discovery and quality description (only PICSEL supported the latter with manual editing). - Schema description usually required manual annotation - There was wide spectrum of approaches to schema integration (from purely manual to fully automatic) with variety of reported effectiveness for automatic integration. Most noteworthy Older systems tried to approach this issue automatically, while newer rely on human work. - Source navigation was usually hard coded in wrappers for particular sources - Extraction and data transformation rules were either hard coded or (rarely) learned by example and represented declaratively. - Only few systems (Protoplasm, Denodo, Aurora) were applicable to Deep Web out-of-the-box. Most of the other systems would require -Either serious (including underlying model refinement) or minor (technical) modifications. - Most of the systems were intended for structured (usually relational) information and few of them could work as distributed systems (MIKS). V. CONCLUSION

International Journal of Computer Information Systems, Vol. 3, No. 3, 2011 Source description editor this module is used for modifying list of sources, their schemas and descriptions. Recording, manual specification or edition of navigation model and data extraction rules is also a functionality of this module. Automatic schema mapping module it provides (during preparation phase) means for grouping or clustering of sources, discovery of schema matching and data translation rules. Schema mappings editor respective editor for the declarative mappings, clusters and data translation rules provided by previous module. Business-level data logic this module is essentially external to the system it can be applied once the data is retrieved and integrated to conduct domain-dependent data cleansing, records grouping, ranking and similarity measurement and even data mining tasks. The logic here is somewhat external to the data integration logic even though there may be scenarios where parts of business logic may seem to fit better in the last parts of data consolidation process. Data mediator used in the query phase, it performs union on data from several sources and executes post-union operations from query plan. It may also maintain mediatorspecific query cost statistics. Query planner is responsible for sources selection (it evaluates, based on the source description, which should actually be considered for querying) and performs query optimization based on cost statistics. Important feature of query planner in data integration (as opposed to traditional database query planners) is query rewriting for specific sources w.r.t to their query capabilities. The planner has to take into account sources limitations and may schedule some operations to be performed after the data is unified. Wrappers (query execution modules) are used to execute the navigation plan for particular query. There is one-to-one mapping between query types supported by the source and navigation models for those queries, therefore wrappers simply execute navigation plans once given queries that they should answer. During navigation they perform data extraction and translation. They can also provide source specific cost statistics. Many of the features executed during preparation phase may be performed automatically; however, as visible from the components description, we assume that corresponding editors also exist that allow for post-edition of results of automated processes to increase systems quality of service and may replace automatic solution in case that automation of some tasks is not possible (missing technology) or not beneficial (too low reliability of automated process). Some of the functions of individual components listed above (e.g. schema matching [8, 7]) may be performed by several different approaches. Thus we propose that our architecture include placeholders (interfaces) supporting dynamically loaded plug-ins for several functions of subcomponents. We describe proposed types of plug-ins below. Modular approach to these components has direct impact on proposed architectures flexibility and extensibility. Another feature of our architecture is level of parallelization or

Based on the analysis of the existing systems and the complete Deep Web data integration process presented above we propose the extensible architecture for a Deep Web data integration system. The factors that motivated us and influenced the architecture design were the following: we strived to achieve maximal generality and extensibility of the architecture and wanted to cover all the steps in data integration process. We designed the architecture in a way that it allows for experimentation and comparison of various techniques that can be applied during data integration process. The list of components in the architecture (see Fig 2) includes: Crawler (source discovery module) used in the system preparation phase to discover source candidates. It performs Web forms parsing and provides preliminary description of the sources. Automatic source description module another preparation phase module, used to describe sources schema, content (in a more detailed way than the crawler) and its quality. This module is also responsible for providing navigation model for the source (how to reach the data through the Web interface, once we know what to ask about). This can be obtained either by learning and generalization. The last responsibility of this module is to create data extraction rules (by discovery).

September Issue

Page 7 of 75

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 3, 2011 distribution - certain components that may reside on several servers and be executed in parallel. Three classes of components are defined: Single single instance of the component in the system Multiple multiple instances of the component may exists (e.g. several parallel crawlers, several mediators handling queries with workload balancing in mind Multiple per query multiple instances of the component may exist to execute on query in distributed manner
[6] Preparation Phase Components Crawler Automat ic Source Mapp ing Modu le Sourc e descri ption Editor Sche ma Mapp ing editor [4] [5] [2] BENEVENTANO D., VINCINI M., GELATI G., GUERRA F., BERGAMASCHI S., Miks : An agent framework supporting information access and integration. AgentLink, 2003, 2249. BERGAMASCHI S., CASTANO S., VINCINI M., BENAVENTANO D., Semantic integration of heterogeneous information sources. Data & Knowledge Engineering, 36(3), 2001, 215249. Bergman M., The deep web: Surfacing hidden value. The Journal of Electronic Publishing, 7(1), 2001. CHAWATHE S., PAPAKONSTANTINOU Y., ULLMAN J., GARCIAMOLINA H., IRELAND K., HAMMER J., WIDOM J. The TSIMMIS project: Integration of heterogeneous information sources. 10th Meeting of the Information Processing Society of Japan, 1994, 718. CHEE C., ARENS Y., HSU C., KNOBLOCK C., Retrieving and integrating data from multiple informationsources. International Journal of Cooperative Information Systems, 2, 1993, 127158. DOAN A., DOMINGOS P., HALEVY A., DHAMANKAR R., LEE Y., IMAP: discovering complex semantic matches between database schemas. 2004 ACM SIGMOD International Conference onManagement of Data, 2004, 383394. DOMINGOS P., HALEVY A., DOAN A. Learning to match the schemas of databases: A multistrategy approach. Machine Learning, 2003. FLEJTER D., KACZMAREK T., KOWALKIEWICZ M., ABRAMOWICZ W., Deep web sources navigation. In Submitted to the 8th International Conference on Web Information Systems Engineering, 2007. GRAVANO L., PAPAKONSTANTINOU Y., Mediating and metasearching on the internet. IEEE Data Engineering Bulletin, 21(2), 1998, 2836. HALEVY A., RAJARAMAN A., ORDILLE J., Querying heterogeneous information sources using source descriptions. 22nd International Conference on Very Large Data Bases, 1996, 251262. HE B., PATEL M., ZHANG Z., CHEN-CHUAN CHANG K., Accessing the deep web. Communications of ACM, 50(5), 2007, 94 101. KACZMAREK T., Deep Web data integration for company environment analysis (in Polish). PhD thesis, Poznan University of Economics, 2006. KELLER A., GENESERETH M., DUSCHKA O., Infomaster: an information integration system. 1997 ACM SIGMOD International Conference on Management of Data, 1997, 539542. MELNIK S., PETROPOULOS M., BERNSTEIN P., QUIX C., Industrial-strength schema matching. SIGMOD Record, 33(4), 2004, 3843. ORDILLE J., RAJARAMAN A., HALEVY A., Data integration: The teenage years. 32nd International Conference on Very Large Data Bases, 2006. OZSU M., LIU L., LING YAN L., Accessing heterogeneous data through homogenization and integration mediators. 2nd International Conference on Cooperative Information Systems, 1997,130139. PAPAKONSTANTINOU Y., GUPTA A., HAAS L., Capabilities-based query rewriting in mediator systems. 4th International Conference on Parallel and Distributed Information Systems, 1996. RAMANATH M., Diaspora: A highly distributed web-query processing system. World Wide Web, 3(2), 2000, 111124. [20] RAPOSO J., L ARDAO L., MOLANO A., HIDALGO J., MONTOTO P., VINA A., ORJALES V., PAN A., ALVAREZ M., The DENODO data integration platform. 28th International Conference on Very Large Data Bases, 2002, 986 989. REYNAUD CH., GOASDOUE F., Modeling information sources for information integration. 11th European Workshop on Knowledge Acquisition, Modeling and Management, 1999, 121138. TEIXEIRA J., RIBEIRO-NETO B., LAENDER A., DA SILVA A., A brief survey of web data extraction tools. SIGMOD Record, 31(2), 2002, 8493. ZIEGLER P., DITTRICH K., Three decades of data integration - all problems solved? 18th IFIP World Computer Congress, 2004, 312.

[3]

Business level data logic Data Mediator Query Planner Wrapper executio n modules

Usages Phase Components Indivi dual Plugi ng imple menta tion Data excha -nge Form at & Mgt Comp onent

[7]

[8]

Data excha -nge Form at & Mgt Comp onent

[9]

Source Discripti on

[10]

[11]

Indivi dual Plugi ng imple menta tion

[12]

[13] Individual data source [14]

Fig 2. Proposed architecture for Deep Web data integration system.

[15]

The Crawler component may have multiple instances and is extensible in a sense that it allows for pluggable source detectors and crawling-level data sources classifiers. The source description module can also be multiply instantiated. Its extension points include schema detectors (based on domain logic, types analysis etc.), content descriptor plug-ins (based on schema analysis, Web page analysis, data source probing etc.), quality measures (domain dependant and independent), and detectors of specific types of extraction rules (based on structure, context, visual parsing etc.). For schema mapping module single instance is sufficient but it can be extended with different schema matchers. There can be multiple instances of data mediators and query planners (which can have pluggable sources selectors, paired with corresponding content description plug-ins). The wrappers can be instantiated multiply per query and support pluggable data extractors that depend on particular data extraction rules used for the query. REFERENCES
[1] BENEVENTANO D., BERGAMASCHI S., The MOMIS methodology for integrating heterogeneous data sources. 18th IFIP World Computer Congress, 2004.

[16]

[17]

[18]

[19] [20]

[21]

[22]

[23]

September Issue

Page 8 of 75

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 3, 2011


AUTHORS PROFILE University, Ajmer, Rajasthan, India. This author is the member of International Journal of Computer Science and Information Security , USA Technical Review Committee and Journal of Information Technology Education, California, USA [Editorial Review Board Committee.

Mr. Suneet kumar received his M.Tech degree from Rajasthan Vidyapeeth University, Rajasthan, India in 2006. and Persuing Ph.D degree from Bhagwant

September Issue

Page 9 of 75

ISSN 2229 5208

You might also like