Professional Documents
Culture Documents
Vector Space Model For Deep Web Data Retrieval and Extraction
Vector Space Model For Deep Web Data Retrieval and Extraction
ISSN 2278-6856
Abstract
Deep web data extraction is challenging problem recently since
the structured data from deep web pages underlie intricate
structure. So, extraction of web data from deep web pages
received much attention among the researchers. In this
research, vector space model and content features are utilized
for deep web data extraction. Initially, extracted deep web
pages are taken as input for the proposed method and
Document Object Model (DOM tree) is constructed. Through
the DOM tree, information given in the whole web pages is
split into block wise and block with its contents are given for
feature computation process. Here, frequency level, title level
and numerical level features are calculated after constructing
vector space model which is a vector of words and its
frequency. From the feature score value of every block, the
important blocks are chosen as final useful data for the taken
web page. The proposed approach of deep web data extraction
is implemented using deep web pages which are collected from
the complete planet web site and performance of the system is
evaluated using precision and recall.
1. INTRODUCTION
Currently, deep web database are usually accessed through
deep search engine which is the system utilized for
extracting underlying database encoded. For the encoded
data units into machine process able, deep web data
extraction and collection is an important task in recent
years to assign useful tags for the data units. Literature
presents several algorithms for deep web data extraction
which has been significant research for the past decade [69]. But, most of the system needs human intervention to
find the desired information from the sample pages which
is really a hectic process if the retrieved web pages for the
user query is large. Because of this, automatic
identification of web data extraction from web pages is
needed for the current world to achieve high extraction
accuracy. On the other hand, they endure from poor
scalability and are not appropriate for applications that
require extracting information from a large number of web
sources [1-5]. In this paper, vector space model is adapted
to extract deep web data useful for indexing and retrieval
of web pages. The ultimate way of extracting deep web
data is to split web pages into a set of blocks and each
block are given to feature identification phase or feature
map phase. Then, contents of web pages are analyzed
blocks wise to find whether important contents are
presented through three different feature formulae. Based
on the feature score value, important block or web data is
2. EXISTING SYSTEM
In order to obtain data records and data items presented in
the web database to machine processable, which is
significant for various applications such as deep Web
crawling and metasearching, the structured database
should be constructed for those applications. In order to
accomplish this task, a vision-based approach is presented
in [1] which is Web-page programming-languageindependent method. Here, different visual features are
utilized on the deep Web pages for web data extraction.
Disadvantages: After analyzing the work presented in [1],
it can process only deep web pages having one data region.
But, most of the deep web pages have number of dataregion. Also, visual features of Web pages is computed by
calling the programming APIs of IE, which is a timeconsuming process. These two problems are solved in the
proposed work. The first problem is easily solved with
identification of blocks and the extraction is done based on
blocks of web pages. The second problem is solved using
vector space model which reduces the time complexity
significantly.
3.PROPOSED
Page 274
1 k fi
F1
Ni1Fi
1 k
F2 Ti
N i1
1 k
F3
Ni
N i 1
ISSN 2278-6856
Fb
1
F1 F 2 F3
3
.
3.4. Identification of blocks
After identifying feature score for every block, blocks
which are having feature score value greater than the
threshold T are taken as final output of this approach.
These blocks are important and it can be useful for the
indexing and retrieval
Pr ecision
Relevant Retrieved
Retrieved
Re call
9
Re levant Re trieved
Re trieved
4.6.
Page 275
ISSN 2278-6856
REFERENCES
5. CONCLUSION
This paper presented a deep web data extraction approach
which is adapted to extract deep web data useful for the
indexing and retrieval of web pages. At first, extracted
input web pages from deep web is taken as input for the
proposed method and then, DOM tree is constructed to
split web page into a set of blocks. Then, from every block,
feature words are extracted to build up vector space model
which is then used to build up the content-based feature for
every block. The feature score for every block is computed
based on the frequency, title matching and numerical
matching. Finally, three features are combined as final
score value for every block to decide the importance score
of the block. For experimentation, deep web pages are
collected from the complete planet web site which is
BIOGRAPHY
Dr. PoonamYadav obtained B.Tech in
Computer Science &Engg. fromKurukshetra
University Kurukshetra and M.Tech in
Information Technology from Guru Govind
Singh Indraprastha University in 2002 and
2007 respectively. She had Awarded Ph.D
inComputer Science& Engg.
fromNIMS University,
Jaipur. She is currently working as Principal in D.A.V
College of Engg. & Technology, Kanina (Mohindergarh).
Her research interests include Information Retrieval, Web
based retrieval and Semantic Web etc. Dr. PoonamYadav
is a life time member of Indian Society for Technical
Education and her
Page 276