Conference

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 5

Proceedings of International Conference on Recent Innovations in Engineering & Technology - 2015

Keyword Based Personalized Document Search on


Web Using Visit of Link Algorithm
T. Balaji,
PG Scholar,

Mrs. B. Narmada,
Assistant Professor,

Department of Computer Science and Engineering,


Dhirajlal Gandhi College of Technology,
Salem, TN.
tbalajimecse@gmail.com

Department of Computer Science and Engineering,


Dhirajlal Gandhi College of Technology,
Salem, TN.
narmada@dgct.ac.in

Abstract The main objective of this research is to search


documents in the public available search engines more easily. For
example Google Scholar, these search engine results of
research papers are in pdf or the links related to those papers. In
this paper, the search result can extensively categorize. Such that
while searching for some documents, the website will provide
them in formats like pdf, docx, ppt and other formats. Other than
research papers, this website will also provide more general
documents related to the search query. Privacy of user is
accounted in this paper. The additional concept included in this
paper is, ranking will be provided for each topics that has been
searched. In this paper a methodology for the document search
process is based on keyword search is proposed. The pattern is
extracted from the generated keyword given by the user and
relevant documents can be constructed. The next process is
clustering the patterns based upon their meaning. Using click
through concept, once the user clicks the document, ranking is
assigned and the URL will be provided based upon the counting.
(Abstract)
Index Terms Ranking, URL Counting, Document Extraction,
User Interest (key words).

I. INTRODUCTION
In generally document search method is based on user need
to access in less time process. In every user point of view they
need to search a document frequently and meaningfully
constructed. Document type is like as .ppt (PowerPoint
Presentation), .pdf (Portable Document Format), .docx
(Microsoft Word 2007/2010/2013 document) and other formats.
This research providing the search engine for to search a
document like .ppt, .pdf, .docx to based on the user
recommendation. In this search engine based on the Google
Scholar. It will be provide the only research paper based on
the keywords. The result only based on .pdf documents. It is
also providing the .pdf direct link of document to refer on the
IEEE website.
Similar concept is based on the SlideShare PPT website
is provide the only type of .ppt (PowerPoint Presentation)
documents. In this SlideShare website also provides the
privacy for every user to maintain their history. It used for
easily retrieve the previous document already user visited on

that page. These concept are provides only document search on


the web.
Both Google Scholar and SlideShare web sites are
provide the documents search result based on keywords. In this
paper providing document search process and also providing
the ranking is based on individual user histories. In Google
search engine provides Page Rank is a unique concept and it is
one of the Google's way of finding out the reputation and rank
of a webpage. The concept of PageRank, how PageRank is
determined and why PageRank is given to a site. The term
PageRank [13] [14] [`15] is a combination of two words, i.e.,
page and rank. Page refers to any web page and rank
refers to the particular number assigned to a webpage by
Google. PageRank determines the popularity and value of a
particular webpage. Its a kind of vote of support by all the
other WebPages on the web to your webpage. So, the more
support you get, the more value they add to your webpage and
the higher goes the PageRank of your webpage.
World Wide Web is global information domain which
provides the combo-facility of read and write. It has a vast web
of links within itself. When a search engine searches for a
visitors query, it goes through all the links in the web and
provides a result list which is popularly known as Search
Engine Result Page (SERP.) If the SE goes to more sites
through a single 00 page then it considers the page more
valuable. Thus it determines the PageRank of the page. Dont
be confused that all things have been done by human beings.
Rather, Big G (Google) has a unique combination of advanced
hardware and software. It applies a specially designed
PageRank algorithm method while determining the PageRank
of a webpage. In real sense, Google attributes a link from page
of site 1 to page of site 2 as a vote, by page of site 1 for page of
site 2. But only more links dont determine the PageRank, it
also looks whether linked pages have some value or not. If a
page has some good links then obviously Google will give
more PageRank to that particular page.
Hence, a good website must get higher PageRank than a
mediocre site. PageRank is not determined only by links that a
page receives but also by some other aspects too. We surf web

Proceedings of International Conference on Recent Innovations in Engineering & Technology - 2015


for necessary information, and if our query doesnt meet with
right information in a SERP then obviously well be frustrated.
To guard this aspect, Google too has provisions while assigning
a PageRank to a page. Here comes the idea of keyword. If
the visitors query has some relevance with the certain Pages
information and the content contains the exact vocabulary that
the visitor has asked, then the page gets higher PageRank. This
means the content should contain relevant keyword to get
higher PageRank. But in the modern era of cut-throat
competition, more often than not, many webmasters have been
found resorting to spam techniques such as PageRank abuse
for their survival.
All Search Engines provide search services. Google too
does the same. But the difference between Google and other
search engines is maintained only due to some unique features.
The search engines revolutionized the World Wide Web,
especially website marketing. Several new concepts come into
the sphere and now it has become a full-fledged marketing
strategy that does all kinds of marketing as we normally do in
direct marketing. For more information on online marketing
visit Website Marketing.
But the Google provides the ranking for overall user. For
example consider the topic Engineering College List in
Tamilnadu is searching the any kind users to manually in a
web. The search engine provides ranking result is based on the
calculation of weighted of link based on keywords its showing
the user pages. But the weighted calculation method is
particular period and also over the entire user visiting the same
topic in that period.
In this paper consider the main objective of provide a
ranking for individual users based on their history like visited
of link using visit of link algorithm [13] concept.
II. RELATED WORKS
A. Domain Terminology Extraction
During the domain terminology extraction step, domainspecific terms are extracted, which are used in the following
step (concept discovery) to derive concepts. Relevant terms can
be determined e. g. by calculation of the TF/IDF values or by
application of the C-value / NC-value method. The resulting
list of terms has to be filtered by a domain expert. In the
subsequent step, similarly [1] to co reference resolution in IE,
the OL system determines synonyms, because they share the
same meaning and therefore correspond to the same concept.
The most common methods therefore are clustering and the
application of statistical similarity measures
B. Text Mining from Documents
The purpose of Text Mining is to process unstructured
(textual) information, extract meaningful numeric indices from
the text, and, thus, make the information contained in the text
accessible to the various data mining (statistical and machine
learning) algorithms. Information can be extracted to derive
summaries for the words contained in the documents or to

compute summaries for the documents based on the words


contained in them. Hence, you can analyze words, clusters of
words used in documents, etc., or you could analyze
documents and determine similarities between them or how
they are related to other variables of interest in the data mining
project. In the most general terms, text mining will "turn text
into numbers" (meaningful indices), which can then be
incorporated in other analyses such as predictive data mining
projects, the application of unsupervised learning methods
(clustering), etc. These methods are described and discussed in
great detail in the comprehensive overview work by Manning
and Schtze (2002), and for an in-depth treatment of these and
related topics as well as the history of this approach to text
mining, we highly recommend that source.
C. Latent Semantic Indexing
An indexing and retrieval method that uses a mathematical
technique called singular value decomposition (SVD) to
identify patterns in the relationships between the terms and
concepts contained in an unstructured collection of text. LSI is
based on the principle that words that are used in the same
contexts tend to have similar meanings. Early challenges to
LSI focused on scalability and performance. LSI requires
relatively high computational performance and memory in
comparison to other information retrieval techniques. However,
with the implementation of modern high-speed processors and
the availability of inexpensive memory, these considerations
have been largely overcome. Real-world applications involving
more than 30 million documents that were fully processed
through the matrix and SVD computations are not uncommon
in some LSI applications. A fully scalable (unlimited number
of documents, online training) implementation of LSI is
contained in the open source genesis software package.
Another challenge to LSI has been the alleged difficulty in
determining the optimal number of dimensions to use for
performing the SVD. As a general rule, fewer dimensions
allow for broader comparisons of the concepts contained in a
collection of text, while a higher number of dimensions enable
more specific (or more relevant) comparisons of concepts. The
actual number of dimensions that can be used is limited by the
number of documents in the collection. Research has
demonstrated that around 300 dimensions will usually provide
the best results with moderate-sized document collections
(hundreds of thousands of documents) and perhaps 400
dimensions for larger document collections (millions of
documents).
D. Rank based on Visits of Links Algorithm
As most of the ranking algorithms proposed earlier are
either link or content oriented in which consideration of user
usage trends are not available. They propose in their paper, a
page ranking mechanism called Page Ranking based on Visits
of Links (PRVOL) [13] [14] is being devised for search
engines, which works on the basic ranking algorithm of
Google, i.e. Page Rank and takes number of visits of inbound
links of web pages into account. This concept is very useful to

Proceedings of International Conference on Recent Innovations in Engineering & Technology - 2015


display most valuable pages on the top of the result list on the
basis of user browsing behavior, which reduce the search
space to a large scale. In this paper as the author describe that
in the original Page Rank algorithm, the rank score of page p,
is evenly divided among its outgoing links or we can say for a
page, an inbound links brings rank value from base page p. So,
he proposed an improved Page Rank algorithm. In this
algorithm we assign more rank value to the outgoing links
which is most visited by users. In this manner a page rank
value is calculated based on visits of inbound links.

about the topic and the link will be added on the user history.
It will be used for ranking concept at next time of this user
visits the same topic in the search engine.
New user is fresher for this search engine to search the
topic about Introduction about Data Mining. Then search
engine also provide some result for to relevant the topic in
result page. The new user is visits the some of the tab on page.
Its also added to the user history database. In this concept
represented on figure 1 show in on above. This figure based on
same topic to search and based on various users.

III. SYSTEM ARCHITECTURE

This system architecture design is based on the user


interactive session with file server system. The searching
process is based on particular topic or concept. For example
we can choose the topic Introduction about Data Mining. It
has taken the user search query. Here user types have been
considered the old user and new user. Old user is represent the
already having the account and already visit the topic about
Introduction about Data Mining on the search engine.
Search engine will displayed the some of the result for that
topic on that result page. Next user manually visits the link

In this architecture system are described the how the user


are interact with the file server. And how the file server
response to user for relevant document for all the topic like
.pdf, .ppt and .docx files.
Those documents are extracted from file server based on
the cosine similarity construction method. Which is similar to
semantic similarity based information retrieved from the text
documents. The number of keyword presented in those
documents which is based on counting to lower or higher
values.

Proceedings of International Conference on Recent Innovations in Engineering & Technology - 2015


IV. IMPLEMENTATION
In this section described how the overall system proceed
and module wise listed. That is following on user modules,
pattern extraction, and pattern clustering and ranking providing
for individual user.

C. Pattern Clustering

In this module, Users are having authentication and


security to access the detail which is presented in the ontology
system. Before accessing or searching the details user should
have the account in that otherwise they should register first.
Users are just need to login in the home pages. Then only
users are going to the home page (search page). If users does
not having the account then need to register in on register
page. In this process represented as the privacy for the entire
user to maintain their histories based on the visited links.

By sorting the lexical patterns in the descending order of


their frequency and clustering [6] [7] the most frequent
patterns first, we form clusters for more common relations
first. This enables us to separate rare patterns which are likely
to be outliers from attaching to otherwise clean clusters. The
greedy sequential nature of the algorithm avoids pair wise
comparisons between all lexical patterns. Extract clusters of
lexical patterns [8] from snippets to represent numerous
semantic relations that exist between two words. In this
section, we describe a machine learning approach to combine
both page counts-based co occurrence measures, and snippetsbased lexical pattern clusters to construct a robust semantic
similarity measure.

B. Snippets and Pattern Extraction

D. Ranking Provide for Searching Document

The snippets are consider the user enter the query has been
keyword for the searching document. Here the keyword refers
the snippets. Snippets are used for to extract the document
from keyword to matching formats. Pattern extraction
algorithm considers all the words in a snippet, and is not
limited to extracting patterns only from the mid fix. The
consideration of gaps enables us to capture relations between
distant words in a snippet. We use a modified version of the
prefix span algorithm to generate subsequences from a text
snippet. We use the constraints (2-4) to prune the search space
of candidate subsequences. We showed how to extract lexical
patterns from snippets to represent numerous semantic
relations that exist between two words. In this section, we
describe a machine learning approach to combine both page
counts-based co-occurrence measures, and snippets-based
lexical pattern clusters to construct a cosine similarity [2] [3]
[4] based measure.

Ranking providing is based on the particular user history


based on searching documents. For example if user searching
particular topic in search engine. It shows some result on the
page. User manually visited on any tab of the link in that page
or probably on next other pages. That link will be stored in the
user history page. If user again visited the same topic search
for that search engine it will be show on that topic last time
what user visited at last time. Its based on priority of
weighted of Visit of Link Algorithm [13] [14] [15].

A. User Module

This module describe the how the document is extracted


from the file server. Its based on the Information Retrieval
concept. Users only know the exact what they expecting from
the searching result. Search engine does not know the result
what user expecting from file server. So we are assume the
user enter the keyword in search engine it will be consider as
to snippets. The snippets are to verifying the file server
(Database) whether the documents are available or not. The
document is based on to retrieve Cosine Similarity based
search. In this method has referred from standards papers.
The similarity based concept is defining a all the keyword are
verify the how many times same keyword occurred from the
retrieve documents.
In this process defined in data flow diagram of snippets
and pattern extraction show in on figure 3.3.2.1 The diagram
was represented the how the module 2 are working and how
the snippets and pattern extraction performed. Before the
searching option user need to login in the account and search
the document for user personally.

V. CONCLUSION
In this approach user registration system and its
functionalities are implemented. Pattern extraction is
exercised using cosine similarity in which keyword produced
by user plays a major role. When the user provides keyword, it
is used as snippets which are matched with already maintained
database. On matching of the snippets with the database
information the exact result is produced. The history of
previous documents viewed by the user is stored for later
access and analysis.
In the above approach the result brings both related and
unrelated information for the given search.
In order to
produce only related documents for the search, the concept of
clustering is proposed. A cluster of information is developed
based on the close relation exhibited documents. Thus for
given search documents which highly related for the
information is being searched is produced. Based on the user
history, the information is rated by using Visit of Link (VOL)
algorithm. This approach is applied only for document type
content. The proposed system increase the efficiency and time
conception for the search of information in web is reduced.
Future work of this approach has the above mentioned
ranking result is based on individual user. But if two links are
having same counting at to produce a result will be based on
time or some other approaches are going too implemented.

Proceedings of International Conference on Recent Innovations in Engineering & Technology - 2015

REFERENCES
[1].

Agirre, Eneko, et al. "A study on similarity and


relatedness using distributional and WordNet-based
approaches." Proceedings of Human Language
Technologies: The 2009 Annual Conference of the
North American Chapter of the Association for
Computational
Linguistics.
Association
for
Computational Linguistics, 2009.
[2]. Leung, KW-T., and Dik Lun Lee. "Deriving conceptbased user profiles from search engine logs."
Knowledge and Data Engineering, IEEE Transactions
on 22.7 (2010): 969-982.
[3]. Liu, Fang, Clement Yu, and Weiyi Meng.
"Personalized web search by mapping user queries to
categories." Proceedings of the eleventh international
conference on Information and knowledge
management. ACM, 2002.
[4]. Baeza-Yates, Ricardo, Carlos Hurtado, and Marcelo
Mendoza. "Query recommendation using query logs
in search engines." Current Trends in Database
Technology-EDBT 2004 Workshops. Springer Berlin
Heidelberg, 2005.
[5]. Cui, Hang, et al. "Probabilistic query expansion using
query logs." Proceedings of the 11th international
conference on World Wide Web. ACM, 2002.
[6]. Zhang, Zhiyong, and Olfa Nasraoui. "Mining search
engine query logs for query recommendation."
Proceedings of the 15th international conference on
World Wide Web. ACM, 2006.
[7]. Church, Kenneth Ward, and Patrick Hanks. "Word
association norms, mutual information, and
lexicography." Computational linguistics 16.1 (1990):
22-29.
[8]. Bagga, Amit, and Breck Baldwin. "Entity-based
cross-document coreferencing using the vector space
model." Proceedings of the 17th international
conference on Computational linguistics-Volume 1.
Association for Computational Linguistics, 1998.
[9]. Haveliwala, Taher H. "Topic-sensitive pagerank: A
context-sensitive ranking algorithm for web search."
Knowledge and Data Engineering, IEEE Transactions
on 15.4 (2003): 784-796.
[10]. Qiu, Feng, and Junghoo Cho. "Automatic
identification of user interest for personalized
search." Proceedings of the 15th international
conference on World Wide Web. ACM, 2006.
[11]. Heymann, Paul, Georgia Koutrika, and Hector
Garcia-Molina. "Can social bookmarking improve
web search?." Proceedings of the 2008 International
Conference on Web Search and Data Mining. ACM,
2008.

[12]. Chahal, Poonam, Manjeet Singh, and Suresh Kumar.


"An Ontology Based Approach for Finding Semantic
Similarity between Web Documents." (2013).
[13]. Gupta, Sachin, and Pallvi Mahajan. "Improvement in
Weighted Page Rank based on Visits of Links (VOL)
algorithm." IJCCER 2.3 (2014): 119-124.
[14]. Langville, Amy N., and Carl D. Meyer. Google's
PageRank and beyond: The science of search engine
rankings. Princeton University Press, 2011.
[15]. Dubey, Hema, and B. N. Roy. "An Improved Page
Rank Algorithm based on Optimized Normalization
Technique." (2011).

You might also like