Professional Documents
Culture Documents
Deepthi - Webclustering Report PDF
Deepthi - Webclustering Report PDF
SEMINAR REPORT
2009-2011
CERTIF I CATE
This is to certify that the seminar report entitled Web Clustering Engines is being
submitted by Deepthi Theresa K.K. in partial fulfillment of the requirements for the award of
M.Tech in Computer & Information Science is a bonafide record of the seminar presented by
her during the academic year 2010.
ACK NO W L E D G E M E NT
Computer Science, CUSAT who provided with the necessary facilities and advice. I am also
thankful to Mr. G.Santhosh Kumar, Lecturer, Dept of Computer Science, CUSAT for his
valuable suggestions and support for the completion of this seminar. With great pleasure I
remember Dr. Sumam Mary Idicula, Reader, Dept. of Computer Science, CUSAT for her
sincere guidance. Also I am thankful to all of my teaching and non-teaching staff in the
department and my friends for extending their warm kindness and help.
I would like to thank my parents without their blessings and support I would not have
been able to accomplish my goal. I also extend my thanks to all my well wishers. Finally, I
thank the almighty for giving the guidance and blessings.
ABSTRACT
Web clustering Engines are emerging trend in the field of information retrieval.
They organize search results by topic, thus offering a complementary view to the flat ranked
list returned by the conventional search engines. The search results returned by traditional
search engines on different subtopics or meanings of a query will be mixed together in the list
so that the user may have to sift through a large number of irrelevant items to locate those of
interest. The Web clustering engines categorize the search results into different hierarchical
groups/clusters and display those cluster labels. Hence the user can locate the desired
document very fast.
In this seminar we discuss different phases in the implementation of web clustering
engines in detail and also incorporate some of the web clustering algorithms, their advantages
and issues. We will familiarize some currently using web clustering engines. Some future
research directions are also presented.
Additional Key Words and Phrases: Web Clustering Engines, Information retrieval, meta
search engines, search results clustering, Search results acquisition, Preprocessing, Cluster
construction and labeling, Vector Space model, data centric clustering algorithms, description
aware algorithms
Contents
1. Introduction
1.1 Motivation
5
5
10
15
20
20
21
22
23
4. Conclusion
24
5. References
25
6. Appendix
26
1.INTRODUCTION
1.1 MOTIVATION
Search engines are an invaluable tool for retrieving information from the Web. In
response to a user query, they return a list of results ranked in order of relevance to the
query. The user starts at the top of the list and follows it down examining one result at a
time, until the sought information has been found.
Now a days efficient search engines are available like Google, Yahoo etc. Even
though they are definitely good for navigational searching and transactional searching,
they are not that much efficient in the case of queries which includes ambiguity.
Ambiguous queries means they should have multiple meaning in different contexts. The
search results returned by conventional search engines on different subtopics or meanings
of a query will be mixed together in the list so that the user may have to sift through a
large number of irrelevant items to locate those of interest. In this context clustering of
search results come in to picture.
Clustering is the act of grouping similar object into sets. The distance between the
objects in the same cluster(inter-cluster variations) should be minimum and the distance
between objects in different clusters(intra-cluster variations) should be maximum. In the
web search context, organizing web pages (search results) into groups, so that different
groups correspond to different user needs.
In 1979 Van Rijsbergen introduced the concept Cluster Hypothesis in the field of
information retrieval. It states that Closely related documents tend to be relevant to the
same requests.
Web Clustering Engines are the systems that perform clustering of web search
results. This systems group the results returned by a search engine into a hierarchy of
labeled clusters (also called categories).
Dept. Of Computer Science
CUSAT
To illustrate, Figure 1 in appendix shows the clustered results returned for the
query tiger .This result is given by one of the very popular web clustering engine called
Vivisimo (as of March 5, 2010). Like many queries on the Web, tiger has multiple
meanings like: the feline, the Mac OS X computer operating system, the golf champion
and so on. These different meanings are well represented in Figure 1.By contrast, if we
submit the query tiger to Google or Yahoo!(Figure 2), we can see that each meanings
items are scattered in the ranked list of search results, often through a large number of
result pages.
The first commercial clustering engine was Northern Light, at the end of the
1990s. It was based on a predefined set of categories, to which the search results were
assigned. A major breakthrough was then made by Vivsimo, whose clusters and cluster
labels were dynamically generated from the search results. Some other available
clustering engines are Clusty, Grokker, KartOO, Lingo3G, CREDO
It makes for shortcuts to the items that relate to the same meaning. Since Web
Clustering Engines group the search results having the same meaning within
same cluster it is very easy for the user to find similar documents. Hence the
search time will be less.
CUSAT
Short input data description. Due to computational reasons, the data available
to the clustering algorithm for each search result are usually limited to a URL,
an optional title, and a short excerpt of the documents text (the snippet)
Meaningful labels. Each cluster label should indicate the contents of the
cluster items within that cluster.
Selection of similarity measure. So many known methods are there for finding
the dissimilarity/similarity between 2 items within a cluster like, euclidean
distance, Manhattan distance etc.
CUSAT
Overlapping clusters. Since the same result may applied to different themes
we may allow overlapping clusters. Handling of overlapping clusters in a
dynamic environment is a open issue.
Unknown number of clusters. In search results clustering, both the number and
the size of clusters cannot be predetermined because they vary with the query.
CUSAT
CUSAT
conventional search engines, in response to the query. The most elegant way of fetching
results from such search engines is by using application programming interfaces(APIs)
these engines provide.
CUSAT
text, the most intuitive set of features would be simply words of a given language. But
this is not the only possibility. The features can vary from single words and fixed-length
tuples of words (n-grams) to frequent phrases (variable-length sequences of words), and
very algorithm-specific data structures, such as approximate sentences.
One method for representing a text is Vector Space model(VSM). A document d
is represented in the VSM as a vector [wt0 , wt1 , . . .wtn], where t0, t1, . . . tn is a global set
of words (features) and wti expresses the weight (importance) of feature ti to document d.
Weights in a document vector typically reflect the distribution of occurrences of features
in that document. For example, a term vector for the phrase Polly had a dog and the dog
had Polly could appear as shown below (weights are simply counts of words, articles are
rarely specific to any document and normally would be omitted).
CUSAT
CUSAT
cluster. Continue this process until the desired no of k clusters reached. The Complexity
of this algorithm is clearly O(n2) since we are using a matrix, where n is the number of
clusters.
Another Data centric algorithm is called as K-means clustering. K is a predefined
value for number of clusters and we are always selecting an average one as the cluster
centroid. Hence the name. Firstly choose the number of clusters k. Randomly generate k
clusters and find cluster representative/centroid. Calculate the distance between each
cluster and each document. Assign each document to the nearest cluster centroid. Recompute new cluster centroid. Repeat the steps until some convergence criterion is met.
The complexity is
documents and T is the number of times the algorithm should repeat for getting a stable
system(without changing the membership of document).
Data-centric algorithms borrow their strengths from well-known and proven
techniques targeted at clustering numeric data. Eventhough it uses simple keyword based
features, still it is a powerful method.
But there are some difficulties in these set of algorithms. All these algorithms are not
incremental in nature. Incremental in the sense, as each document arrives from the web,
we clean it and add it to the available model. All the above algorithms excluded the
incremental property.
Another difficulty raised in Data centric approaches are in the case of meaningful labels.
In these algorithms cluster labels are created by selecting frequent keywords from the set
of cluster documents. This keyword based representation seemed to be insufficient from
the user perspective. Once a text is converted to a document vector we can hardly speak
of the texts meaning, because the vector is basically a collection of unrelated terms.
Using the extracted features in a keyword based approach the content of the cluster is not
that much readable.
CUSAT
10
CUSAT
11
papers by Zamir and Etzioni in 1998, 1999, and implemented in a system called Grouper.
In practice, STC was as much of a break through to search results clustering.
Suffix Tree Clustering(STC) uses a data structure called suffix tree. It Use
phrases(ordered sequence of words) as their atomic features rather than keywords. 3 steps
are there for performing suffix tree clustering. Those are, data cleaning, identifying base
clusters and combining base clusters. We define a base cluster to be a set of documents
that share a common phrase.
A suffix tree-Definition
1 A suffix tree of a string S is a compact trie containing all suffixes of S.
2. It is a rooted tree.
3. Each internal node has at least two children
4. Each edge is labeled with a non empty substring of S. The label of a node is the
concatenation of the edge labels on the path from the root to that node
5. No two edges out of the same node can have edge labels that begin with the same word
For example the suffixes of a sentence mouse ate cheese too are:
Suffix no.
Suffixes
1.
2.
3.
cheese too
4.
too
CUSAT
12
A General Suffix Tree (GST) means a suffix tree contains all the suffixes of two
or more sentences.
Step1-Data Cleaning
In this step, the string of text representing each document is transformed using a
light stemming algorithm (deleting word prefixes and suffixes and reducing plural to
singular). Sentence boundaries (identified via punctuation and HTML tags) are marked
and non-word tokens (such as numbers, HTML tags and most punctuation) are stripped.
Step 2-Identifying base clusters
The following picture is an example for a General Suffix Tree of a set of strings1)"cat ate cheese", 2)"mouse ate cheese too" and 3)"cat ate mouse too". The nodes of the
suffix tree are drawn as circles. Each suffix-node has one or more boxes attached to it
designating the string(s) it originated from. The first number in each box designates the
string of origin (1-3 in our example, by the order the strings appear above); the second
number designates which suffix of that string labels that suffix-node.
CUSAT
13
Each node of the suffix tree represents a group of documents and a phrase that is
common to all of them. The label of the node represents the common phrase; the set of
documents tagging the suffix-nodes that are descendants of the node make up the
document group. Therefore, each node represents a base cluster.
Following Table lists the six marked nodes (a-f) from the example shown above
and their corresponding base clusters:
where |B| is the number of documents in base cluster B, and |P| is the number of words in
P that have a non-zero score (i.e., the effective length of the phrase)
Step 3 - Combining Base Clusters
This step of the algorithm merges the base clusters, with a high overlap in their
document sets. For doing this we are using a base cluster graph. The nodes in this graph
are base clusters. Combine these base clusters based on some similarity measure.
The following figure is a base cluster graph of the previous example.
Dept. Of Computer Science
CUSAT
14
CUSAT
15
The Query used here is salsa. Only the first 5 clusters are shown here. The
words in bold are the shared phrases found in the clusters. Note the descriptive power of
phrases such as "Puerto Rico", "Latin Music" and "York Salsa Dancers".
CUSAT
16
Clusty
Clusty is a clustering engine developed by the company Vivisimo. Vivisimo won
the best meta-search engine award assigned by SearchEngineWatch.com from 2001 to
2003. Vivisimo means lively, bright, or clever in Spanish. Vivisimo's founders picked the
name to express their vision of optimizing and giving life to our information. Clusty is a
meta search engine, meaning it combines results from a variety of different sources. It
uses an algorithm to cluster content based on textual similarity. Every time of a search,
Clusty pulls together the data from other engines like Ask, MSN and Wisenut. It then
organizes the search results in a way that helps us navigate away from ambiguity towards
specific cluster of results.
Clusty uses a hierarchical folder approach. It is a very simple method and familiar
to everyone. Figure1 in appendix is the screenshot (taken on March 5, 2010) of Clusty.
Dept. Of Computer Science
CUSAT
17
The hierarchical folders are limited in the left side of the screen so that the user can
choose any cluster he may need within no time.
CREDO
CREDO ( Conceptual REorganization of DOcuments) has been developed at
Fondazione Ugo Bordoni by Claudio Carpineto and Gianni Romano. CREDO groups the
results of a web search (currently Yahoo APIs search results) in a lattice of conceptual
clusters that highlight the contents of the retrieved documents. CREDO is based on a
mathematical data representation termed a concept lattice. Compared to other systems for
clustering Web results, the clusters produced by CREDO are more justifiable, are easier
to navigate because they are organized in a lattice rather than a strict hierarchy, and allow
discovery of causal associations between the words contained in the results. CREDO is
an interesting example of a system that attempts to build the taxonomy of topics and their
descriptions simultaneously. Eventhough CREDO do not follow a strict hierarchical
organization can still use a tree-based visualization. Refer Figure 4(taken on March 6,
2010) in appendix for seeing the visualization of CREDO.
A version of CREDO for PDAs (Credino) and for cellular phones (SmartCREDO)
has been developed in collaboration with Stefano Mizzaro and Andrea Della Pietra
(University of Udine).
Grokker
Grokker is developed by a company called Groxis. Groxis was a tech company
based in San Francisco, California. The name Grokker is inspired by the 1961 Robert A.
Heinlein science fiction classic Stranger in a Strange Land, in which Grok is a Martian
word meaning literally to drink and metaphorically to be one with. To grok something
is to understand something so well that it is fully absorbed into oneself. It is to look at
every problem, opportunity, action, and point of view from any and all perspectives.
Grokker sits on top of multiple sources. After Grokker retrieves the information, it
CUSAT
18
"federates" it, meaning it meshes it all together. Finally, it clusters the returns into
categories. End users most frequently look at less than three screens from the thousands
of returned search results. Using Grokker, users immediately see the cluster(s) of greatest
relevance, and drill down, only within the cluster(s) that matter to them.
Grokker uses Nesting and Zooming approach. The screen shot of Grokker
is shown in appendix Figure 5. This Map View is a visual representation of the return of
hits. When the user click on one of the circles and see the subcategories again. By
clicking on Search Options the user can change the number of hits he will return. The
user can also choose which sites you want to search: Yahoo, Wikipedia and/or Amazon.
Simultaneous searching of different sites are also permitted. Finally, we can limit our
results by using the tools on the left side of the screen.
Some universities are using Grokker as their searching tool. Stanford University
was one of the first customers of Grokker. The new platform provides faculty and
students with a single point of access to multiple resources, including library catalogs,
proprietary subscription databases, and the Web. It helps Stanford users to be more
efficient in their research and navigation among the numerous available resources. The
desktop version of Stanford Grokker is no longer being supported, and is not available for
download. In March of 2009, Groxis ceased operations.
KartOO
KartOO was a meta search engine which displayed a visual interface. It operated
from 2001 to early 2010. KartOO had an advanced Adobe Flash GUI, as opposed to a
text-based list of results.It uses a Graph based approach. Its color scheme was to a degree
reminiscent of Apple Computer's Aqua interface. Search results were presented as a
"map", with blob-like masses of varying color connecting each item. The shape of the
blobs clearly depends on the relevance of the keyword corresponding to that blob,
according to the query. If one began their search with a general topic, KartOO sometimes
helped to narrow it down. Every "blob" clicked added another word to the search query.
CUSAT
19
The map would often succeed in presenting keywords or subtopics that defined the topic
one was searching on. Refer Figure 6 in appendix for seeing the visualization of KartOO.
It was co-founded in France by two cousins, Laurent and Nicholas Baleydier. This
project was then launched in 2001. In 2004, KartOO launched a new version called
UJIKO. In January 2010 KartOO closed down, removing all content from the KartOO
and UJIKO websites, but leaving a small message in French thanking its users for their
support.
CUSAT
20
CUSAT
21
CUSAT
22
CUSAT
23
CUSAT
24
4. CONCLUSION
Web clustering engines organize search results by topic, thus offering a
complementary view to the flat-ranked list returned by conventional search engines. Web
Clustering Engines has reached a level in which research has been deployed and
commercial systems are being deployed. A number of advances must be made to improve
the cluster labels, coherence of cluster structure, performance evaluation studies,
advanced visualization techniques. Then Web Clustering Engines entirely fulfills the
promise of being the PageRank of the future.
CUSAT
25
5. REFERENCES
Journal/Paper:
Claudio Carpineto,Stanisiaw Osinski,Giovanni Romano and Dawid Weiss,A survey
of Web Clustering Engines,ACM Computing Surveys,Vol.41,No.3,Article 17,July
2009.
Oren Zamir and Orem Etzioni,Web Document Clustering :A Feasibility
Demonstration, In Proc. 21st annual Int. ACM SIGIR Conf. on Research and
Development of Information Retrieval, pp.46-54 ,1998.
Books:
C.J.Van Rijsbergen , Information Retrieval, Butterworth , 1979
Ricardo Baeza Yates and Berthier Ribeiro Neto, Modern Information Retrieval
Addison Wesley Longman Publishing Co. Inc.,1999
Websites:
http://clusty.com/
March 5, 2010
http://credo.fub.it
March 8, 2010
http://www2.parc.com/istl/projects/ia/sg-example1.html
http://credino.dimi.uniud.it/
http://smartcredo.dimi.uniud.it
March 4, 2010
CUSAT
26
6. APPENDIX
Figure 1
CUSAT
27
Figure 2
CUSAT
28
Figure 3
CUSAT
29
Figure 4
CUSAT
30
Figure 5
CUSAT
31
Figure 6
CUSAT
32
Figure 7
Figure 8
CUSAT
33
Figure 9
CUSAT