Improving Web Clustering by Cluster Selection: By-Vishal Rathore Regd. No. 0721215022 (+91) 9861084119

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 21

Improving

Web Clustering by
Cluster Selection

By- Vishal Rathore


Regd. No. 0721215022
(+91)9861084119
2

Web Search
 Iterative Process

 Problems with Standard Web Search


 Many Irrelevant Results
 Single Long List

 Solution
 Identify and Present Implicit Clusters
3

Web Clustering
Search Results for: Jaguar 1 – 6 of 70,000,000

4. Jaguar
1.
Clusters Official worldwide
General information
web
from
siteBig
of Cats
Jaguar
Online.
Cars.
1. Car 6. Jaguar
2. Apple - --
MacDefenders
OS X of Wildlife
2. Animal The Apple
Size, appearance,
Mac OSlife
X product
span and
page.
diet.

3. Mac OS 3. Jaguar UK - R is for Racing


The essence of the Jaguar breed
4. Other 4. Jaguar
General information from Big Cats Online.
5. Jaguar AU - Jaguar Cars
Services and news
6. Jaguar -- Defenders of Wildlife
Size, appearance, life span and diet.
4

Web Clustering Algorithms


 Many standard clustering algorithms.

 Text oriented clustering algorithms


 STC - Suffix Tree Clustering

 ESTC - Improvement on STC


5

Suffix Tree Clustering

Clean Pages

Identify Base Clusters

Combine Base Clusters

Rank/Select Clusters
6

STC: Identify Base Clusters


animal 4.5

car
5

mac os x
24

car model 10

Base clusters each given a score: # documents  phrase score


7

STC: Combining Base Clusters


Merge Clusters Based On Overlap
18 30

12

Merged Cluster Score is sum of base cluster scores


8

STC: Rank/Select Clusters

 Sort Clusters by Score

 Select Best N
9

Problems with STC


 STC is better than many other algorithms
BUT not good enough

 Scores
 Poor Cluster Quality Measure

 Selection
 Poor Coverage
 Excessive Overlap
10

ESTC: Better Cluster Scoring


 Base Cluster Scores – OK
 Combined Cluster Scores – BAD
 Overlap between clusters over counted in sum

 Example - Particularly Similar Pages


11

ESTC: Scoring Solution


 Solution
 Eliminate the over counting of the overlap

 Merged Cluster Score


 Sum over document scores
 Document Score
 Average phrase score of base clusters
containing the document in the merged cluster
12

ESTC: Better Cluster Selection


 Top N Clusters – BAD
 Dominant Topic – over represented

Cars Animals Mac OS Other


13

ESTC: Smarter Selection – The Search

 ESTC: Smarter selection


 Heuristic
 Minimize Overlap
 Maximize Coverage
14

ESTC: The Search

 Incremental

 Greedy

 Look-ahead Protection

 Sophisticated Branch and Bound Pruning


15

Evaluation Method
 Gold Standard - Ideal Clustering
 2 Searches and 2 Types of Input Data
 Jaguar and Salsa
 Snippets and Full Text

 Precision
 Cluster accuracy against the best matching ideal cluster
 Recall
 Coverage of ideal cluster in matched clusters

 F-measure
 Combination of precision and recall
16

Results – STC, STC-NS, ESTC


Jaguar Full Text Clustering Results
17

Results – ESTC vs Grokker

 Similar performance without page titles

 Page titles are often very useful

Algorithm Input F-measure


ESTC Snippets 58%
Grokker Snippets + Page Titles 62%
ESTC Full Text 74%
18

Conclusions
 ESTC has
 A new cluster scoring
 A new cluster selection algorithm

 ESTC is better than STC, and compares favourably


with Grokker.

 ESTC Scoring function applicable to any


agglomerative clustering algorithm.

 ESTC Cluster Selection algorithm more widely


applicable.
19

Future Work
 Make improvements to other stages of STC
 Particularly Combining Base Clusters

 Apply cluster selection method to other


algorithms

 Improve cluster selection heuristic

You might also like