Professional Documents
Culture Documents
Efficient Clustering Approaches For Organizing Document Collection
Efficient Clustering Approaches For Organizing Document Collection
Web Document
Spider
corpus
Query IR
String System
1. Doc1
2. Doc2
?
Ranked 3. Doc3
Documents .
.
IR System
• A perfect IRS always retrieves all relevant
documents without retrieving any non
relevant document.
• In reality , systems retrieve relevant as
well as non relevant documents.
• To measure effectiveness of retrieval
two ratios are used : precision and recall.
• Present document according to user need
Document
•
Clustering
Automatically partition documents into clusters based on
content
– Documents within each cluster should be similar
– Documents in different clusters should be different
• Discover categories in an unsupervised manner
– No sample category labels provided by humans
– It is a common and important task that finds many
applications in IR and other places
Example “Star”
Why cluster
documents?
• Whole corpus analysis/navigation
– Better user interface
• For improving recall in search applications
– Better search results
• For better navigation of search results
– Effective user recall will be higher
• For speeding up vector space retrieval
– Faster search
Challenging Task
• What are the challenges of Web Data ?
• Why it is difficult to Cluster Web data?
Information Based Problem
•Larger repository
•Unlabelled
•Dynamic Duplication Interconnected (Hyper Link)
Structure Based Problem
• Unstructured
• Heterogeneous Distributed
• Language dependent
Pre-Clustering
Post-Clustering
Combining Clustering with IR
Pre-Clustering
To retrieve one or more clusters in their entirety to a query
Post-Clustering Approaches
• Clustering is used in Improving document search and
retrieval
• An attempt to improve conventional search techniques
• Enhancing of near-neighbor search
“Scatter/Gather Method”
Limitations
Even Buckshot or Fractionation algorithms may be too slow for
large corpus on the Web
Quality of clustering
Web Document Clustering Using Suffix Tree
Algorithm
Incompleteness Share only few short word Not contain all documents
Clusters
Preprocessing Clustering
(Stop word Elimination, Algorithm
Stemming,…)
Feature Extraction
Feature Selection
(Document-Term Matrix)
Feature Selection
• A good feature set is
– Efficient
• Low dimension as mush as possible - Objective
– Effective
• Discriminating documents as much as possible – Subjective
• Feature selection process: Optimization process,
minimizing the number of features and maximizing
the discriminating property of the feature set
Problem statements
Surprising results!
Clusters
Preprocessing
(Stop word Elimination,
Stemming,…)
Subspace Clustering
Why Subspace
Clustering?
Extension of feature selection
• To integrate feature evaluation and clustering in order to find clusters in different subspaces
• Uncover complex relationship in data set
• Subspace-clustering: find clusters in all the subspaces
• Cover all the document collection to make sub space
• Can handle the new features
Find out Initial Clustering in full set of Find the dense regions in low
Dimension dimension spaces
Instance Feature
(Pages) (Keywords)
Find set of keywords (Subspace) for given group of Page
Keywords connect the group
Cluster represent the Domain
Example
Data Set 3-D
(400 instances) (a,b,c)
Apply k-means
In order to this
Unique challenges in subspace clustering
Finding appropriate result depends on cluster
technique
Strength, Weakness & biases of potential clustering
Research Proposal
• To investigate computationally efficient ways for combining
information retrieval with clustering.