Efficient Clustering Approaches For Organizing Document Collection

Efficient Clustering Approaches
for Organizing Document

Collection
Dr. Aditi Sharan Sonia

Assistant Professor PhD Scholar
School of Computer &System Sciences

Jawaharlal Nehru University
New Delhi-110067
Table of Contents
• Information Retrieval
Efficient Retrieval System
• Document Clustering
Clustering Algorithm
• Feature Selection
Dimensionality Reduction
• Subspace Clustering
Subspace Creation
• Research Proposal
Objective
Web Search System IR System
Web Document
Spider
corpus
Query IR
String System
1. Doc1
2. Doc2
?
Ranked 3. Doc3
Documents .
.
IR System
• A perfect IRS always retrieves all relevant
documents without retrieving any non
relevant document.
• In reality , systems retrieve relevant as
well as non relevant documents.
• To measure effectiveness of retrieval
two ratios are used : precision and recall.
• Present document according to user need
Document
•
Clustering
Automatically partition documents into clusters based on
content
– Documents within each cluster should be similar
– Documents in different clusters should be different
• Discover categories in an unsupervised manner
– No sample category labels provided by humans
– It is a common and important task that finds many
applications in IR and other places
Example “Star”
Why cluster
documents?
• Whole corpus analysis/navigation
– Better user interface
• For improving recall in search applications
– Better search results
• For better navigation of search results
– Effective user recall will be higher
• For speeding up vector space retrieval
– Faster search
Challenging Task
• What are the challenges of Web Data ?
• Why it is difficult to Cluster Web data?
Information Based Problem
•Larger repository
•Unlabelled
•Dynamic Duplication Interconnected (Hyper Link)
Structure Based Problem
• Unstructured
• Heterogeneous Distributed
• Language dependent
User based Problem

•Insufficient Query
•Heterogeneous User
•Dynamic Requirements
•Behavioral Changes
Why it is difficult to Cluster Web
data?
• Data is Heterogeneous
• High dimensionality of Data
• No good definition of similarity itself
• Pre-clustering of Data
• Therefore traditional clustering algorithms
have to modified or new algorithm should
be developed to cluster web data
Clustering
Algorithms
Clustering
Algorithm
Geometric Generative Models

Hierarchal Partitioning
Embedding & Probabilistic
Bottom-up clustering K-means Self-organizing maps (SOM) Generative Distributions for Documents
Top-down clustering Buckshot
Multidimensional scaling (MDS) Expectation Maximization ( EM)
Fractionation
Latent Semantic Indexing (LSI) Multiple Cause Mixture Model (MCMM)
Aspect Models and Probabilistic LSI

Model and Feature Selection
Clustering
Pre-Clustering
Post-Clustering
Combining Clustering with IR
Pre-Clustering
To retrieve one or more clusters in their entirety to a query
Post-Clustering Approaches
• Clustering is used in Improving document search and
retrieval
• An attempt to improve conventional search techniques
• Enhancing of near-neighbor search
Document clustering algorithms are often slow, with

quadratic running times
How clustering can be effective method in its own right
“Scatter/Gather Method”
A document browsing technique that employs document clustering as its primary

operation.
Scatter/Gather : A Cluster Based
• How it works
Approach
– The system clusters documents into small no of groups - Scatter
– The system displays short summaries of them
– User chooses one or more of the groups for further study
– Selected groups are gathered together to form a subcollection
– With each successive iteration the groups become smaller and more
detailed
• The groups become small enough, this process bottoms out by displaying
individual documents
Application to
•
Scatter/Gather
Zooming into a large document collection
• Interactive browsing paradigm
• Effective Information access tool
• Helpful in situation where the query is unspecified
• Comparatively fast algorithms Buckshot and fractionation
– linear-time preprocessing
– constant-time query processing
• Effective geometric clustering Tool
Limitations
Even Buckshot or Fractionation algorithms may be too slow for
large corpus on the Web
Quality of clustering
Web Document Clustering Using Suffix Tree
Algorithm
Preparing the Doc
Suffix Tree Construction
Merging Clusters Clusters x and y if (Bx ∩ By) / |Bx|>k

(Bx ∩ By) / |By|>k
Labeling Clusters one or more labels in the original suffix tree
Scoring Cluster SC = NC * ∑p(li)

Analysis of STC
Applications!
• The definition of STC – an incremental, o(n) time clustering
algorithm that satisfies these requirements
• Effective for Information Retrieval
• Snippets versus Whole Documents Clustering
• Execution Time is less
Analysis of the STC
Drawbacks!
Documents may appear in more than one No specific category

Non-Exclusiveness
Incompleteness Share only few short word Not contain all documents
Absoluteness No information about document lengths or suffix mismatches
Topic Generating Topic identification for document clusters

Clustering High-Dimensional Data
• Clustering high-dimensional data
– Many applications: text documents, DNA micro-array data
– Major challenges:
• Many irrelevant dimensions may mask clusters
• Distance measure becomes meaningless—due to equi-distance
• Clusters may exist only in some subspaces
• Methods
– Feature transformation: only effective if most dimensions are relevant
• PCA & SVD useful only when features are highly correlated/redundant
– Feature selection: wrapper or filter approaches
• useful to find a subspace where the data have nice clusters
– Subspace-clustering: find clusters in all the possible subspaces
• CLIQUE, ProClus, and frequent pattern-based clustering
Feature Selection
• Feature selection strategy
– Remove non-informative words from documents
– Improve categorization effectiveness
– Reduce computational complexity
– Remove redundant data
– Result: Dimensionality Reduction
Data Space Feature Space Cluster/Class
n >> m1 >> m2 >> k

Dimensionality Reduction
Document Clustering using Feature
selection
Documents
Clusters
Preprocessing Clustering
(Stop word Elimination, Algorithm
Stemming,…)
Feature Extraction
Feature Selection
(Document-Term Matrix)
Feature Selection
• A good feature set is
– Efficient
• Low dimension as mush as possible - Objective
– Effective
• Discriminating documents as much as possible – Subjective
• Feature selection process: Optimization process,
minimizing the number of features and maximizing
the discriminating property of the feature set
Problem statements
•Searching the feature space to find an optimum subset

of features to satisfy goal
•Silent about the clusters of different subspaces

The Curse of Dimensionality
• When the number of dimension increases,
– the distance between any two points is nearly
the same
Surprising results!
This is the reason why we need to study subspace clustering

Document Clustering using
Subspace
Documents
Clusters
Preprocessing
(Stop word Elimination,
Stemming,…)
Subspace Clustering
Why Subspace
Clustering?
Extension of feature selection
• To integrate feature evaluation and clustering in order to find clusters in different subspaces
• Uncover complex relationship in data set
• Subspace-clustering: find clusters in all the subspaces
• Cover all the document collection to make sub space
• Can handle the new features
Top-down subspace clustering search

Bottom-up subspace clustering search
Dense Unit-based Method

Entropy-Based Method
Transformation-Based Method
Subspace Clustering
Top-down Subspace Bottom-up Subspace
Clustering Algorithms Clustering Algorithm
Multiple iterations of expensive Integrate the clustering and subspac
clustering algorithms selection
Find out Initial Clustering in full set of Find the dense regions in low
Dimension dimension spaces
Evaluate the Subspace of each cluster Combine them to form cluster
Iterative processing will be done to

improve the result
Text mining are particularly relevant and present unique challenges

to subspace clustering.
Applications of Subspace Clustering
• Information Integration
• Web Text Mining
• DNA Microarray
Web Text
Mining
Web Page in Document-Term Matrix
Instance Feature
(Pages) (Keywords)
Find set of keywords (Subspace) for given group of Page
Keywords connect the group
Cluster represent the Domain
Example
Data Set 3-D
(400 instances) (a,b,c)
ClusterI ClusterII ClusterIII ClusterIV

(100 instances) (100 instances) (100 instances) (100 instances)
2-D 2-D 3-D 3-D

(a,b) (a,b) (b,c) (b,c)
Apply k-means
Do poor Job finding the Cluster

As each cluster are in irrelevant Dimensions
Consider the Fewer Dimension

Apply Feature Transformation
Transform the dimension from high to low

Relative distance preserve
Unaffected the irrelevant dimensions
Apply Feature Selection
Reduce the dimensionality

Find the cluster in the same subspace
Not explain the cluster in different subspace
Find the Cluster in each subspace

Apply Subspace Clustering
Represent the cluster in interpretable and meaningful ways

Represent cluster as well as subspace in which it exists
Uncover the complex relationship found in data
In order to this
Unique challenges in subspace clustering
Finding appropriate result depends on cluster
technique
Strength, Weakness & biases of potential clustering
Research Proposal
• To investigate computationally efficient ways for combining
information retrieval with clustering.
• Efforts will be made to explore the efficient clustering algorithms,

which work better in high dimensional datasets and apply them for
document clustering.
• Work on feature vector representation and reduction of its

dimensionality using feature selection and subspace clustering will
be investigated to make clustering algorithm more efficient for
large set of documents. Specifically we will focus on the word co-
occurrence frequency to reduce feature space for clustering.
Thanks
Suggestions!!!!

Efficient Clustering Approaches For Organizing Document Collection

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Efficient Clustering Approaches For Organizing Document Collection

Uploaded by

Copyright:

Available Formats

Efficient Clustering Approaches

for Organizing Document

Dr. Aditi Sharan Sonia

School of Computer &System Sciences

User based Problem

Geometric Generative Models

Aspect Models and Probabilistic LSI

Document clustering algorithms are often slow, with

How clustering can be effective method in its own right

A document browsing technique that employs document clustering as its primary

Preparing the Doc

Suffix Tree Construction

Merging Clusters Clusters x and y if (Bx ∩ By) / |Bx|>k

Labeling Clusters one or more labels in the original suffix tree

Scoring Cluster SC = NC * ∑p(li)

Documents may appear in more than one No specific category

Absoluteness No information about document lengths or suffix mismatches

Topic Generating Topic identification for document clusters

Data Space Feature Space Cluster/Class

n >> m1 >> m2 >> k

•Searching the feature space to find an optimum subset

•Silent about the clusters of different subspaces

This is the reason why we need to study subspace clustering

Top-down subspace clustering search

Dense Unit-based Method

Evaluate the Subspace of each cluster Combine them to form cluster

Iterative processing will be done to

Text mining are particularly relevant and present unique challenges

ClusterI ClusterII ClusterIII ClusterIV

2-D 2-D 3-D 3-D

Do poor Job finding the Cluster

Consider the Fewer Dimension

Transform the dimension from high to low

Apply Feature Selection

Reduce the dimensionality

Find the Cluster in each subspace

Represent the cluster in interpretable and meaningful ways

• Efforts will be made to explore the efficient clustering algorithms,

• Work on feature vector representation and reduction of its

You might also like