Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 29

Efficient Clustering Approaches

for Organizing Document


Collection 

Dr. Aditi Sharan Sonia


Assistant Professor PhD Scholar

  School of Computer &System Sciences


Jawaharlal Nehru University
New Delhi-110067
Table of Contents
• Information Retrieval
Efficient Retrieval System
• Document Clustering
Clustering Algorithm
• Feature Selection
Dimensionality Reduction
• Subspace Clustering
Subspace Creation
• Research Proposal
Objective
Web Search System IR System

Web Document
Spider
corpus

Query IR
String System

1. Doc1
2. Doc2

?
Ranked 3. Doc3
Documents .
.
IR System
• A perfect IRS always retrieves all relevant
documents without retrieving any non
relevant document.
• In reality , systems retrieve relevant as
well as non relevant documents.
• To measure effectiveness of retrieval
two ratios are used : precision and recall.
• Present document according to user need
Document

Clustering
Automatically partition documents into clusters based on
content
– Documents within each cluster should be similar
– Documents in different clusters should be different
• Discover categories in an unsupervised manner
– No sample category labels provided by humans
– It is a common and important task that finds many
applications in IR and other places
Example “Star”
Why cluster
documents?
• Whole corpus analysis/navigation
– Better user interface
• For improving recall in search applications
– Better search results
• For better navigation of search results
– Effective user recall will be higher
• For speeding up vector space retrieval
– Faster search
Challenging Task
• What are the challenges of Web Data ?
• Why it is difficult to Cluster Web data?
Information Based Problem
•Larger repository
•Unlabelled
•Dynamic Duplication Interconnected (Hyper Link)
Structure Based Problem
• Unstructured
• Heterogeneous Distributed
• Language dependent

User based Problem


•Insufficient Query
•Heterogeneous User
•Dynamic Requirements
•Behavioral Changes
Why it is difficult to Cluster Web
data?
• Data is Heterogeneous
• High dimensionality of Data
• No good definition of similarity itself
• Pre-clustering of Data
• Therefore traditional clustering algorithms
have to modified or new algorithm should
be developed to cluster web data
Clustering
Algorithms
Clustering
Algorithm

Geometric Generative Models


Hierarchal Partitioning
Embedding & Probabilistic
Bottom-up clustering K-means Self-organizing maps (SOM) Generative Distributions for Documents
Top-down clustering Buckshot
Multidimensional scaling (MDS) Expectation Maximization ( EM)
Fractionation
Latent Semantic Indexing (LSI) Multiple Cause Mixture Model (MCMM)

Aspect Models and Probabilistic LSI


Model and Feature Selection
Clustering

Pre-Clustering
Post-Clustering
Combining Clustering with IR

Pre-Clustering
To retrieve one or more clusters in their entirety to a query
Post-Clustering Approaches
• Clustering is used in Improving document search and
retrieval
• An attempt to improve conventional search techniques
• Enhancing of near-neighbor search

Document clustering algorithms are often slow, with


quadratic running times

How clustering can be effective method in its own right

“Scatter/Gather Method”

A document browsing technique that employs document clustering as its primary


operation.
Scatter/Gather : A Cluster Based
• How it works
Approach
– The system clusters documents into small no of groups - Scatter
– The system displays short summaries of them
– User chooses one or more of the groups for further study
– Selected groups are gathered together to form a subcollection
– With each successive iteration the groups become smaller and more
detailed
• The groups become small enough, this process bottoms out by displaying
individual documents
Application to

Scatter/Gather
Zooming into a large document collection
• Interactive browsing paradigm
• Effective Information access tool
• Helpful in situation where the query is unspecified
• Comparatively fast algorithms Buckshot and fractionation
– linear-time preprocessing
– constant-time query processing
• Effective geometric clustering Tool

Limitations
Even Buckshot or Fractionation algorithms may be too slow for
large corpus on the Web
Quality of clustering
Web Document Clustering Using Suffix Tree
Algorithm

Preparing the Doc

Suffix Tree Construction

Merging Clusters Clusters x and y if (Bx ∩ By) / |Bx|>k


(Bx ∩ By) / |By|>k

Labeling Clusters one or more labels in the original suffix tree

Scoring Cluster SC = NC * ∑p(li)


Analysis of STC
Applications!
• The definition of STC – an incremental, o(n) time clustering
algorithm that satisfies these requirements
• Effective for Information Retrieval
• Snippets versus Whole Documents Clustering
• Execution Time is less
Analysis of the STC
Drawbacks!

Documents may appear in more than one No specific category


Non-Exclusiveness

Incompleteness Share only few short word Not contain all documents

Absoluteness No information about document lengths or suffix mismatches

Topic Generating Topic identification for document clusters


Clustering High-Dimensional Data
• Clustering high-dimensional data
– Many applications: text documents, DNA micro-array data
– Major challenges:
• Many irrelevant dimensions may mask clusters
• Distance measure becomes meaningless—due to equi-distance
• Clusters may exist only in some subspaces
• Methods
– Feature transformation: only effective if most dimensions are relevant
• PCA & SVD useful only when features are highly correlated/redundant
– Feature selection: wrapper or filter approaches
• useful to find a subspace where the data have nice clusters
– Subspace-clustering: find clusters in all the possible subspaces
• CLIQUE, ProClus, and frequent pattern-based clustering
Feature Selection
• Feature selection strategy
– Remove non-informative words from documents
– Improve categorization effectiveness
– Reduce computational complexity
– Remove redundant data
– Result: Dimensionality Reduction

Data Space Feature Space Cluster/Class

n >> m1 >> m2 >> k


Dimensionality Reduction
Document Clustering using Feature
selection
Documents

Clusters

Preprocessing Clustering
(Stop word Elimination, Algorithm
Stemming,…)

Feature Extraction
Feature Selection
(Document-Term Matrix)
Feature Selection
• A good feature set is
– Efficient
• Low dimension as mush as possible - Objective
– Effective
• Discriminating documents as much as possible – Subjective
• Feature selection process: Optimization process,
minimizing the number of features and maximizing
the discriminating property of the feature set
Problem statements

•Searching the feature space to find an optimum subset


of features to satisfy goal

•Silent about the clusters of different subspaces


The Curse of Dimensionality
• When the number of dimension increases,
– the distance between any two points is nearly
the same

Surprising results!

This is the reason why we need to study subspace clustering


Document Clustering using
Subspace
Documents

Clusters

Preprocessing
(Stop word Elimination,
Stemming,…)

Subspace Clustering
Why Subspace
Clustering?
Extension of feature selection
• To integrate feature evaluation and clustering in order to find clusters in different subspaces
• Uncover complex relationship in data set
• Subspace-clustering: find clusters in all the subspaces
• Cover all the document collection to make sub space
• Can handle the new features

Top-down subspace clustering search


Bottom-up subspace clustering search

Dense Unit-based Method


Entropy-Based Method
Transformation-Based Method
Subspace Clustering
Top-down Subspace Bottom-up Subspace
Clustering Algorithms Clustering Algorithm
Multiple iterations of expensive Integrate the clustering and subspac
clustering algorithms selection

Find out Initial Clustering in full set of Find the dense regions in low
Dimension dimension spaces

Evaluate the Subspace of each cluster Combine them to form cluster

Iterative processing will be done to


improve the result

Text mining are particularly relevant and present unique challenges


to subspace clustering.
Applications of Subspace Clustering
• Information Integration
• Web Text Mining
• DNA Microarray
Web Text
Mining
Web Page in Document-Term Matrix

Instance Feature
(Pages) (Keywords)
Find set of keywords (Subspace) for given group of Page
Keywords connect the group
Cluster represent the Domain
Example
Data Set 3-D
(400 instances) (a,b,c)

ClusterI ClusterII ClusterIII ClusterIV


(100 instances) (100 instances) (100 instances) (100 instances)

2-D 2-D 3-D 3-D


(a,b) (a,b) (b,c) (b,c)

Apply k-means

Do poor Job finding the Cluster


As each cluster are in irrelevant Dimensions

Consider the Fewer Dimension


Apply Feature Transformation

Transform the dimension from high to low


Relative distance preserve
Unaffected the irrelevant dimensions

Apply Feature Selection

Reduce the dimensionality


Find the cluster in the same subspace
Not explain the cluster in different subspace

Find the Cluster in each subspace


Apply Subspace Clustering

Represent the cluster in interpretable and meaningful ways


Represent cluster as well as subspace in which it exists
Uncover the complex relationship found in data

In order to this
Unique challenges in subspace clustering
Finding appropriate result depends on cluster
technique
Strength, Weakness & biases of potential clustering
Research Proposal
• To investigate computationally efficient ways for combining
information retrieval with clustering.

• Efforts will be made to explore the efficient clustering algorithms,


which work better in high dimensional datasets and apply them for
document clustering.

• Work on feature vector representation and reduction of its


dimensionality using feature selection and subspace clustering will
be investigated to make clustering algorithm more efficient for
large set of documents. Specifically we will focus on the word co-
occurrence frequency to reduce feature space for clustering.
Thanks
Suggestions!!!!

You might also like