Datamining - Revited

A TECHNICAL PAPER ON
DATAMINING AND DATAWARE HOUSING WITH SPECIAL REFERENCE

TO
PARTITIONAL ALGORITHMS IN CLUSTERING OF DATA MINING
Gudlavalleru Engineering College

by
V.M.SESHA GIRI MARELLA

G.V.RAMANA III/IV B.Tech , CSE
III/IV B.Tech, CSE Email:maruti_575@yahoo.co.in
Email:ramana_581@yahoo.com Phone no:9949805841
Phone no: 9985378225
1
Contents
1. Abstract
2. Keywords
3. Introduction
4. Clustering
5. Partitional Algorithms
6. K-medoid Algorithms
6.1 PAM
6.2 CLARA
6.3 CLARANS
7. Analysis
8. Conclusion
9. References
2
PARTITIONAL ALGORITHMS IN CLUSTERING OF DATA MINING
1. ABSTRACT
In last few years there has been tremendous research interest in devising efficient data mining algorithms.
Clustering is a very essential component of data mining techniques. Interestingly, the special nature of data
mining makes the classical clustering algorithms unsuitable, these characteristics are usually very large datasets;
the dataset need not be necessarily numeric and hence importance should be given to efficient input and output
operations instead of algorithmic complexity. As a result in last few years a number of clustering algorithms are
proposed for data mining. The present paper gives a brief overview of partitional clustering algorithms used in
data mining. The first part of the paper discuses overview of clustering technique used in data mining. In the
second part the paper discusses different partitional clustering algorithms used in mining of data.
2. KEYWORDS:
Knowledge discovery in database, Data mining, Clustering, partitional algorithms, PAM,
CLARA, CLARANS.
3. INTRODUCTION:
Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately
understandable patterns of data. Knowledge discovery in database (KDD) is a well defined process, consisting
of several distinct steps. Data mining is the core step in the process which results in the discovery of knowledge.
Data mining is a high-level application technique used to present and analyze data for decision-makers. There is
an enormous wealth of information embedded in huge databases belonging to enterprises and this has spurred
tremendous interest in areas of knowledge discovery and data mining. The fundamental goals of data mining are
prediction and description. Prediction makes use of existing variables in the database in order to predict
unknown or future values of interest and description focuses on finding patterns describing the data and the
subsequent presentation for user interpretation. There are several mining techniques for prediction and
description. These are categorized as association, classification, sequential patterns and clustering. The basic
premise of association is to find all associations such that the presence of one set of items in a transaction
implies other items. Classification develops profiles different groups. Sequential patterns identify sequential
patterns subject to a user-specified minimum constraint. Clustering segments a database into subsets or clusters.
3
4. Clustering
Clustering is a useful technique for discovery of data distribution and patterns in the underlying data. The goal
of clustering is to discover dense and sparse regions in a data set. Data clustering has been studied in the
statistics, machine learning, and database communities with diverse emphases. There are two main types of
clustering techniques partitional clustering techniques and hierarchical clustering techniques. The partitional
clustering techniques construct a partition of the database into predefined number of clusters. The hierarchical
clustering techniques do a sequence of partitions in which each partition is nested into next partition in the
sequence.
Datasets before clustering Datasets after clustering
5. PARTITIONAL ALGORITHMS
Partitional algorithms construct a partition of a database of n objects into a set of k clusters. The construction
involves determining the optimal partition with respect to an objective function. There are approximately kⁿ/k!
ways of partitioning a set of n data points into k subsets. An exhaustive enumeration method can though find
the global optimal partition but is practically infeasible when n and k are very small. The partitional clustering
algorithm usually adopts iterative optimization paradigm. It starts with an initial partition and uses an iterative
control strategy. It tries swapping of data points to see if such a swapping improves the quality of clustering.
When no swapping yields improvements in clustering it finds a locally optimal partition. This quality of
clustering is very sensitive to initially selected partition. There are mainly two different categories of the
partitioning algorithms.
• k-means algorithm, where each cluster is represented by the center of gravity of the cluster.
• k-medoid algorithms where each cluster is represented by one of the objects of the cluster located near
the center.
4
Most of special clustering algorithms designed for data mining are k-medoid algorithms. Different k-medoid
algorithms are PAM, CLARA, CLARANS.
6. k-Medoid Algorithms
6.1 PAM
PAM uses a k-medoid method to identify the clusters. PAM selects k objects arbitrarily from the data as
medoids. In each step, a swap between a selected object Oi and a non-selected object Oh is made as long as such
a swap would result in an improvement of the quality of clustering .To calculate the effect of such a swap
between Oi and Oh a cost Cih is computed, which is related to the quality of partitioning the non-selected objects
to k clusters represented by the medoids. So, at this stage it is necessary first to understand the method of
partitioning of the data objects when a set of k-medoids are given
Partitioning
If Oj is a non-selected object and Oi is a medoid, we then say Oj belongs to the cluster represented by Oi, if
d(Oi,Oj)=Minoe d(Oj,Oe), where the minimum is taken over all medoids Oe and d(Oa,Oh) determines the distance
or dissimilarity between objects Oa and Oh. The dissimilarity matrix is known prior to the commencement of
PAM. The quality of clustering is measured by the average dissimilarity between an object and the medoid of
the cluster to which the object belongs.
Iterative Selection of Medoids
Let us assume that O1, O2, ….., Ok are k medoids selected at any stage. We denote C1, C2, … , Ck are the
respective clusters. From the foregoing discussion, for a non-selected object Oj, j ≠ 1, 2 … k if Oj Є Ch then
Min(1<i<k) d(Oj,Oi) = d(Oj,Oh). Let us now analyze the effect of swapping Oi and Oh. In other words, let us of
compare the quality clustering, if we select k medoids as O1,O2, … ,Oi-1,Oh, Oi+1,…,Ok, where Oh replaces Oi as
one of the medoids. Due to the change in the set of medoids, there will be three types of changes that can occur
in the actual clustering.
• A non-selected object Oj, such that Oj, such that Oj Є Ci before swapping and Oj Є Ch after swapping.
This case arises when the following conditions hold:
Min d(Oj,Oe) = d(Oj,Oi), before swapping and Mine≠I d(Oj,Oe)=d(kOj,Oh) after swapping.
• A non-selected object Oj Є Ci and Oj Є Cj΄ , j΄ ≠h. This case arises when Min d(Oj,Oe) = d(Oj,Oi), and
Min d(Oj,Oe)=d(Oj,Oj΄), j΄ ≠ h.Define a cost as Cjih =d(Oj,Oj΄) - d(Oj,Oi)
• A non-selected object joj Є Cj΄ = Oj Є Ch

So, Min d(Jo,Au) = d(Jo,Jo΄), and
Min d(Jo,Au) = d(Jo,Oh)Cjih = d(Oj,Oh) - d(Oj,Oj΄)
5
Define the total cost of swapping Oi and Oh as Chi = ∑jCjih if Cih is negative then the quality of clustering is
improved by making Oh as a medoid in plase of Oi. The process is repeated until we cannot find a negative Cih.
The algorithm can be stated as follows:
ALGORITHM
• Input: Database of object D.
• Select arbitrarily k representative objects. Mark these objects as “selected” and mark the remaining as
“non-selected”.
• Repeat until no more objects are to be classified.
• Do for all selected object Oi.

Do for all non-selected objects Oh.
Compute Cih
End do
• End do
• Select imin, hmin such that Cimin,hmin=Min i,h Cih
• If Cimin,hmin<0
Then mark Oi as non-selected and Oh as selected.
Do repeat.
• Find cluster C1, C2, C3, … , Ch.
6.2 CLARA
It can be observed that the major computational efforts for PAM are to determine k medoids through an
iterative optimization. CLARA through follows the same principle, attempts to reduce the computational effort
by relying on sampling to handle large datasets. Instead of finding representative objects for the entire dataset,
CLARA draws sample of the dataset, applies PAM on this sample and finds the medoids of the sample. If the
sample were drawn in a sufficiently random way, the medoids of the sample would approximate the medoids of
the entire dataset. The steps of CLARA are summarized below:
ALGORITHM
• Input: Database of D objects,
• Repeat until
1. Draw a sample S c D randomly from D.
6
2. Call PAM (s,k) to get k medoids.
3. Classify the entire data set D to C1, C2, Ck,
4. Calculate the quality of clustering as the average dissimilarity.
• End.
6.3 CLARANS
CLARANS (Clustering Large Applications based on the Randomized Search) is similar to PAM but it
applies a randomized Iterative-Optimization for determination of medoids. It is easy to see that in PAM, at
every iteration, we examine k(N-k) swapping to determine the pair corresponding to minimum cost . On the
other hand, CLARA tries to examine fewer elements by restricting its search to smaller sample of the database.
Thus if the sample size is s ≤ N, it examines at most k(S-k) pairs at every iteration. CLARANS does not restrict
the search to any particular subset of objects. Neither does it search the entire dataset. It randomly selects few
pairs for swapping at the current state. CLARANS, like PAM, start with a randomly selected set of k medoids.
It checks at most the maxneighbour number of pairs for swapping and if a pair with negative cost is found, it
updates the medoids set and continues. Otherwise, it records the current selection of medoids as a local
optimum and restarts with a new randomly selected medoid-set to search for another local optimum.
CLARANS stops after the “numlocal” number of local optimal medoid-sets are determined and return the best
among these.
ALGORITHM
• Input(D, k, maxneighbour and numlocal)
• Select arbitrarily k representative objects. Mark these objects as “selected” and all other objects as non-
selected. Call it current.
• Set e =1.
• Do while (e ≤ numlocal)
Set j=1.
• Do while (j ≤ maxneighbour)
o Consider randomly a pair (i,h) such that Oi is a selected object and Oh is a non-selected
object.
o Calculate the cost Cih.
o If Cih is negative
 “update current”
 Mark Oi non-selected ,Oh selected and j=1

Else
Increment j←j+1
 End do
 Compare the cost of clustering “with mincost”
7
 If current_cost < mincost
o Mincost ←current_cost
o Best_node←current
• increment e←e+1
• End do
• Return “best node”.
7. ANALYSIS
PAM is very robust to the existence of outliers. The clusters found by this method do not depend on the order in
which the objects are examined. However, it cannot handle very large data. CLARA samples the large data and
applies PAM on this sample. The result will be based on the sample. CLARANS applies randomized Iterative-
Optimization for determining of medoids. This can be applied to large datasets also. It is more efficient than
earlier medoid-based methods suffers from two major drawbacks: it assumes that all objects fit in main
memory, and the result is very sensitive to input order. In addition, it may not find a real local minimum due to
the trimming of searching which is controlled by ‘maxneighbour’.
8. CONCLUSION
PAM algorithm is efficient and gives good results when data is small. However it cannot be applied to large
datasets. CLARA efficiency is determined by the sample of data taken at sampling phase. CLARANS is
efficient for large datasets. As datasets from which required data is mined is large CLARANS is used and is
efficient partitional algorithm compared to PAM and CLARA.
9. REFERENCES:
Vasudha Bhatnagar, On Mining Of Data, IETE Journal of research, 2001
Data mining and warehousing by Dunham
IEEE Papers
www.datawarehouse.com
www.itpapers.com

Datamining - Revited

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Datamining - Revited

Uploaded by

Copyright:

Available Formats

A TECHNICAL PAPER ON

DATAMINING AND DATAWARE HOUSING WITH SPECIAL REFERENCE

Gudlavalleru Engineering College

V.M.SESHA GIRI MARELLA

Datasets before clustering Datasets after clustering

Iterative Selection of Medoids

• A non-selected object joj Є Cj΄ = Oj Є Ch

The algorithm can be stated as follows:

• Input: Database of object D.

• Repeat until no more objects are to be classified.

• Do for all selected object Oi.

• Select imin, hmin such that Cimin,hmin=Min i,h Cih

• Find cluster C1, C2, C3, … , Ch.

• Input: Database of D objects,

 Mark Oi non-selected ,Oh selected and j=1

 Compare the cost of clustering “with mincost”

Vasudha Bhatnagar, On Mining Of Data, IETE Journal of research, 2001

Data mining and warehousing by Dunham

You might also like