Professional Documents
Culture Documents
Performance Analysis of AIM-K-means & K-Means in Quality Cluster Generation
Performance Analysis of AIM-K-means & K-Means in Quality Cluster Generation
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/45891314
CITATIONS
READS
14
28
2 authors:
Samarjeet Borah
19 PUBLICATIONS 23 CITATIONS
SEE PROFILE
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
175
1 INTRODUCTION
Clustering [2][3][4] is a type of unsupervised learning
method in which a set of elements is separated into ho
mogeneous groups. It seeks to discover groups, or clus
ters, of similar objects. Generally, patterns within a valid
cluster are more similar to each other than they are to a
pattern belonging to a different cluster. The similarity
betweenobjectsisoftendeterminedusingdistancemeas
uresoverthevariousdimensionsinthedataset.Thevari
ety of techniques for representing data, measuring simi
laritybetweendataelements,andgroupingdataelements
has produced a rich and often confusing assortment of
clustering methods. Clustering is useful in several ex
ploratory patternanalysis, grouping, decisionmaking,
and machinelearning situations, including data mining,
document retrieval, image segmentation, and pattern
classification[5][3].
jorityofthemcouldbeconsideredasgreedyalgorithms,
i.e., algorithms that at each step choose the best solution
and may not lead to optimal results in the end. The best
solutionateachstepistheplacementofacertainobjectin
theclusterforwhichtherepresentativepointisnearestto
the object. This family of clustering algorithms includes
thefirstonesthatappearedintheDataMiningCommu
nity. The most commonly used are Kmeans [JD88,
KR90][6], PAM (Partitioning Around Medoids) [KR90],
CLARA (Clustering LARge Applications) [KR90] and
CLARANS(ClusteringLARgeApplicatioNS)[NH94].All
ofthem are applicable to data sets with numerical attrib
utes.
Kmeans[7]isaprototypebased,simplepartitionalclus
teringtechniquewhichattemptstofindauserspecifiedK
number of clusters. These clusters are represented by
2 PARTITION BASED CLUSTERING METHODS
theircentroids.Aclustercentroidistypicallythemeanof
Partition based clustering methods create the clusters in thepointsinthecluster.Thisalgorithmisasimpleitera
onestep.Onlyonesetofclustersiscreated,althoughsev tive clustering algorithm. The algorithm is simple to im
eral different sets of clusters may be created internally plementandrun,relativelyfast,easytoadapt,andcom
withinthevariousalgorithms.Sinceonlyonesetofclus mon in practice. It is historically one of the most impor
tersisoutput,theusersmustinputthedesirednumberof tant algorithms in data mining. The general algorithm
clusters. Given a database of n objects, a partition based was introduced by Cox (1957), and (Ball and Hall, 1967;
[5]clusteringalgorithmconstructskpartitionsoftheda MacQueen,1967)[6]firstnameditKmeans.Sincethenit
ta, so that an objective function is optimized. In these has become widely popular and is classified as a parti
clustering methods some metric or criterion function is tional or nonhierarchical clustering method (Jain and
usedtodeterminethegoodnessofanyproposedsolution. Dubes,1988).Ithasanumberofvariations[8][11].
This measure of quality could be average distance be TheKmeansalgorithmworksasfollows:
tweenclustersorsomeothermetric.Onecommonmeas
a. Select initial centres of the K clusters. Repeat
ureofsuchkindisthesquirederrormetric,whichmeas
steps b through c until the cluster membership
uresthesquireddistancefromeachpointtothecentroid
stabilizes.
fortheassociatedcluster.Partitionbasedclusteringalgo
b. Generateanewpartitionbyassigningeachdata
rithmstrytolocallyimproveacertaincriterion.Thema
toitsclosestclustercentres.
Samarjeet Borah is with the Department of Computer Science & Engineerc.
Compute new cluster centres as the centroids of
ing, Sikkim Manipal Institute of Technology, Majitar, Rangpo, East Siktheclusters.
kim-737132.
Mrinal Kanti Ghose is with the Department of Computer Science & Engi-
neering as Professor & HOD, Sikkim Manipal Institute of Technology,
Thealgorithmcanbebrieflydescribedasfollows:
Majitar, Rangpo, East Sikkim-737132.
Let us consider a dataset D having n data points x1, x2
(1)
176
2.2.1 Background
Inprobabilitytheoryandstatistics,theGaussiandistribu
tionisacontinuousprobabilitydistributionthatdescribes
data that clusters around a mean or average. Assuming
Gaussiandistributionitisknownthat1contain67.5%
ofthepopulationandthussignificantvaluesconcentrate
aroundtheclustermean.Pointsbeyondthismayhave
tendency of belonging to other clusters. We could have
taken2insteadof1,butproblemwith2isthat
itwillcoverabout95%ofthepopulationandasaresultit
mayleadtoimproperclustering.Somepointsthatarenot
sorelevanttotheclustermayalsobeincludedintheclus
ter.
2.2.2 Description
LetusassumethatdatasetDas{xi,i=1,2N}whichcon
sistsofNdataobjectsx1,x2,,xN.,whereeachobjecthas
M different attribute values corresponding to the M dif
ferentattributes.Thevalueofithobjectcanbegivenby:
Di={xi1,xi2,,xiM}
Againletusassumethattherelationxi=xkdoesnotmean
thatxiandxkarethesameobjectsintherealworlddata
base. It means that the two objects has equal values for
theattributesetA={al,a2,a3,,am}.Themainobjectiveof
the algorithm is to find out the value k automatically in
prior to partition the dataset into k disjoint subsets. For
distance calculation the distance measure sum of square
Euclidian distance is used in this algorithm. It aims at
minimizing the average square error criterion which is a
goodmeasureofthewithinclustervariationacrossallthe
partitions.Thustheaveragesquareerrorcriteriontriesto
makethekclustersascompactandseparatedaspossible.
(2)
where=
and=
selectthefirstmeanoftheinitialmeansetrandomlyfrom
the dataset. Then the object selected as mean will be re
moved from the temporary dataset. The procedure Dis
tance_Threshold will compute the distance threshold as
givenineq.2.
Wheneveranewobjectisconsideredasthecandidatefor
a cluster mean, its average distance with existing means
willbecalculatedasgivenintheequationbelow.
Average_Distance
(3)
whereMisthesetofinitialmeans,i=1,2,,mandmn,
mcisthecandidatefornewclustermean.
Ifitsatisfiesthedistancethresholdthenitwillconsidered
as new mean and will be removed from the temporary
dataset.Thealgorithmisasfollows:
Input:
//setofobjects
D={x1,x2,,xn}
Output:
K//Totalnumberofclusterstobegenerated
M={m1,m2,,mk}//Thesetofinitialmeans
Algorithm:
CopyDtoatemporarydatasetT
CalculateDistance_ThresholdonT
Arbitrarilyselectxiasm1
Insertm1toM
RemovexifromT
Fori=1tonkdo//Checkforthenextmean
Begin:
Arbitrarilyselectxiasmc
SetL=0
Forj=1tokdo//Calculatetheavg.dist
Begin:
L=L+Distance(mc,M[j])
End
Average_Distance=L/k
IfAverage_Distance
Distance_Thresholdthen:
RemovexifromT
InsertmctoM
k=k+1
End
3 PERFORMANCE ANALYSIS
TheAIM is just the extension of Kmeans to provide the
numberofclusterstobegeneratedbytheKmeansalgo
rithm.ItalsoprovidestheinitialsetofmeanstoKmeans.
Therefore it has been decided to make a comperative
analysis of the clustering quality of AIMKmeans with
convensional Kmeans. The main difference between the
two algorithms is that in case of AIMKmeans it is not
necessary to provide the number of clusters to be gener
atedinpriorandforKmeans,usershavetoprovidethe
numberofclusterstobegenerated.
Inthisevaluationprocessthreedatasetshave beenused.
They have been fed to the algorithms according to the
increasing order of their size. The programs were devel
177
Theabovecomparisonwasmadeonthebasisofaverage
sum of square error. From the study it has been found
that AIMKmeans is showing improvements in average
sumofsquare.Thisisbasicallybecauseoftheinitialsetof
cluster means provided to the algorithm. In case of K
meansthevalueofkhasbeenprovidedbasedontheout
put provided by AIM. But it is not possible to provide
initialsetofclustersinKmeans.
4 CONCLUSION
ThemostattractivepropertyoftheKmeansalgorithmin
data mining is its efficiency in clustering large data sets.
But the main disadvantage it is facing is the number of
clusters that is to be provided from the user. The algo
rithm AIM, which is an extension of Kmeans, can be
usedtoenhancetheefficiencyautomatingtheselectionof
theinitialmeans.Fromtheexperimentsithasbeenfound
that it can improve the cluster generation process of the
Kmeans algorithm, without diminishing the clustering
quality in most of the cases. The basic idea ofAIM is to
keep the simplicity and scalability of Kmeans, while
achievingautomaticity.
ACKNOWLEDGMENT
This work has been carried out as part of Research Pro
motionScheme(RPS)ProjectfundedbyAllIndiaCouncil
forTechnicalEducation,GovernmentofIndia;videsanc
tionorder8023/BOR/RID/RPS217/200708.
REFERENCES
[1]
178