Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/261233398

Improvement of K-means clustering algorithm with better initial centroids


based on weighted average

Conference Paper · December 2012


DOI: 10.1109/ICECE.2012.6471633

CITATIONS READS

40 1,810

3 authors, including:

Md Sohrab Mahmud Md. Mostafizer Rahman


UNSW Canberra The University of Aizu
8 PUBLICATIONS   58 CITATIONS    12 PUBLICATIONS   48 CITATIONS   

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Md Sohrab Mahmud on 09 January 2019.

The user has requested enhancement of the downloaded file.


Improvement of K-means Clustering algorithm with
better initial centroids based on weighted average
Md. Sohrab Mahmud,1 Md. Mostafizer Rahman,2 and Md. Nasim Akhtar3
Department of Computer Science & Engineering
Dhaka University of Engineering & Technology, Gazipur
Bangladesh
1
sohrabmahmud@gmail.com, 2 mostafiz26@gmail.com, 3 nasimntu@yahoo.com

Abstract—Clustering is the process of grouping similar data in clustering large data sets. The K-means clustering algo-
into a set of clusters. Cluster analysis is one of the major rithm is a partitioning clustering method that partition data
data analysis techniques and k-means one of the most popular objects into k different clusters [2][4][5]. Accuracy of k-
partitioning clustering algorithm that is widely used. But the
original k-means algorithm is computationally expensive and the means clustering method is better than other approaches. The
resulting set of clusters strongly depends on the selection of method is more efficient since its intelligence to cluster large
initial centroids. Several methods have been proposed to improve data set fast. K-means clustering is widely used for large
the performance of k-means clustering algorithm. In this paper number of datasets. But K-means has a major drawback to
we propose a heuristic method to find better initial centroids select initial seed points (centroids). As initial centroids have
as well as more accurate clusters with less computational time.
Experimental results show that the proposed algorithm generates a heavy impact in final cluster sets in depends exclusively on
clusters with better accuracy thus improve the performance of the selection of initial seed points [2]. In original k-means
k-means clustering algorithm. initial centroids are selected in random manner and thus it
affects the operational time and accuracy of cluster directly.
Index Terms—Data Mining, Clustering, K-means, Enhancing For different initial centroids selection resulting clusters will
kmeans, improved Initial centroids. be different. Also the operational complexity of the basic
k-means algorithm is high [2]. Various methods have been
I. I NTRODUCTION proposed to enhance the efficiency and accuracy of k-means
clustering algorithm and suggests that better initial centroids
Clustering is an unsupervised classification of patterns calculation can lead to accurate and efficient cluster groups as
(observations, data items, or feature vectors) into groups well as better time complexity [6].
(clusters). Classification is defined as supervised learning
whereas clustering is defined as unsupervised ones. However, we propose a novel idea to overcome the drawback
Classification is a procedure where data objects are assigned of k-means algorithm for selecting initial seed point (cen-
to predefined classes. Clustering methods attempt to organize troids).Working procedure of our propose algorithm will very
a set of data into clusters such that data within a given simple it just calculate a standard score based on data point.
cluster have a high degree of similarity (or related), whereas Deep discussion will be later of this paper. Accuracy and
objects belonging to different clusters have a high degree complexity of our novel idea is better than k-means algorithm.
of dissimilarity (or unrelated). The term unsupervised here This paper is organized as follows. Section 2 presents an
means that clustering of data objects doesn’t depend on overview of related works on k-means with different initial
predefined classes. These techniques have been widely centroids calculation. In section 3 we discuss the original k-
used in various areas such as taxonomy, image processing, means algorithm. Section 4 introduces the proposed algorithm
information retrieval, data mining, etc. [1]. for better selection of initial centroids. Section 5 describes the
computational complexity of the proposed algorithm. Section
Clustering is one of the major data analysis tool in the 6 shows the experimental data and finally Section 7 describes
field of data mining. Clustering algorithms are basically the conclusion and future work.
divided into two categories : Partitioning Algorithms and
Hierarchical algorithms. A partitioning clustering algorithm
separates data set into a defined number of set in a single
iteration while a hierarchical clustering divides data into II. R ELATED W ORK
smaller subsets in hierarchical manner [3]. A good clustering
method will generate high quality of data cluster. The quality The classic k-means algorithm is strongly dependable on the
of a clustering result depends on both the similarity measure selection of initial centroids. Different selection of centroids
used by the method and its implementation and also by its leads to different resulting clusters. So the accuracy and quality
ability to discover some or all of the hidden patterns. A good of clusters depends heavily on initial selection points. A
survey on clustering methods is found in [2]. number of methods have been proposed over the last few years
to improve the quality and accuracy of k-means [1].
A number of clustering methods have been proposed to
solve clustering problems [1]. K-means is the most exoteric A. M. Fahim et al. [4] proposed an enhanced algorithm to put
partitioning clustering method for its efficiency and simplicity data points into the suitable cluster. In [4], the author proposed
an approach that requires less computational time compared mean is immediately updated by calculating the average of
to the original k-means which is computationally expensive. all the points in that cluster [3]. After all the points are
But initial centroids are still selected randomly in this method. included in some clusters the early grouping is done. Now
So, the method remains sensitive to the initial seed points and each data object is assigned to a cluster based on closeness
doesn’t defined to generate accurate clustering results. with cluster center where closeness is measured by Euclidean
distance. This process of assigning a data points to a cluster
Chen Zhang et al. in [7] proposed a new algorithm for and updating cluster centroids continues until the convergence
selection of initial centroids in k-means clustering that avoided criteria is met or the centroids don’t differ between two
the random selection of cluster seeds as well as improve consecutive iterations. Once a situation is met where centroids
performance of k-means. don’t move anymore the algorithm ends. The Pseudocode for
k-means clustering algorithm is given below [3].
A. Bhattacharya et al. [8] proposed a heuristic clustering
algorithm which was named Divisive Correlation Clustering
Algorithm (DCCA) for grouping data objects. The method
generates cluster of data set without taking the initial centroids A. K-means partitioning algorithm:
and the value of the desired number of clusters k. The time Input :
complexity of this algorithm is too high.
D = d1,d2,.....dn // set of n data items.
Kathiresan V. et al. in [9] proposed an algorithm to select
better initial centroids based on Z-Score Ranking method. In k = the number of desired clusters,
this algorithm the author proposed to calculate the Z-Score
of each data point in the data set and then sorting the data Output :
points based on the Z-Score values. After sorting the data set a set of k clusters.
is divided into k subsets, where k is the number of desired
clusters. Then the mean value of each subset is calculated Steps:
and finally any nearby value of each subset is taken as initial • Randomly choose k objects from n as the initial cluster
centroid. Experimental result shows the algorithm improves centers.
both accuracy and time complexity.
• Repeat
Madhu et al. proposed an enhanced method to find better initial
centroids [1]. The algorithm first checks if the data points • Assign each object from n to one of k clusters to which
contains negative data. If it contains then they are converted the object is most similar based on the mean value of
to positive data through a general conversion. The data points the objects in the cluster.
are sorted based on the distance from the origin and partitioned
into k (number of desired clusters) equal sets. From each set • Update the cluster means by taking the mean value of
the middle point is taken as the initial centroids. the objects for each of k cluster.

K. A. Abdul Nazeer er al proposed an iterative process to • Until no change in cluster means/ min error E is reached.
select initial centroids [2]. The system mainly works in two
sequential phases. In first phase k initial centroids are selected
based on the relative distances of each data points. In the
second phase the clusters are computed based on the distance
of each point from the initial centroids. The algorithm is IV. M ODIFIED K-M EANS
constrained with the phase 1 it is proposed. It need s to
calculate the distance of each data points from all other data
points. A large set of data points may lead to a tremendous
computation. A. Initial Centroids Calculation
In the improved algorithm discussed in this paper a new
method is used to find a weighted average score of dataset.
III. K-M EANS ALGORITHM Here we use an uniform method to find rank score by
averaging the attribute of each data point which will generate
K-means is a popular unsupervised learning algorithm and initial centroids that follow the data distribution of the given
partitioning method for clustering. The basic idea of K-means set. A sorting algorithm is applied to the score of each data
algorithm is to classify the dataset D into k different clusters point and divided into k subsets where k is the number of
where D is the dataset of n data, k is the number of desired desired clusters. Finally the nearest value of mean from each
clusters. The algorithm consists of two basic phases [10]. The subset is taken as initial centroid.
first phase is to select the initial centroids for each cluster
randomly. The second and final phase is to take each point in In recent times we have to deal with multidimensional
dataset and assign it to the nearest centroids [10]. To measure data. For example a data set D is consists of n data
the distance between points Euclidean Distance method is such as d1,d2,d3,.....,dn. Each data point of this set may
used. When a new point is assigned to a cluster the cluster contain multiple attributes such as di may contain attributes
TABLE I
x1,x2,x3,.....,xm, where m is the number of attributes. In case P ERFORMANCE C OMPARISON OF THE PROPOSED ALGORITHM
of multidimensional attributes we propose to set a weight
with each attribute to ensure that while we are calculating the
DataSet Number of Algorithm Average
average of the attributes the score follows the data distribution Clusters Time Taken (ms)
of that set. The another advantage of this method is that if Diabetes 3 Standard K-means 0.0938
we want to enhance a particular feature of the data set we Proposed K-means 0.0781
Thyriod 3 Standard K-means 0.1293
can do that by increasing the weight related to that attribute Proposed K-means 0.1074
of the data objects. After multiplying the weight factor with Blood Pressure 3 Standard K-means 0.1055
each attribute we sum the values and make an average by Proposed K-means 0.0924
dividing the total with m. The entire set of data points are
then sorted using Merge Sort. The sorted list of data points
are then divided into k subsets. The nearest possible value of number of clusters and l is the number of iterations.
mean from each dataset becomes the initial centroids of the
cluster to be constructed. The Pseudocode for the proposed The proposed algorithm discussed in this paper works in two
algorithm is as follows. phases. In the first phase of the algorithm the time required to
calculate the weighted average of all the data points is O(n)
Input : where n is the number of data points. The algorithm then
proposes to sort the data in ascending order. Sorting the data
D = d1, d2.......dn // set of n data items points based on the weighted average of each item can be
done in O(nlogn) time using Merge Sort. Finally the overall
k // Number of desired clusters time complexity of the first phase of the proposed algorithm
is O(nlogn).
Output:
The second phase of the proposed algorithm is same
A set of k initial centroids. as like the original k-means algorithm. Distribution of
the data points to the nearest cluster and the consequent
Steps: recalculation of centroids are conducted repeatedly until the
convergence criteria reached. This process concluded with
1.Calculate the average score of each data point; a time complexity of O(nkl) where the symbols represent
the meaning mentioned above. The experimental data shows
i) di=x1,x2,x3,..,xn that the algorithm converges in less number of iterations as
the initial centroids are calculated in a strategic way rather
ii) di(avg)=(w1*x1+w2*x2+w3*x3+ ...... +wm*xm)/m // than randomly. Thus the overall complexity of the proposed
where x = the attributes value, m = number of attributes and algorithm is of O(n(kl + logn).
w = weight to multiply to ensure fair distribution of cluster

2. Sort the data based on average score; VI. E XPERIMENTAL R ESULTS

3. Divide the data into k subsets; The multivariate data set, taken from the UCI repository of
machine learning databases, that is used to test the accuracy
4. Calculate the mean value of the each subset; and efficiency of the modified k-means algorithm. This same
data set is given as input to the standard k-means algorithm and
5. Take the nearest possible data point of the mean as the the modified k-means algorithm. The value of k, the number
initial centroid for each data subsets; of clusters, is taken as 3.

We have evaluated our algorithm on several datasets. We


The above described method for finding initial centroids of the have compared our results with standard k-means algorithm
clusters is more meaningful than the original k-means where in terms of the accuracy of cluster and total execution time.
centroids are selected randomly. The algorithm converges The experimental results are shown in Table 1.
faster than the original k-means.
In standard k-means algorithm centroid will take randomly
but our modified k-means algorithm the dataset and the value
V. C OMPLEXITY A NALYSIS of k are the only inputs needed since the initial centroids
are computed automatically and find optimal cantroids by the
In basic K-means algorithm, the initial centroids are randomly program.The percentage of accuracy and the execution time
calculated. For this reason, the cluster centroids are recalcu- taken for each experiment where modified k-means algorithm
lated numerous times before the convergence criteria of the shows better performance than standard k-means algorithm.
algorithm are met and the data points are assigned to their
nearest centroids. Since, complete reassignment of data points
takes place according the new centroids, this method takes
time O(nkl) where n is the number of data points, k is the
VII. C ONCLUSION AND F UTURE W ORK [2] K. A. Abdul Nazeer, M. P. Sebastian, Improving the Accuracy and
Efficiency of the k-means Clustering Algorithm, Proceedings of the World
Congress on Engineering 2009 Vol IWCE 2009, July 1 - 3, 2009, London,
U.K.
The k-means algorithm is widely used for clustering large [3] Margaret H. Dunham, Data Mining-Introductory and Advanced Concepts,
set of data. But standard k-means algorithm does not Pearson Education, 2006
[4] A. M. Fahim, A. M. Salem, F. A. Torkey and M. A. Ramadan, “An
always ensure good results as the accuracy of the final Efficient enhanced k-means clustering algorithm,” Journal of Zhejiang
clusters depend on the selection of initial centroids. The University, 10(7): 16261633, 2006.
computational complexity of the standard k-means algorithm [5] Koheri Arai and Ali Ridho Barabakh, “Hierarchical K-means: an algo-
rithm for Centroids initialization for k-means,” Department of Information
is high than our proposed k-means algorithm. This paper Science and Electrical Engineering Politechnique in Surabaya, Faculty of
presents a modified k-means algorithm which find the Science and Engineering, Saga University, Vol. 36, No. 1, 2007.
approximate centroids that reduce the number of iteration [6] K. A. Abdul Nazeer, S.D. Madhu Kumar and M. P. Sebastian, “Enhancing
the k-means clustering algorithm by using a O(n logn) heuristic method
to assign the data into a cluster. One limitation of this for finding better initial centroids,” Second International Conference on
algorithm is that we still need to provide the desired cluster Emerging Applications of Information Technology, 2011.
number as input. A research issue on this point remains [7] Chen Zhang and Shixiong Xia, “K-means clustering algorithm with im-
proved initial centroids ,” Second International Workshop on Knowledge
open. Automating the value of k is suggested as a future work. Discovery and Data Mining (WKDD), pp. 790-792, 2009.
[8] A Bhattacharya and R. k. De, “Divisive Correlation Clustering Algorithm
(DCCA) for grouping of genes: detecting varying patterns in expression
profiles,” bioinformatics, Vol. 24 pp. 1359-1366, 2008.
R EFERENCES [9] Kathiresan V. and Dr. P Sumanthi, “An Efficient Clustering Algorithm
based on Z-Score Ranking method,” International Conference on Com-
puter Communication and Informatics (ICCCI-2012), Jan. 10-12, 2012.
[1] Madhu Yedla, Srinivasa Rao Pathakota,T M SrinivasaEnhancing, K-means [10] J. Han and M. Kamber, Data Mining Concepts and Techniques, Morgan
Clustering Algorithm with Improved Initial Center, International Journal Kaufmann Publishers, San Diego, 2001.
of Computer Science and Information Technologies(IJCSIT), Vol. 1 (2) ,
2010, 121-125.

View publication stats

You might also like