Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No.

11

CLUSTERING ALGORITHMS IN DATAMINING A REVIEW


Ruchi Saxena1, Shubha Soni2 and Upasana Sharma3 Deptt. Of IT BUIT, Bhopal 1 Deptt. Of CSE RKDF, Bhopal 2 Deptt. Of CSE LNCT, Bhopal 3 ruchisaxena88@gmail.com1, soni.shub01@gmail.com2, sharmaupasana23@gmail.com3 Abstract
Learning is the process of generating useful information from a huge amount of data. Learning can be classified as supervised learning and unsupervised learning. Clustering is a kind of unsupervised learning. A pattern representing a common behavior or characteristics that exist among each item can be generated. This paper gives an overview of different clustering algorithm. It describes about the general working behavior, the methodologies followed on these approaches and the parameters which algorithms.

affects the performance of these

Keywords: Computational Intelligence, Information Security, Image Compression, Prediction, Networking

1. Introduction
The Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. Two or more objects belong to the same cluster if they are close according to a given distance. This is called distance-based clustering. Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures. The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute best criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs. For instance, we could be interested in finding representatives for homogeneous groups, in finding natural clusters and describe their unknown properties, in finding useful and suitable groupings or in finding unusual data objects[1]. 1.2 Types of Clustering There exit a large number of clustering algorithms in the literature .The choice of clustering algorithm depends both on the type of data available and on the particular purpose and application. If cluster analysis is used as a descriptive or exploratory tool, it is possible to try several algorithms on the same data to see what the data may disclose. In general, major clustering methods can be classified into the following categories. Hierarchical clustering algorithm Partition clustering algorithm Spectral clustering algorithm Grid based clustering algorithm Density based clustering algorithm 1.2.1 Hierarchical Clustering Algorithm Hierarchical clustering algorithm groups data objects to form a tree shaped structure. It can be broadly classified into agglomerative hierarchical clustering and divisive hierarchical clustering. In agglomerative approach which 9

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11 is also called as bottom up approach, each data points are considered to be a separate cluster and on each iteration clusters are merged based on a criteria. The merging can be done by using single link, complete link, centroid or wards method. In divisive approach all data points are considered as a single cluster and they are splited into number of clusters based on certain criteria, and this is called as top down approach. Examples for this algorithms are LEGCLUST [2], BRICH [3] (Balance Iterative Reducing and Clustering using Hierarchies), CURE (Cluster Using REpresentatives) [4], and Chemeleon [1]. 1.2.2 Spectral Clustering Algorithm Spectral clustering refers to a class of techniques which relies on the Eigen structure of a similarity matrix. Clusters are formed by partition data points using the similarity matrix. Any spectral clustering algorithm will have three main stages [5]. They are Preprocessing: Deals with the construction of similarity matrix. Spectral Mapping: Deals with the construction of eigen vectors for the similarity matrix Post Processing: Deals with the grouping data pointsThe advantages of Spectral clustering algorithm: Strong assumptions on cluster shape are not made. Simple to implement. Objective does not consider local optima. Statistically consistent. Works faster. The major drawback of this approach is that it exhibits high computational complexity. For the larger dataset it requires O(n3) where n is the number of data points [6]. Examples for this algorithms are SM (Shi and Malik) algorithm, KVV (Kannan,Vempala andVetta) algorithm, NJW ( Ng, Jordan and Weiss) algorithm [2]. 1.2.3 Grid based Clustering Algorithm Grid based algorithm quantize the object space into a finite number of cells that forms a grid structure [1].Operations are done on these grids. The advantage of this method is lower processing time. Clustering complexity is based on the number of populated grid cells and does not depend on the number of objects in the dataset. The features of this algorithm are: No distance computations. Clustering is performed on summarized data points. Shapes are limited to union of grid-cells. The complexity of the algorithm is usually O(Number of populated grid-cells) STING [1] is an example for this algorithm. 1.2.4 Density based Clustering Algorithm Density based algorithm continue to grow the given cluster as long as the density in the neighborhood exceeds certain threshold [1]. This algorithm is suitable for handling noise in the dataset. the features of this algorithm are: Handles clusters of arbitrary shape Handle noise, Needs only one scan of the input dataset. Needs density parameters to be initialized. DBSCAN, DENCLUE and OPTICS [1] are examples for this algorithm. 1.2.5 Partition Clustering Algorithm Partition clustering algorithm splits the data points into k partition, where each partition represents a cluster. The partition is done based on certain objective function. One such criterion functions is minimizing square error criterion which is computed as, E = || p mi || 2 (1) where p is the point in a cluster and mi is the mean of the cluster. The cluster should exhibit two properties, they are (1) each group must contain at least one object (2) each object must belong to exactly one group. The main draw back of this algorithm [3] is whenever a point is close to the center of another cluster, it gives poor result due to overlapping of data points.

2. Literature Review of clustering


2.1 Agglomerative Hierarchical Clustering based on Affinity Propagation Algorithm Affinity propagation (AP) algorithm doesn't fix the number of the clusters and doesn't rely on random sampling. It exhibits fast execution speed with low error rate. However, it is hard to generate optimal clusters. This paper proposes an agglomerative clustering based on Affinity propagation method to overwhelm the limitation. It puts forward k-c1uster closeness to merge the clusters yielded by AP. In comparison to AP, method has better performance and is better than or equal to the quality of AP method. And it has an advantage of time complexity compared to adaptive affinity propagation. This paper studies properties of AP algorithm, and then proposes the agglomerative hierarchical clustering based on AP. It generates the initial division by AP partition. And it defines a novel cluster closeness based on neighbor relationship, which can evade the influence of density. Based on it, Affinity propagation algorithm can quickly and effectively performs agglomerative hierarchical clustering and generates the better clusters. Experiments show that Affinity propagation algorithm works better than the original 10

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11 AP and gets a more accurate division and has an advantage in time complexity compared with Affinity propagation. How to deal with the data with complicate structure and noise is a direction for future research[7]. 2.2 A Fast Genetic K-means Clustering Algorithm In this paper, we propose a new clustering algorithm called Fast Genetic K-means Algorithm (FGKA). FGKA is inspired by the Genetic K-means Algorithm (GKA) proposed by Krishna and Murty in 1999 but features several improvements over GKA. they experiment indicate that, while K-means algorithm might converge to a local optimum, both FGKA and GKA always converge to the global optimum eventually but FGKA runs much faster than GKA. In this paper, we propose a new clustering algorithm called Fast Genetic K-means Algorithm (FGKA). FGKA is inspired by the Genetic K-means Algorithm (GKA) but features several improvements over it, including efficient calculation of TWCVs, avoiding illegal string elimination overhead, and the simplification of the mutation operator. The initialization phase and the three operators are redefined to achieve these improvements[8]. 2.3 Gene Expression Analysis Using Clustering Data Mining has become an important topic in effective analysis of gene expression data due to its wide application in the biomedical industry. In this paper, k-means clustering algorithm has been extensively studied for gene expression analysis. Since our purpose is to demonstrate the effectiveness of the k-means algorithm for a wide variety of data sets, Two pattern recognition data and thirteen microarray data sets with both overlapping and non-overlapping class boundaries were taken for studies, where the number of features/genes ranges from 4 to 7129 and number of sample ranges from 32 to 683. The number of clusters ranges from two to eleven. For pattern recognition, we use IRIS and WBCD data and for microarray data we use serum data (Iyer et. al.), yeast data (Cho et. al), leukemia data (Golub et. al), breast data (Golub et. al), Lymphoma data (Alizadeh et al.), lung cancer (Bhattacharjee et. al), and St. Jude leukemia data (Yeoh et. al). To identify common subtypes in independent disease data, four different types of breast data (Golub et. al) and four Diffused Large B-cell Lymphoma (DLBCL) data were used. Clustering error rate (or, clustering accuracy) is used as evaluation metrics to measure the performance of k-means algorithm. Clustering is an efficient way of analyzing information from microarray data and K-means is a basic method for it. K-means can be very easily applied to Microarray data. Depending on the nature and complexity of the data performance of K-means varies. We achieve maximum accuracy for IRIS data where as lowest for DLBCL D. K-means has some serious drawbacks. Many papers have presented in past to improve K-Means. In the future we are planning to study K-Means clustering with other heuristic based search methods like SA and GA or some others[9]. 2.4 Enhancing Cluster Compactness using Genetic Algorithm Initialized K-means This paper presents a new initialization technique for clustering. Genetic algorithm has been used for optimal centroid selection. These centroids act as starting points for k-means. Previous researches used GA initialized Kmeans (GAIK) for clustering. In this paper some modification is done and a partition based GA initialized Kmeans (PGAIK) technique is introduced in order to improve the clustering performance. To measure the cluster compactness a within cluster scatter criteria has been used. Experimental results show that PGAIK yields more compact clusters as compared to simple GAIK. The initialization step is very important for any clustering algorithm. The experimental results show that the partition based random initialization method performs well 11

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11 and yields more compact clusters as compared to the normal random selection[10].

2.5 Ant-based Clustering Algorithms Ant-based clustering is a biologically inspired data clustering technique. Clustering task aims at the unsupervised classification of patterns in different groups. Clustering problem has been approached from different disciplines during last years. In recent years, many algorithms have been developed for solving numerical and combinatorial optimization problems. Most promising among them are swarm intelligence algorithms. Clustering with swarm-based algorithms is emerging as an alternative to more conventional clustering techniques. These algorithms have recently been shown to produce good results in a wide variety of real-world applications. During the last five years, research on and with the ant-based clustering algorithms has reached a very promising state. In this paper, a brief survey on ant-based clustering algorithms is described. We also present some applications of ant-based clustering algorithms. Ant-based clustering algorithms are an appropriate alternative to traditional clustering algorithms. The algorithm has a number of features that make it an interesting study of cluster analysis. It has the ability of automatically discovering the number of clusters. It linearly scales against the dimensionality of data. The nature of the algorithm makes it fairly robust to the effects of outliers within the data. Research on ant-based clustering algorithms is still an on-going field of research. In this paper, we address a brief survey of ant-based clustering algorithms and an overview of some of its applications. There are a number of directions in which research on ant-based clustering can be continued. We summarize and conclude the survey with listing some important future works and research trends for ant-based clustering algorithms: a comparative study of ant clustering performance with respect to other clustering algorithms; applying ant clustering algorithms to real-world applications; effects on performance of user-defined parameters; a hierarchical analysis of the input data by varying some of the user-defined parameters; sensitivity analysis of various user-defined parameters of ant clustering algorithms; to determine optimal values of parameters other than pick and drop policies; developing new probabilistic rules for picking and dropping objects; study the effect based on reasonably good validity index function to judge the fitness of several possible partitioning of the data of ant-based clustering schemes and validating mathematically; study the possibility of dynamic clustering using ant clustering with data mining applications; applying ant clustering algorithms for multi-objective optimization problems; study of transformation of ant clustering algorithms into supervised algorithms; developing new theoretical results of behavior of ant clustering algorithms and study of hierarchical ant-based clustering algorithms; to analyze the working principles that ant-based clustering shares with other clustering methods; hybridization of ant-clustering algorithm with alternative clustering methods[11]. 2.6 Fuzzy Kernel K-Means Clustering Method Based on Immune Genetic Algorithm A fuzzy kernel k-means clustering method based on immune genetic algorithm (IGA-FKKM) is proposed in this paper to overcome the dependence on the shape of the sample space and local optimization of fuzzy k-means algorithm. Mapping samples from low-dimension space into high-dimension feature space with Mercer kernel, the method thus eliminates the influence of the shape of sample space on clustering accuracy. Meanwhile, the probability of gaining the global optimal value is also increased by using the immune genetic algorithm. Compared with the fuzzy k-means clustering method (FKM) and the fuzzy k-means clustering method based on 12

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11 genetic algorithm (GA-FKM), IGA-FKKM is validated by experimental results to achieve higher classification accuracy. We propose a Fuzzy Kernel K-Means clustering method based on Immune Genetic Algorithm (IGAFKKM). Dependence of fuzzy K-Means clustering on distribution of sample is eliminated with the introduction of kernel function. Immune genetic algorithm is also used to suppress fluctuation occurred at later evolvement and avoid local optimum. Compared with FKM and GA-FKM, the experimental results show that IGA-FKKM obtains the global optimum, and has higher cluster accuracy. Further study will focus on dealing with the sensibility of clustering algorithm to initial value[12]. 2.7 An Improved Genetic Algorithm for Document Clustering with Semantic Similarity Measure This paper proposes a self-organized genetic algorithm for document clustering based on semantic similarity measure. The traditional method to represent text is that the document is organized as a string of words, while the conceptual similarity is ignored. We take advantage of thesaurus-based ontology to overcome this problem. To investigate how ontology method could be used effectively in document clustering, a hybrid strategy which combines the thesaurus-based semantic similarity measure and vector space model (VSM) measure to provide more accurate assessment of similarity between documents are implemented. Considering the influence between the diversity of the population and the selective pressure, an approach of dynamic evolution operators is put forward in this article. In our experiment two data sets of 200 and 600 documents from Reuter-21578 corpus are excerpted for test and the experiment results show that our method of genetic algorithm in conjunction with the hybrid semantic strategy, the combination of the thesaurus-based measure and VSM-based measure, outperforms that with the sole VSM measure. Our clustering algorithm also efficiently enhances the performance of precision and recall in comparison with k-means in the same similarity environments. In this article a modified genetic algorithm with the semantic similarity measure is proposed for document clustering. The common problem in the fields of text clustering is that the document is represented as a bag of words, while the conceptual similarity between each pairs of documents is ignored. We take advantage of thesaurus based ontology to overcome this problem. In our experiments, data set 1 with 200 documents from four topics and data set 2 with 600 documents form 6 topics are selected for test. The results show that our genetic algorithm in conjunction with the hybrid strategy, the combination of the VSM-based and thesaurus-based similarity measure, gets the best clustering performance in terms of the precision and recall. Furthermore, the proposed self-organized genetic algorithm, considering the influence between the diversity of the population and the selective pressure, efficiently evolve the clustering of the documents in comparison with standard k-means algorithm in the same similarity strategy. As we discussed, some important words which transform to incomplete forms after stemming are not included in WorldNet lexicon and will not be considered as concepts for similarity evaluation. In the future we will refine our algorithm by using a more excellent parser, for example, Text Analyst, or combine with the corpus-based method to overcome this problem for clustering[13]. 2.8 Hierarchical Clustering for Adaptive Refactoring Identification This paper studies an adaptive refactoring problem. It is well-known that improving the software systems design through refactoring is one of the most important issues during the evolution of object oriented software systems. We focus on identifying the refactoring needed in order to improve the class structure of software systems, in an adaptive manner, when new application classes are added to the system. We propose an adaptive clustering 13

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11 method based on a hierarchical agglomerative approach that adjusts the structure of the system that was established by applying a hierarchical agglomerative clustering algorithm before the application classes set changed. The adaptive method identifies, more efficiently, the refactoring that would improve the structure of the extended software system, without decreasing the accuracy of the obtained results. An experiment testing the methods efficiency is also reported. We have proposed in this paper a new method (HAR) for adapting a restructuring scheme of a software system when new application classes are added to the system. The considered experiment proves that the result is reached more efficiently using HAR method than running HAC again from the scratch on the extended software system. Further work will be done in the following directions: To isolate conditions to decide when it is more effective to adapt (using HAR) the partitioning of the extended software system than to recalculate it from scratch using HAC algorithm. To apply the adaptive algorithm HAR on open source case studies and real software systems. Identify adaptive extensions of other existing automatic methods for refactoring identification[14].

3. Conclusion and Future Research


This paper describes different methodologies and parameters associated with partition clustering algorithms. The drawback of k-means algorithm is to find the optimal k value and initial centroid for each cluster. This is overcome by applying the concepts such as genetic algorithm, simulated annealing, harmony search techniques and ant colony optimization.

References
[1] [2] [3] http://home.dei.polimi.it/matteucc/Clustering/tutorial_html. Santos, J.M, de SA, J.M, Alexandre, L.A, 2008. LEGClust- a Clustering Algorithm based on Layered Entropic sub graph. Pattern Analysis and Machine Intelligence, IEEE Transactions: 62-75. M. Livny, R.Ramakrishnan, T. Zhang, 1996. BIRCH: An Efficient Clustering Method for Very Large Databases. Proceeding ACMSIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery: 103-114. S. Guha, R. Rastogi, and K. Shim, 1998. CURE: An Efficient Clustering Algorithm for Large Databases. Proc. ACM Intl Conf. Management of Data: 73-84. M Meila, D Verma, Comparison of spectral clustering algorithm. University of Washington, Technical report 2001. Cai X. Y. et al, Survey on Spectral Clustering Algorithms. Computer Science: 14-18, 2008. Qinghe Zhang, Xiaoyun Chen, Agglomerative Hierarchical Clustering based on Affinity Propagation Algorithm, International Symposium on Knowledge Acquisition and Modeling, 2010. [8] Yi Lu, Shiyong Lu, Farshad Fotouhi, FGKA: A Fast Genetic K-means Clustering Algorithm, SAC04, March 14-17, 2004.
[9]

[4] [5] [6] [7]

Kumar Dhiraj and Santanu Kumar Rath, Gene Expression Analysis Using Clustering, International
Journal of Computer and Electrical Engineering, Vol. 1, No. 2, June 2009

[10] Kailash Chander, Dr. Dinesh Kumar, Vijay Kumar, Enhancing Cluster Compactness using Genetic Algorithm Initialized K-means International Journal of Software Engineering Research & Practices Vol.1, Issue 1, Jan, 2011 [11] O.A. Mohamed Jafar and R. Sivakumar, Ant-based Clustering Algorithms: A Brief Survey International Journal of Computer Theory and Engineering, Vol. 2, No. 5, October, 2010. [12] Chengjie GU1, , Shunyi ZHANG1, Kai LIU1, He HUANG2, Fuzzy Kernel K-Means Clustering Method Based on Immune Genetic Algorithm Journal of Computational Information Systems, 2011. 14

International Journal of Computational Intelligence and Information Security, November 2011 Vol. 2, No. 11 [13] Wei Song and Soon Cheol Park An Improved Genetic Algorithm for Document Clustering with Semantic Similarity Measure, Fourth International Conference on Natural Computation, 2008. [14] Istvan Gergely Czibula1, Gabriela Czibula, Hierarchical Clustering for Adaptive Refactoring Identification 2008.

15

You might also like