Download as pdf
Download as pdf
You are on page 1of 32
Clustering in Machine Learning Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It can be defined as "A way of grouping the data points into different clusters, consisting of similar data points. The objects with the possible similarities remain in a group that has less or no similarities with another group." It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior, etc., and divides them as per the presence and absence of those similar patterns. After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system can use this id to simplify the processing of large and complex datasets. The clustering technique is commonly used for statistical data analysis. / Note: Clustering is somewhere similar to the classification algorithm, but the difference is the type of dataset that we are using. In classification, we work with the labeled data set, whereas in clustering, we work with the unlabelled dataset. wo —E- eo Las, Types of Clustering Methods The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one group) and Soft Clustering (data points can belong to another group also). But there are also other various approaches of Clustering exist. Below are the main clustering methods used in Machine learning: 1. Partitioning Clustering 2. Density-Based Clustering 3. Distribution Model-Based Clustering 4. Hierarchical Clustering 5. Fuzzy Clustering Applications of Clustering Clustering is an unsupervised machine learning technique with a lot of applications in the areas of pattern recognition, image analysis, customer analytics, market segmentation, social network analysis, and more. A broad range of industries use clustering, from airlines to healthcare and beyond. It is a type of unsupervised learning, meaning that we do not need labeled data for clustering algorithms; this is one of the biggest advantages of clustering over other supervised learning like Classification. Kmeans Algorithm Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non- overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster. The way kmeans algorithm works is as follows: HB Np wo . Specify number of clusters K. . Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement. . Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing. Compute the sum of the squared distance between data points and all centroids. Assign each data point to the closest cluster (centroid). Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster. K-means_ follows Expectation-Maximization approach to solve the problem. The Expectation- step is used for assigning the data points to the closest cluster and the Maximization-step is used for computing the centroid of each cluster. While working with K-means algorithm we need to take care of the following things - While working with clustering algorithms including K-Means, it is recommended to standardize the data because such algorithms use distance-based measurement to determine the similarity between data points. Due to the iterative nature of K-Means and random initialization of centroids, K-Means may stick in a local optimum and may not converge to global optimum. That is why it is recommended to use different initializations of centroids. Advantages and Disadvantages Advantages The following are some advantages of K-Means clustering algorithms - S\t is very easy to understand and implement. 5 If we have large number of variables then, K-means would be faster than Hierarchical clustering. 5 On re-computation of centroids, an instance can change the cluster. = Tighter clusters are formed with K-means as compared to Hierarchical clustering. Disadvantages The following are some disadvantages of K- Means clustering algorithms - 3 It is a bit difficult to predict the number of clusters i.e. the value of k. = Output is strongly impacted by initial inputs like number of clusters (value of k). = Order of data will have strong impact on the final output. = It is very sensitive to rescaling. If we will rescale our data by means of normalization or standardization, then the output will completely change.final output. 8 It is not good in doing clustering job if the clusters have a complicated geometric shape. Applications of K-Means Clustering Algorithm The main goals of cluster analysis are - 5 To get a meaningful intuition from the data we are working with. 5 Cluster-then-predict where different models will be built for different subgroups. To fulfill the above-mentioned goals, K-means clustering is performing well enough. It can be used in following applications - 5 Market segmentation = Document Clustering 5 Image segmentation =| Image compression = Customer segmentation Analyzing the trend on dynamic data r Using clustering for image segmentation Introduction Image Segmentation is the process of partitioning an image into multiple regions based on the characteristics of the pixels in the original image. Clustering is a technique to group similar entities and label them. Thus, for image segmentation using clustering, we can cluster similar pixels using a _ clustering algorithm and group a particular cluster pixel as a single segment. Thus, let's explore more image segmentation using clustering, Image Segmentation The process of image segmentation by clustering can be carried out using two methods. = Agglomerative clustering & Divisive clustering In Agglomerative clustering, we label a pixel to a close cluster and then increase the size of the clusters iteratively. The following steps outline the process of Agglomerative clustering. 5 Each pixel is considered to be an individual cluster § Similar clusters with smaller inter-cluster distances (WCSS) are merged. § The steps are repeated. In Divisive clustering, the following process is followed. & All the pixels are assigned to a single cluster. 5 The cluster is split into two with large inter- cluster distance over some epochs. 5 The steps are repeated until the optimal number of clusters is reached. Using clustering for image preprocessing Pre-processing of images. Before we can extract features from the images, we need to perform some preprocessing steps to make sure that images are comparable in color, value range, and image size. The preprocessing steps are utilized from open-cy and pipelined in clustimage. 1. colorscale: Conversion of the image into e.g. grayscale (2-D) or color (3- D). 2. scale: Normalize all pixel values between the minimum and maximum range of [0, 255]. 3. dim: Resize each image to make sure that the number of features is the same. Using Clustering for SemiSupervised Learning : Clustering methods that can be applied to partially labeled data or data with other types of outcome measures are known as semi-supervised clustering methods (or sometimes as supervised clustering meth- ods). They are examples of semi-supervised learning methods, which are methods that use both labeled and unlabeled data2®. This Why DBSCAN? Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding spherical-shaped clusters or convex clusters. In other words, they are suitable only for compact and well-separated clusters. Moreover, they are also severely affected by the presence of noise and outliers in the data. Real life data may contain irregularities, like: 1. Clusters can be of arbitrary shape such as those shown in the figure below. 2. Data may contain noise.

You might also like