Professional Documents
Culture Documents
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
18 A Comparison of Various Distance Functions On K - Mean Clustering Algorithm
Volume: 63 Issue: 6
Publication Year October: 2020
Ahmedalshmirty@gmail.com; irtefaa.radhi@uokufa.edu.iq
Abstract
In this paper, the algorithm of k- mean is studied by a comparison in a simulation. The performance
of the algorithm is compared at different distance formulas, which are (Euclidian, cityblock and cosine
distance) for different number of clusters. As a results, the numerical simulation for comparison shows
that the performance of the algorithm was better at the distance of the cityblock, for the more the
number of clusters.
Keywords: K- mean algorithm, clustering, Euclidian distance, Cityblock distance, Cosine distance.
1. Introduction
Clustering is the method of classifying data or collecting similar data in a group and dissimilar data in
different groups [1]. and the cluster is formed based on the greatest similarity between the data
characteristics and their difference from the data characteristics of other clusters. Cluster grouping can
be considered one of the most important non-supervised learning techniques [2].
K- means clustering is the most popular clustering algorithm used by MacQueen in 1967 [3].Used to
automatically divide the data into k groups, in which random centers are chosen for each group, then
the process is repeated to choose the optimal center, and through which clusters of each cluster are
similar and independent of each other [4].
17369
In 2012, they used the k means clustering algorithm with an effective distance scale to choose the
optimal and optimal distance between the data accurately and in the shortest time [5]. In the same year
the work was shown in the effect of distance functions on the K-mean Clustering algorithm was
covered the effect of the Euclidean distance function and the cityblock distance function on the above
algorithm where the effect of distances on the size of groups formed by k means that the clustering
algorithm was shown [6]. In 2013, a comparison of three distances was used in Euclidean, Manhattan
and Minkowski, with the algorithm K-mean Clustering by [7] through which the comparison was made
using graphs and it was found that the distance measurement plays an important role in clustering. He
concluded that the Euclidean distance gives the best results and the distance in Manhattan is the worse.
While [8] has used a new distance scale with the k means clustering algorithm, in which it describes
a basic problem in data collection and its solutions by improving or developing a distance scale based
on symmetry of points and thus improving the traditional means algorithm. In 2015, the performance
of the K-mean Clustering algorithm was tested and evaluated with different distance scales, which
were applied to a set of different data, and through it the best performance of the algorithm was
concluded based on the metrics and thus the best metric was approved for a specific application [9].
In addition to the above, Awawdeh and others have improved the K-mean Clustering algorithm to
choose the best convergence of the midpoints within the cluster with the least time for calculations and
through which all data are dealt with [10].
In this paper, we will discuss a set of distance measures and their effect on the K-means Clustering
algorithm by applying them to a set of random data and comparing the results for each distance
measure and choosing the best and worst. Thus, the best performance of the algorithm is known based
on the measures.
In 1967, Macqueen was first proposed the k – mean algorithm, which was an unsupervised learning
clustering algorithm [11]. This algorithm is used to group the objects by a set of numerical
characteristics so that these objects within the group are more similar than the objects in the rest of the
groups. Therefore, a specific grouping algorithm must be provided, which is a standard for measuring
the similarity of objects, and how to group objects or points into groups through the following steps:
The standard distance for k-mean algorithm is Euclidean distance. This distance is one of the oldest
17370
methods of distance mapping between two points, which is based on distance between two points that
one can measure using a ruler. [6] [13].
Suppose, 𝑉 = {𝑣1, 𝑣2, 𝑣3, … … , 𝑣𝑐} be the centers set, and 𝑋 = {𝑥1, 𝑥2, 𝑥3, … … , 𝑥𝑛} be the data set.
𝑚
𝐷𝑥 𝑦 = √∑( 𝑥𝑖𝑘 − 𝑥𝑗𝑘) 2
𝑘=1
In the cityblock distance it be used the distance between two points, it is as the sum of the absolute
differences of the coordinates. This distance is the sum of the lengths in the line segment between the
points onto the coordinate axis [6].
Suppose, 𝑉 = {𝑣1, 𝑣2, 𝑣3, … … , 𝑣𝑐} be the set of centers and 𝑋 = {𝑥1, 𝑥2, 𝑥3, … … , 𝑥𝑛} be the set of data
points.
Suppose, 𝑉 = {𝑣1, 𝑣2, 𝑣3, … … , 𝑣𝑐} be the set of centers, and 𝑋 = {𝑥1, 𝑥2, 𝑥3, … … , 𝑥𝑛} be the set of data
points.
2- Calculating the distance between cluster centers and each data point using the co distance
metric as follows
𝑛 𝑥𝑖𝑘 𝑥𝑗𝑘
∑𝑖=1
𝐷𝑥 𝑦 =
2 2
√∑𝑖=1 𝑥 𝑖𝑘 √∑𝑖=1 𝑥 𝑗𝑘
𝑛 𝑛
3. Results of Simulation
Random data was generated and the steps of the k means clustering algorithm were applied by choosing
the number of different clusters for different distance equations to know the performance of the
algorithm and the effect of both the number of clusters and the distance on them to choose the most
efficient results.
After implementing the code in the Matlab program, the following results were obtained.
A B
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
C D
17372
10 10
9 9
8 8
7 7
6 6
5 5
4 4
Archives Available @ www.solidstatetechnolo gy.us
3 3
2 2
1 1
0 0
Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020
In figure (1) shown above, the Euclidean distance was tested for a number of clusters and for random
data, where drawing A represents the original data and drawing B two clusters were taken. As for
drawing C, we used three clusters, while drawing D and the last shows our use of five clusters.
A B
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
C D
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
In figure (2) shown above, the Cityblock distance was tested for a number of clusters and for random
data, where drawing A represents the original data and drawing B two clusters were taken. As for
drawing C, we used three clusters, while drawing D and the last shows our use of five clusters.
17373
A B
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
C D
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
In figure (3) shown above, the Cosine distance was tested for a number of clusters and for random
data, where drawing A represents the original data and drawing B two clusters were taken. As for
drawing C, we used three clusters, while drawing D and the last shows our use of five clusters.
Table 1 : The Silhouette Value for different distance with different clusters
2 0.5149
Euclidean 3 0.5701
5 0.5701
2 0.3821
Cityblock 3 0.3915
5 0.3586
17374
2 0.7584
Cosine 3 0.7430
5 0.7268
The results shown in the above table were obtained from testing distances with a number of clusters
equally for all distances after applying the k means clustering algorithm and explain them as follows:
Firstly, when the number of clusters is two, the Cityblock effect is more efficient on the algorithm k
mean clustering than the Euclidean distance and the Cosine distance, but when the number of clusters
is three, the Cityblock distance efficiency remains the best when applying the algorithm. In addition
to the above, increasing the number of clusters to five Cityblock increased the efficiency of the distance
in relation to the previous clusters and became the best and most efficient for all results as well as for
all distances, and that the difference in stray values is exposed and explained below.
A
1
1
1
2
Cluster
Cluster
Cluster 3
2
3 5
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Silhouette Value Silhouette Value Silhouette Value
B
1
1
2
Cluster
Cluster
Cluster
2
3
2 4
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Silhouette Value Silhouette Value Silhouette Value
C
1 1
1
2
2
Cluster
Cluster
Cluster
4
17375
2
3
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Silhouette Value Silhouette Value Silhouette Value
Figure 4 : The distance values (Euclidean, Cityblock, Cosine) for the number of clusters
(2, 3, 5) respectively
Archives Available @ www.solidstatetechnology.us
Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020
Where A represents the Euclidean distance, B represents the cityblock distance, and C represents the
cosine distance
4. Conclusion
Since the k mean clustering algorithm has proven its importance and effectiveness in many
applications, so this algorithm was addressed in such a study to find out the effect of different distance
formulas on its performance.
After conducting a simulation study, it was found that the Cityblock distance is more efficient and
effective than others when the number of clusters increases.
References
[1] K. Wagsta, C. Cardie, S. Rogers and S. Schroedl, "Constrained K-means Clustering with
Background Knowledge," in Proceedings of the Eighteenth International Conference on
Machine Learning, 2001.
[2] M. Verma, M. Srivastava, N. Chack, A. K. Diswar and N. Gupta, "A Comparative Study of
Various Clustering Algorithms in Data Mining," International Journal of Engineering
Research and Applications, 2012.
[3] P. Vora and B. Oza, "A Survey on K-mean Clustering and Particle Swarm Optimization,"
International Journal of Science and Modern Engineering , 2013.
[4] S. Naeem and A. Wumaier, "Study and Implementing K-mean Clustering Algorithm on
English Text and Techniques to Find the Optimal Value of K," International Journal of
Computer Applications, pp. 7-14, 2018.
[5] B. Shanmugapriya and M. Punithavalli, "A Modified Projected K-Means Clustering Algorithm
with Effective Distance Measure," International Journal of Computer Applications, 2012.
[6] R. Loohach and K. Garg, "Effect of Distance Functions on K-Means Clustering Algorithm,"
International Journal of Computer Applications, 2012.
[7] A. Singh, A. Yadav and A. Rana, "K-means with Three different Distance Metrics,"
17376
[8] S. I. ABUDALFA and M. MIKKI, "K-means algorithm with a novel distance measure,"
Turkish Journal of Electrical Engineering & Computer Sciences, 2013.
[10] S. Awawdeh, A. Edinat and A. Sleit, "An Enhanced K-means Clustering Algorithm for Multi-
attributes Data," International Journal of Computer Science and Information Security, 2019.
[11] J. Xie and S. Jiang, "A Simple and Fast Algorithm for Global K-means Clustering," Second
International Workshop on Education Technology and Computer Science, 2010.
[12] Kanika and G. Narula, "Contrasting Different Distance Functions Using K-Means Means,"
International Journal of Computer Science Trends and Technology, 2015.
17377