Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Solid State Technology

Volume: 63 Issue: 6
Publication Year October: 2020

A comparison of Various Distance Functions on K


– Mean Clustering Algorithm

1Ahmed Raad Radhi; 2Irtefaa A. Neamah


1 Department of Mathematics, Faculty of Computer Science and Mathematics, University of Kufa,
Iraq
2
Department of Mathematics, Faculty of Computer Science and Mathematics, University of Kufa,
Iraq

Ahmedalshmirty@gmail.com; irtefaa.radhi@uokufa.edu.iq

Abstract

In this paper, the algorithm of k- mean is studied by a comparison in a simulation. The performance
of the algorithm is compared at different distance formulas, which are (Euclidian, cityblock and cosine
distance) for different number of clusters. As a results, the numerical simulation for comparison shows
that the performance of the algorithm was better at the distance of the cityblock, for the more the
number of clusters.

Keywords: K- mean algorithm, clustering, Euclidian distance, Cityblock distance, Cosine distance.

1. Introduction

Clustering is the method of classifying data or collecting similar data in a group and dissimilar data in
different groups [1]. and the cluster is formed based on the greatest similarity between the data
characteristics and their difference from the data characteristics of other clusters. Cluster grouping can
be considered one of the most important non-supervised learning techniques [2].

K- means clustering is the most popular clustering algorithm used by MacQueen in 1967 [3].Used to
automatically divide the data into k groups, in which random centers are chosen for each group, then
the process is repeated to choose the optimal center, and through which clusters of each cluster are
similar and independent of each other [4].
17369

In 2012, they used the k means clustering algorithm with an effective distance scale to choose the
optimal and optimal distance between the data accurately and in the shortest time [5]. In the same year
the work was shown in the effect of distance functions on the K-mean Clustering algorithm was
covered the effect of the Euclidean distance function and the cityblock distance function on the above

Archives Available @ www.solidstatetechnology.us


Solid State Technology
Volume: 63 Issue: 6
Publication Year October: 2020

algorithm where the effect of distances on the size of groups formed by k means that the clustering
algorithm was shown [6]. In 2013, a comparison of three distances was used in Euclidean, Manhattan
and Minkowski, with the algorithm K-mean Clustering by [7] through which the comparison was made
using graphs and it was found that the distance measurement plays an important role in clustering. He
concluded that the Euclidean distance gives the best results and the distance in Manhattan is the worse.
While [8] has used a new distance scale with the k means clustering algorithm, in which it describes
a basic problem in data collection and its solutions by improving or developing a distance scale based
on symmetry of points and thus improving the traditional means algorithm. In 2015, the performance
of the K-mean Clustering algorithm was tested and evaluated with different distance scales, which
were applied to a set of different data, and through it the best performance of the algorithm was
concluded based on the metrics and thus the best metric was approved for a specific application [9].
In addition to the above, Awawdeh and others have improved the K-mean Clustering algorithm to
choose the best convergence of the midpoints within the cluster with the least time for calculations and
through which all data are dealt with [10].

In this paper, we will discuss a set of distance measures and their effect on the K-means Clustering
algorithm by applying them to a set of random data and comparing the results for each distance
measure and choosing the best and worst. Thus, the best performance of the algorithm is known based
on the measures.

2. k-means clustering algorithm

In 1967, Macqueen was first proposed the k – mean algorithm, which was an unsupervised learning
clustering algorithm [11]. This algorithm is used to group the objects by a set of numerical
characteristics so that these objects within the group are more similar than the objects in the rest of the
groups. Therefore, a specific grouping algorithm must be provided, which is a standard for measuring
the similarity of objects, and how to group objects or points into groups through the following steps:

1- The number of groups must be known or chosen to be K-shaped.


2- Select "c" the centers of groups K randomly so that the points are somewhat apart.
3- The algorithm calculates the distance between each point with all centers and assigns it to the
closest group.
4- The new center of mass is calculated.
5- Repeating the steps for a specified number of times to recalculate the distance between each
data point and the new mass centers obtained to obtain stable organisms that do not move
between groups.
6- End of the algorithm (get results). [12]
2.1 Euclidean distance

The standard distance for k-mean algorithm is Euclidean distance. This distance is one of the oldest
17370

methods of distance mapping between two points, which is based on distance between two points that
one can measure using a ruler. [6] [13].

Suppose, 𝑉 = {𝑣1, 𝑣2, 𝑣3, … … , 𝑣𝑐} be the centers set, and 𝑋 = {𝑥1, 𝑥2, 𝑥3, … … , 𝑥𝑛} be the data set.

Archives Available @ www.solidstatetechnology.us


Solid State Technology
Volume: 63 Issue: 6
Publication Year October: 2020

1- Randomly selecting ‘c’ cluster centers.


2- The distance calculating between clusters centers and each data point using the following
formula:

𝑚
𝐷𝑥 𝑦 = √∑( 𝑥𝑖𝑘 − 𝑥𝑗𝑘) 2

𝑘=1

3- In this step the data center point is assigned.


4- Calculating a new center of clusters using
𝑐𝑖
1
𝑣 𝑖 = ( ) ∑ 𝑥𝑖
𝑐𝑖
1
where, the number of data points is denote ci, in ith cluster.
5- Recalculating a new cluster by precomputing the distance between each data point.
6- Repeat steps from 3 to 5, if no data point was reassigned then stop.

2.2 Cityblock distance

In the cityblock distance it be used the distance between two points, it is as the sum of the absolute
differences of the coordinates. This distance is the sum of the lengths in the line segment between the
points onto the coordinate axis [6].

Suppose, 𝑉 = {𝑣1, 𝑣2, 𝑣3, … … , 𝑣𝑐} be the set of centers and 𝑋 = {𝑥1, 𝑥2, 𝑥3, … … , 𝑥𝑛} be the set of data
points.

1- Selecting randomly ‘c’ cluster centers.


2- Calculating the distance between cluster centers each data point using the formula as follows
𝐷𝑥 𝑦 = |𝑥𝑖𝑘 − 𝑥𝑗𝑘 |

3- The same step in Euclidean distance.


4- Calculating a new center of cluster using:
𝑐𝑖
1
𝑣 𝑖 = ( ) ∑ 𝑥𝑖
𝑐𝑖
1
where, the number of data points is denote ci, in ith cluster.
5- Recalculating a new cluster by precomputing the distance between each data point.
6- Repeat steps from 3 to 5, if no data point was reassigned then stop. [7]
17371

2.3 Cosine distance

Suppose, 𝑉 = {𝑣1, 𝑣2, 𝑣3, … … , 𝑣𝑐} be the set of centers, and 𝑋 = {𝑥1, 𝑥2, 𝑥3, … … , 𝑥𝑛} be the set of data
points.

1- Selecting randomly the center of cluster c.

Archives Available @ www.solidstatetechnology.us


Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020

2- Calculating the distance between cluster centers and each data point using the co distance
metric as follows
𝑛 𝑥𝑖𝑘 𝑥𝑗𝑘
∑𝑖=1
𝐷𝑥 𝑦 =
2 2
√∑𝑖=1 𝑥 𝑖𝑘 √∑𝑖=1 𝑥 𝑗𝑘
𝑛 𝑛

3- The same step in Euclidean distance.


4- New center of cluster is calculated using:
𝑐𝑖
1
𝑣 𝑖 = ( ) ∑ 𝑥𝑖
𝑐𝑖
1

where, the number of data points is denote ci, in ith cluster.


5- Recalculating a new cluster by precomputing the distance between each data point.
6- Repeat steps from 3 to 5, if no data point was reassigned then stop. [7] [14]

3. Results of Simulation

Random data was generated and the steps of the k means clustering algorithm were applied by choosing
the number of different clusters for different distance equations to know the performance of the
algorithm and the effect of both the number of clusters and the distance on them to choose the most
efficient results.

After implementing the code in the Matlab program, the following results were obtained.

A B
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

C D
17372

10 10

9 9

8 8

7 7

6 6

5 5

4 4
Archives Available @ www.solidstatetechnolo gy.us
3 3

2 2

1 1

0 0
Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020

In figure (1) shown above, the Euclidean distance was tested for a number of clusters and for random
data, where drawing A represents the original data and drawing B two clusters were taken. As for
drawing C, we used three clusters, while drawing D and the last shows our use of five clusters.

A B
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

C D
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Figure 2 : Shows the effect ciltyblock distance for different clusters

In figure (2) shown above, the Cityblock distance was tested for a number of clusters and for random
data, where drawing A represents the original data and drawing B two clusters were taken. As for
drawing C, we used three clusters, while drawing D and the last shows our use of five clusters.
17373

Archives Available @ www.solidstatetechnology.us


Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020

A B
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

C D
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Figure 3 : Shows the effect cosine distance for different clusters

In figure (3) shown above, the Cosine distance was tested for a number of clusters and for random
data, where drawing A represents the original data and drawing B two clusters were taken. As for
drawing C, we used three clusters, while drawing D and the last shows our use of five clusters.

Table 1 : The Silhouette Value for different distance with different clusters

Distance Cluster Silhouette Value

2 0.5149
Euclidean 3 0.5701
5 0.5701
2 0.3821
Cityblock 3 0.3915
5 0.3586
17374

2 0.7584
Cosine 3 0.7430
5 0.7268

Archives Available @ www.solidstatetechnology.us


Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020

The results shown in the above table were obtained from testing distances with a number of clusters
equally for all distances after applying the k means clustering algorithm and explain them as follows:

Firstly, when the number of clusters is two, the Cityblock effect is more efficient on the algorithm k
mean clustering than the Euclidean distance and the Cosine distance, but when the number of clusters
is three, the Cityblock distance efficiency remains the best when applying the algorithm. In addition
to the above, increasing the number of clusters to five Cityblock increased the efficiency of the distance
in relation to the previous clusters and became the best and most efficient for all results as well as for
all distances, and that the difference in stray values is exposed and explained below.

A
1
1

1
2
Cluster
Cluster

Cluster 3
2

3 5

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Silhouette Value Silhouette Value Silhouette Value

B
1
1

2
Cluster

Cluster
Cluster

2
3

2 4

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Silhouette Value Silhouette Value Silhouette Value

C
1 1

1
2

2
Cluster

Cluster

Cluster

4
17375

2
3

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Silhouette Value Silhouette Value Silhouette Value

Figure 4 : The distance values (Euclidean, Cityblock, Cosine) for the number of clusters
(2, 3, 5) respectively
Archives Available @ www.solidstatetechnology.us
Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020

Where A represents the Euclidean distance, B represents the cityblock distance, and C represents the
cosine distance

4. Conclusion

Since the k mean clustering algorithm has proven its importance and effectiveness in many
applications, so this algorithm was addressed in such a study to find out the effect of different distance
formulas on its performance.

After conducting a simulation study, it was found that the Cityblock distance is more efficient and
effective than others when the number of clusters increases.

References

[1] K. Wagsta, C. Cardie, S. Rogers and S. Schroedl, "Constrained K-means Clustering with
Background Knowledge," in Proceedings of the Eighteenth International Conference on
Machine Learning, 2001.

[2] M. Verma, M. Srivastava, N. Chack, A. K. Diswar and N. Gupta, "A Comparative Study of
Various Clustering Algorithms in Data Mining," International Journal of Engineering
Research and Applications, 2012.

[3] P. Vora and B. Oza, "A Survey on K-mean Clustering and Particle Swarm Optimization,"
International Journal of Science and Modern Engineering , 2013.

[4] S. Naeem and A. Wumaier, "Study and Implementing K-mean Clustering Algorithm on
English Text and Techniques to Find the Optimal Value of K," International Journal of
Computer Applications, pp. 7-14, 2018.

[5] B. Shanmugapriya and M. Punithavalli, "A Modified Projected K-Means Clustering Algorithm
with Effective Distance Measure," International Journal of Computer Applications, 2012.

[6] R. Loohach and K. Garg, "Effect of Distance Functions on K-Means Clustering Algorithm,"
International Journal of Computer Applications, 2012.

[7] A. Singh, A. Yadav and A. Rana, "K-means with Three different Distance Metrics,"
17376

International Journal of Computer Applications, 2013.

Archives Available @ www.solidstatetechnology.us


Solid State Technology
Volume: 63 Issue: 6
Publication Year: 2020

[8] S. I. ABUDALFA and M. MIKKI, "K-means algorithm with a novel distance measure,"
Turkish Journal of Electrical Engineering & Computer Sciences, 2013.

[9] Y. S. Thakare and S. B. Bagal, "Performance Evaluation of K-means Clustering Algorithm


with Various Distance Metrics," International Journal of Computer Applications, 2015.

[10] S. Awawdeh, A. Edinat and A. Sleit, "An Enhanced K-means Clustering Algorithm for Multi-
attributes Data," International Journal of Computer Science and Information Security, 2019.

[11] J. Xie and S. Jiang, "A Simple and Fast Algorithm for Global K-means Clustering," Second
International Workshop on Education Technology and Computer Science, 2010.

[12] Kanika and G. Narula, "Contrasting Different Distance Functions Using K-Means Means,"
International Journal of Computer Science Trends and Technology, 2015.

[13] N. Bhargava, A. Kumawat and R. Bhargava, "Fingerprint Matching of Normalized Image


based on Euclidean Distance," International Journal of Computer Applications, 2015.

[14] D. Stanikunas, J. Mandravickait and T. Krilavicius, "Comparison of distance and similarity


measures for stylometric analysis of Lithuanian texts," in Automatic extraction of style applied
to individual authors and groups of authors, Kaunas, Lithuania, 2017.

17377

Archives Available @ www.solidstatetechnology.us

You might also like