Tópicos Especiais em

Redes de Comunicações
- Tecnologias 5G/6G

A Distance Metric for Uneven Clusters

of Unsupervised K-Means Clustering

Prof. MSc Sérgio Vieira
Connected Autonomous Vehicles
Dynamic Environment
Similar velocity
Support for Offloading

New metric for K-Means algorithm
Useful for cases that require unequal size clusters
This metric can be used in autonomous vehicles
wireless communication to distinguish low-velocity
pedestrians from fast-speed vehicles.

Unequal size

Clustering is a technique in
unsupervised learning.
It works by automatically
grouping data points
together based on their
similarities, revealing
hidden patterns and
structures within the data.

The distance metric is a crucial element in
clustering algorithms.
It defines how we measure the similarity or
difference between data points and the centers
(centroids) of their assigned clusters.
Clustering vs Classification

Unequal size clusters
Clusters of unequal sizes can arise due to various
factors such as natural groupings within data,
differences in density or dispersion, or inherent
variability in the data (Dynamic Environments).

Unequal size clusters

Each component of X is equal to

Cluster's set
Each component of C is a centroid position of
dimension to

Distance Metrics



Distance Metrics

The Canberra distance proposed by Lance and

Williams in [1] is a common distance metric that
can be used in K-Means algorithm to form unequal
size clusters.

1. Mixed-data classificatory programs I. Agglomerative systems (1967)

Canberra Distance
The Canberra distance has a problem with opposite
sign numbers along an axis (e.g.,
) and the distance is always
1 over that dimension regardless of their absolute

Canberra Distance

If and , the Canberra distance would


Canberra Distance
If and , the Canberra distance
would be:

Regardless of the absolute values of and , if they

have opposite signs, the Canberra distance will
always be 1 in that dimension.

Distance Metrics

If and , the proposed distance would


Página 17 / 37
If and , the proposed distance
would be:

Proposed Metric
In contrast to K-Means using the Euclidean metric,
which places the decision boundary between two
adjacent clusters at the midpoint of the imaginary
line connecting their centroids.
K-Means with proposed metric the decision
boundary closer to the centroid that is nearer to
the origin.

Proposed Metric
Compared to the Canberra metric, the proposed
metric is computationally more intensive because it
has an extra square root in the denominator.
In terms of cluster sizes, our proposed metric’s
cluster areas get wider as the centroids get further
away from the origin compared with the Euclidean
metric while it is smaller than the Canberra metric.

Proposed Metric
This metric can be used in autonomous vehicles’
wireless communication to distinguish low-velocity
pedestrians from fast-speed vehicles.

K-Means Algorithm
K-means algorithm is used to classify points of a
dataset into K sub-groups, based on their
similarities, in an iterative approach.

This algorithm consists of four steps that
recursively searches for the local optimum point for
cluster centres also known as centroids as
listed below.
i. Select number of clusters K
ii. initialize centroids
iii. Distance calculation and point assignment
iv. update centroid location

K-Means - Select number of
clusters K
Selecting the appropriate number of clusters K in
K-Means is crucial for achieving meaningful results.
Elbow Method

K-Means - Elbow Method
Within Cluster Sum of Squares

Página 26 / 37
The initial position of the centroid influences the
configuration of the clusters.
When the centroids are placed differently at the
start, it can lead to distinct cluster arrangements.
Running the algorithm multiple times is important
to obtain meaningful results.

N: number of observations (rows)
P: number of features (cols)
K: number of cluster
i: number of iterations

Fast runtime performance [2]
i7 8th Gen processor, 16GB RAM
56 features, max_iter=300
K-Means (50K): 3.14 seconds
K-Means (100K): 4.66 seconds
K-Means (250K): 13.04 seconds
K-Means (500K): 26.48 seconds
K-Means (1M): 27.23 seconds

2.
K-Means - Distance calculation
and point assignment
The distances from all the points to all centroids
are calculated, and each point is assigned to its
closest centroid.

K-Means - Update centroid
New centroid are calculated using the mean value
of the points that belongs to each cluster.

Our metric generates unequal cluster sizes with
smaller clusters closer to the origin and larger
clusters for clusters’ centroids farther away from
the origin compared to the Euclidean distance.
Simulation results show the effectiveness of the
proposed metric in applications with non-linear
distance requirements such as clustering datasets
with unequal size cluster in wireless and
autonomous networks application

