Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Tópicos Especiais em

Redes de Comunicações
- Tecnologias 5G/6G

A Distance Metric for Uneven Clusters


of Unsupervised K-Means Clustering
Algorithm

Prof. MSc Sérgio Vieira


sergio.vieira@ifce.edu.br
sergio.vieira@aluno.uece.br
Página 1 / 37
Motivation
Connected Autonomous Vehicles
Dynamic Environment
Decision-making
Similar velocity
Support for Offloading

Página 2 / 37
New metric for K-Means algorithm
Useful for cases that require unequal size clusters
This metric can be used in autonomous vehicles
wireless communication to distinguish low-velocity
pedestrians from fast-speed vehicles.

Página 3 / 37
Cluster?
Definition
Unequal size

Página 4 / 37
Clustering
Clustering is a technique in
unsupervised learning.
It works by automatically
grouping data points
together based on their
similarities, revealing
hidden patterns and
structures within the data.

Página 5 / 37
Clustering
The distance metric is a crucial element in
clustering algorithms.
It defines how we measure the similarity or
difference between data points and the centers
(centroids) of their assigned clusters.
Clustering vs Classification

Página 6 / 37
Página 7 / 37
Unequal size clusters
Clusters of unequal sizes can arise due to various
factors such as natural groupings within data,
differences in density or dispersion, or inherent
variability in the data (Dynamic Environments).

Página 8 / 37
Unequal size clusters

Página 9 / 37
Notations
Dataset
Each component of X is equal to

...
...
...
... ... ... ...
...

Página 10 / 37
Notations
Cluster's set
Each component of C is a centroid position of
dimension to

...
...
...
... ... ... ...
...

Página 11 / 37
Distance Metrics

Euclidean

Manhattan

Página 12 / 37
Distance Metrics
Canberra

The Canberra distance proposed by Lance and


Williams in [1] is a common distance metric that
can be used in K-Means algorithm to form unequal
size clusters.

1. Mixed-data classificatory programs I. Agglomerative systems (1967) Página 13 / 37


Canberra Distance
The Canberra distance has a problem with opposite
sign numbers along an axis (e.g.,
) and the distance is always
1 over that dimension regardless of their absolute
values

Página 14 / 37
Canberra Distance
(one-dimension)

If and , the Canberra distance would


be:

Página 15 / 37
Canberra Distance
If and , the Canberra distance
would be:

Regardless of the absolute values of and , if they


have opposite signs, the Canberra distance will
always be 1 in that dimension.

Página 16 / 37
Distance Metrics
Proposed

If and , the proposed distance would


be:

Página 17 / 37
If and , the proposed distance
would be:

Página 18 / 37
Proposed Metric
In contrast to K-Means using the Euclidean metric,
which places the decision boundary between two
adjacent clusters at the midpoint of the imaginary
line connecting their centroids.
K-Means with proposed metric the decision
boundary closer to the centroid that is nearer to
the origin.

Página 19 / 37
Proposed Metric
Compared to the Canberra metric, the proposed
metric is computationally more intensive because it
has an extra square root in the denominator.
In terms of cluster sizes, our proposed metric’s
cluster areas get wider as the centroids get further
away from the origin compared with the Euclidean
metric while it is smaller than the Canberra metric.

Página 20 / 37
Proposed Metric
This metric can be used in autonomous vehicles’
wireless communication to distinguish low-velocity
pedestrians from fast-speed vehicles.

Página 21 / 37
K-Means Algorithm
K-means algorithm is used to classify points of a
dataset into K sub-groups, based on their
similarities, in an iterative approach.

Página 22 / 37
K-Means
This algorithm consists of four steps that
recursively searches for the local optimum point for
cluster centres also known as centroids as
listed below.
i. Select number of clusters K
ii. initialize centroids
iii. Distance calculation and point assignment
iv. update centroid location

Página 23 / 37
Página 24 / 37
K-Means - Select number of
clusters K
Selecting the appropriate number of clusters K in
K-Means is crucial for achieving meaningful results.
Elbow Method

Página 25 / 37
K-Means - Elbow Method
Within Cluster Sum of Squares

Página 26 / 37
Página 27 / 37
Página 28 / 37
Página 29 / 37
Página 30 / 37
K-Means
The initial position of the centroid influences the
configuration of the clusters.
When the centroids are placed differently at the
start, it can lead to distinct cluster arrangements.
Running the algorithm multiple times is important
to obtain meaningful results.

Página 31 / 37
K-Means
Complexity
N: number of observations (rows)
P: number of features (cols)
K: number of cluster
i: number of iterations

Página 32 / 37
K-Means
Fast runtime performance [2]
i7 8th Gen processor, 16GB RAM
56 features, max_iter=300
K-Means (50K): 3.14 seconds
K-Means (100K): 4.66 seconds
K-Means (250K): 13.04 seconds
K-Means (500K): 26.48 seconds
K-Means (1M): 27.23 seconds

2. https://www.aifinesse.com/k-means/k-means-complexity/ Página 33 / 37
K-Means - Distance calculation
and point assignment
The distances from all the points to all centroids
are calculated, and each point is assigned to its
closest centroid.

Página 34 / 37
K-Means - Update centroid
location
New centroid are calculated using the mean value
of the points that belongs to each cluster.

Página 35 / 37
Página 36 / 37
Conclusion
Our metric generates unequal cluster sizes with
smaller clusters closer to the origin and larger
clusters for clusters’ centroids farther away from
the origin compared to the Euclidean distance.
Simulation results show the effectiveness of the
proposed metric in applications with non-linear
distance requirements such as clustering datasets
with unequal size cluster in wireless and
autonomous networks application

Página 37 / 37

You might also like