Professional Documents
Culture Documents
Lesson 4
Lesson 4
Lesson 4
AI AND MACHINE
LEARNING
• what is clustering
UNDERSTANDING • Types of clustering
algorithms
CLUSTERING • Application of clustering
WHAT IS CLUSTERING
Interesting history : in 1854, London, Dr John Snow use a special map plot to study the
outbreak of cholera cases. It zeroed down to the discovery of the source – a well pump
Source : https://www.cs.toronto.edu/~periklis/pubs/depth.pdf
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC
WHAT IS CLUSTERING
DISTANCE OR PROXIMITY
Euclidean distance
▪ A non-negative measure distance between 2 points
▪ Based on Pythagoras Theorem
▪ Used by
▪ K-Means,
▪ K-nearest neighbors
Source : https://lzpdatascience.wordpress.com/2019/11/17/
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC distance-measures-and-linkage-methods-in-hierarchical-clustering/
WHAT IS CLUSTERING
TYPES OF DISTANCE
Manhattan distance
▪ Determine the absolute difference among the pair of
coordinates
▪ Similar to the grid on the street map
Minihowski distance
▪ Is a generalized form of Euclidean and Manhattan distance
▪ The computed distance is in-between the 2 types of distance
▪ The P can tweaked as needed with p
Mahalanobis distance
▪ A statistical distance measure used to compute the distance
from the point to the centre of a distribution
▪ Take normalization and dispersion of data into account
▪ Useful for non-spherical-shaped distribution
▪ The red and green cross have same
Euclidean Distance but red is outliers
Clustering
Graph
Divisive Agglomerative Centroid Model Based Spectral
Theoretic
Source :
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC https://www.researchgate.net/figure/An-UPGMA-cluster-dendrogram-showing-the-genetic-relationships-of-the
-12-rice-cultivars_fig2_283481744
TYPES OF CLUSTERING ALGORITHMS
HIERARCHICAL
Divisive Agglomerative
▪ 1st step – split ▪ 1st step – merge
▪ Have access to all data ▪ Do not consider global data,
but pairwise data
Strength Weakness
1. Simple ▪ Applicable if mean can be
defined, hence for categorical
2. Efficient (very fast) data k-mode is use (based on
3. Popular (if not most) most frequent values)
▪ Need to specify k
K-means will terminate at local ▪ Sensitive to outliers
optimum not global optimum due data point recorded wrongly
to its complexity data that are very far away
▪ Output is categorial