Lesson 4

INTRODUCTION TO
AI AND MACHINE
LEARNING
• what is clustering
UNDERSTANDING • Types of clustering
algorithms
CLUSTERING • Application of clustering
WHAT IS CLUSTERING
▪ The organization of unlabeled data into similarity groups of

clusters.
▪ Hence, cluster is collection of similar items from a bigger set

of dissimilar items.
Interesting history : in 1854, London, Dr John Snow use a special map plot to study the
outbreak of cholera cases. It zeroed down to the discovery of the source – a well pump
Source : https://www.cs.toronto.edu/~periklis/pubs/depth.pdf
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC
WHAT IS CLUSTERING
DISTANCE OR PROXIMITY
▪ Since “cluster is collection of similar items”, hence

▪ Similarity is measured by Distance (proximity)
▪ There are 4 common examples of distance measures

use for proximity matrix
▪ Euclidean distance
▪ Manhattan distance
▪ Minihowski distance
▪ Mahalanobis distance
WHAT IS CLUSTERING
TYPES OF DISTANCE
Euclidean distance
▪ A non-negative measure distance between 2 points
▪ Based on Pythagoras Theorem
▪ Used by
▪ K-Means,
▪ K-nearest neighbors
Source : https://lzpdatascience.wordpress.com/2019/11/17/
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC distance-measures-and-linkage-methods-in-hierarchical-clustering/
WHAT IS CLUSTERING
TYPES OF DISTANCE
Manhattan distance
▪ Determine the absolute difference among the pair of
coordinates
▪ Similar to the grid on the street map

WHAT IS CLUSTERING
TYPES OF DISTANCE
Minihowski distance
▪ Is a generalized form of Euclidean and Manhattan distance
▪ The computed distance is in-between the 2 types of distance
▪ The P can tweaked as needed with p

WHAT IS CLUSTERING
TYPES OF DISTANCE
Mahalanobis distance
▪ A statistical distance measure used to compute the distance
from the point to the centre of a distribution
▪ Take normalization and dispersion of data into account
▪ Useful for non-spherical-shaped distribution
▪ The red and green cross have same
Euclidean Distance but red is outliers
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC https://vitalflux.com/different-types-of-distance-measures-in-machine-learning/ /

WHAT IS CLUSTERING
CLUSTER EVALUATION
Need to check quality of the model

▪ Intra-cluster cohesion (compactness)
▪ Measure how near the data points in the cluster are to the centroid
▪ Common choice - Use Sum of squared error (SSE)
▪ Inter-cluster separation (isolation)
▪ The different cluster centroids should be as far apart as possible
▪ Expert judgment is the most used method….

WHAT IS CLUSTERING
CLUSTER EVALUATION
▪ Given the above

▪ Might be difficult decide based on data , error (SSE)
▪ The domain expert may recommend 2 or 3 clusters (i.e. 3 sub-species)

TYPES OF CLUSTERING ALGORITHMS
Clustering
Hierarchical Partitional Others
Graph
Divisive Agglomerative Centroid Model Based Spectral
Theoretic

HIERARCHICAL
▪ Find successive clusters using previously established

clusters.
▪ Agglomerative algorithms – Bottom-up approach, formed clusters first

merge into larger clusters
▪ Divisive algorithms – Top-down approach, first from main clusters

proceed to divide into small clusters

HIERARCHICAL
▪ Represented well in taxonomy tree and dendrogram to show

the hierarchical relationship
Source :
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC https://www.researchgate.net/figure/An-UPGMA-cluster-dendrogram-showing-the-genetic-relationships-of-the
-12-rice-cultivars_fig2_283481744
HIERARCHICAL
Step for clustering involves:

▪ Divisive clustering (taxonomy tree like)
▪ Starts with all data point in 1 cluster – root
▪ Splits the root into a set of clusters
▪ Each cluster is recursively splited
▪ Stop when only 1 data point in the cluster
▪ Agglomerative clustering (dendrogram like)
▪ Merge the most similar pair of clusters
▪ Stop when all data points merged into 1 cluster (i.e. root)
HIERARCHICAL | AGGLOMERATIVE
Consideration with forming clusters with:

▪ Minimum distance ( single linkage)
▪ Generate minimum spanning tree
▪ Encourages elongated clusters growth
▪ BUT : very sensitive to noise
▪ Farthest neighbor ( complete linkage)
▪ Encourages compact clusters
▪ Does not work well with elongated clusters (data point can cross into
another cluster)
HIERARCHICAL | SUMMARY
• Agglomerative is faster to compute

• Divisive may be less ”blind” to the global structure of the data
Divisive Agglomerative
▪ 1st step – split ▪ 1st step – merge
▪ Have access to all data ▪ Do not consider global data,
but pairwise data

PARTITIONAL
▪ Will determine all clusters at once, followed by divisive

algorithms to more finely split the clusters.

PARTITIONAL | CENTROID | K-MEANS
▪ K-means (MacQueen 1967) algorithm

▪ User will specify the value of k
▪ K-means partition data into k clusters
▪ Each cluster has a centre called centroid
▪ Each data point could have multiple dimensions (x1, x2, .. xn)

K-means works as follows:

1. Choose k number of data points (seeds) to be the the initial
centroids
2. Assign each data point to the closest centroid
3. Re-compute the centroids using the current cluster
membership
4. Repeat step 2 & 3 until convergence criterion met

K-means convergence criterion, ANY 1 condition required

1. No / minimum re-assignments of data points to different
cluster
2. No / minimum change of centroids
3. Minimum / Increase in the sum of squared error (SSE)
* When a data point is move to another cluster, the centroids
might move. It might give-up or take-up more data points from
other clusters and hence move again.
Strength Weakness
1. Simple ▪ Applicable if mean can be
defined, hence for categorical
2. Efficient (very fast) data k-mode is use (based on
3. Popular (if not most) most frequent values)
▪ Need to specify k
K-means will terminate at local ▪ Sensitive to outliers
optimum not global optimum due data point recorded wrongly
to its complexity data that are very far away

Outliers force the wrong grouping,
• since K is 2, the outlier is grouped with a group

• it should be left out as shown by diagram on the right

Possible solutions to handle outliers
• Remove outliers (data points), preferably monitor these

outliers over a few iterations before removal
• Perform random sampling to choose the centroid first, assign

the main bulk of data after it
• Outliers are are rare, hence unlikely to be picked
(randomly) initially.
K-means is sensitive to initial seeds

• Different clusters can be formed

K-means is not for special structure that are not hyper-sphere

▪ Diagram on the right is “correct clusters”, but K-means got it wrong

OTHERS
▪ Bayesian based algorithms could also be used which is based

on probability.
▪ Generate posteriori distribution over the collection of all

partitions of the data
▪ It need not be based on parameters (nonparametric)

APPLICATION OF CLUSTERING
Cluster could be used for the following:

▪ Identifying Fake news - usually contains certain words more commonly
in sensationalized click-bait articles
▪ Spam filter – K-Means clustering is proven to be effective in identifying
spam, it analyses different sections of the email separately.
▪ Marketing and Sales – Group people with similar traits and likelihood to
purchase group, make marketing more effiective
▪ Classifying network traffic – Use K-Means to group traffic source and
types – No precise information of traffic source needed as the source
will change frequently – Hacker use VPN
APPLICATION OF CLUSTERING
Cluster could be used for the following (too)

▪ Document analysis – Able to organize similar documents quickly using
the characteristic identified in the paragraph.
▪ Computer vision – Segmentation of image

ADDITIONAL NOTES FOR CLUSTERING
▪ Use for unsupervised learning (you do not need labeled “target”
▪ Multi-features input, for address use postal code or geolocation
▪ Output is categorial
▪ If unsure which to use, start with K-means (most popular)

▪ Try different k values and methods as needed.

END OF LESSON 4

Lesson 4

Uploaded by

Copyright:

Available Formats

You might also like

Lesson 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lesson 4

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO

▪ The organization of unlabeled data into similarity groups of

▪ Hence, cluster is collection of similar items from a bigger set

▪ Since “cluster is collection of similar items”, hence

▪ There are 4 common examples of distance measures

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC https://vitalflux.com/different-types-of-distance-measures-in-machine-learning/ /

Need to check quality of the model

▪ Expert judgment is the most used method….

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ Given the above

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

Hierarchical Partitional Others

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ Find successive clusters using previously established

▪ Agglomerative algorithms – Bottom-up approach, formed clusters first

▪ Divisive algorithms – Top-down approach, first from main clusters

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ Represented well in taxonomy tree and dendrogram to show

Step for clustering involves:

Consideration with forming clusters with:

• Agglomerative is faster to compute

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ Will determine all clusters at once, followed by divisive

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ K-means (MacQueen 1967) algorithm

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

K-means works as follows:

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

K-means convergence criterion, ANY 1 condition required

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

Outliers force the wrong grouping,

• since K is 2, the outlier is grouped with a group

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

Possible solutions to handle outliers

• Remove outliers (data points), preferably monitor these

• Perform random sampling to choose the centroid first, assign

K-means is sensitive to initial seeds

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

K-means is not for special structure that are not hyper-sphere

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ Bayesian based algorithms could also be used which is based

▪ Generate posteriori distribution over the collection of all

▪ It need not be based on parameters (nonparametric)

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

Cluster could be used for the following:

Cluster could be used for the following (too)

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

▪ Use for unsupervised learning (you do not need labeled “target”

▪ Multi-features input, for address use postal code or geolocation

▪ If unsure which to use, start with K-means (most popular)

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC

You might also like