Lesson 4

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

INTRODUCTION TO

AI AND MACHINE
LEARNING
• what is clustering
UNDERSTANDING • Types of clustering
algorithms
CLUSTERING • Application of clustering
WHAT IS CLUSTERING

▪ The organization of unlabeled data into similarity groups of


clusters.

▪ Hence, cluster is collection of similar items from a bigger set


of dissimilar items.

Interesting history : in 1854, London, Dr John Snow use a special map plot to study the
outbreak of cholera cases. It zeroed down to the discovery of the source – a well pump

Source : https://www.cs.toronto.edu/~periklis/pubs/depth.pdf
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC
WHAT IS CLUSTERING
DISTANCE OR PROXIMITY

▪ Since “cluster is collection of similar items”, hence


▪ Similarity is measured by Distance (proximity)

▪ There are 4 common examples of distance measures


use for proximity matrix
▪ Euclidean distance
▪ Manhattan distance
▪ Minihowski distance
▪ Mahalanobis distance
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC
WHAT IS CLUSTERING
TYPES OF DISTANCE

Euclidean distance
▪ A non-negative measure distance between 2 points
▪ Based on Pythagoras Theorem
▪ Used by
▪ K-Means,
▪ K-nearest neighbors

Source : https://lzpdatascience.wordpress.com/2019/11/17/
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC distance-measures-and-linkage-methods-in-hierarchical-clustering/
WHAT IS CLUSTERING
TYPES OF DISTANCE

Manhattan distance
▪ Determine the absolute difference among the pair of
coordinates
▪ Similar to the grid on the street map

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


WHAT IS CLUSTERING
TYPES OF DISTANCE

Minihowski distance
▪ Is a generalized form of Euclidean and Manhattan distance
▪ The computed distance is in-between the 2 types of distance
▪ The P can tweaked as needed with p

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


WHAT IS CLUSTERING
TYPES OF DISTANCE

Mahalanobis distance
▪ A statistical distance measure used to compute the distance
from the point to the centre of a distribution
▪ Take normalization and dispersion of data into account
▪ Useful for non-spherical-shaped distribution
▪ The red and green cross have same
Euclidean Distance but red is outliers

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC https://vitalflux.com/different-types-of-distance-measures-in-machine-learning/ /


WHAT IS CLUSTERING
CLUSTER EVALUATION

Need to check quality of the model


▪ Intra-cluster cohesion (compactness)
▪ Measure how near the data points in the cluster are to the centroid
▪ Common choice - Use Sum of squared error (SSE)
▪ Inter-cluster separation (isolation)
▪ The different cluster centroids should be as far apart as possible

▪ Expert judgment is the most used method….

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


WHAT IS CLUSTERING
CLUSTER EVALUATION

▪ Given the above


▪ Might be difficult decide based on data , error (SSE)
▪ The domain expert may recommend 2 or 3 clusters (i.e. 3 sub-species)

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


TYPES OF CLUSTERING ALGORITHMS

Clustering

Hierarchical Partitional Others

Graph
Divisive Agglomerative Centroid Model Based Spectral
Theoretic

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


TYPES OF CLUSTERING ALGORITHMS
HIERARCHICAL

▪ Find successive clusters using previously established


clusters.

▪ Agglomerative algorithms – Bottom-up approach, formed clusters first


merge into larger clusters

▪ Divisive algorithms – Top-down approach, first from main clusters


proceed to divide into small clusters

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


TYPES OF CLUSTERING ALGORITHMS
HIERARCHICAL

▪ Represented well in taxonomy tree and dendrogram to show


the hierarchical relationship

Source :
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC https://www.researchgate.net/figure/An-UPGMA-cluster-dendrogram-showing-the-genetic-relationships-of-the
-12-rice-cultivars_fig2_283481744
TYPES OF CLUSTERING ALGORITHMS
HIERARCHICAL

Step for clustering involves:


▪ Divisive clustering (taxonomy tree like)
▪ Starts with all data point in 1 cluster – root
▪ Splits the root into a set of clusters
▪ Each cluster is recursively splited
▪ Stop when only 1 data point in the cluster
▪ Agglomerative clustering (dendrogram like)
▪ Merge the most similar pair of clusters
▪ Stop when all data points merged into 1 cluster (i.e. root)
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC
TYPES OF CLUSTERING ALGORITHMS
HIERARCHICAL | AGGLOMERATIVE

Consideration with forming clusters with:


▪ Minimum distance ( single linkage)
▪ Generate minimum spanning tree
▪ Encourages elongated clusters growth
▪ BUT : very sensitive to noise
▪ Farthest neighbor ( complete linkage)
▪ Encourages compact clusters
▪ Does not work well with elongated clusters (data point can cross into
another cluster)
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC
TYPES OF CLUSTERING ALGORITHMS
HIERARCHICAL | SUMMARY

• Agglomerative is faster to compute


• Divisive may be less ”blind” to the global structure of the data

Divisive Agglomerative
▪ 1st step – split ▪ 1st step – merge
▪ Have access to all data ▪ Do not consider global data,
but pairwise data

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


TYPES OF CLUSTERING ALGORITHMS
PARTITIONAL

▪ Will determine all clusters at once, followed by divisive


algorithms to more finely split the clusters.

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


TYPES OF CLUSTERING ALGORITHMS
PARTITIONAL | CENTROID | K-MEANS

▪ K-means (MacQueen 1967) algorithm


▪ User will specify the value of k
▪ K-means partition data into k clusters
▪ Each cluster has a centre called centroid
▪ Each data point could have multiple dimensions (x1, x2, .. xn)

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


TYPES OF CLUSTERING ALGORITHMS
PARTITIONAL | CENTROID | K-MEANS

K-means works as follows:


1. Choose k number of data points (seeds) to be the the initial
centroids
2. Assign each data point to the closest centroid
3. Re-compute the centroids using the current cluster
membership
4. Repeat step 2 & 3 until convergence criterion met

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


TYPES OF CLUSTERING ALGORITHMS
PARTITIONAL | CENTROID | K-MEANS

K-means convergence criterion, ANY 1 condition required


1. No / minimum re-assignments of data points to different
cluster
2. No / minimum change of centroids
3. Minimum / Increase in the sum of squared error (SSE)
* When a data point is move to another cluster, the centroids
might move. It might give-up or take-up more data points from
other clusters and hence move again.
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC
TYPES OF CLUSTERING ALGORITHMS
PARTITIONAL | CENTROID | K-MEANS

Strength Weakness
1. Simple ▪ Applicable if mean can be
defined, hence for categorical
2. Efficient (very fast) data k-mode is use (based on
3. Popular (if not most) most frequent values)
▪ Need to specify k
K-means will terminate at local ▪ Sensitive to outliers
optimum not global optimum due data point recorded wrongly
to its complexity data that are very far away

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


TYPES OF CLUSTERING ALGORITHMS
PARTITIONAL | CENTROID | K-MEANS

Outliers force the wrong grouping,

• since K is 2, the outlier is grouped with a group


• it should be left out as shown by diagram on the right

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


TYPES OF CLUSTERING ALGORITHMS
PARTITIONAL | CENTROID | K-MEANS

Possible solutions to handle outliers

• Remove outliers (data points), preferably monitor these


outliers over a few iterations before removal

• Perform random sampling to choose the centroid first, assign


the main bulk of data after it
• Outliers are are rare, hence unlikely to be picked
(randomly) initially.
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC
TYPES OF CLUSTERING ALGORITHMS
PARTITIONAL | CENTROID | K-MEANS

K-means is sensitive to initial seeds


• Different clusters can be formed

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


TYPES OF CLUSTERING ALGORITHMS
PARTITIONAL | CENTROID | K-MEANS

K-means is not for special structure that are not hyper-sphere


▪ Diagram on the right is “correct clusters”, but K-means got it wrong

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


TYPES OF CLUSTERING ALGORITHMS
OTHERS

▪ Bayesian based algorithms could also be used which is based


on probability.

▪ Generate posteriori distribution over the collection of all


partitions of the data

▪ It need not be based on parameters (nonparametric)

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


APPLICATION OF CLUSTERING

Cluster could be used for the following:


▪ Identifying Fake news - usually contains certain words more commonly
in sensationalized click-bait articles
▪ Spam filter – K-Means clustering is proven to be effective in identifying
spam, it analyses different sections of the email separately.
▪ Marketing and Sales – Group people with similar traits and likelihood to
purchase group, make marketing more effiective
▪ Classifying network traffic – Use K-Means to group traffic source and
types – No precise information of traffic source needed as the source
will change frequently – Hacker use VPN
INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC
APPLICATION OF CLUSTERING

Cluster could be used for the following (too)


▪ Document analysis – Able to organize similar documents quickly using
the characteristic identified in the paragraph.
▪ Computer vision – Segmentation of image

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


ADDITIONAL NOTES FOR CLUSTERING

▪ Use for unsupervised learning (you do not need labeled “target”

▪ Multi-features input, for address use postal code or geolocation

▪ Output is categorial

▪ If unsure which to use, start with K-means (most popular)


▪ Try different k values and methods as needed.

INTRO TO AI & ML | DIP IN BORDER SECURITY | SINGAPORE POLYTECHNIC


END OF LESSON 4

You might also like