Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 14

Clustering

A N U NSU P E RV ISE D M A CH INE L E A R N ING


TA SK
Why use clustering?
 Utilizes a broad set of techniques to find subgroups of
observations within a data set.
Simplifies extremely large datasets by grouping features with
similar values.
Uses simple non-statistical principles.
Very flexible and malleable algorithm.
Wide set of real-world applications.
Segments consumers into groups with similar demographics or buying
patterns.
Types of Clustering
Good Clustering
 Good clustering will produce
clusters with:
High intra-class similarity
Low inter-class similarity
 Similarity is a measure of
“alikeness” of instances.
This is sometimes expressed as
a distance function.
K-Means Clustering
 Simplest and most commonly used clustering method for splitting
a dataset into a set of k groups.
k-Means is very sensitive to the value of k and to the initial
randomly-chosen cluster centers.
Choosing the Right k
 Ideally, the appropriate value for k should be
determined based on a priori knowledge or on
business requirements.
 Rule of thumb - k ≈ √(n/2) (this is only a
start)
Potential Methods of
Choosing k
Elbow Method (most
common)
Choose k, such that there are
diminishing returns beyond that
point.

Information Criterion
Approach
Silhouettes Method
Jump Method
Gap Statistic
k-Means Clustering Example

Suppose we would like Randomly pick k initial


to cluster these cluster centers or
instances. centroids.
k-Means Clustering Example
 
Assign each instance to
the closest cluster
centroid.
The distance between
each instance and the
centroid is measured by
the Euclidean distance:
k-Means Clustering Example

Move each cluster Reassign instances


centroid to the mean of closest to a different
each cluster. centroid to the
appropriate cluster
centroid.
k-Means Clustering Example

Recompute cluster Reassign instances to


centroid means. clusters.
No change, then
converged (finished)!
Hierarchical Clustering
Do not know in advance how many clusters we want
End up with a tree-like visual representation of the observations,
called a dendrogram

Two Main Types of Hierarchical Clustering:
Agglomerative Clustering
Divisive Clustering


The (dis)similarity of observations are analyzed by their
Euclidean distance, as done so in k-means clustering

A linkage criterion then specifies the dissimilarity of sets as a
function of the pairwise distances of observations in the sets
Agglomerative Hierarchical
Clustering

Suppose we begin with eleven different instances we would like to cluster.


Here, we take a "bottom-up" approach:
1. Each observation starts in its own cluster.
2. Pairs of clusters are merged together based on similar characteristics as
one moves up the hierarchy.
3. At each step of the algorithm, the two clusters that are the most similar
are combined into a new bigger cluster, until there is only one cluster
Divisive Hierarchical
Clustering

Suppose we begin with the same eleven different instances we would like to
cluster. Here, we take a “top-down" approach:
1. All observations start in one cluster.
2. Splits are performed recursively as one moves down the hierarchy based
on similarities between the observations.
3. At each step of iteration, the most heterogeneous cluster is divided into
two until all observations are in their own cluster
Works Cited
James, Gareth, et al. An Introduction to Statistical Learning:
with Applications in R. Springer, 2017.

You might also like