Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Clustering

What Is Clustering?
• Group data into clusters
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Unsupervised learning: no predefined classes

Outliers
Cluster 1
Cluster 2
Outliers
• Outliers are objects that do not belong to any cluster or
form clusters of very small cardinality

cluster

outliers

• In some applications we are interested in discovering


outliers, not clusters (outlier analysis)
Why do we cluster?
• Clustering : given a collection of data objects group them so that
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters

• Clustering results are used:


• As a stand-alone tool to get insight into data distribution
• Visualization of clusters may unveil important information
• As a preprocessing step for other algorithms
• Efficient indexing or compression often relies on clustering
Applications of clustering?
• Image Processing
• cluster images based on their visual content
• Web
• Cluster groups of users based on their access patterns on
webpages
• Cluster webpages based on their content
• Bioinformatics
• Cluster similar proteins together (similarity wrt chemical
structure and/or functionality etc)
• Many more…
Observations to cluster
• Real-value attributes/variables
• e.g., salary, height

• Binary attributes
• e.g., gender (M/F), has_cancer(T/F)

• Nominal (categorical) attributes


• e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

• Ordinal/Ranked attributes
• e.g., military rank (soldier, sergeant, lutenant, captain, etc.)

• Variables of mixed types


• multiple attributes with various types
What Is A Good Clustering?
• High intra-class similarity and low inter-class similarity
• Depending on the similarity measure
• The ability to discover some or all of the hidden patterns
How Good Is A Clustering?
• Dissimilarity/similarity depends on distance function
• Different applications have different functions
• Judgment of clustering quality is typically highly subjective
Similarity and Dissimilarity Between Objects
• Distances are normally used measures
• Distance: a generalization

• If q = 2, d is Euclidean distance
• If q = 1, d is Manhattan distance
• Weighed distance
K-means algorithm
• Step-1: Select the value of K, to decide the number of clusters to be
formed.
• Step-2: Select random K points which will act as centroids.
• Step-3: Assign each data point, based on their distance from the randomly
selected points (Centroid), to the nearest/closest centroid which will form
the predefined clusters.
• Step-4: place a new centroid of each cluster.
• Step-5: Repeat step no.3, which reassign each datapoint to the new closest
centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else go to Step 7.
• Step-7: FINISH
K-Means: Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
Assign Update 3
3

2 each the 2

1
objects cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign

K=2
Arbitrarily choose K
object as initial
cluster center Update
the
cluster
means
K mean clustering with k=4
Implementation in python
• from sklearn.cluster import Kmeans
• kmeans = KMeans(n_clusters=2)
• kmeans.fit(X)
• https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KM
eans.html

You might also like