Professional Documents
Culture Documents
Lecture 23 - Clustring
Lecture 23 - Clustring
What Is Clustering?
• Group data into clusters
• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Unsupervised learning: no predefined classes
Outliers
Cluster 1
Cluster 2
Outliers
• Outliers are objects that do not belong to any cluster or
form clusters of very small cardinality
cluster
outliers
• Binary attributes
• e.g., gender (M/F), has_cancer(T/F)
• Ordinal/Ranked attributes
• e.g., military rank (soldier, sergeant, lutenant, captain, etc.)
• If q = 2, d is Euclidean distance
• If q = 1, d is Manhattan distance
• Weighed distance
K-means algorithm
• Step-1: Select the value of K, to decide the number of clusters to be
formed.
• Step-2: Select random K points which will act as centroids.
• Step-3: Assign each data point, based on their distance from the randomly
selected points (Centroid), to the nearest/closest centroid which will form
the predefined clusters.
• Step-4: place a new centroid of each cluster.
• Step-5: Repeat step no.3, which reassign each datapoint to the new closest
centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else go to Step 7.
• Step-7: FINISH
K-Means: Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
Assign Update 3
3
2 each the 2
1
objects cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
K=2
Arbitrarily choose K
object as initial
cluster center Update
the
cluster
means
K mean clustering with k=4
Implementation in python
• from sklearn.cluster import Kmeans
• kmeans = KMeans(n_clusters=2)
• kmeans.fit(X)
• https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KM
eans.html