Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

Hierarchical clustering :

In data mining and statistics, hierarchical clustering analysis is a method of clustering analysis
that seeks to build a hierarchy of clusters i.e. tree-type structure based on the hierarchy.
In machine learning, clustering is the unsupervised learning technique that groups the data based
on similarity between the set of data. There are different-different types of clustering
algorithms in machine learning. Connectivity-based clustering: This type of clustering
algorithm builds the cluster based on the connectivity between the data points. Example:
Hierarchical clustering
 Centroid-based clustering: This type of clustering algorithm forms around the
centroids of the data points. Example: K-Means clustering, K-Mode clustering
 Distribution-based clustering: This type of clustering algorithm is modeled using
statistical distributions. It assumes that the data points in a cluster are generated from a
particular probability distribution, and the algorithm aims to estimate the parameters of
the distribution to group similar data points into clusters Example: Gaussian Mixture
Models (GMM)
 Density-based clustering: This type of clustering algorithm groups together data
points that are in high-density concentrations and separates points in low-
concentrations regions. The basic idea is that it identifies regions in the data space that
have a high density of data points and groups those points together into clusters.
Example: DBSCAN(Density-Based Spatial Clustering of Applications with Noise)
In this article, we will discuss connectivity-based clustering algorithms i.e Hierarchical
clustering

Hierarchical clustering
Hierarchical clustering is a connectivity-based clustering model that groups the data points
together that are close to each other based on the measure of similarity or distance. The
assumption is that data points that are close to each other are more similar or related than data
points that are farther apart.
A dendrogram, a tree-like figure produced by hierarchical clustering, depicts the hierarchical
relationships between groups. Individual data points are located at the bottom of the dendrogram,
while the largest clusters, which include all the data points, are located at the top. In order to
generate different numbers of clusters, the dendrogram can be sliced at various heights.
The dendrogram is created by iteratively merging or splitting clusters based on a measure of
similarity or distance between data points. Clusters are divided or merged repeatedly until all
data points are contained within a single cluster, or until the predetermined number of clusters is
attained.
We can look at the dendrogram and measure the height at which the branches of the dendrogram
form distinct clusters to calculate the ideal number of clusters. The dendrogram can be sliced at
this height to determine the number of clusters.
Types of Hierarchical Clustering

Basically, there are two types of hierarchical Clustering:


1. Agglomerative Clustering
2. Divisive clustering
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative clustering (HAC). A
structure that is more informative than the unstructured set of clusters returned by flat clustering.
This clustering algorithm does not require us to prespecify the number of clusters. Bottom-up
algorithms treat each data as a singleton cluster at the outset and then successively agglomerate
pairs of clusters until all clusters have been merged into a single cluster that contains all data.
Algorithm :
given a dataset (d 1, d2, d3, ....dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[d i, dj]
each data point is a singleton cluster
repeat
merge the two cluster having minimum distance
update the distance matrix
until only a single cluster remains
Hierarchical Agglomerative Clustering

Steps:
Consider each alphabet as a single cluster and calculate the distance of one cluster

from all the other clusters.
 In the second step, comparable clusters are merged together to form a single cluster.
Let’s say cluster (B) and cluster (C) are very similar to each other therefore we merge
them in the second step similarly to cluster (D) and (E) and at last, we get the clusters
[(A), (BC), (DE), (F)]
 We recalculate the proximity according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
 Repeating the same process; The clusters DEF and BC are comparable and merged
together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
 At last, the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].
Python implementation of the above algorithm using the scikit-learn library:

 Python3
from sklearn.cluster import AgglomerativeClustering

import numpy as np
# randomly chosen dataset

X = np.array([[1, 2], [1, 4], [1, 0],

[4, 2], [4, 4], [4, 0]])

# here we need to mention the number of clusters

# otherwise the result will be a single cluster

# containing all the data

clustering = AgglomerativeClustering(n_clusters=2).fit(X)

# print the class labels

print(clustering.labels_)

Output :
[1, 1, 1, 0, 0, 0]
Hierarchical Divisive clustering
It is also known as a top-down approach. This algorithm also does not require to prespecify the
number of clusters. Top-down clustering requires a method for splitting a cluster that contains
the whole data and proceeds by splitting clusters recursively until individual data have been split
into singleton clusters.
Algorithm :
given a dataset (d 1, d2, d3, ....dN) of size N
at the top we have all data in one cluster
the cluster is split using a flat clustering method eg. K-Means etc
repeat
choose the best cluster among all the clusters to split
split that cluster by the flat clustering algorithm
until each data is in its own singleton cluster
Hierarchical Divisive clustering

Computing Distance Matrix


While merging two clusters we check the distance between two every pair of clusters and merge
the pair with the least distance/most similarity. But the question is how is that distance
determined. There are different ways of defining Inter Cluster distance/similarity. Some of them
are:
1. Min Distance: Find the minimum distance between any two points of the cluster.
2. Max Distance: Find the maximum distance between any two points of the cluster.
3. Group Average: Find the average distance between every two points of the clusters.
4. Ward’s Method: The similarity of two clusters is based on the increase in squared error
when two clusters are merged.
For example, if we group a given data using different methods, we may get different results:
Distance Matrix Comparision in Hierarchical Clustering

Implementations code

 Python3
import numpy as np

from scipy.cluster.hierarchy import dendrogram, linkage

import matplotlib.pyplot as plt

# randomly chosen dataset

X = np.array([[1, 2], [1, 4], [1, 0],

[4, 2], [4, 4], [4, 0]])

# Perform hierarchical clustering

Z = linkage(X, 'ward')

# Plot dendrogram

dendrogram(Z)

plt.title('Hierarchical Clustering Dendrogram')

plt.xlabel('Data point')

plt.ylabel('Distance')

plt.show()

Output:
Hierarchical Clustering Dendrogram

Hierarchical Agglomerative vs Divisive Clustering


 Divisive clustering is more complex as compared to agglomerative clustering, as in the
case of divisive clustering we need a flat clustering method as “subroutine” to split
each cluster until we have each data having its own singleton cluster.
 Divisive clustering is more efficient if we do not generate a complete hierarchy all the
way down to individual data leaves. The time complexity of a naive agglomerative
clustering is O(n3) because we exhaustively scan the N x N matrix dist_mat for the
lowest distance in each of N-1 iterations. Using priority queue data structure we can
reduce this complexity to O(n2logn). By using some more optimizations it can be
brought down to O(n2). Whereas for divisive clustering given a fixed number of top
levels, using an efficient flat algorithm like K-Means, divisive algorithms are linear in
the number of patterns and clusters.
 A divisive algorithm is also more accurate. Agglomerative clustering makes decisions
by considering the local patterns or neighbor points without initially taking into
account the global distribution of data. These early decisions cannot be undone.
whereas divisive clustering takes into consideration the global
distribution of data when making top-level partitioning decisions.
Kohnen Network
The Kohonen neural network differs from the feedforward back propagation neural
network.

- Kohonen nueral networks are used because they are a relatively simple
network to construct that can be trained very rapidly.
- The Kohonen neural network contains only an input and output layer of
neurons. There is no hidden layer in a Kohonen neural network
- It differs both in how it is trained and how it recalls a pattern. The Kohohen
neural network does not use any sort of activation function and a bias
weight.
- Output from the Kohonen neural network does not consist of the output of
several neurons.
- When a pattern is presented to a Kohonen network one of the output
neurons is selected as a "winner". This "winning" neuron is the output from
the Kohonen network.
- Often these "winning" neurons represent groups in the data that is
presented to the Kohonen network.
- The most significant difference between the Kohonen neural
network and the feed forward back propagation neural
network is that the Kohonen network trained in an
unsupervised mode.

Difference between the Kohonen neural network and the feed forward back
propagation neural network:

 Kohonen network trained in an unsupervised mode.


 This means that the Kohonen network is presented with data, but the
correct output that corresponds to that data is not specified.
 Using the Kohonen network this data can be classified into groups.

Limitations of the Kohonen neural network:

 The neural networks with only two layers can only be applied to
linearly separable problems and this is the case with the Kohonen neural
network.
 Kohonennueral networks are used because they are a relatively simple
network to
construct that can be trained very rapidly.

Self Organizing Maps – Kohonen Maps





Self Organizing Map (or Kohonen Map or SOM) is a type of Artificial Neural
Network which is also inspired by biological models of neural systems from the
1970s. It follows an unsupervised learning approach and trained its network through a
competitive learning algorithm. SOM is used for clustering and mapping (or
dimensionality reduction) techniques to map multidimensional data onto lower-
dimensional which allows people to reduce complex problems for easy interpretation.
SOM has two layers, one is the Input layer and the other one is the Output layer.
The architecture of the Self Organizing Map with two clusters and n input features of
any sample is given below:
How do SOM works?
Let’s say an input data of size (m, n) where m is the number of training examples and
n is the number of features in each example. First, it initializes the weights of size (n,
C) where C is the number of clusters. Then iterating over the input data, for each
training example, it updates the winning vector (weight vector with the shortest
distance (e.g Euclidean distance) from training example). Weight updation rule is
given by :
wij = wij(old) + alpha(t) * (xik - wij(old))
where alpha is a learning rate at time t, j denotes the winning vector, i denotes the
ith feature of training example and k denotes the kth training example from the input
data. After training the SOM network, trained weights are used for clustering new
examples. A new example falls in the cluster of winning vectors.

Algorithm

Training:
Step 1: Initialize the weights wij random value may be assumed. Initialize the learning
rate α.
Step 2: Calculate squared Euclidean distance.
D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m
Step 3: Find index J, when D(j) is minimum that will be considered as winning index.
Step 4: For each j within a specific neighborhood of j and for all i, calculate the new
weight.
wij(new)=wij(old) + α[xi – wij(old)]
Step 5: Update the learning rule by using :
α(t+1) = 0.5 * t
Step 6: Test the Stopping Condition.
Below is the implementation of the above approach:
 Python3
import math

class SOM:

# Function here computes the winning vector

# by Euclidean distance

def winner(self, weights, sample):

D0 = 0

D1 = 0

for i in range(len(sample)):

D0 = D0 + math.pow((sample[i] - weights[0][i]), 2)

D1 = D1 + math.pow((sample[i] - weights[1][i]), 2)

# Selecting the cluster with smallest distance as winning


cluster

if D0 < D1:

return 0

else:

return 1

# Function here updates the winning vector

def update(self, weights, sample, J, alpha):


# Here iterating over the weights of winning cluster and
modifying them

for i in range(len(weights[0])):

weights[J][i] = weights[J][i] + alpha * (sample[i] -


weights[J][i])

return weights

# Driver code

def main():

# Training Examples ( m, n )

T = [[1, 1, 0, 0], [0, 0, 0, 1], [1, 0, 0, 0], [0, 0, 1, 1]]

m, n = len(T), len(T[0])

# weight initialization ( n, C )

weights = [[0.2, 0.6, 0.5, 0.9], [0.8, 0.4, 0.7, 0.3]]

# training

ob = SOM()

epochs = 3

alpha = 0.5

for i in range(epochs):

for j in range(m):
# training sample

sample = T[j]

# Compute winner vector

J = ob.winner(weights, sample)

# Update winning vector

weights = ob.update(weights, sample, J, alpha)

# classify test sample

s = [0, 0, 0, 1]

J = ob.winner(weights, s)

print("Test Sample s belongs to Cluster : ", J)

print("Trained weights : ", weights)

if __name__ == "__main__":

main()

Output:
Test Sample s belongs to Cluster : 0
Trained weights : [[0.6000000000000001, 0.8, 0.5, 0.9], [0.3333984375,
0.0666015625, 0.7, 0.3]]

Implementation of unsupervised algorithms ( prepare lab programs of k means , code


of hierarchical, kohonen given )

You might also like