Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

10/18/23, 10:41 PM FMLASS3Q7 - Jupyter Notebook

In [49]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.cluster import AgglomerativeClustering

In [50]: # Load the Online Retail data


df = pd.read_csv('OnlineRetailPreProcessed.csv')

In [51]: df

Out[51]: Unnamed: 0 CustomerID Price Quantity Delay

0 0 12346.0 -0.233043 -0.391699 2.322023

1 1 12347.0 0.292637 0.378244 -0.893733

2 2 12348.0 -0.008126 -0.258357 -0.169196

3 3 12349.0 -0.018406 -0.082001 -0.725005

4 4 12350.0 -0.196820 -0.340082 2.163220

... ... ... ... ... ...

4367 4367 18280.0 -0.208356 -0.357288 1.845615

4368 4368 18281.0 -0.222891 -0.387397 0.882873

4369 4369 18282.0 -0.209090 -0.348685 -0.834182

4370 4370 18283.0 0.026963 2.868727 -0.873883

4371 4371 18287.0 -0.007600 -0.090604 -0.486801

4372 rows × 5 columns

In [52]: def plot_clusters(data,labels=None,title_cluster="Agglomerative Clustering"


fig = plt.figure(figsize = (16, 9))
ax = plt.axes(projection ="3d")
ax.scatter3D(data[:,0],data[:,1],data[:,2],c=labels)
ax.set_title(title_cluster)
plt.show()

In [53]: # Select a subset of features for clustering


features = ['Price','Quantity','Delay']

# Compute the distance matrix
X = df[features].values
D = squareform(pdist(X))

localhost:8888/notebooks/FMLASS3Q7.ipynb 1/6
10/18/23, 10:41 PM FMLASS3Q7 - Jupyter Notebook

In [54]: agglo_cluster_single=AgglomerativeClustering(n_clusters=3,metric='euclidean
agglo_cluster_single.fit(X)

Out[54]: AgglomerativeClustering(linkage='single', metric='euclidean', n_clusters=


3)
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.

In [55]: plot_clusters(X,agglo_cluster_single.labels_,title_cluster="Agglomerative C

In [56]: agglo_cluster_comp=AgglomerativeClustering(n_clusters=3,metric='euclidean',
agglo_cluster_comp.fit(X)

Out[56]: AgglomerativeClustering(linkage='complete', metric='euclidean', n_clusters


=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.

localhost:8888/notebooks/FMLASS3Q7.ipynb 2/6
10/18/23, 10:41 PM FMLASS3Q7 - Jupyter Notebook

In [57]: plot_clusters(X,agglo_cluster_comp.labels_,title_cluster="Agglomerative Clu

In [58]: agglo_cluster_avg=AgglomerativeClustering(n_clusters=3,metric='euclidean',l
agglo_cluster_avg.fit(X)

Out[58]: AgglomerativeClustering(linkage='average', metric='euclidean', n_clusters=


3)
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.

localhost:8888/notebooks/FMLASS3Q7.ipynb 3/6
10/18/23, 10:41 PM FMLASS3Q7 - Jupyter Notebook

In [59]: plot_clusters(X,agglo_cluster_avg.labels_,title_cluster="Agglomerative Clus

In [60]: # Perform hierarchical clustering with different linkage methods


linkage_methods = ['single', 'complete', 'average']
linkages = [linkage(D, method) for method in linkage_methods]

<ipython-input-60-f613d96f39ea>:3: ClusterWarning: scipy.cluster: The symm


etric non-negative hollow observation matrix looks suspiciously like an un
condensed distance matrix
linkages = [linkage(D, method) for method in linkage_methods]

localhost:8888/notebooks/FMLASS3Q7.ipynb 4/6
10/18/23, 10:41 PM FMLASS3Q7 - Jupyter Notebook

In [61]: # Plot the dendrograms


for i, linkage in enumerate(linkages):
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111)
dendrogram(linkage, ax=ax)
ax.set_xlabel('Customer ID')
ax.set_ylabel('Distance')
ax.set_title('Dendrogram for {} linkage'.format(linkage_methods[i]))
plt.show()

localhost:8888/notebooks/FMLASS3Q7.ipynb 5/6
10/18/23, 10:41 PM FMLASS3Q7 - Jupyter Notebook

Use cases
The choice of linkage method depends on the specific application. Single linkage is often
used in anomaly detection, as it can identify outliers that are connected to other data points
by a chain of similar data points. Complete linkage is often used in image segmentation, as
it can produce compact clusters. Average linkage is a general-purpose linkage method that
is often used in a variety of applications.

In the context of the Online Retail data, the average linkage dendrogram appears to be the
most informative. It shows that the customers are grouped into three distinct clusters, with
each cluster having a different purchasing behavior. For example, one cluster may consist
of customers who frequently purchase large quantities of low-priced items, while another
cluster may consist of customers who infrequently purchase high-priced items.

localhost:8888/notebooks/FMLASS3Q7.ipynb 6/6

You might also like