10/18/23, 10:41 PM FMLASS3Q7 - Jupyter Notebook

In [49]: import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.cluster import AgglomerativeClustering

In [50]: # Load the Online Retail data

df = pd.read_csv('OnlineRetailPreProcessed.csv')

In [51]: df

Out[51]: Unnamed: 0 CustomerID Price Quantity Delay

0 0 12346.0 -0.233043 -0.391699 2.322023

1 1 12347.0 0.292637 0.378244 -0.893733

2 2 12348.0 -0.008126 -0.258357 -0.169196

3 3 12349.0 -0.018406 -0.082001 -0.725005

4 4 12350.0 -0.196820 -0.340082 2.163220

... ... ... ... ... ...

4367 4367 18280.0 -0.208356 -0.357288 1.845615

4368 4368 18281.0 -0.222891 -0.387397 0.882873

4369 4369 18282.0 -0.209090 -0.348685 -0.834182

4370 4370 18283.0 0.026963 2.868727 -0.873883

4371 4371 18287.0 -0.007600 -0.090604 -0.486801

4372 rows × 5 columns

In [52]: def plot_clusters(data,labels=None,title_cluster="Agglomerative Clustering"

fig = plt.figure(figsize = (16, 9))
ax = plt.axes(projection ="3d")

In [53]: # Select a subset of features for clustering

features = ['Price','Quantity','Delay']

# Compute the distance matrix
X = df[features].values
D = squareform(pdist(X))

In [54]: agglo_cluster_single=AgglomerativeClustering(n_clusters=3,metric='euclidean

Out[54]: AgglomerativeClustering(linkage='single', metric='euclidean', n_clusters=

In [55]: plot_clusters(X,agglo_cluster_single.labels_,title_cluster="Agglomerative C

In [56]: agglo_cluster_comp=AgglomerativeClustering(n_clusters=3,metric='euclidean',

Out[56]: AgglomerativeClustering(linkage='complete', metric='euclidean', n_clusters

In [57]: plot_clusters(X,agglo_cluster_comp.labels_,title_cluster="Agglomerative Clu

In [58]: agglo_cluster_avg=AgglomerativeClustering(n_clusters=3,metric='euclidean',l

Out[58]: AgglomerativeClustering(linkage='average', metric='euclidean', n_clusters=

In [59]: plot_clusters(X,agglo_cluster_avg.labels_,title_cluster="Agglomerative Clus

In [60]: # Perform hierarchical clustering with different linkage methods

linkage_methods = ['single', 'complete', 'average']
linkages = [linkage(D, method) for method in linkage_methods]

<ipython-input-60-f613d96f39ea>:3: ClusterWarning: scipy.cluster: The symm

etric non-negative hollow observation matrix looks suspiciously like an un
condensed distance matrix
linkages = [linkage(D, method) for method in linkage_methods]

In [61]: # Plot the dendrograms

for i, linkage in enumerate(linkages):
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111)
dendrogram(linkage, ax=ax)
ax.set_xlabel('Customer ID')
ax.set_title('Dendrogram for {} linkage'.format(linkage_methods[i]))

Use cases
The choice of linkage method depends on the specific application. Single linkage is often
used in anomaly detection, as it can identify outliers that are connected to other data points
by a chain of similar data points. Complete linkage is often used in image segmentation, as
it can produce compact clusters. Average linkage is a general-purpose linkage method that
is often used in a variety of applications.

In the context of the Online Retail data, the average linkage dendrogram appears to be the
most informative. It shows that the customers are grouped into three distinct clusters, with
each cluster having a different purchasing behavior. For example, one cluster may consist
of customers who frequently purchase large quantities of low-priced items, while another
cluster may consist of customers who infrequently purchase high-priced items.

