FMLASS3Q7 - Jupyter Notebook

10/18/23, 10:41 PM FMLASS3Q7 - Jupyter Notebook
In [49]: import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.cluster import AgglomerativeClustering

In [50]: # Load the Online Retail data

df = pd.read_csv('OnlineRetailPreProcessed.csv')
In [51]: df
Out[51]: Unnamed: 0 CustomerID Price Quantity Delay
0 0 12346.0 -0.233043 -0.391699 2.322023
1 1 12347.0 0.292637 0.378244 -0.893733
2 2 12348.0 -0.008126 -0.258357 -0.169196
3 3 12349.0 -0.018406 -0.082001 -0.725005
4 4 12350.0 -0.196820 -0.340082 2.163220
... ... ... ... ... ...
4367 4367 18280.0 -0.208356 -0.357288 1.845615
4368 4368 18281.0 -0.222891 -0.387397 0.882873
4369 4369 18282.0 -0.209090 -0.348685 -0.834182
4370 4370 18283.0 0.026963 2.868727 -0.873883
4371 4371 18287.0 -0.007600 -0.090604 -0.486801
4372 rows × 5 columns
In [52]: def plot_clusters(data,labels=None,title_cluster="Agglomerative Clustering"

fig = plt.figure(figsize = (16, 9))
ax = plt.axes(projection ="3d")
ax.scatter3D(data[:,0],data[:,1],data[:,2],c=labels)
ax.set_title(title_cluster)
plt.show()
In [53]: # Select a subset of features for clustering

features = ['Price','Quantity','Delay']

# Compute the distance matrix
X = df[features].values
D = squareform(pdist(X))

localhost:8888/notebooks/FMLASS3Q7.ipynb 1/6
In [54]: agglo_cluster_single=AgglomerativeClustering(n_clusters=3,metric='euclidean
agglo_cluster_single.fit(X)
Out[54]: AgglomerativeClustering(linkage='single', metric='euclidean', n_clusters=

3)
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.
In [55]: plot_clusters(X,agglo_cluster_single.labels_,title_cluster="Agglomerative C
In [56]: agglo_cluster_comp=AgglomerativeClustering(n_clusters=3,metric='euclidean',
agglo_cluster_comp.fit(X)
Out[56]: AgglomerativeClustering(linkage='complete', metric='euclidean', n_clusters

=3)
trust the notebook.
with nbviewer.org.
In [57]: plot_clusters(X,agglo_cluster_comp.labels_,title_cluster="Agglomerative Clu
In [58]: agglo_cluster_avg=AgglomerativeClustering(n_clusters=3,metric='euclidean',l
agglo_cluster_avg.fit(X)
Out[58]: AgglomerativeClustering(linkage='average', metric='euclidean', n_clusters=

3)
trust the notebook.
with nbviewer.org.
In [59]: plot_clusters(X,agglo_cluster_avg.labels_,title_cluster="Agglomerative Clus
In [60]: # Perform hierarchical clustering with different linkage methods

linkage_methods = ['single', 'complete', 'average']
linkages = [linkage(D, method) for method in linkage_methods]
<ipython-input-60-f613d96f39ea>:3: ClusterWarning: scipy.cluster: The symm

etric non-negative hollow observation matrix looks suspiciously like an un
condensed distance matrix
linkages = [linkage(D, method) for method in linkage_methods]
In [61]: # Plot the dendrograms

for i, linkage in enumerate(linkages):
fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111)
dendrogram(linkage, ax=ax)
ax.set_xlabel('Customer ID')
ax.set_ylabel('Distance')
ax.set_title('Dendrogram for {} linkage'.format(linkage_methods[i]))
plt.show()
Use cases
The choice of linkage method depends on the specific application. Single linkage is often
used in anomaly detection, as it can identify outliers that are connected to other data points
by a chain of similar data points. Complete linkage is often used in image segmentation, as
it can produce compact clusters. Average linkage is a general-purpose linkage method that
is often used in a variety of applications.
In the context of the Online Retail data, the average linkage dendrogram appears to be the
most informative. It shows that the customers are grouped into three distinct clusters, with
each cluster having a different purchasing behavior. For example, one cluster may consist
of customers who frequently purchase large quantities of low-priced items, while another
cluster may consist of customers who infrequently purchase high-priced items.

FMLASS3Q7 - Jupyter Notebook

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FMLASS3Q7 - Jupyter Notebook

Uploaded by

Copyright:

Available Formats

10/18/23, 10:41 PM FMLASS3Q7 - Jupyter Notebook

In [49]: import numpy as np

In [50]: # Load the Online Retail data

Out[51]: Unnamed: 0 CustomerID Price Quantity Delay

0 0 12346.0 -0.233043 -0.391699 2.322023

1 1 12347.0 0.292637 0.378244 -0.893733

2 2 12348.0 -0.008126 -0.258357 -0.169196

3 3 12349.0 -0.018406 -0.082001 -0.725005

4 4 12350.0 -0.196820 -0.340082 2.163220

... ... ... ... ... ...

4367 4367 18280.0 -0.208356 -0.357288 1.845615

4368 4368 18281.0 -0.222891 -0.387397 0.882873

4369 4369 18282.0 -0.209090 -0.348685 -0.834182

4370 4370 18283.0 0.026963 2.868727 -0.873883

4371 4371 18287.0 -0.007600 -0.090604 -0.486801

4372 rows × 5 columns

In [52]: def plot_clusters(data,labels=None,title_cluster="Agglomerative Clustering"

In [53]: # Select a subset of features for clustering

Out[54]: AgglomerativeClustering(linkage='single', metric='euclidean', n_clusters=

Out[56]: AgglomerativeClustering(linkage='complete', metric='euclidean', n_clusters

In [57]: plot_clusters(X,agglo_cluster_comp.labels_,title_cluster="Agglomerative Clu

Out[58]: AgglomerativeClustering(linkage='average', metric='euclidean', n_clusters=

In [59]: plot_clusters(X,agglo_cluster_avg.labels_,title_cluster="Agglomerative Clus

In [60]: # Perform hierarchical clustering with different linkage methods

<ipython-input-60-f613d96f39ea>:3: ClusterWarning: scipy.cluster: The symm

In [61]: # Plot the dendrograms

You might also like