Data Science Project VI - Ipynb - Colaboratory

02/09/22 13.41 Data Science Project VI.
ipynb - Colaboratory
Dataset Description
Customer Segmentation Clustering
Tujuan dari project ini adalah untuk menemukan pola dari perilaku customer dan dibagi menjadi
beberapa cluster untuk bisa menjadi sebuah insight.
Ket. Dataset
CustomerID = Identitas tiap customer

Gender = Jenis Kelamin Customer
Age = Umur Customer
Annual Income = Income per bulan
Spending Score = Score Customer dalam membelanjakan uang mereka (1-100)
Import Library
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
df = pd.read_csv(r'/content/drive/MyDrive/digital skola/Dataset18_Clustering_Customer.csv'
df.head()
https://colab.research.google.com/drive/17aNfpDllpoi6wxjLL4xJekIWDpBsMnqb#scrollTo=ort78Scb66TI&printMode=true 1/15
02/09/22 13.41 Data Science Project VI.ipynb - Colaboratory
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)

df.shape
0 1 Male 19 15 39
(200, 5)
1 2 Male 21 15 81
2 3 Female 20 16 6
df.isnull().sum()
3 4 Female 23 16 77
CustomerID 0
4
Gender 5 Female 310
17 40
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64
df.describe()
CustomerID Age Annual Income (k$) Spending Score (1-100)
count 200.000000 200.000000 200.000000 200.000000
mean 100.500000 38.850000 60.560000 50.200000
std 57.879185 13.969007 26.264721 25.823522
min 1.000000 18.000000 15.000000 1.000000
25% 50.750000 28.750000 41.500000 34.750000
50% 100.500000 36.000000 61.500000 50.000000
75% 150.250000 49.000000 78.000000 73.000000
max 200.000000 70.000000 137.000000 99.000000
sns.heatmap(df.corr(), annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f6b9768b450>
Hipotesa awal
Semakin muda maka semakin sering spending semakin tua maka semakin jarang belanja
semakin tinggi gaji mka semakin sering berbelanja
Data visual
sns.set(style="whitegrid")
sns.distplot(df['Age'], color = "blue", bins=20)
plt.title("Age Distribution Plot", fontsize=14)
plt.xlabel("Age", fontsize=14)
plt.ylabel("count", fontsize=14)
plt.show()
plt.figure(figsize=(14,8))
sns.countplot(df['Age'])
plt.title("Age countplot", fontsize=14)
plt.show()
sns.boxplot(df['Age'])
plt.title("Age box plot",fontsize=14)
plt.show()
sns.violinplot(y="Age", x ="Gender", data = df)
plt.title("Age Violin plot with Gender", fontsize=14)
plt.xlabel("Gender", fontsize=14)
plt.ylabel("Age", fontsize=14)
plt.show()
spending score
sns.set(style="whitegrid")
sns.distplot(df['Spending Score (1-100)'], color = "blue", bins=20)
plt.title("Spending Score (1-100) Distribution Plot", fontsize=14)
plt.xlabel("Spending Score (1-100)", fontsize=14)
plt.show()
sns.boxplot(df['Spending Score (1-100)'])
plt.title("Spending Score (1-100) box plot",fontsize=14)
plt.show()
sns.countplot(df['Spending Score (1-100)'])
plt.title("Spending Score (1-100) countplot", fontsize=14)
plt.show()
sns.violinplot(y="Spending Score (1-100)", x ="Gender", data = df)
plt.title("Spending Score Violin plot with Gender", fontsize=14)
plt.xlabel("Gender", fontsize=14)
plt.ylabel("Spending Score", fontsize=14)
plt.show()
Perempuan usia 30-40 yang banyak spending dengan rata-rata spending score 40-50
Laki-laki usia 25-35 spending scorenya antara 40-60

sns.scatterplot(df['Age'], df['Spending Score (1-100)'], hue=df['Gender'], palette=['blue'
plt.title("Scatter plot distribution of gender based on Age and spending score",fontsize=1
plt.ylabel("Spending Score", fontsize=14)
plt.show()
Data preprocessing
df.set_index('CustomerID',inplace=True)
df
Gender Age Annual Income (k$) Spending Score (1-100)
CustomerID
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
... ... ... ... ...
196 Female 35 120 79
197 Female 45 126 28
198 Male 32 126 74

df_ss = df.copy()
199 Male 32 137 18
200 Male 30 137

from sklearn.preprocessing import LabelEncoder
83
200 rows × 4 columns

le=LabelEncoder()
df_ss['Gender'] = le.fit_transform(df_ss['Gender'])
df_ss.head()
Gender Age Annual Income (k$) Spending Score (1-100)
CustomerID
1 1 19 15 39
2 1 21 15 81
3 0 20 16 6
4 0 23 16 77
5 0 31 17 40
Age_Spend = df_ss[['Age','Spending Score (1-100)']].iloc[:,:].values
inertia_list=[]
for i in range(2,9):
kmeans_us = KMeans(n_clusters=i,n_init=10,max_iter=100, random_state=0)
kmeans_us.fit(Age_Spend)
inertia_list.append(kmeans_us.inertia_)
plt.plot(range(2,9),inertia_list)
plt.xlabel("Num of clusters")
plt.ylabel("Distortion")
plt.show()
!pip install scikit-learn-extra
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/p

Collecting scikit-learn-extra
Downloading scikit_learn_extra-0.2.0-cp37-cp37m-manylinux2010_x86_64.whl (1.7 MB)
|████████████████████████████████| 1.7 MB 5.2 MB/s
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.7/dist-package

Requirement already satisfied: scikit-learn>=0.23.0 in /usr/local/lib/python3.7/dist-
Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.7/dist-package
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-
Installing collected packages: scikit-learn-extra
Successfully installed scikit-learn-extra-0.2.0
from sklearn_extra.cluster import KMedoids
labels_kmedoid = KMedoids(n_clusters=4).fit_predict(Age_Spend)
labels_kmedoid
array([1, 2, 0, 2, 1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 1, 1, 0, 2, 1, 2,
0, 2, 0, 2, 0, 1, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 3, 2, 3, 1,
0, 1, 3, 1, 1, 1, 3, 1, 1, 3, 3, 3, 3, 3, 1, 3, 3, 1, 3, 3, 3, 1,
3, 3, 1, 1, 3, 3, 3, 3, 3, 1, 3, 1, 1, 3, 3, 1, 3, 3, 1, 3, 3, 1,
1, 3, 3, 1, 3, 3, 1, 1, 3, 1, 3, 1, 1, 3, 3, 1, 3, 1, 3, 3, 3, 3,
3, 1, 1, 1, 1, 1, 3, 3, 3, 3, 1, 1, 1, 2, 1, 2, 3, 2, 0, 2, 0, 2,
1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 3, 2, 0, 2, 0, 2, 0, 2,
0, 2, 0, 2, 0, 2, 3, 2, 0, 2, 0, 2, 0, 2, 0, 1, 0, 2, 0, 2, 0, 2,
0, 2, 0, 2, 0, 2, 0, 2, 3, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2,
0, 2])
labels_kmean_pp = KMeans(init='k-means++',n_clusters=4).fit_predict(Age_Spend)
labels_kmean_pp
array([3, 1, 0, 1, 3, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 3, 3, 0, 1, 3, 1,
0, 1, 0, 1, 0, 3, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 2, 1, 2, 3,
0, 3, 2, 3, 3, 3, 2, 3, 3, 2, 2, 2, 2, 2, 3, 2, 2, 3, 2, 2, 2, 3,
2, 2, 3, 3, 2, 2, 2, 2, 2, 3, 2, 3, 3, 2, 2, 3, 2, 2, 3, 2, 2, 3,
3, 2, 2, 3, 2, 3, 3, 3, 2, 3, 2, 3, 3, 2, 2, 3, 2, 3, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 2, 2, 2, 2, 3, 3, 3, 1, 3, 1, 2, 1, 0, 1, 0, 1,
3, 1, 0, 1, 0, 1, 0, 1, 0, 1, 3, 1, 0, 1, 2, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 2, 1, 0, 1, 0, 1, 0, 1, 0, 3, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 3, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1], dtype=int32)
import scipy.cluster.hierarchy as shc
from matplotlib import pyplot
pyplot.figure(figsize=(14,8))
pyplot.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(Age_Spend, method="complete"))
dend1 = shc.dendrogram(shc.linkage(Age_Spend, method="single"))
dend2 = shc.dendrogram(shc.linkage(Age_Spend, method="average"))
dend3 = shc.dendrogram(shc.linkage(Age_Spend, method="ward"))
labels_cluster_hierarchical_Ward = AgglomerativeClustering(n_clusters=4, linkage="ward").f
labels_cluster_hierarchical_Complete = AgglomerativeClustering(n_clusters=4, linkage="comp
labels_cluster_hierarchical_Ward
array([0, 3, 2, 3, 0, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 0, 0, 0, 3, 0, 3,
2, 3, 2, 3, 0, 1, 0, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 1, 3, 0, 1,
0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0,
1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1,
1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 3, 0, 3, 0, 3, 2, 3, 2, 3,
0, 3, 2, 3, 2, 3, 2, 3, 2, 3, 0, 3, 2, 3, 0, 3, 2, 3, 2, 3, 2, 3,
2, 3, 2, 3, 2, 3, 1, 3, 2, 3, 0, 3, 2, 3, 2, 1, 2, 3, 2, 3, 2, 3,
2, 3, 2, 3, 0, 3, 2, 3, 0, 3, 0, 3, 2, 3, 2, 3, 2, 3, 2, 3, 0, 3,
2, 3])
labels_cluster_hierarchical_Complete
array([2, 1, 0, 1, 2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 2, 1, 0, 1, 2, 1,
0, 1, 0, 1, 0, 2, 2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 3, 1, 0, 2,
0, 1, 3, 2, 2, 2, 3, 2, 2, 3, 3, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 2,
3, 3, 2, 2, 3, 3, 3, 3, 3, 2, 3, 3, 2, 3, 3, 2, 3, 3, 2, 3, 3, 2,
2, 3, 3, 2, 3, 2, 2, 2, 3, 2, 3, 2, 2, 3, 3, 2, 3, 2, 3, 3, 3, 3,
3, 2, 2, 2, 2, 2, 3, 3, 3, 3, 2, 2, 2, 1, 2, 1, 0, 1, 0, 1, 0, 1,
2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 3, 1, 0, 1, 0, 1, 0, 1, 0, 2, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 2, 1, 0, 1, 2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1])
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score
print("Silhouette score of K-Medoids: ", silhouette_score(Age_Spend, labels_kmedoid), "\n"
print("Silhouette score of K-Means++: ", silhouette_score(Age_Spend, labels_kmean_pp), "\n
print("Silhouette score of Agglo Hierarchical Ward: ", silhouette_score(Age_Spend, labels_
print("Silhouette score of Agglo Hierarchical Complete: ", silhouette_score(Age_Spend, lab
print("Davies Bouldin score of K-Medoids: ", davies_bouldin_score(Age_Spend, labels_kmedoi
print("Davies Bouldin score of K-Means++: ", davies_bouldin_score(Age_Spend, labels_kmean_
print("Davies Bouldin score of Agglo Hierarchical Ward: ", davies_bouldin_score(Age_Spend,
print("Davies Bouldin score of Agglo Hierarchical Complete: ", davies_bouldin_score(Age_Sp
Silhouette score of K-Medoids: 0.49888640369265486
Silhouette score of K-Means++: 0.49973941540141753
Silhouette score of Agglo Hierarchical Ward: 0.4602496389565028
Silhouette score of Agglo Hierarchical Complete: 0.49294328457852726
Davies Bouldin score of K-Medoids: 0.6886500456702762
Davies Bouldin score of K-Means++: 0.6869328339833629
Davies Bouldin score of Agglo Hierarchical Ward: 0.8629286547656256
Davies Bouldin score of Agglo Hierarchical Complete: 0.6912818137334898
plt.scatter(Age_Spend[labels_kmean_pp == 0,0], Age_Spend[labels_kmean_pp == 0,1], c = 'pin
plt.scatter(Age_Spend[labels_kmean_pp == 1,0], Age_Spend[labels_kmean_pp == 1,1], c = 'ora
plt.scatter(Age_Spend[labels_kmean_pp == 2,0], Age_Spend[labels_kmean_pp == 2,1], c = 'gre
plt.scatter(Age_Spend[labels_kmean_pp == 3,0], Age_Spend[labels_kmean_pp == 3,1], c = 'red
plt.legend()
plt.title('Customer Segmentation using Age and spending score', fontsize=14)
plt.xlabel('Age', fontsize=14)
plt.ylabel('Spending Score', fontsize=14)
plt.show()
Produk berbayar Colab

-
Batalkan kontrak di sini
check 0 d selesai pada 13.36

Data Science Project VI - Ipynb - Colaboratory

Uploaded by

Copyright:

Available Formats

You might also like

Data Science Project VI - Ipynb - Colaboratory

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Project VI - Ipynb - Colaboratory

Uploaded by

Copyright:

Available Formats

02/09/22 13.41 Data Science Project VI.

CustomerID = Identitas tiap customer

CustomerID Gender Age Annual Income (k$) Spending Score (1-100)

Annual Income (k$) 0

Spending Score (1-100) 0

CustomerID Age Annual Income (k$) Spending Score (1-100)

count 200.000000 200.000000 200.000000 200.000000

mean 100.500000 38.850000 60.560000 50.200000

std 57.879185 13.969007 26.264721 25.823522

min 1.000000 18.000000 15.000000 1.000000

25% 50.750000 28.750000 41.500000 34.750000

50% 100.500000 36.000000 61.500000 50.000000

75% 150.250000 49.000000 78.000000 73.000000

max 200.000000 70.000000 137.000000 99.000000

Laki-laki usia 25-35 spending scorenya antara 40-60

Gender Age Annual Income (k$) Spending Score (1-100)

... ... ... ... ...

196 Female 35 120 79

197 Female 45 126 28

198 Male 32 126 74

199 Male 32 137 18

200 Male 30 137

200 rows × 4 columns

Gender Age Annual Income (k$) Spending Score (1-100)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/p

Downloading scikit_learn_extra-0.2.0-cp37-cp37m-manylinux2010_x86_64.whl (1.7 MB)

|████████████████████████████████| 1.7 MB 5.2 MB/s

Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.7/dist-package

Successfully installed scikit-learn-extra-0.2.0

Silhouette score of K-Medoids: 0.49888640369265486

Silhouette score of K-Means++: 0.49973941540141753

Silhouette score of Agglo Hierarchical Ward: 0.4602496389565028

Silhouette score of Agglo Hierarchical Complete: 0.49294328457852726

Davies Bouldin score of K-Medoids: 0.6886500456702762

Davies Bouldin score of K-Means++: 0.6869328339833629

Davies Bouldin score of Agglo Hierarchical Ward: 0.8629286547656256

Davies Bouldin score of Agglo Hierarchical Complete: 0.6912818137334898

Produk berbayar Colab

check 0 d selesai pada 13.36

You might also like