Download as pdf or txt
Download as pdf or txt
You are on page 1of 71

Customer Segmentation in

Python: A Practical Approach

Image by Author | Created Using Excalidraw and Flaticon

Customer segmentation can help businesses tailor their marketing efforts and
improve customer satisfaction. Here’s how.

Functionally, customer segmentation involves dividing a customer base into


distinct groups or segments—based on shared characteristics and behaviors.
By understanding the needs and preferences of each segment, businesses
can deliver more personalized and effective marketing campaigns, leading to
increased customer retention and revenue.
In this tutorial, we’ll explore customer segmentation in Python by combining
two fundamental techniques: RFM (Recency, Frequency, Monetary)
analysis and K-Means clustering. RFM analysis provides a structured
framework for evaluating customer behavior, while K-means clustering offers
a data-driven approach to group customers into meaningful segments. We’ll
work with a real-world dataset from the retail industry: the Online Retail
dataset from UCI machine learning repository.
From data preprocessing to cluster analysis and visualization, we’ll code our
way through each step. So let’s dive in!

Our Approach: RFM Analysis and K-


Means Clustering

Let’s start by stating our goal: By applying RFM analysis and K-means
clustering to this dataset, we’d like to gain insights into customer behavior and
preferences.

RFM Analysis is a simple yet powerful method to quantify customer behavior.


It evaluates customers based on three key dimensions:

 Recency (R): How recently did a particular customer make a purchase?


 Frequency (F): How often do they make purchases?
 Monetary Value (M): How much money do they spend?
We’ll use the information in the dataset to compute the recency, frequency,
and monetary values. Then, we’ll map these values to the generally used
RFM score scale of 1 - 5.
If you’d like, you can explore and analyze further using these RFM scores. But
we’ll try to identify customer segments with similar RFM characteristics. And
for this, we’ll use K-Means clustering, an unsupervised machine learning
algorithm that groups similar data points into clusters.

So let’s start coding!

🔗 Link to Google Colab notebook.

Step 1 – Import Necessary Libraries and Modules

First, let’s import the necessary libraries and the specific modules as needed:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

We need pandas and matplotlib for data exploration and visualization, and
the KMeans class from scikit-learn’s cluster module to perform K-Means
clustering.

Step 2 – Load the Dataset

As mentioned, we’ll use the Online Retail dataset. The dataset contains
customer records: transactional information, including purchase dates,
quantities, prices, and customer IDs.
Let's read in the data that’s originally in an excel file from its URL into a
pandas dataframe.
# Load the dataset from UCI repository
url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/00352/Online%20Retail.xlsx"
data = pd.read_excel(url)

Alternatively, you can download the dataset and read the excel file into a
pandas dataframe.

Step 3 – Explore and Clean the Dataset

Now let’s start exploring the dataset. Look at the first few rows of the dataset:
data.head()

Output of data.head()

Now call the describe() method on the dataframe to understand the numerical
features better:
data.describe()
We see that the “CustomerID” column is currently a floating point value. When
we clean the data, we’ll cast it into an integer:

Output of data.describe()

Also note that the dataset is quite noisy. The “Quantity” and “UnitPrice”
columns contain negative values:
Output of data.describe()

Let’s take a closer look at the columns and their data types:
data.info()

We see that the dataset has over 541K records and the “Description” and
“CustomerID” columns contain missing values:
Let’s get the count of missing values in each column:

# Check for missing values in each column


missing_values = data.isnull().sum()
print(missing_values)

As expected, the “CustomerID” and “Description” columns contain missing


values:
For our analysis, we don’t need the product description contained in the
“Description” column. However, we need the “CustomerID” for the next steps
in our analysis. So let’s drop the records with missing “CustomerID”:
# Drop rows with missing CustomerID
data.dropna(subset=['CustomerID'], inplace=True)

Also recall that the values “Quantity” and “UnitPrice” columns should be
strictly non-negative. But they contain negative values. So let's also drop the
records with negative values for “Quantity” and “UnitPrice”:
# Remove rows with negative Quantity and Price
data = data[(data['Quantity'] > 0) & (data['UnitPrice'] > 0)]

Let’s also convert the “CustomerID” to an integer:


data['CustomerID'] = data['CustomerID'].astype(int)

# Verify the data type conversion


print(data.dtypes)
Step 4 – Compute Recency, Frequency, and Monetary Value

Let’s start out by defining a reference date snapshot_date that’s a day later than
the most recent date in the “InvoiceDate” column:
snapshot_date = max(data['InvoiceDate']) + pd.DateOffset(days=1)

Next, create a “Total” column that contains Quantity*UnitPrice for all the
records:
data['Total'] = data['Quantity'] * data['UnitPrice']

To calculate the Recency, Frequency, and MonetaryValue, we calculate the


following—grouped by CustomerID:
 For recency, we’ll calculate the difference between the most recent
purchase date and a reference date (snapshot_date). This gives the number
of days since the customer's last purchase. So smaller values indicate
that a customer has made a purchase more recently. But when we talk
about recency scores, we’d want customers who bought recently to have a
higher recency score, yes? We’ll handle this in the next step.
 Because frequency measures how often a customer makes purchases,
we’ll calculate it as the total number of unique invoices or transactions
made by each customer.
 Monetary value quantifies how much money a customer spends. So we’ll
find the average of the total monetary value across transactions.
rfm = data.groupby('CustomerID').agg({
'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
'InvoiceNo': 'nunique',
'Total': 'sum'
})

Let’s rename the columns for readability:


rfm.rename(columns={'InvoiceDate': 'Recency', 'InvoiceNo': 'Frequency',
'Total': 'MonetaryValue'}, inplace=True)
rfm.head()

Step 5 – Map RFM Values onto a 1-5 Scale

Now let’s map the “Recency”, “Frequency”, and “MonetaryValue” columns to


take on values in a scale of 1-5; one of {1,2,3,4,5}.

We’ll essentially assign the values to five different bins, and map each bin to a
value. To help us fix the bin edges, let’s use the quantile values of the
“Recency”, “Frequency”, and “MonetaryValue” columns:
rfm.describe()

Here’s how we define the custom bin edges:


# Calculate custom bin edges for Recency, Frequency, and Monetary scores
recency_bins = [rfm['Recency'].min()-1, 20, 50, 150, 250,
rfm['Recency'].max()]
frequency_bins = [rfm['Frequency'].min() - 1, 2, 3, 10, 100,
rfm['Frequency'].max()]
monetary_bins = [rfm['MonetaryValue'].min() - 3, 300, 600, 2000, 5000,
rfm['MonetaryValue'].max()]

Now that we’ve defined the bin edges, let’s map the scores to corresponding
labels between 1 and 5 (both inclusive):
# Calculate Recency score based on custom bins
rfm['R_Score'] = pd.cut(rfm['Recency'], bins=recency_bins, labels=range(1,
6), include_lowest=True)
# Reverse the Recency scores so that higher values indicate more recent
purchases
rfm['R_Score'] = 5 - rfm['R_Score'].astype(int) + 1

# Calculate Frequency and Monetary scores based on custom bins


rfm['F_Score'] = pd.cut(rfm['Frequency'], bins=frequency_bins,
labels=range(1, 6), include_lowest=True).astype(int)
rfm['M_Score'] = pd.cut(rfm['MonetaryValue'], bins=monetary_bins,
labels=range(1, 6), include_lowest=True).astype(int)

Notice that the R_Score, based on the bins, is 1 for recent purchases 5 for all
purchases made over 250 days ago. But we’d like the most recent purchases
to have an R_Score of 5 and purchases made over 250 days ago to have an
R_Score of 1.

To achieve the desired mapping, we do: 5 - rfm['R_Score'].astype(int) + 1.

Let’s look at the first few rows of the R_Score, F_Score, and M_Score
columns:
# Print the first few rows of the RFM DataFrame to verify the scores
print(rfm[['R_Score', 'F_Score', 'M_Score']].head(10))
If you’d like, you can use these R, F, and M scores to carry out an in-depth
analysis. Or use clustering to identify segments with similar RFM
characteristics. We’ll choose the latter!

Step 6 – Perform K-Means Clustering

K-Means clustering is sensitive to the scale of features. Because the R, F, and


M values are all on the same scale, we can proceed to perform clustering
without further scaling the features.

Let’s extract the R, F, and M scores to perform K-Means clustering:


# Extract RFM scores for K-means clustering
X = rfm[['R_Score', 'F_Score', 'M_Score']]
Next, we need to find the optimal number of clusters. For this let’s run the K-
Means algorithm for a range of K values and use the elbow method to pick the
optimal K:
# Calculate inertia (sum of squared distances) for different values of k
inertia = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, n_init= 10, random_state=42)
kmeans.fit(X)
inertia.append(kmeans.inertia_)

# Plot the elbow curve


plt.figure(figsize=(8, 6),dpi=150)
plt.plot(range(2, 11), inertia, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Curve for K-means Clustering')
plt.grid(True)
plt.show()

We see that the curve elbows out at 4 clusters. So let’s divide the customer
base into four segments.
We’ve fixed K to 4. So let’s run the K-Means algorithm to get the cluster
assignments for all points in the dataset:
# Perform K-means clustering with best K
best_kmeans = KMeans(n_clusters=4, n_init=10, random_state=42)
rfm['Cluster'] = best_kmeans.fit_predict(X)

Step 7 – Interpret the Clusters to Identify Customer Segments

Now that we have the clusters, let’s try to characterize them based on the
RFM scores.
# Group by cluster and calculate mean values
cluster_summary = rfm.groupby('Cluster').agg({
'R_Score': 'mean',
'F_Score': 'mean',
'M_Score': 'mean'
}).reset_index()
The average R, F, and M scores for each cluster should already give you an
idea of the characteristics.
print(cluster_summary)

But let’s visualize the average R, F, and M scores for the clusters so it’s easy
to interpret:
colors = ['#3498db', '#2ecc71', '#f39c12','#C9B1BD']

# Plot the average RFM scores for each cluster


plt.figure(figsize=(10, 8),dpi=150)

# Plot Avg Recency


plt.subplot(3, 1, 1)
bars = plt.bar(cluster_summary.index, cluster_summary['R_Score'],
color=colors)
plt.xlabel('Cluster')
plt.ylabel('Avg Recency')
plt.title('Average Recency for Each Cluster')

plt.grid(True, linestyle='--', alpha=0.5)


plt.legend(bars, cluster_summary.index, title='Clusters')

# Plot Avg Frequency


plt.subplot(3, 1, 2)
bars = plt.bar(cluster_summary.index, cluster_summary['F_Score'],
color=colors)
plt.xlabel('Cluster')
plt.ylabel('Avg Frequency')
plt.title('Average Frequency for Each Cluster')
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend(bars, cluster_summary.index, title='Clusters')

# Plot Avg Monetary


plt.subplot(3, 1, 3)
bars = plt.bar(cluster_summary.index, cluster_summary['M_Score'],
color=colors)
plt.xlabel('Cluster')
plt.ylabel('Avg Monetary')
plt.title('Average Monetary Value for Each Cluster')
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend(bars, cluster_summary.index, title='Clusters')

plt.tight_layout()
plt.show()

Notice how the customers in each of the segments can be characterized


based on the recency, frequency, and monetary values:
 Cluster 0: Of all the four clusters, this cluster has the highest recency,
frequency, and monetary values. Let’s call the customers in this
cluster champions (or power shoppers).
 Cluster 1: This cluster is characterized by moderate recency, frequency,
and monetary values. These customers still spend more and purchase
more frequently than clusters 2 and 3. Let’s call them loyal customers.
 Cluster 2: Customers in this cluster tend to spend less. They don’t buy
often, and haven’t made a purchase recently either. These are
likely inactive or at-risk customers.
 Cluster 3: This cluster is characterized by high recency and relatively lower
frequency and moderate monetary values. So these are recent
customers who can potentially become long-term customers.
Here are some examples of how you can tailor marketing efforts—to target
customers in each segment—to enhance customer engagement and
retention:

 For Champions/Power Shoppers: Offer personalized special discounts,


early access, and other premium perks to make them feel valued and
appreciated.
 For Loyal Customers: Appreciation campaigns, referral bonuses, and
rewards for loyalty.
 For At-Risk Customers: Re-engagement efforts that include running
discounts or promotions to encourage buying.
 For Recent Customers: Targeted campaigns educating them about the
brand and discounts on subsequent purchases.
It’s also helpful to understand what percentage of customers are in the
different segments. This will further help streamline marketing efforts and
grow your business.

Let’s visualize the distribution of the different clusters using a pie chart:
cluster_counts = rfm['Cluster'].value_counts()

colors = ['#3498db', '#2ecc71', '#f39c12','#C9B1BD']


# Calculate the total number of customers
total_customers = cluster_counts.sum()

# Calculate the percentage of customers in each cluster


percentage_customers = (cluster_counts / total_customers) * 100

labels = ['Champions(Power Shoppers)','Loyal Customers','At-risk


Customers','Recent Customers']

# Create a pie chart


plt.figure(figsize=(8, 8),dpi=200)
plt.pie(percentage_customers, labels=labels, autopct='%1.1f%%',
startangle=90, colors=colors)
plt.title('Percentage of Customers in Each Cluster')
plt.legend(cluster_summary['Cluster'], title='Cluster', loc='upper left')

plt.show()
Here we go! For this example, we have quite an even distribution of
customers across segments. So we can invest time and effort in retaining
existing customers, re-engaging with at-risk customers, and educating recent
customers.

Wrapping Up

And that’s a wrap! We went from over 154K customer records to 4 clusters in
7 easy steps. I hope you understand how customer segmentation allows you
to make data-driven decisions that influence business growth and customer
satisfaction by allowing for:
 Personalization: Segmentation allows businesses to tailor their marketing
messages, product recommendations, and promotions to each customer
group's specific needs and interests.
 Improved Targeting: By identifying high-value and at-risk customers,
businesses can allocate resources more efficiently, focusing efforts where
they are most likely to yield results.
 Customer Retention: Segmentation helps businesses create retention
strategies by understanding what keeps customers engaged and satisfied.
As a next step, try applying this approach to another dataset, document your
journey, and share with the community! But remember, effective customer
segmentation and running targeted campaigns requires a good understanding
of your customer base—and how the customer base evolves. So it requires
periodic analysis to refine your strategies over time.

Dataset Credits

Bala Priya C is a developer and technical writer from India. She likes working
at the intersection of math, programming, data science, and content creation.
Her areas of interest and expertise include DevOps, data science, and natural
language processing. She enjoys reading, writing, coding, and coffee!
Currently, she's working on learning and sharing her knowledge with the
developer community by authoring tutorials, how-to guides, opinion pieces,
and more. Bala also creates engaging resource overviews and coding
tutorials.

Customer segmentation with Python


Introduction

Customer segmentation is important for businesses to understand their target audience. Different
advertisements can be curated and sent to different audience segments based on their
demographic profile, interests, and affluence level.

There are many unsupervised machine learning algorithms that can help companies identify their
user base and create consumer segments.

In this article, we will be looking at a popular unsupervised learning technique called K-Means
clustering.

This algorithm can take in unlabelled customer data and assign each data point to clusters.

The goal of K-Means is to group all the data available into non-overlapping sub-groups that are
distinct from each other.

That means each sub-group/cluster will consist of features that distinguish them from other
clusters.

K-Means clustering is a commonly used technique by data scientists to help companies with
customer segmentation. It is an important skill to have, and most data science interviews will test
your understanding of this algorithm/your ability to apply it to real life scenarios.

In this article, you will learn the following:


 Data pre-processing for K-Means clustering
 Building a K-Means clustering algorithm from scratch
 The metrics used to evaluate the performance of a clustering model
 Visualizing clusters built
 Interpretation and analysis of clusters built

Pre-requisites

You can download the dataset for this tutorial here.

Make sure to have the following libraries installed before getting started: pandas, numpy,
matplotlib, seaborn, scikit-learn, kneed.

Once you're done, we can start building the model!

Imports and reading the data frame

Run the following lines of code to import the necessary libraries and read the dataset:
# Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sea
from kneed import KneeLocator
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D

# reading the data frame


df = pd.read_csv('Mall_Customers.csv')

Now, lets take a look at the head of the data frame:

df.head()

There are five variables in the dataset. CustomerID is the unique identifier of each customer in
the dataset, and we can drop this variable. It doesn't provide us with any useful cluster
information.

Since gender is a categorial variable, it needs to be encoded and converted into numeric.

All other variables will be scaled to follow a normal distribution before being fed into the model.
We will standardize these variables with a mean of 0 and a standard deviation of 1.

Standardizing variables

First, lets standardize all variables in the dataset to get them around the same scale.

You can learn more about standardization here.

col_names = ['Annual Income (k$)', 'Age', 'Spending Score (1-100)']


features = df[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)
scaled_features = pd.DataFrame(features, columns = col_names)
scaled_features.head()

Now, lets take a look at the head of the data frame:

We can see that all the variables have been transformed, and are now centered around zero.

One hot encoding

The variable 'gender' is categorical, and we need to transform this into a numeric variable.

This means that we need to substitute numbers for each category. We can do this with Pandas
using pd.get_dummies().

gender = df['Gender']
newdf = scaled_features.join(gender)

newdf = pd.get_dummies(newdf, prefix=None, prefix_sep='_', dummy_na=False,


columns=None, sparse=False, drop_first=False, dtype=None)

newdf = newdf.drop(['Gender_Male'],axis=1)

newdf.head()

Lets take a look at the head of the data frame again:


We can see that the gender variable has been transformed. You might have noticed that we
dropped 'Gender_Male' from the data frame. This is because there is no need to keep the
variable anymore.

The values for 'Gender_Male' can be inferred from 'Gender_Female,' (that is, if
'Gender_Female' is 0, then 'Gender_Male' will be 1 and vice versa).

To learn more about one-hot encoding on categorical variables, you can watch this YouTube
video.

Building the clustering model


SSE = []

for cluster in range(1,10):


kmeans = KMeans(n_jobs = -1, n_clusters = cluster, init='k-means++')
kmeans.fit(newdf)
SSE.append(kmeans.inertia_)

# converting the results into a dataframe and plotting them

frame = pd.DataFrame({'Cluster':range(1,10), 'SSE':SSE})


plt.figure(figsize=(12,6))
plt.plot(frame['Cluster'], frame['SSE'], marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
Visualizing the model's performance:

We can see that the optimal number of clusters is 4.

Now, lets take a look at another clustering metric.

Silhouette coefficient

A silhouette coefficient, or a silhouette score is a metric used to evaluate the quality of clusters
created by the algorithm.

Silhouette scores range from -1 to +1. The higher the silhouette score, the better the model.
The silhouette score measures the distance between all the data points within the same cluster.
The lower this distance, the better the silhouette score.

It also measures the distance between an object and the data points in the nearest cluster. The
higher this distance, the better.

A silhouette score closer to +1 indicates good clustering performance, and a silhouette score
closer to -1 indicates a poor clustering model.

Lets calculate the silhouette score of the model we just built:

# First, build a model with 4 clusters

kmeans = KMeans(n_jobs = -1, n_clusters = 4, init='k-means++')


kmeans.fit(newdf)

# Now, print the silhouette score of this model

print(silhouette_score(newdf, kmeans.labels_, metric='euclidean'))

The silhouette score of this model is about 0.35.

This isn't a bad model, but we can do better and try getting higher cluster separation.

Before we try doing that, lets visualize the clusters we just built to get an idea of how well the
model is doing:

clusters = kmeans.fit_predict(df.iloc[:,1:])
newdf["label"] = clusters

fig = plt.figure(figsize=(21,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(newdf.Age[newdf.label == 0], newdf["Annual Income
(k$)"][newdf.label == 0], df["Spending Score (1-100)"][newdf.label == 0],
c='blue', s=60)
ax.scatter(newdf.Age[df.label == 1], newdf["Annual Income (k$)"][newdf.label
== 1], newdf["Spending Score (1-100)"][newdf.label == 1], c='red', s=60)
ax.scatter(newdf.Age[df.label == 2], newdf["Annual Income (k$)"][newdf.label
== 2], df["Spending Score (1-100)"][newdf.label == 2], c='green', s=60)
ax.scatter(newdf.Age[newdf.label == 3], newdf["Annual Income
(k$)"][newdf.label == 3], newdf["Spending Score (1-100)"][newdf.label == 3],
c='orange', s=60)

ax.view_init(30, 185)
plt.show()

The output of the above code is as follows:

From the above diagram, we can see that cluster separation isn't too great.
The red points are mixed with the blue, and the green are overlapping the yellow.

This, along with the silhouette score shows us that the model isn't performing too well.

Now, lets create a new model that has better cluster separability than this one.

Building clustering model #2

For this model, lets do some feature selection.

We can use a technique called Principal Component Analysis (PCA).

PCA is a technique that helps us reduce the dimension of a dataset. When we run PCA on a data
frame, new components are created. These components explain the maximum variance in the
model.

We can select a subset of these variables and include them into the K-means model.

Now, lets run PCA on the dataset:

pca = PCA(n_components=4)
principalComponents = pca.fit_transform(newdf)

features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_ratio_, color='black')
plt.xlabel('PCA features')
plt.ylabel('variance %')
plt.xticks(features)

PCA_components = pd.DataFrame(principalComponents)

The above code will render the following chart:


This chart shows us each PCA component, along with it variance.

Based on this visualization, we can see that the first two PCA components explain around 70%
of the dataset variance.

We can feed these two components into the model.

Lets build the model again with the first two principal components, and decide on the number of
clusters to use:

ks = range(1, 10)
inertias = []

for k in ks:
model = KMeans(n_clusters=k)
model.fit(PCA_components.iloc[:,:2])
inertias.append(model.inertia_)

plt.plot(ks, inertias, '-o', color='black')


plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

The code above will render the following chart:

Again, it looks like the optimal number of clusters is 4.

We can calculate the silhouette score for this model with 4 clusters:

model = KMeans(n_clusters=4)
model.fit(PCA_components.iloc[:,:2])

# silhouette score
print(silhouette_score(PCA_components.iloc[:,:2], model.labels_,
metric='euclidean'))

The silhouette score of this model is 0.42, which is better than the previous model we created.

We can visualize the clusters for this model just like we did earlier:
model = KMeans(n_clusters=4)

clusters = model.fit_predict(PCA_components.iloc[:,:2])
newdf["label"] = clusters

fig = plt.figure(figsize=(21,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(newdf.Age[newdf.label == 0], newdf["Annual Income
(k$)"][newdf.label == 0], newdf["Spending Score (1-100)"][newdf.label == 0],
c='blue', s=60)
ax.scatter(newdf.Age[newdf.label == 1], newdf["Annual Income
(k$)"][newdf.label == 1], newdf["Spending Score (1-100)"][newdf.label == 1],
c='red', s=60)
ax.scatter(newdf.Age[newdf.label == 2], newdf["Annual Income
(k$)"][newdf.label == 2], newdf["Spending Score (1-100)"][newdf.label == 2],
c='green', s=60)

ax.scatter(newdf.Age[newdf.label == 3], newdf["Annual Income


(k$)"][newdf.label == 3], newdf["Spending Score (1-100)"][newdf.label == 3],
c='orange', s=60)

ax.view_init(30, 185)
plt.show()

Model 1 vs Model 2

Lets compare the cluster separability of this model to that of the first model:
Model 1 (left) vs Model 2 (right)

Notice that the clusters in the second model are much better separated than that in the first
model.

Furthermore, the silhouette score of the second model is a lot higher.

For these reasons, we can pick the second model to go forward with our analysis.

Cluster Analysis

Now that we're done building these different clusters, lets try to interpret them and look at the
different customer segments.
First, lets map the clusters back to the dataset and take a look at the head of the data frame.

df = pd.read_csv('Mall_Customers.csv')
df = df.drop(['CustomerID'],axis=1)

# map back clusters to dataframe

pred = model.predict(PCA_components.iloc[:,:2])
frame = pd.DataFrame(df)
frame['cluster'] = pred
frame.head()

Notice that each row in the data frame is now assigned to a cluster.

To compare attributes of the different clusters, lets find the average of all variables across each
cluster:

avg_df = df.groupby(['cluster'], as_index=False).mean()


avg_df.show()
We can interpret these clusters more easily if we visualized them. Run these four lines of code to
come up with different visualizations of each variable:

sns.barplot(x='cluster',y='Age',data=avg_df)
sns.barplot(x='cluster',y='Spending Score (1-100)',data=avg_df)
sns.barplot(x='cluster',y='Annual Income (k$)',data=avg_df)

Spending Score vs Annual Income vs Age


Gender Breakdown
df2 = pd.DataFrame(df.groupby(['cluster','Gender'])['Gender'].count())
df2.head()
Main attributes of each segment

Cluster 0:
 High average annual income, low spending.
 Mean age is around 40 and gender is predominantly male.

Cluster 1:

 Low to mid average income, average spending capacity.


 Mean age is around 50 and gender is predominantly female.

Cluster 2:

 Low average income, high spending score.


 Mean age is around 25 and gender is predominantly female.

Cluster 3:

 High average income, high spending score.


 Mean age is around 30 and gender is predominantly female.
It is important to note that calculating the median age would provide better insight on the
distribution of age within each cluster.

Also, females are more highly represented in the entire dataset, which is why most clusters
contain a larger number of females than males. We can find the percentage of each gender
relative to the numbers in the entire dataset to give us a better idea of gender distribution.
Building personas around each cluster
Photo by h heyerlein on Unsplash

Now that we know the attributes of each cluster, we can build personas around them.

Being able to tell a story around your analysis is an important skill to have as a data scientist.

This will help your clients or stakeholders understand your findings more easily.

Here is an example of building consumer personas based on the clusters created:

Cluster 0: The frugal spender

This persona comprises of middle aged individuals who are very careful with money.

Despite having the highest average income compared to individuals in all other clusters, they
spend the least.

This might be because they have financial responsibilities - like saving up for their kid's higher
education.

Recommendation: Promos, coupons, and discount codes will attract individuals in this segment
due to their tendency to spend less.

Cluster 1: Almost retired

This segment comprises of an older group of people.

They earn less and spend less, and are probably saving up for retirement.

Recommendation: Marketing to these individuals can be done through Facebook, which appeals
to an older demographic. Promote healthcare related products to people in this segment.

Cluster 2: The careless buyer

This segment is made up of a younger age group.


Individuals in this segment are most likely first jobbers. They make the least amount of money
compared to all other segments.

However, they are very high spenders.

These are enthusiastic young individuals who enjoy living a good lifestyle, and tend to spend
above their means.

Recommendation: Since these are young individuals who spend a lot, providing them with travel
coupons or hotel discounts might be a good idea. Providing them with discounts off top clothing
and makeup brands would also work well for this segment.

Cluster 3: Highly affluent individuals

This segment is made up of middle-aged individuals.

These are individuals who have worked hard to build up a significant amount of wealth.

They also spend large amounts of money to live a good lifestyle.

These individuals have likely just started a family, and are leading baby or family-focused
lifestyles. It is a good idea to promote baby or child related products to these individuals.

Recommendation: Due to their large spending capacity and their demographic, these individuals
are likely to be looking for properties to buy or invest in. They are also more likely than all other
segments to take out housing loans and make serious financial commitments.

Conclusion

We have successfully built a K-Means clustering model for customer segmentation. We also
explored cluster interpretation, and analyzed the behaviour of individuals in each cluster.

Finally, we took a look at some business recommendations that could be provided based on the
attributes of each individual in the cluster.
You can use the analysis above as starter code for any clustering or segmentation project in the
future.

Step by Step Customer


Segmentation using K-Means
in Python
Segment your customers for better marketing.

Table of Contents

1. Introduction — What is Customer Segmentation?


2. Business Scenario

3. Explore the Dataset

4. Data Preprocessing

5. K-Means for Segmentation

6. PCA with K-Means for better visualization

7. Conclusion

Introduction

Let’s say, you decided to buy a t-shirt from a brand online. Have you
ever thought that who else bought the same t-shirt?

People, who have similar to you, right? Same age, same hobbies, same
gender, etc.

In marketing, Companies basically try to find your t-shirts on other


people!

But wait ? How? Of course, with data!

Customer segmentation is that simple!


We actually try to find and group customers based on common
characteristics such as age, gender, living area, spending behavior, etc.
So that we can market the customers effectively.

Let’s dive into our segmentation project!

Business Scenario

Suppose we are working as a data scientist for a FMCG company and


want to segment our customers to help the marketing department for
them to launch new products and sales based on the segmentation.
Therefore we will save our time and money by marketing a specific
group of customers with selected products.

How Did we collect the data by the way?

All data has been collected through the loyalty cards they use at
checkout :)

We will utilize K-Means and PCA algorithms for this project and see
how we define new grouped customers!

Understanding Data is Important!

Before starting any project, We need to understand the business


problem and dataset first.
Let’s see the variables(features) in the dataset.

Variable Description

ID: Shows a unique identification of a customer.

Sex: Biological sex (gender) of a customer. In this dataset, there are


only 2 different options.

0: male

1: female

Marital status: Marital status of a customer.

0: single

1: non-single (divorced / separated / married / widowed)

Age: The age of the customer in years, calculated as current year


minus the year of birth of the customer at the time of the creation of
the dataset

18 Min value (the lowest age observed in the dataset)

76 Max value (the highest age observed in the dataset)


Education: Level of education of the customer.

0:other / unknown

1: high school

2: university

3: graduate school

Income: Self-reported annual income in US dollars of the customer.

35832 Min value (the lowest income observed in the dataset)

309364 Max value (the highest income observed in the dataset)

Occupation: Category of occupation of the customer.

0: unemployed/unskilled

1: skilled employee / official

2: management / self-employed / highly qualified employee / officer

Settlement size: The size of the city that the customer lives in.
0: small city

1: mid-sized city

2: big city

We have datasets and know the business problem. Now, Let’s start
coding!

Importing Libraries

In this project, we will need some friends that help you along the way!

Let me introduce them below,


### Data Analysis and Manipulation
import pandas as pd
import numpy as np### Data Visualizationimport matplotlib.pyplot as
plt
import seaborn as snssns.set() ## this is for styling### Data
Standardization and Modeling with K-Means and PCAfrom
sklearn.preprocessing import StandardScalerfrom sklearn.cluster
import KMeans
from sklearn.decomposition import PCA

3. Explore the Dataset


df= pd.read_csv('segmentation data.csv', index_col = 0)

This part consists of understanding data with the help of descriptive


analysis and visualization.
df.head()
df.head() output

We can also apply the describe method to see descriptive statistics


about the columns.
df.describe()

We see the mean of Age and Income 35.90 and 120954 respectively.
Describe method is very useful for numerical columns.
df.info()
df.info() method returns information about the DataFrame including
the index data type and columns, non-null values, and memory usage.

We see that there is no missing value in the dataset and all the
variables are integer.

A good way to get an initial understanding of the relationship between


the different variables is to explore how they correlate.

We calculate the correlation between our variables using corr method


in the pandas library.
plt.figure(figsize=(12,9))
sns.heatmap(df.corr(),annot=True,cmap='RdBu')
plt.title('Correlation Heatmap',fontsize=14)
plt.yticks(rotation =0)
plt.show()
Let’s explore the correlation.

We see that there is a strong correlation between Education and Age.


In other words, older people tend to be more highly educated.

How about income and occupation?

Their correlation is 0.68. That means If you have a higher salary, you
are more likely to have a higher-level occupation such as a manager.

Correlation matrix is a very useful tool to analyze the relationship


between features.
Now, we understand our dataset and have a general idea of it.

Next section will be the segmentation. But before that, we need to scale
our data first.

4. Data Preprocessing

We need to apply standardization to our features before using any


distance-based machine learning model such as K-Means, KNN.

In general, We want to treat all the features equally and we can achieve
that by transforming the features in such a way that their values fall
within the same numerical range such as [0:1].

This process is commonly referred to as Standardization.

Standardization

Now that we cleared that up. It is time to perform standardization in


Python.
scaler = StandardScaler()
df_std = scaler.fit_transform(df)

Now, We are all set to start building segmentation model K-Means!


df_std = pd.DataFrame(data = df_std,columns = df.columns)

Building Our Segmentation Model


Before applying the K-Means algorithm we need to choose how many
clusters we would like to have.

But How?

There are two components. Within Clusters Sum of


Squares(WCSS) and Elbow Method.
wcss = []
for i in range(1,11):
kmeans_pca = KMeans(n_clusters = i, init = 'k-means++',
random_state = 42)
kmeans_pca.fit(scores_pca)
wcss.append(kmeans_pca.inertia_)

We stored to each within clusters sum of squared value to wcss list.

Let’s visualize them.


plt.figure(figsize = (10,8))
plt.plot(range(1, 11), wcss, marker = 'o', linestyle = '-
.',color='red')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.title('K-means Clustering')
plt.show()

The elbow in the graph is the four-cluster mark. This is the only place
until which the graph is steeply declining while smoothing out
afterward.

Let’s perform K-Means clustering with 4 clusters.


kmeans = KMeans(n_clusters = 4, init = 'k-means++', random_state =
42)
Fitting Our Model to the Dataset
kmeans.fit(df_std)

# We create a new data frame with the original features and add a new
column with the assigned clusters for each point.
df_segm_kmeans= df_std.copy()
df_std[‘Segment K-means’] = kmeans.labels_

We now see the segments with our dataset.

Let’s group the customers by clusters and see the average values for
each variable.
df_segm_analysis = df_std.groupby(['Segment K-means']).mean()
df_segm_analysis

It’s time to interpret our new dataset,


Let’s start with the first segment,

It has almost the same number of men and women with an average age
of 56. Compared to other clusters, we realize that this is the oldest
segment.

For the second segment, we can say,

This segment has the lowest values for the annual salary.

They live almost exclusively in small cities

With low income living in small cities, it seems that this is a segment of
people with fewer opportunities.

Let’s carry on with the third segment,

This is the youngest segment with an average age of 29. They have
medium level of education and average income.

They also seem average about every parameter we can label the
segment average or standard.

Finally, we come to the fourth segment,


It is comprised almost entirely of men, less than 20 percent of whom
are in relationships.

Looking at the numbers, we observe relatively low values for education,


paired with high values for income and occupation.

The majority of this segment lives in big or middle-sized cities.

Let’s label the segment according to their relevance.


df_segm_analysis.rename({0:'well-off',
1:'fewer-opportunities',
2:'standard',
3:'career focused'})

Finally, we can create our plot to visualize each segment.


x_axis = df_std['Age']
y_axis = df_std['Income']
plt.figure(figsize = (10, 8))
sns.scatterplot(x_axis, y_axis, hue = df_std['Labels'], palette =
['g', 'r', 'c', 'm'])
plt.title('Segmentation K-means')
plt.show()
We can see the green segment well off is clearly separated as it is
highest in both age and income. But the other three are grouped
together.

We can conclude that K-Means did a decent job! However, it’s hard to
separate segments from each other.
In the next section, we will combine PCA and K-Means to try to get a
better result.

PCA with K-Means for Better Visualization

What we will do here is apply dimensionality reduction to simplify our


problem.

We will choose reasonable components in order to obtain a better


clustering solution than with the standard K-Means. So that We aim to
see a nice and clear plot for our segmented groups.
pca = PCA()
pca.fit(df_std)

Now, Let’s see the explained variance ratio by each component.


pca.explained_variance_ratio_

We observe that the first component explains around 36 % of the


variability of the data. The second one is 26 % and so on.

We now can plot the cumulative sum of explained variance.


plt.figure(figsize = (12,9))
plt.plot(range(1,8), pca.explained_variance_ratio_.cumsum(), marker
= 'o', linestyle = '--')
plt.title('Explained Variance by Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
Well, How do we choose the right number of components? The answer
is there is no right or wrong answer for that.

But, a rule of thumb is to keep at least 70 to 80 percent of the


explained variance.

80 % of the variance of the data is explained by the first 3 components.


Let’s keep the first 3 components for our further analysis.
pca = PCA(n_components = 3)
pca.fit(df_std)
pca.components_

The result is a 3 by 7 array. We reduced our futures to three


components from the original seven values that explain the shape the
values themselves show the so-called loadings.

Hey, just a minute, what is loading then?

Loadings are correlations between an original variable and the


component.

For instance, the first value of the array shows the loading of the first
feature on the first component.

Let’s put this information in a pandas data frame so that we can see
them nicely. Columns are seven original features and rows are three
components that PCA gave us.
df_pca_comp = pd.DataFrame(data = pca.components_,
columns = df.columns,
index = ['Component 1', 'Component 2', 'Component
3'])
df_pca_comp
plt.figure(figsize=(12,9))
sns.heatmap(df_pca_comp,
vmin = -1,
vmax = 1,
cmap = 'RdBu',
annot = True)
plt.yticks([0, 1, 2],
['Component 1', 'Component 2', 'Component 3'],
rotation = 45,
fontsize = 12)
plt.title('Components vs Original Features',fontsize = 14)
plt.show()
We see that there is a positive correlation between Component1
and Age,Income, Occupation and Settlement size. These are strictly
related to the career of a person. So this component shows the career
focus of the individual.

For the second component Sex, Marital status and Education are by
far the most prominent determinants.
For the final component, we realize that Age, Marital Status, and
Occupation are the most important features. We observed that marital
status and occupation load negatively but are still important.

Now, we have an idea about our new variables(components). We can


clearly see the relationship between components and variables.

Let’s transform our data and save it scores_pca.


pca.transform(df_std)
scores_pca = pca.transform(df_std)

K-means clustering with PCA

Our new dataset is ready! It’s time to apply K-Means to our brand new
dataset with 3 components.

It is as simple as before! We follow the same steps with standard K-


Means.
wcss = []
for i in range(1,11):
kmeans_pca = KMeans(n_clusters = i, init = 'k-means++',
random_state = 42)
kmeans_pca.fit(scores_pca)
wcss.append(kmeans_pca.inertia_)
We see that the optimal cluster number by within sum of square is 4.
kmeans_pca = KMeans(n_clusters = 4, init = 'k-means++',
random_state = 42)
kmeans_pca.fit(scores_pca)

K-Means algorithm has learnt from our new components and created 4
clusters . I would like to see old datasets with new components and
labels .
df_segm_pca_kmeans = pd.concat([df.reset_index(drop = True),
pd.DataFrame(scores_pca)], axis =
1)df_segm_pca_kmeans.columns.values[-3: ] = ['Component 1',
'Component 2', 'Component 3']
df_segm_pca_kmeans['Segment K-means PCA'] =
kmeans_pca.labels_df_segm_pca_kmeans.head()

# We calculate the means by segments.


df_segm_pca_kmeans_freq = df_segm_pca_kmeans.groupby(['Segment K-
means PCA']).mean()
df_segm_pca_kmeans_freq

Above we see our data grouped by K-Means Segment. We can also


convert segment numbers to the label and see the observation number
and proportions of each segment by total observation.
df_segm_pca_kmeans_freq['N Obs'] = df_segm_pca_kmeans[['Segment K-
means PCA','Sex']].groupby(['Segment K-means PCA']).count()
df_segm_pca_kmeans_freq['Prop Obs'] = df_segm_pca_kmeans_freq['N
Obs'] / df_segm_pca_kmeans_freq['N Obs'].sum()
df_segm_pca_kmeans_freq =
df_segm_pca_kmeans_freq.rename({0:'standard',
1:'career focused',
2:'fewer opportunities',
3:'well-off'})
df_segm_pca_kmeans_freq
We obtained columns and changed the name with a few line of
codes.Now, Let’s plot the our new segments and see differences.

As you can probably recall, our previous four clusters were


standard,career focused, fewer opportunities.
df_segm_pca_kmeans['Legend'] = df_segm_pca_kmeans['Segment K-means
PCA'].map({0:'standard',
1:'career focused',
2:'fewer opportunities',
3:'well-off'})
x_axis = df_segm_pca_kmeans['Component 2']
y_axis = df_segm_pca_kmeans['Component 1']
plt.figure(figsize = (10, 8))
sns.scatterplot(x_axis, y_axis, hue = df_segm_pca_kmeans['Legend'],
palette = ['g', 'r', 'c', 'm'])
plt.title('Clusters by PCA Components')
plt.show()
When we plotted the K means clustering solution without PCA, we
were only able to distinguish the green segment, but the division based
on the components is much more pronounced.

That was one of the biggest goals of PCA to reduce the number of
variables by combining them into bigger ones.

“Don’t find customers for your products, find products for your
customers.”
— Seth Godin

Conclusion

We segmented our customers into 4 groups. We are ready to start to


choose our groups based on our aims and marketing them!

Segmentation helps marketers to be more efficient in terms of time,


money and other resources.

They gain a better understanding of customer's needs and wants and


therefore can tailor campaigns to customer segments most likely to
purchase products.

If you want to see the entire code in Jupyter notebook, it can be found
on my Github.

You might also like