Clustering - Case Study 4

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 27

Clustering – Case Study 4

Dataset: Attribute Information

1. InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it
indicates a cancellation.

2. StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.

3. Description: Product (item) name. Nominal.

4. Quantity: The quantities of each product (item) per transaction. Numeric.

5. InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.

6. UnitPrice: Unit price. Numeric, Product price per unit in sterling.

7. CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

8. Country: Country name. Nominal, the name of the country where each customer resides.
Undetstand Data & Case
It is a behavioral data because, In this situation, how might customers be segmented? This depends upon the business objective, of
course, but a common goal is to identify high- and low-value customers for marketing purposes.

Step 1 :

Flow - calculate and work with metrics for each customer’s information of last purchase, frequency of purchase, and monetary value.
These three variables, collectively known as RFM, are often used in customer segmentation for marketing purposes. RFM stands for
Recency, Frequency, and Monetary value, each corresponding to some key customer trait. 

Step 2:

Lastly, Use k-means clustering technique because it can efficiently handle large datasets and iterates quickly to good solutions.
Bring your RAW data (csv file)
80/20 Rule

• If you aren’t familiar with the 80/20 Rule (also known as the Pareto Principle), it’s the concept that 80% of the
results generally come from 20% of the causes. In this context, it implies that ~80% of sales would be
produced by the top ~20% of customers. These 20% represent the high-value, important customers a business
would want to protect.

• To make a point about outliers below,

• Lets create some simple segments here by looking at the top customers who produced 80% of annual sales for
the year. In this dataset, 80% of the annual sales are produced by the top 29% of customers, so the percentage
isn’t quite 20%, but it’s not that far off and it does illustrate that there’s a small segment producing the bulk
of the value.
EDA Ends here K Means – Case Study 4
Processing Data for K Means Clustering (Final move)
k-means clustering requires continuous variables and works best with relatively normally-distributed, standardized input variables.

Standardizing the input variables is quite important; otherwise, input variables with larger variances will have commensurately greater
influence on the results.

Lets - transform our three input variables to reduce positive skew and then standardize them as z-scores.

🡺 Positive Skewness means when the tail on the right side of the distribution is longer or fatter. The mean and median will be greater
than the mode. Negative Skewness is when the tail of the left side of the distribution is longer or fatter than the tail on the right side. The
mean and median will be less than the mode.
Last preprocessing for data … make sure to delete NA records ..
Otherwise Kmeans function will produce error.

� In case your case study don’t support the decision to delete records.
� Talk to your team leader to replace with average.
NbClust Package – Determine no. of Clusters
Extra Code -
Extra Code

You might also like