Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

3/13/24, 8:10 PM SAP HANA PAL – K-Means Algorithm or How to do Cust...

- SAP Community

Community

SAP Community  Products and Technology  Technology  Technology Blogs by Members  SAP HANA PAL – K-Means Algorithm or How to do Cust...

Technology Blogs by Members


Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the
mix!

All community  What are you looking for today?

 Due to SAP Community maintenance, portions of the site will be unavailable. We expect this downtime to last a few
hours.

SAP HANA PAL – K-Means Algorithm or How to do Customer Segmentation for the
Telecommunications Industry

Former Member


‎03-28-2013 3:11 PM

 5 Kudos

https://community.sap.com/t5/technology-blogs-by-members/sap-hana-pal-k-means-algorithm-or-how -to-do-customer-segmentation-for-the/ba-p/12976696/page/2 1/39


3/13/24, 8:10 PM SAP HANA PAL – K-Means Algorithm or How to do Cust... - SAP Community

PAL is an optional component from SAP HANA and its main porpoise is to enable modelers to perform predictive analysis
over big volumes of data. If this is the first time you hear about PAL, I would recommend reading the official documentation.
You can also take a look at my prior post where I talk about the Apriori Algorithm.

In this post I’m going to focus on how to use the K-Means clustering algorithm included in PAL because it’s one of the most
popular and most commonly used in data-mining. But before we jump into the code, let’s talk about how the algorithm
works.

According to Wikipedia, “clustering is the task of grouping a set of objects in a way that objects in the same group (called
cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)”. In other words,
grouping your data into multiple clusters. The most common use case for a clustering algorithm is customer segmentation,
meaning you use a clustering algorithm to divide your customer database in multiple groups (or clusters) based on how
similar they are or how similar they behave, e.g., age, gender, interests, spending habits and so on.

The K-Means algorithm works in a very simple way (for me that I don’t have to code it in C++ J). The first step is to plot all
the database objects into space where each attribute is a dimension. So if we use a two attributes data set the resulting
chart would look something like this:

https://community.sap.com/t5/technology-blogs-by-members/sap-hana-pal-k-means-algorithm-or-how -to-do-customer-segmentation-for-the/ba-p/12976696/page/2 2/39


3/13/24, 8:10 PM SAP HANA PAL – K-Means Algorithm or How to do Cust... - SAP Community

After all the objects are plotted, the algorithm calculates the distance between them and the ones that are close to each
other are grouped in the same cluster. So if we go back to our previous example we can create 4 different clusters:

Each cluster is associated with a centroid and each point is assigned to the cluster with the closest centroid. The centroid is
the mean of the points in the cluster. The closeness can be measured using:

Manhattan Distance
Euclidean Distance (most commonly used)
Minkowski Distance

Every time a point is assigned to a cluster the centroid is recalculated. This is repeated in multiple iterations until centroids
don’t change anymore (meaning all points have been assigned to a corresponding cluster) or until relatively few points
change clusters. Usually most of the centroid movement happens in the first iterations.

One of the main drawbacks of the K-Means Algorithm is that you need to specify the number of Ks (or clusters) upfront as
an input parameter. Knowing this value is usually very hard, that is why it is important to run quality measurement
functions to check the quality of your clustering. Later in this post we will talk about this.

I came across a very interesting paper that talks about segmentation in the telecommunication industry, so I thought it
would be a very nice use case to demo the K-Means algorithm in HANA (if you are interested in this topic, I very much
recommend reading this paper). These are the steps I followed:
https://community.sap.com/t5/technology-blogs-by-members/sap-hana-pal-k-means-algorithm-or-how -to-do-customer-segmentation-for-the/ba-p/12976696/page/2 3/39
3/13/24, 8:10 PM SAP HANA PAL – K-Means Algorithm or How to do Cust... - SAP Community

Prepare the Data

The first step is creating a table that will contain information on customers mobile phone usage habits with the following
structure:

CREATE COLUMN TABLE "TELCO" (

"ID" INTEGER NOT NULL, --> Customer ID

"AVG_CALL_DURATION" DOUBLE, --> Average Call Duration

"AVG_NUMBER_CALLS_RCV_DAY" DOUBLE, --> Average Calls Received per Day

"AVG_NUMBER_CALLS_ORI_DAY" DOUBLE, --> Average Calls Originated per Day

"DAY_TIME_CALLS" DOUBLE, --> Percentage of Calls made during day time hours (9 a.m. - 6 p.m.)

"WEEK_DAY_CALLS" DOUBLE, --> Percentage of Calls made during week days (Monday thru Friday)

"CALLS_TO_MOBILE" DOUBLE, --> Percentage of Calls made to mobile phones

"SMS_RCV_DAY" DOUBLE, --> Number of SMSs received per day

"SMS_ORI_DAY" DOUBLE, --> Number of SMSs sent per day

PRIMARY KEY ("ID"))

https://community.sap.com/t5/technology-blogs-by-members/sap-hana-pal-k-means-algorithm-or-how -to-do-customer-segmentation-for-the/ba-p/12976696/page/2 4/39

You might also like