Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 329

https://www.kaggle.

com/ambarish/ml-kaggler-types-using-kmeans-and-pca
Therefore, we scale our data before employing a distance based algorithm so that all the
features contribute equally to the result.

https://medium.com/@16611050/k-means-clustering-8476c74ad462 (very important)

https://towardsdatascience.com/segmenting-customers-using-k-means-and-transaction-records-
76f4055d856a

https://www.quora.com/Should-you-standardize-binary-categorical-and-indicator-primary-key-
variables-before-performing-K-means-clustering

https://github.com/adelweiss/RFM_Kmeans

https://medium.com/@16611050/k-means-clustering-8476c74ad462

https://heartbeat.fritz.ai/understanding-the-mathematics-behind-k-means-clustering-40e1d55e2f4c

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-
standardization/

https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-glr-
auto-examples-cluster-plot-kmeans-silhouette-analysis-py

https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html

https://www.guru99.com/r-k-means-clustering.html ( R )

https://www.geeksforgeeks.org/k-means-clustering-introduction/

Kaggle project with K-means

https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-
drawbacks-aa03e644b48a

https://www.slideshare.net/kasunrangawijeweera/kmeans-example

http://people.csail.mit.edu/dsontag/courses/ml13/slides/lecture14.pdf

Seraj k-means

https://www.youtube.com/watch?edufilter=NULL&v=9991JlKnFmk

https://www.kaggle.com/isaikumar/credit-card-fraud-detection-using-k-means-and-knn

An Improved Credit Card Fraud Detection Using K-Means Clustering Algorithm Paper

Genetic K-means Algorithm for Credit Card Fraud Detection paper


http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.735.9906&rep=rep1&type=pdf

A Fast Fraud Detection Approach Clustering Based Method

https://www.krishisanskriti.org/vol_image/03Jul201510071210.pdf

A Cluster Based Approach for Credit Card Fraud Detection System using Hmm with the Implementation
of Big Data Technology

https://www.ripublication.com/ijaer19/ijaerv14n2_08.pdf

Grouping of Retail Items by Using K-Means Clustering Paper


https://www.sciencedirect.com/science/article/pii/S1877050915035929

Analyzing Inventory Data Using K-Means Clustering


https://csce.ucmss.com/cr/books/2018/LFS/CSREA2018/ICD8072.pdf

https://towardsdatascience.com/clustering-machine-learning-combination-in-sales-prediction-
330a7a205102

Sales Prediction using Clustering & Machine Learning (ARIMA & Holt’s Winter Approach) (R-

programming)

https://www.slideshare.net/annafensel/kmeans-clustering-122651195
unique_vals = data['cluster'].unique() # [0, 1, 2]

# Sort the dataframe by target


# Use a list comprehension to create list of sliced dataframes
targets = [data.loc[data['cluster'] == val] for val in unique_vals]

# Iterate through list and plot the sliced dataframe


for i ,target in enumerate(targets):
sns.distplot(target[["Traffic Level Average (E)"]],hist=False,rug=True,label="Cluster" + str(i))

Subplot Seaborn + Matplotlib : for every Columns


sklearn.cluster.KMeans¶

class  sklearn.cluster.KMeans(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0
.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=None, 
algorithm='auto')[source]

K-Means clustering.

Read more in the User Guide.

Parameters

n_clustersint, default=8

The number of clusters to form as well as the number of centroids to generate.

init{‘k-means++’, ‘random’} or ndarray of shape (n_clusters, n_features),


default=’k-means++’

Method for initialization, defaults to ‘k-means++’:

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed
up convergence. See section Notes in k_init for more details.

‘random’: choose k observations (rows) at random from data for the initial centroids.

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial
centers.

n_initint, default=10

Number of time the k-means algorithm will be run with different centroid seeds. The
final results will be the best output of n_init consecutive runs in terms of inertia.

max_iterint, default=300

Maximum number of iterations of the k-means algorithm for a single run.

tolfloat, default=1e-4

Relative tolerance with regards to inertia to declare convergence.

precompute_distances‘auto’ or bool, default=’auto’


Precompute distances (faster but takes more memory).

‘auto’ : do not precompute distances if n_samples * n_clusters > 12 million. This


corresponds to about 100MB overhead per job using double precision.

True : always precompute distances.

False : never precompute distances.

verboseint, default=0

Verbosity mode.

random_stateint, RandomState instance, default=None

Determines random number generation for centroid initialization. Use an int to make the
randomness deterministic. See Glossary.

copy_xbool, default=True

When pre-computing distances it is more numerically accurate to center the data first. If
copy_x is True (default), then the original data is not modified, ensuring X is C-
contiguous. If False, the original data is modified, and put back before the function
returns, but small numerical differences may be introduced by subtracting and then
adding the data mean, in this case it will also not ensure that data is C-contiguous which
may cause a significant slowdown.

n_jobsint, default=None

The number of jobs to use for the computation. This works by computing each of the
n_init runs in parallel.

None means 1 unless in a joblib.parallel_backend context. -1 means using all


processors. See Glossary for more details.

algorithm{“auto”, “full”, “elkan”}, default=”auto”

K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation
is more efficient by using the triangle inequality, but currently doesn’t support sparse
data. “auto” chooses “elkan” for dense data and “full” for sparse data.

Attributes

cluster_centers_ndarray of shape (n_clusters, n_features)

Coordinates of cluster centers. If the algorithm stops before fully converging


(see tol and max_iter), these will not be consistent with labels_.
labels_ndarray of shape (n_samples,)

Labels of each point

inertia_float

Sum of squared distances of samples to their closest cluster center.

n_iter_int

Number of iterations run.

See also
MiniBatchKMeans

Alternative online implementation that does incremental updates of the centers


positions using mini-batches. For large scale learning (say n_samples > 10k)
MiniBatchKMeans is probably much faster than the default batch implementation.

Notes

The k-means problem is solved using either Lloyd’s or Elkan’s algorithm.

The average complexity is given by O(k n T), were n is the number of samples and T is
the number of iteration.

The worst case complexity is given by O(n^(k+2/p)) with n = n_samples, p = n_features.


(D. Arthur and S. Vassilvitskii, ‘How slow is the k-means method?’ SoCG2006)

In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms
available), but it falls in local minima. That’s why it can be useful to restart it several
times.

If the algorithm stops before fully converging (because


of tol or max_iter), labels_ and cluster_centers_ will not be consistent, i.e.
the cluster_centers_ will not be the means of the points in each cluster. Also, the
estimator will reassign labels_ after the last iteration to make labels_ consistent
with predict on the training set.

Examples

>>>

>>> from sklearn.cluster import KMeans


>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> kmeans.cluster_centers_
array([[10., 2.],
[ 1., 2.]])

Methods

fit(self, X[, y, sample_weight]) Compute k-means clustering.


fit_predict(self, X[, y, sample_wei Compute cluster centers and predict cluster index for
ght]) each sample.
fit_transform(self, X[, y, sample_w Compute clustering and transform X to cluster-distance
eight]) space.
get_params(self[, deep]) Get parameters for this estimator.
predict(self, X[, sample_weight]) Predict the closest cluster each sample in X belongs to.
score(self, X[, y, sample_weight]) Opposite of the value of X on the K-means objective.
set_params(self, \*\*params) Set the parameters of this estimator.
transform(self, X) Transform X to a cluster-distance space.
__init__(self, n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, prec
ompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=None, alg
orithm='auto')[source]

Initialize self. See help(type(self)) for accurate signature.

fit(self, X, y=None, sample_weight=None)[source]

Compute k-means clustering.

Parameters

Xarray-like or sparse matrix, shape=(n_samples, n_features)

Training instances to cluster. It must be noted that the data will be converted to C
ordering, which will cause a memory copy if the given data is not C-contiguous.

yIgnored

Not used, present here for API consistency by convention.

sample_weightarray-like, shape (n_samples,), optional

The weights for each observation in X. If None, all observations are assigned equal
weight (default: None).
Returns

self

Fitted estimator.

fit_predict(self, X, y=None, sample_weight=None)[source]

Compute cluster centers and predict cluster index for each sample.

Convenience method; equivalent to calling fit(X) followed by predict(X).

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to transform.

yIgnored

Not used, present here for API consistency by convention.

sample_weightarray-like, shape (n_samples,), optional

The weights for each observation in X. If None, all observations are assigned equal
weight (default: None).

Returns

labelsarray, shape [n_samples,]

Index of the cluster each sample belongs to.

fit_transform(self, X, y=None, sample_weight=None)[source]

Compute clustering and transform X to cluster-distance space.

Equivalent to fit(X).transform(X), but more efficiently implemented.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to transform.

yIgnored

Not used, present here for API consistency by convention.


sample_weightarray-like, shape (n_samples,), optional

The weights for each observation in X. If None, all observations are assigned equal
weight (default: None).

Returns

X_newarray, shape [n_samples, k]

X transformed in the new space.

get_params(self, deep=True)[source]

Get parameters for this estimator.

Parameters

deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are
estimators.

Returns

paramsmapping of string to any

Parameter names mapped to their values.

predict(self, X, sample_weight=None)[source]

Predict the closest cluster each sample in X belongs to.

In the vector quantization literature, cluster_centers_ is called the code book and


each value returned by predict is the index of the closest code in the code book.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to predict.

sample_weightarray-like, shape (n_samples,), optional

The weights for each observation in X. If None, all observations are assigned equal
weight (default: None).

Returns
labelsarray, shape [n_samples,]

Index of the cluster each sample belongs to.

score(self, X, y=None, sample_weight=None)[source]

Opposite of the value of X on the K-means objective.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data.

yIgnored

Not used, present here for API consistency by convention.

sample_weightarray-like, shape (n_samples,), optional

The weights for each observation in X. If None, all observations are assigned equal
weight (default: None).

Returns

scorefloat

Opposite of the value of X on the K-means objective.

set_params(self, **params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines).
The latter have parameters of the form <component>__<parameter> so that it’s possible
to update each component of a nested object.

Parameters

**paramsdict

Estimator parameters.

Returns

selfobject

Estimator instance.
transform(self, X)[source]

Transform X to a cluster-distance space.

In the new space, each dimension is the distance to the cluster centers. Note that even if
X is sparse, the array returned by transform will typically be dense.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to transform.

Returns

X_newarray, shape [n_samples, k]

X transformed in the new space.

K-means Clustering: Algorithm, Applications, Evaluation Methods, and Drawbacks

Imad Dabbura

Follow

Sep 17, 2018 · 13 min read


Clustering

Clustering is one of the most common exploratory data analysis technique used to get an
intuition about the structure of the data. It can be defined as the task of identifying subgroups in
the data such that data points in the same subgroup (cluster) are very similar while data points in
different clusters are very different. In other words, we try to find homogeneous subgroups
within the data such that data points in each cluster are as similar as possible according to a
similarity measure such as euclidean-based distance or correlation-based distance. The decision
of which similarity measure to use is application-specific.

Clustering analysis can be done on the basis of features where we try to find subgroups of
samples based on features or on the basis of samples where we try to find subgroups of features
based on samples. We’ll cover here clustering based on features. Clustering is used in market
segmentation; where we try to fined customers that are similar to each other whether in terms of
behaviors or attributes, image segmentation/compression; where we try to group similar regions
together, document clustering based on topics, etc.

Unlike supervised learning, clustering is considered an unsupervised learning method since we


don’t have the ground truth to compare the output of the clustering algorithm to the true labels to
evaluate its performance. We only want to try to investigate the structure of the data by grouping
the data points into distinct subgroups.

In this post, we will cover only Kmeans which is considered as one of the most used clustering
algorithms due to its simplicity.

Kmeans Algorithm

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined


distinct non-overlapping subgroups (clusters) where each data point belongs to only one
group. It tries to make the inter-cluster data points as similar as possible while also keeping the
clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the
squared distance between the data points and the cluster’s centroid (arithmetic mean of all the
data points that belong to that cluster) is at the minimum. The less variation we have within
clusters, the more homogeneous (similar) the data points are within the same cluster.

The way kmeans algorithm works is as follows:

1. Specify number of clusters K.

2. Initialize centroids by first shuffling the dataset and then randomly selecting K data
points for the centroids without replacement.

3. Keep iterating until there is no change to the centroids. i.e assignment of data points to
clusters isn’t changing.

 Compute the sum of the squared distance between data points and all centroids.

 Assign each data point to the closest cluster (centroid).

 Compute the centroids for the clusters by taking the average of the all data points that
belong to each cluster.

The approach kmeans follows to solve the problem is called Expectation-Maximization. The


E-step is assigning the data points to the closest cluster. The M-step is computing the centroid of
each cluster. Below is a break down of how we can solve it mathematically (feel free to skip it).
The objective function is:

where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0. Also, μk is the centroid
of xi’s cluster.

It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed. Then
we minimize J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J w.r.t. wik
first and update cluster assignments (E-step). Then we differentiate J w.r.t. μk and recompute the
centroids after the cluster assignments from previous step (M-step). Therefore, E-step is:

In other words, assign the data point xi to the closest cluster judged by its sum of squared
distance from cluster’s centroid.

And M-step is:


Which translates to recomputing the centroid of each cluster to reflect the new assignments.

Few things to note here:

 Since clustering algorithms including kmeans use distance-based measurements to


determine the similarity between data points, it’s recommended to standardize the data to
have a mean of zero and a standard deviation of one since almost always the features in any
dataset would have different units of measurements such as age vs income.

 Given kmeans iterative nature and the random initialization of centroids at the start of
the algorithm, different initializations may lead to different clusters since kmeans algorithm
may stuck in a local optimum and may not converge to global optimum. Therefore, it’s
recommended to run the algorithm using different initializations of centroids and pick the
results of the run that that yielded the lower sum of squared distance.

 Assignment of examples isn’t changing is the same thing as no change in within-cluster


variation:

Implementation
We’ll use simple implementation of kmeans here to just illustrate some concepts. Then we will
use sklearn implementation that is more efficient take care of many things for us.

Applications

kmeans algorithm is very popular and used in a variety of applications such as market
segmentation, document clustering, image segmentation and image compression, etc. The goal
usually when we undergo a cluster analysis is either:

1. Get a meaningful intuition of the structure of the data we’re dealing with.

2. Cluster-then-predict where different models will be built for different subgroups if we


believe there is a wide variation in the behaviors of different subgroups. An example of that
is clustering patients into different subgroups and build a model for each subgroup to
predict the probability of the risk of having heart attack.

In this post, we’ll apply clustering on two cases:

 Geyser eruptions segmentation (2D dataset).

 Image compression.

Kmeans on Geyser’s Eruptions Segmentation

We’ll first implement the kmeans algorithm on 2D dataset and see how it works. The dataset has
272 observations and 2 features. The data covers the waiting time between eruptions and the
duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming,
USA. We will try to find K subgroups within the data points and group them accordingly. Below is
the description of the features:

 eruptions (float): Eruption time in minutes.

 waiting (int): Waiting time to next eruption.

Let’s plot the data first:


We’ll use this data because it’s easy to plot and visually spot the clusters since its a 2-dimension
dataset. It’s obvious that we have 2 clusters. Let’s standardize the data first and run the kmeans
algorithm on the standardized data with K=2.
The above graph shows the scatter plot of the data colored by the cluster they belong to. In this
example, we chose K=2. The symbol ‘*‘ is the centroid of each cluster. We can think of those 2
clusters as geyser had different kinds of behaviors under different scenarios.

Next, we’ll show that different initializations of centroids may yield to different results. I’ll use 9
different random_state to change the initialization of the centroids and plot the results. The
title of each plot will be the sum of squared distance of each initialization.

As a side note, this dataset is considered very easy and converges in less than 10 iterations.
Therefore, to see the effect of random initialization on convergence, I am going to go with 3
iterations to illustrate the concept. However, in real world applications, datasets are not at all
that clean and nice!
As the graph above shows that we only ended up with two different ways of clusterings based on
different initializations. We would pick the one with the lowest sum of squared distance.

Kmeans on Image Compression

In this part, we’ll implement kmeans to compress an image. The image that we’ll be working on is
396 x 396 x 3. Therefore, for each pixel location we would have 3 8-bit integers that specify the
red, green, and blue intensity values. Our goal is to reduce the number of colors to 30 and
represent (compress) the photo using those 30 colors only. To pick which colors to use, we’ll use
kmeans algorithm on the image and treat every pixel as a data point. That means reshape the
image from height x width x channels to (height * width) x channel, i,e we would have 396 x 396
= 156,816 data points in 3-dimensional space which are the intensity of RGB. Doing so will allow
us to represent the image using the 30 centroids for each pixel and would significantly reduce the
size of the image by a factor of 6. The original image size was 396 x 396 x 24 = 3,763,584 bits;
however, the new compressed image would be 30 x 24 + 396 x 396 x 4 = 627,984 bits. The huge
difference comes from the fact that we’ll be using centroids as a lookup for pixels’ colors and that
would reduce the size of each pixel location to 4-bit instead of 8-bit.

From now on we will be using sklearn implementation of kmeans. Few thing to note here:

 n_init is the number of times of running the kmeans with different centroid’s
initialization. The result of the best one will be reported.

 tol is the within-cluster variation metric used to declare convergence.

 The default of init is k-means++ which is supposed to yield a better results than just


random initialization of centroids.
We can see the comparison between the original image and the compressed one. The compressed
image looks close to the original one which means we’re able to retain the majority of the
characteristics of the original image. With smaller number of clusters we would have higher
compression rate at the expense of image quality. As a side note, this image compression method
is called lossy data compression because we can’t reconstruct the original image from the
compressed image.

Evaluation Methods

Contrary to supervised learning where we have the ground truth to evaluate the model’s
performance, clustering analysis doesn’t have a solid evaluation metric that we can use to
evaluate the outcome of different clustering algorithms. Moreover, since kmeans requires k as an
input and doesn’t learn it from data, there is no right answer in terms of the number of clusters
that we should have in any problem. Sometimes domain knowledge and intuition may help but
usually that is not the case. In the cluster-predict methodology, we can evaluate how well the
models are performing based on different K clusters since clusters are used in the downstream
modeling.

In this post we’ll cover two metrics that may give us some intuition about k:
 Elbow method

 Silhouette analysis

Elbow Method

Elbow method gives us an idea on what a good k number of clusters would be based on the sum
of squared distance (SSE) between data points and their assigned clusters’ centroids. We pick k at
the spot where SSE starts to flatten out and forming an elbow. We’ll use the geyser dataset and
evaluate SSE for different values of k and see where the curve might form an elbow and flatten
out.
The graph above shows that k=2 is not a bad choice. Sometimes it’s still hard to figure out a good
number of clusters to use because the curve is monotonically decreasing and may not show any
elbow or has an obvious point where the curve starts flattening out.

Silhouette Analysis

Silhouette analysis can be used to determine the degree of separation between clusters. For
each sample:

 Compute the average distance from all data points in the same cluster (ai).

 Compute the average distance from all data points in the closest cluster (bi).

 Compute the coefficient:


The coefficient can take values in the interval [-1, 1].

 If it is 0 –> the sample is very close to the neighboring clusters.

 It it is 1 –> the sample is far away from the neighboring clusters.

 It it is -1 –> the sample is assigned to the wrong clusters.

Therefore, we want the coefficients to be as big as possible and close to 1 to have a good clusters.
We’ll use here geyser dataset again because its cheaper to run the silhouette analysis and it is
actually obvious that there is most likely only two groups of data points.
As the above plots show, n_clusters=2 has the best average silhouette score of around 0.75
and all clusters being above the average shows that it is actually a good choice. Also, the thickness
of the silhouette plot gives an indication of how big each cluster is. The plot shows that cluster 1
has almost double the samples than cluster 2. However, as we increased n_clusters to 3 and 4,
the average silhouette score decreased dramatically to around 0.48 and 0.39 respectively.
Moreover, the thickness of silhouette plot started showing wide fluctuations. The bottom line is:
Good n_clusters will have a well above 0.5 silhouette average score as well as all of the clusters
have higher than the average score.

Drawbacks

Kmeans algorithm is good in capturing structure of the data if clusters have a spherical-like
shape. It always try to construct a nice spherical shape around the centroid. That means, the
minute the clusters have a complicated geometric shapes, kmeans does a poor job in clustering
the data. We’ll illustrate three cases where kmeans will not perform well.
First, kmeans algorithm doesn’t let data points that are far-away from each other share the same
cluster even though they obviously belong to the same cluster. Below is an example of data points
on two different horizontal lines that illustrates how kmeans tries to group half of the data points
of each horizontal lines together.
Kmeans considers the point ‘B’ closer to point ‘A’ than point ‘C’ since they have non-spherical
shape. Therefore, points ‘A’ and ‘B’ will be in the same cluster but point ‘C’ will be in a different
cluster. Note the Single Linkage hierarchical clustering method gets this right because it
doesn’t separate similar points).

Second, we’ll generate data from multivariate normal distributions with different means and
standard deviations. So we would have 3 groups of data where each group was generated from
different multivariate normal distribution (different mean/standard deviation). One group will
have a lot more data points than the other two combined. Next, we’ll run kmeans on the data with
K=3 and see if it will be able to cluster the data correctly. To make the comparison easier, I am
going to plot first the data colored based on the distribution it came from. Then I will plot the
same data but now colored based on the clusters they have been assigned to.
Looks like kmeans couldn’t figure out the clusters correctly. Since it tries to minimize the within-
cluster variation, it gives more weight to bigger clusters than smaller ones. In other words, data
points in smaller clusters may be left away from the centroid in order to focus more on the larger
cluster.

Last, we’ll generate data that have complicated geometric shapes such as moons and circles
within each other and test kmeans on both of the datasets.
As expected, kmeans couldn’t figure out the correct clusters for both datasets. However, we can
help kmeans perfectly cluster these kind of datasets if we use kernel methods. The idea is we
transform to higher dimensional representation that make the data linearly separable (the same
idea that we use in SVMs). Different kinds of algorithms work very well in such scenarios such
as SpectralClustering, see below:
Conclusion

Kmeans clustering is one of the most popular clustering algorithms and usually the first thing
practitioners apply when solving clustering tasks to get an idea of the structure of the dataset.
The goal of kmeans is to group data points into distinct non-overlapping subgroups. It does a
very good job when the clusters have a kind of spherical shapes. However, it suffers as the
geometric shapes of clusters deviates from spherical shapes. Moreover, it also doesn’t learn the
number of clusters from the data and requires it to be pre-defined. To be a good practitioner, it’s
good to know the assumptions behind algorithms/methods so that you would have a pretty good
idea about the strength and weakness of each method. This will help you decide when to use each
method and under what circumstances. In this post, we covered both strength, weaknesses, and
some evaluation methods related to kmeans.

Below are the main takeaways:


 Scale/standardize the data when applying kmeans algorithm.

 Elbow method in selecting number of clusters doesn’t usually work because the error
function is monotonically decreasing for all ks.

 Kmeans gives more weight to the bigger clusters.

 Kmeans assumes spherical shapes of clusters (with radius equal to the distance between
the centroid and the furthest data point) and doesn’t work well when clusters are in different
shapes such as elliptical clusters.

 If there is overlapping between clusters, kmeans doesn’t have an intrinsic measure for
uncertainty for the examples belong to the overlapping region in order to determine for
which cluster to assign each data point.

 Kmeans may still cluster the data even if it can’t be clustered such as data that comes
from uniform distributions.

The notebook that created this post can be found here.

Originally published at imaddabbura.github.io on September 17, 2018.

https://towardsdatascience.com/machine-learning-algorithms-part-9-k-means-example-in-python-
f2ad05ed5203

K-Means Clustering is an unsupervised machine learning algorithm. In contrast to traditional


supervised machine learning algorithms, K-Means attempts to classify data without having first
been trained with labeled data. Once the algorithm has been run and the groups are defined, any
new data can be easily assigned to the most relevant group.

The real world applications of K-Means include:

 customer profiling

 market segmentation

 computer vision

 search engines

 astronomy

How it works
1. Select K (i.e. 2) random points as cluster centers called centroids
2. Assign each data point to the closest cluster by calculating its distance with respect to each
centroid
3. Determine the new cluster center by computing the average of the assigned points
4. Repeat steps 2 and 3 until none of the cluster assignments change
Choosing the right number of clusters

Often times the data you’ll be working with will have multiple dimensions making it difficult to
visual. As a consequence, the optimum number of clusters is no longer obvious. Fortunately, we
have a way of determining this mathematically.

We graph the relationship between the number of clusters and Within Cluster Sum of Squares
(WCSS) then we select the number of clusters where the change in WCSS begins to level off
(elbow method).
WCSS is defined as the sum of the squared distance between each member of the cluster and its
centroid.
For example, the computed WCSS for figure 1 would be greater than the WCSS calculated
for figure 2.
Figure 1
Figure 2

Code

Let’s take a look at how we could go about classifying data using the K-Means algorithm with
python. As always, we need to start by importing the required libraries.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
from sklearn.cluster import KMeans

In this tutorial, we’ll generate our own data using the make_blobs function from
the sklearn.datasets module. The centers parameter specifies the number of clusters.
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60,
random_state=0)plt.scatter(X[:,0], X[:,1])

Even though we already know the optimal number of clusters, I figured we could still benefit
from determining it using the elbow method. To get the values used in the graph, we train
multiple models using a different number of clusters and storing the value of
the intertia_ property (WCSS) every time.
wcss = []for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300,
n_init=10, random_state=0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Next, we’ll categorize the data using the optimum number of clusters (4) we determined in the
last step. k-means++ ensures that you get don’t fall into the random initialization trap.
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10,
random_state=0)
pred_y = kmeans.fit_predict(X)plt.scatter(X[:,0], X[:,1])
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
s=300, c='red')
plt.show()
https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/

Here, we see that there is a lot of variation in the magnitude of the data. Variables like Channel
and Region have low magnitude whereas variables like Fresh, Milk, Grocery, etc. have a higher
magnitude.

Since K-Means is a distance-based algorithm, this difference of magnitude can create a


problem. So let’s first bring all the variables to the same magnitude:

The Most Comprehensive Guide to K-Means Clustering You’ll Ever Need

PULKIT SHARMA, AUGUST 19, 2019 


LOGIN TO BOOKMARK THIS ARTICLE

Overview

 K-Means Clustering is a simple yet powerful algorithm in data science


 There are a plethora of real-world applications of K-Means Clustering (a few of which we
will cover here)
 This comprehensive guide will introduce you to the world of clustering and K-Means
Clustering along with an implementation in Python on a real-world dataset

Introduction

I love working on recommendation engines. Whenever I come across any recommendation


engine on a website, I can’t wait to break it down and understand how it works underneath. It’s
one of the many great things about being a data scientist!

What truly fascinates me about these systems is how we can group similar items, products, and
users together. This grouping, or segmenting, works across industries. And that’s what makes
the concept of clustering such an important one in data science.

Clustering helps us understand our data in a unique way – by grouping things together into –
you guessed it – clusters.
In this article, we will cover k-means clustering and it’s components comprehensively. We’ll look
at clustering, why it matters, its applications and then deep dive into k-means clustering
(including how to perform it in Python on a real-world dataset).

And if you want to directly work on the Python code, jump straight here. We have a live coding
window where you can build your own k-means clustering algorithm without leaving this article!

Learn more about clustering and other machine learning algorithms (both supervised and
unsupervised) in the comprehensive ‘Applied Machine Learning‘ course.

Table of Contents

1. What is Clustering?
2. How is Clustering an Unsupervised Learning Problem?
3. Properties of Clusters
4. Applications of Clustering in Real-World Scenarios
5. Understanding the Different Evaluation Metrics for Clustering
6. What is K-Means Clustering?
7. Implementing K-Means Clustering from scratch in Python
8. Challenges with K-Means Algorithm
9. K-Means ++ to choose initial cluster centroids for K-Means Clustering
10. How to choose the Right Number of Clusters in K-Means?
11. Implementing K-Means Clustering in Python

What is Clustering?

Let’s kick things off with a simple example. A bank wants to give credit card offers to its
customers. Currently, they look at the details of each customer and based on this information,
decide which offer should be given to which customer.

Now, the bank can potentially have millions of customers. Does it make sense to look at the
details of each customer separately and then make a decision? Certainly not! It is a manual
process and will take a huge amount of time.

So what can the bank do? One option is to segment its customers into different groups. For
instance, the bank can group the customers based on their income:

Can you see where I’m going with this? The bank can now make three different strategies or
offers, one for each group. Here, instead of creating different strategies for individual customers,
they only have to make 3 strategies. This will reduce the effort as well as the time.

The groups I have shown above are known as clusters and the process of creating these
groups is known as clustering. Formally, we can say that:

Clustering is the process of dividing the entire data into groups (also known as clusters) based
on the patterns in the data.
Can you guess which type of learning problem clustering is? Is it a supervised or unsupervised
learning problem?
Think about it for a moment and make use of the example we just saw. Got it? Clustering is an
unsupervised learning problem!

How is Clustering an Unsupervised Learning Problem?

Let’s say you are working on a project where you need to predict the sales of a big mart:

Or, a project where your task is to predict whether a loan will be approved or not:

We have a fixed target to predict in both of these situations. In the sales prediction problem, we
have to predict the Item_Outlet_Sales based on outlet_size, outlet_location_type, etc. and in the
loan approval problem, we have to predict the Loan_Status depending on the Gender, marital
status, the income of the customers, etc.

So, when we have a target variable to predict based on a given set of predictors or
independent variables, such problems are called supervised learning problems.
Now, there might be situations where we do not have any target variable to predict.

Such problems, without any fixed target variable, are known as unsupervised learning
problems. In these problems, we only have the independent variables and no target/dependent
variable.
In clustering, we do not have a target to predict. We look at the data and then try to club
similar observations and form different groups. Hence it is an unsupervised learning
problem.
We now know what are clusters and the concept of clustering. Next, let’s look at the properties
of these clusters which we must consider while forming the clusters.

Properties of Clusters

How about another example? We’ll take the same bank as before who wants to segment its
customers. For simplicity purposes, let’s say the bank only wants to use the income and debt to
make the segmentation. They collected the customer data and used a scatter plot to visualize it:

On the X-axis, we have the income of the customer and the y-axis represents the amount of
debt. Here, we can clearly visualize that these customers can be segmented into 4 different
clusters as shown below:
This is how clustering helps to create segments (clusters) from the data. The bank can further
use these clusters to make strategies and offer discounts to its customers. So let’s look at the
properties of these clusters.

Property 1

All the data points in a cluster should be similar to each other. Let me illustrate it using the
above example:

If the customers in a particular cluster are not similar to each other, then their requirements
might vary, right? If the bank gives them the same offer, they might not like it and their interest
in the bank might reduce. Not ideal.

Having similar data points within the same cluster helps the bank to use targeted marketing.
You can think of similar examples from your everyday life and think about how clustering will (or
already does) impact the business strategy.

 
Property 2

The data points from different clusters should be as different as possible. This will
intuitively make sense if you grasped the above property. Let’s again take the same example to
understand this property:

Which of these cases do you think will give us the better clusters? If you look at case I:

Customers in the red and blue clusters are quite similar to each other. The top four points in the
red cluster share similar properties as that of the top two customers in the blue cluster. They
have high income and high debt value. Here, we have clustered them differently. Whereas, if
you look at case II:
Points in the red cluster are completely different from the customers in the blue cluster. All the
customers in the red cluster have high income and high debt and customers in the blue cluster
have high income and low debt value. Clearly we have a better clustering of customers in this
case.

Hence, data points from different clusters should be as different from each other as possible to
have more meaningful clusters.

So far, we have understood what clustering is and the different properties of clusters. But why
do we even need clustering? Let’s clear this doubt in the next section and look at some
applications of clustering.

Applications of Clustering in Real-World Scenarios

Clustering is a widely used technique in the industry. It is actually being used in almost every
domain, ranging from banking to recommendation engines, document clustering to image
segmentation.

Customer Segmentation

We covered this earlier – one of the most common applications of clustering is customer
segmentation. And it isn’t just limited to banking. This strategy is across functions, including
telecom, e-commerce, sports, advertising, sales, etc.

 
Document Clustering

This is another common application of clustering. Let’s say you have multiple documents and
you need to cluster similar documents together. Clustering helps us group these documents
such that similar documents are in the same clusters.

Image Segmentation

We can also use clustering to perform image segmentation. Here, we try to club similar pixels in
the image together. We can apply clustering to create clusters having similar pixels in the same
group.

You can refer to this article to see how we can make use of clustering for image segmentation
tasks.

 
Recommendation Engines

Clustering can also be used in recommendation engines. Let’s say you want to recommend
songs to your friends. You can look at the songs liked by that person and then use clustering to
find similar songs and finally recommend the most similar songs.

There are many more applications which I’m sure you have already thought of. You can share
these applications in the comments section below. Next, let’s look at how we can evaluate our
clusters.

Understanding the Different Evaluation Metrics for Clustering

The primary aim of clustering is not just to make clusters, but to make good and meaningful
ones. We saw this in the below example:
Here, we used only two features and hence it was easy for us to visualize and decide which of
these clusters is better.

Unfortunately, that’s not how real-world scenarios work. We will have a ton of features to work
with. Let’s take the customer segmentation example again – we will have features like
customer’s income, occupation, gender, age, and many more. Visualizing all these features
together and deciding better and meaningful clusters would not be possible for us.

This is where we can make use of evaluation metrics. Let’s discuss a few of them and
understand how we can use them to evaluate the quality of our clusters.

Inertia

Recall the first property of clusters we covered above. This is what inertia evaluates. It tells us
how far the points within a cluster are. So, inertia actually calculates the sum of distances of
all the points within a cluster from the centroid of that cluster.

We calculate this for all the clusters and the final inertial value is the sum of all these distances.
This distance within the clusters is known as intracluster distance. So, inertia gives us the sum
of intracluster distances:

Now, what do you think should be the value of inertia for a good cluster? Is a small inertial value
good or do we need a larger value? We want the points within the same cluster to be similar to
each other, right? Hence, the distance between them should be as low as possible.

Keeping this in mind, we can say that the lesser the inertia value, the better our clusters are.
 

Dunn Index

We now know that inertia tries to minimize the intracluster distance. It is trying to make more
compact clusters.

Let me put it this way – if the distance between the centroid of a cluster and the points in that
cluster is small, it means that the points are closer to each other. So, inertia makes sure that the
first property of clusters is satisfied. But it does not care about the second property – that
different clusters should be as different from each other as possible.

This is where Dunn index can come into action.

Along with the distance between the centroid and points, the Dunn index also takes into
account the distance between two clusters. This distance between the centroids of two
different clusters is known as inter-cluster distance. Let’s look at the formula of the Dunn
index:

Dunn index is the ratio of the minimum of inter-cluster distances and maximum of intracluster
distances.
We want to maximize the Dunn index. The more the value of the Dunn index, the better will be
the clusters. Let’s understand the intuition behind Dunn index:

In order to maximize the value of the Dunn index, the numerator should be maximum. Here, we
are taking the minimum of the inter-cluster distances. So, the distance between even the closest
clusters should be more which will eventually make sure that the clusters are far away from
each other.
Also, the denominator should be minimum to maximize the Dunn index. Here, we are taking the
maximum of intracluster distances. Again, the intuition is the same here. The maximum distance
between the cluster centroids and the points should be minimum which will eventually make
sure that the clusters are compact.

 
Introduction to K-Means Clustering

We have finally arrived at the meat of this article!

Recall the first property of clusters – it states that the points within a cluster should be similar to
each other. So, our aim here is to minimize the distance between the points within a
cluster.

There is an algorithm that tries to minimize the distance of the points in a cluster with their
centroid – the k-means clustering technique.
K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the
distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.

The main objective of the K-Means algorithm is to minimize the sum of distances
between the points and their respective cluster centroid.
Let’s now take an example to understand how K-Means actually works:
We have these 8 points and we want to apply k-means to create clusters for these points.
Here’s how we can do it.

Step 1: Choose the number of clusters k

The first step in k-means is to pick the number of clusters, k.

Step 2: Select k random points from the data as centroids

Next, we randomly select the centroid for each cluster. Let’s say we want to have 2 clusters, so
k is equal to 2 here. We then randomly select the centroid:

Here, the red and green circles represent the centroid for these clusters.

Step 3: Assign all the points to the closest cluster centroid

Once we have initialized the centroids, we assign each point to the closest cluster centroid:
Here you can see that the points which are closer to the red point are assigned to the red
cluster whereas the points which are closer to the green point are assigned to the green cluster.

Step 4: Recompute the centroids of newly formed clusters

Now, once we have assigned all of the points to either cluster, the next step is to compute the
centroids of newly formed clusters:

Here, the red and green crosses are the new centroids.

Step 5: Repeat steps 3 and 4

We then repeat steps 3 and 4:


The step of computing the centroid and assigning all the points to the cluster based on their
distance from the centroid is a single iteration. But wait – when should we stop this process? It
can’t run till eternity, right?

 
Stopping Criteria for K-Means Clustering

There are essentially three stopping criteria that can be adopted to stop the K-means algorithm:

1. Centroids of newly formed clusters do not change


2. Points remain in the same cluster
3. Maximum number of iterations are reached

We can stop the algorithm if the centroids of newly formed clusters are not changing. Even after
multiple iterations, if we are getting the same centroids for all the clusters, we can say that the
algorithm is not learning any new pattern and it is a sign to stop the training.

Another clear sign that we should stop the training process if the points remain in the same
cluster even after training the algorithm for multiple iterations.

Finally, we can stop the training if the maximum number of iterations is reached. Suppose if we
have set the number of iterations as 100. The process will repeat for 100 iterations before
stopping.

 
Implementing K-Means Clustering in Python from Scratch

Time to fire up our Jupyter notebooks (or whichever IDE you use) and get our hands dirty in
Python!

We will be working on the loan prediction dataset that you can download here. I encourage you
to read more about the dataset and the problem statement here. This will help you visualize
what we are working on (and why we are doing this). Two pretty important questions in any data
science project.

First, import all the required libraries:

Now, we will read the CSV file and look at the first five rows of the data:

For this article, we will be taking only two variables from the data – “LoanAmount” and
“ApplicantIncome”. This will make it easy to visualize the steps as well. Let’s pick these two
variables and visualize the data points:

Steps 1 and 2 of K-Means were about choosing the number of clusters (k) and selecting random
centroids for each cluster. We will pick 3 clusters and then select random observations from the
data as the centroids:

Here, the red dots represent the 3 centroids for each cluster. Note that we have chosen these
points randomly and hence every time you run this code, you might get different centroids.

Next, we will define some conditions to implement the K-Means Clustering algorithm. Let’s first
look at the code:

These values might vary every time we run this. Here, we are stopping the training when the
centroids are not changing after two iterations. We have initially defined the diff as 1 and inside
the while loop, we are calculating this diff as the difference between the centroids in the
previous iteration and the current iteration.

When this difference is 0, we are stopping the training. Let’s now visualize the clusters we have
got:
Awesome! Here, we can clearly visualize three clusters. The red dots represent the centroid of
each cluster. I hope you now have a clear understanding of how K-Means work.
Here is a LIVE CODING window for you to play around with the code and see the results for
yourself – without leaving this article! Go ahead and start working on it:

However, there are certain situations where this algorithm might not perform as well. Let’s look
at some challenges which you can face while working with k-means.

Challenges with the K-Means Clustering Algorithm

One of the common challenges we face while working with K-Means is that the size of
clusters is different. Let’s say we have the below points:
The left and the rightmost clusters are of smaller size compared to the central cluster. Now, if
we apply k-means clustering on these points, the results will be something like this:

Another challenge with k-means is when the densities of the original points are
different. Let’s say these are the original points:
Here, the points in the red cluster are spread out whereas the points in the remaining clusters
are closely packed together. Now, if we apply k-means on these points, we will get clusters like
this:

We can see that the compact points have been assigned to a single cluster. Whereas the points
that are spread loosely but were in the same cluster, have been assigned to different clusters.
Not ideal so what can we do about this?

One of the solutions is to use a higher number of clusters. So, in all the above scenarios,
instead of using 3 clusters, we can have a bigger number. Perhaps setting k=10 might lead to
more meaningful clusters.

Remember how we randomly initialize the centroids in k-means clustering? Well, this is also
potentially problematic because we might get different clusters every time. So, to solve this
problem of random initialization, there is an algorithm called K-Means++ that can be used to
choose the initial values, or the initial cluster centroids, for K-Means.
 

K-Means++ to Choose Initial Cluster Centroids for K-Means Clustering

In some cases, if the initialization of clusters is not appropriate, K-Means can result in arbitrarily
bad clusters. This is where K-Means++ helps. It specifies a procedure to initialize the cluster
centers before moving forward with the standard k-means clustering algorithm.

Using the K-Means++ algorithm, we optimize the step where we randomly pick the cluster
centroid. We are more likely to find a solution that is competitive to the optimal K-Means solution
while using the K-Means++ initialization.

The steps to initialize the centroids using K-Means++ are:

1. The first cluster is chosen uniformly at random from the data points that we want to
cluster. This is similar to what we do in K-Means, but instead of randomly picking all the
centroids, we just pick one centroid here
2. Next, we compute the distance (D(x)) of each data point (x) from the cluster center that
has already been chosen
3. Then, choose the new cluster center from the data points with the probability of x being
proportional to (D(x))2
4. We then repeat steps 2 and 3 until k clusters have been chosen

Let’s take an example to understand this more clearly. Let’s say we have the following points
and we want to make 3 clusters here:

Now, the first step is to randomly pick a data point as a cluster centroid:
Let’s say we pick the green point as the initial centroid. Now, we will calculate the distance
(D(x)) of each data point with this centroid:

The next centroid will be the one whose squared distance (D(x)2) is the farthest from the current
centroid:

In this case, the red point will be selected as the next centroid. Now, to select the last centroid,
we will take the distance of each point from its closest centroid and the point having the largest
squared distance will be selected as the next centroid:
We will select the last centroid as:

We can continue with the K-Means algorithm after initializing the centroids. Using K-Means++ to
initialize the centroids tends to improve the clusters. Although it is computationally costly relative
to random initialization, subsequent K-Means often converge more rapidly.

I’m sure there’s one question which you’ve been wondering about since the start of this article –
how many clusters should we make? Aka, what should be the optimum number of clusters to
have while performing K-Means?

How to Choose the Right Number of Clusters in K-Means Clustering?

One of the most common doubts everyone has while working with K-Means is selecting the right
number of clusters.

So, let’s look at a technique that will help us choose the right value of clusters for the K-Means
algorithm. Let’s take the customer segmentation example which we saw earlier. To recap, the
bank wants to segment its customers based on their income and amount of debt:
Here, we can have two clusters which will separate the customers as shown below:

All the customers with low income are in one cluster whereas the customers with high income
are in the second cluster. We can also have 4 clusters:
Here, one cluster might represent customers who have low income and low debt, other cluster
is where customers have high income and high debt, and so on. There can be 8 clusters as
well:

Honestly, we can have any number of clusters. Can you guess what would be the maximum
number of possible clusters? One thing which we can do is to assign each point to a separate
cluster. Hence, in this case, the number of clusters will be equal to the number of points or
observations. So,

The maximum possible number of clusters will be equal to the number of


observations in the dataset.
But then how can we decide the optimum number of clusters? One thing we can do is plot a
graph, also known as an elbow curve, where the x-axis will represent the number of
clusters and the y-axis will be an evaluation metric. Let’s say inertia for now.
You can choose any other evaluation metric like the Dunn index as well:

Next, we will start with a small cluster value, let’s say 2. Train the model using 2 clusters,
calculate the inertia for that model, and finally plot it in the above graph. Let’s say we got an
inertia value of around 1000:

Now, we will increase the number of clusters, train the model again, and plot the inertia value.
This is the plot we get:
When we changed the cluster value from 2 to 4, the inertia value reduced very sharply. This
decrease in the inertia value reduces and eventually becomes constant as we increase the
number of clusters further.

So,

the cluster value where this decrease in inertia value becomes constant can be
chosen as the right cluster value for our data.

Here, we can choose any number of clusters between 6 and 10. We can have 7, 8, or even 9
clusters. You must also look at the computation cost while deciding the number of
clusters. If we increase the number of clusters, the computation cost will also increase. So, if
you do not have high computational resources, my advice is to choose a lesser number of
clusters.

Let’s now implement the K-Means Clustering algorithm in Python. We will also see how to use
K-Means++ to initialize the centroids and will also plot this elbow curve to decide what should be
the right number of clusters for our dataset.

Implementing K-Means Clustering in Python

We will be working on a wholesale customer segmentation problem. You can download the
dataset using this link. The data is hosted on the UCI Machine Learning repository.

The aim of this problem is to segment the clients of a wholesale distributor based on
their annual spending on diverse product categories, like milk, grocery, region, etc. So,
let’s start coding!

We will first import the required libraries:

Next, let’s read the data and look at the first five rows:

We have the spending details of customers on different products like Milk, Grocery, Frozen,
Detergents, etc. Now, we have to segment the customers based on the provided details. Before
doing that, let’s pull out some statistics related to the data:
Here, we see that there is a lot of variation in the magnitude of the data. Variables like Channel
and Region have low magnitude whereas variables like Fresh, Milk, Grocery, etc. have a higher
magnitude.

Since K-Means is a distance-based algorithm, this difference of magnitude can create a


problem. So let’s first bring all the variables to the same magnitude:

The magnitude looks similar now. Next, let’s create a kmeans function and fit it on the data:

We have initialized two clusters and pay attention – the initialization is not random here. We
have used the k-means++ initialization which generally produces better results as we have
discussed in the previous section as well.

Let’s evaluate how well the formed clusters are. To do that, we will calculate the inertia of the
clusters:

Output: 2599.38555935614
We got an inertia value of almost 2600. Now, let’s see how we can use the elbow curve to
determine the optimum number of clusters in Python.

We will first fit multiple k-means models and in each successive model, we will increase the
number of clusters. We will store the inertia value of each model and then plot it to visualize the
result:

Can you tell the optimum cluster value from this plot? Looking at the above elbow curve, we
can choose any number of clusters between 5 to 8. Let’s set the number of clusters as 6 and
fit the model:

Finally, let’s look at the value count of points in each of the above-formed clusters:

So, there are 234 data points belonging to cluster 4 (index 3), then 125 points in cluster 2 (index
1), and so on. This is how we can implement K-Means Clustering in Python.

 
End Notes

In this article, we discussed one of the most famous clustering algorithms – K-Means. We
implemented it from scratch and looked at its step-by-step implementation. We looked at the
challenges which we might face while working with K-Means and also saw how K-Means++ can
be helpful when initializing the cluster centroids.

Finally, we implemented k-means and looked at the elbow curve which helps to find the
optimum number of clusters in the K-Means algorithm.

If you have any doubts or feedback, feel free to share them in the comments section below. And
make sure you check out the comprehensive ‘Applied Machine Learning‘ course that takes
you from the basics of machine learning to advanced algorithms (including an entire module on
deploying your machine learning models!)

K-means Clustering¶

The plots display firstly what a K-means algorithm would yield using three clusters. It is then
shown what the effect of a bad initialization is on the classification process: By setting n_init to
only 1 (default is 10), the amount of times that the algorithm will be run with different centroid
seeds is reduced. The next plot displays what using eight clusters would deliver and finally the
ground truth.

  
  

  

print(__doc__)

# Code source: Gaël Varoquaux


# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
# Though the following import is not directly being used, it is required
# for 3D projection to work
from mpl_toolkits.mplot3d import Axes3D

from sklearn.cluster import KMeans


from sklearn import datasets

np.random.seed(5)

iris = datasets.load_iris()
X = iris.data
y = iris.target

estimators = [('k_means_iris_8', KMeans(n_clusters=8)),


('k_means_iris_3', KMeans(n_clusters=3)),
('k_means_iris_bad_init', KMeans(n_clusters=3, n_init=1,
init='random'))]

fignum = 1
titles = ['8 clusters', '3 clusters', '3 clusters, bad initialization']
for name, est in estimators:
fig = plt.figure(fignum, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
est.fit(X)
labels = est.labels_

ax.scatter(X[:, 3], X[:, 0], X[:, 2],


c=labels.astype(np.float), edgecolor='k')

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title(titles[fignum - 1])
ax.dist = 12
fignum = fignum + 1

# Plot the ground truth


fig = plt.figure(fignum, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

for name, label in [('Setosa', 0),


('Versicolour', 1),
('Virginica', 2)]:
ax.text3D(X[y == label, 3].mean(),
X[y == label, 0].mean(),
X[y == label, 2].mean() + 2, name,
horizontalalignment='center',
bbox=dict(alpha=.2, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor='k')

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title('Ground Truth')
ax.dist = 12

fig.show()
Selecting the number of clusters with silhouette analysis on KMeans clustering¶

Silhouette analysis can be used to study the separation distance between the resulting clusters.
The silhouette plot displays a measure of how close each point in one cluster is to points in the
neighboring clusters and thus provides a way to assess parameters like number of clusters
visually. This measure has a range of [-1, 1].

Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far
away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to
the decision boundary between two neighboring clusters and negative values indicate that
those samples might have been assigned to the wrong cluster.

In this example the silhouette analysis is used to choose an optimal value for n_clusters. The
silhouette plot shows that the n_clusters value of 3, 5 and 6 are a bad pick for the given data
due to the presence of clusters with below average silhouette scores and also due to wide
fluctuations in the size of the silhouette plots. Silhouette analysis is more ambivalent in deciding
between 2 and 4.

Also from the thickness of the silhouette plot the cluster size can be visualized. The silhouette
plot for cluster 0 when n_clusters is equal to 2, is bigger in size owing to the grouping of the
3 sub clusters into one big cluster. However when the n_clusters is equal to 4, all the plots are
more or less of similar thickness and hence are of similar sizes as can be also verified from the
labelled scatter plot on the right.

 

 

 

 

Out:
For n_clusters = 2 The average silhouette_score is : 0.7049787496083262
For n_clusters = 3 The average silhouette_score is : 0.5882004012129721
For n_clusters = 4 The average silhouette_score is : 0.6505186632729437
For n_clusters = 5 The average silhouette_score is : 0.5745566973301872
For n_clusters = 6 The average silhouette_score is : 0.43902711183132426
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.pyplot as plt


import matplotlib.cm as cm
import numpy as np

print(__doc__)

# Generating the sample data from make_blobs


# This particular setting has one distinct cluster and 3 clusters placed
close
# together.
X, y = make_blobs(n_samples=500,
n_features=2,
centers=4,
cluster_std=1,
center_box=(-10.0, 10.0),
shuffle=True,
random_state=1) # For reproducibility

range_n_clusters = [2, 3, 4, 5, 6]

for n_clusters in range_n_clusters:


# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)

# The 1st subplot is the silhouette plot


# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

# Initialize the clusterer with n_clusters value and a random generator


# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)

# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)

# Compute the silhouette scores for each sample


sample_silhouette_values = silhouette_samples(X, cluster_labels)

y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]

ith_cluster_silhouette_values.sort()

size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i

color = cm.nipy_spectral(float(i) / n_clusters)


ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)

# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

# Compute the new y_lower for next plot


y_lower = y_upper + 10 # 10 for the 0 samples

ax1.set_title("The silhouette plot for the various clusters.")


ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")

# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

ax1.set_yticks([]) # Clear the yaxis labels / ticks


ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

# 2nd Plot showing the actual clusters formed


colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
c=colors, edgecolor='k')

# Labeling the clusters


centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
c="white", alpha=1, s=200, edgecolor='k')

for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
s=50, edgecolor='k')
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")

plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "


"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')

plt.show()
Total running time of the script: ( 0 minutes 1.106 seconds)

Estimated memory usage: 8 MB

https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html

This is an excerpt from the Python Data Science Handbook by Jake VanderPlas; Jupyter
notebooks are available on GitHub.
The text is released under the CC-BY-NC-ND license, and code is released under the MIT
license. If you find this content useful, please consider supporting the work by buying the book!

In Depth: k-Means Clustering

https://heartbeat.fritz.ai/k-means-clustering-using-sklearn-and-python-4a054d67b187

K-means clustering using sklearn and Python

Did you know that 60% of newly-launched products may not perform well because they fail to
represent or actually offer something, their customers really want?

This is the era of personalization. Using personalization you can efficiently attract new customers
and retain existing customers. These days, a one-size-fits-all approach generally doesn’t work.

Personalization starts with customer segmentation, which is the practice of grouping customers


based on features like age, gender, interests, and spending habits. We do this so we can customize
our marketing approaches for each customer group.
Photo by Clem Onojeghuo on Unsplash

In the realm of machine learning, k-means clustering can be used to segment customers (or other
data) efficiently.

K-means clustering is one of the simplest unsupervised machine learning algorithms. Here,
we’ll explore what it can do and work through a simple implementation in Python.
Photo by Alice Achterhof on Unsplash

Some facts about k-means clustering:

1. K-means converges in a finite number of iterations. Since the algorithm iterates a


function whose domain is a finite set, the iteration must eventually converge.

2. The computational cost of the k-means algorithm is O(k*n*d), where n is the number of
data points, k the number of clusters, and d the number of attributes.

3. Compared to other clustering methods, the k-means clustering technique is fast and
efficient in terms of its computational cost.

4. It’s difficult to predict the optimal number of clusters or the value of k. To find the
number of clusters, we need to run the k-means clustering algorithm for a range of k values
and compare the results.
Photo by Markus Spiske on Unsplash

Embedding machine learning models on mobile apps can help you scale while reducing

costs. Subscribe to the Fritz AI Newsletter for more on this and other ways mobile ML can benefit

your business.

Example Implementation

Let’s implement k-means clustering using a famous dataset: the Iris dataset. This dataset
contains 3 classes of 50 instances each and each class refers to a type of iris plant. The dataset has
four features: sepal length, sepal width, petal length, and petal width. The fifth column is for
species, which holds the value for these types of plants. For example, one of the types is
a setosa, as shown in the image below.
iris dataset for k-means clustering

To start Python coding for k-means clustering, let’s start by importing the required libraries.
Apart from NumPy, Pandas, and Matplotlib, we’re also importing KMeans from sklearn.cluster,
as shown below.
k-means clustering with python

We’re reading the Iris dataset using the read_csv Pandas method and storing the data in a data
frame df. After populating the data frame df, we use the head() method on the dataset to see
its first 10 records.
read iris dataset using pandas

Now we select all four features (sepal length, sepal width, petal length, and petal width) of the
dataset in a variable called x so that we can train our model with these features. For this, we use
the iloc function on df, and the column index (0,1,2,3) for the above four columns are used, as
shown below:
select iris dataset features into variable x

To start, let’s arbitrarily assign the value of k as 5. We will implement k-means clustering
using k=5. For this we will instantiate the KMeans class and assign it to the variable kmeans5:
k-means clustering with k = 5

Below, you can see the output of the k-means clustering model with k=5. Note that we can find
the centers of 5 clusters formed from the data:
There’s a method called the Elbow method, which is designed to help find the optimal number of
clusters in a dataset. So let’s use this method to calculate the optimum value of k. To implement
the Elbow method, we need to create some Python code (shown below), and we’ll plot a graph
between the number of clusters and the corresponding error value.
This graph generally ends up shaped like an elbow, hence its name:
elbow method to calculate the optimum value of k

The output graph of the Elbow method is shown below. Note that the shape of elbow is
approximately formed at k=3.

As you can see, the optimal value of k is between 2 and 4, as the elbow-like shape is formed
at k=3 in the above graph.

Let’s implement k-means again using k=3


Finally, its time to visualize the three clusters that were formed with the optimal k value. You can
clearly see three clusters in the image below, with each cluster represented by a different color.
visualizing k means clustering

Closing comments
I hope you learned how to implement k-means clustering using sklearn and Python. Finding the
optimal k value is an important step here. In case the Elbow method doesn’t work, there
are several other methods that can be used to find optimal value of k.

Happy Machine Learning!

https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-
methods/

https://heartbeat.fritz.ai/understanding-the-mathematics-behind-k-means-clustering-40e1d55e2f4c

Understanding the Mathematics behind K-Means Clustering


Exploring K-means Clustering: Mathematical foundations, classification, and benefits

and limitations

In this post, we’re going to dive deep into one of the most influential unsupervised learning
algorithms—k-means clustering. K-means clustering is one of the simplest and most popular
unsupervised machine learning algorithms, and we’ll be discussing how the algorithm works,
distance and accuracy metrics, and a lot more.
What is meant by unsupervised learning?

Unsupervised learning is a type of self-organized learning that aids us in discovering patterns


in our data related to various features. It is one of the three main categories of machine learning,
along with supervised and reinforcement learning.
Source: https://datafloq.com/read/machine-learning-explained-understanding-learning/4478

Two of the main methods used in unsupervised learning are principal component


anaylsis and cluster analysis. To learn more about principal component analysis, refer
to this article.

What is Clustering?

Clustering is the process of dividing the data space or data points into a number of groups, such
that data points in the same groups are more similar to other data points in the same group, and
dissimilar to the data points in other groups.
Clustering Objectives

The major objective of clustering is to find patterns (i.e. similarities within data points) in an
unlabeled dataset and cluster them together. But how do we decide what constitutes a good
clustering? There isn’t a definitive best way of clustering, which would be independent of the final
aim of the clustering. The end results usually depends on users and the parameters they select,
focusing on the most important features used for clustering.

Did you know: Machine learning isn’t just happening on servers and in the cloud. It’s also being

deployed to the edge. Fritz AI has the developer tools to make this transition possible.

Applications of Clustering in Real-World problems

Vector quantization

K-means originates from signal processing, but it’s also used for vector quantization. For
example, color quantization is the task of reducing the color palette of an image to a fixed
number of colors k. The k-means algorithm can easily be used for this task.

Psychology and Medicine

An illness or condition frequently has a number of variations, and cluster analysis can be used to
identify these different subcategories. For example, clustering has been used to identify different
types of depression. Cluster analysis can also be used to detect patterns in the spatial or temporal
distribution of a disease.

Recommender Systems

Clustering can also be used in recommendation engines. In the case of recommending movies to
someone, you can look at the movies enjoyed by a user and then use clustering to find similar
movies.

For a detailed discussion on recommender systems, refer to this series.


Document Clustering

This is another common application of clustering. Let’s say you have multiple documents and you
need to cluster similar documents together. Clustering helps us group these documents such that
similar documents are in the same clusters.
Image Segmentation

Image segmentation is a wide-spread application of clustering. Similar pixels in the image are
grouped together. We can apply this technique to create clusters having similar pixels in the same
group.
The k-means clustering algorithm

K-means clustering is a prototype-based, partitional clustering technique that attempts to find a


user-specified number of clusters (k), which are represented by their centroids.

Procedure

We first choose k initial centroids, where k is a user-specified parameter; namely, the number of
clusters desired. Each point is then assigned to the closest centroid, and each collection of points
assigned to a centroid is called a cluster. The centroid of each cluster is then updated based on the
points assigned to the cluster. We repeat the assignment and update steps until no point changes
clusters, or similarly, until the centroids remain the same.
Source: https://www.researchgate.net/figure/The-pseudo-code-for-K-means-clustering-
algorithm_fig2_273063437

Proximity Measures

For clustering, we need to define a proximity measure for two data points. Proximity here means
how similar/dissimilar the samples are with respect to each other.

 Similarity measure is large if features are similar.

 Dissimilarity measure is small if features are similar.

Data in Euclidean Space

Consider data whose proximity measure is Euclidean distance. For our objective function,
which measures the quality of a clustering, we use the sum of the squared error (SSE), which
is also known as scatter.

In other words, we calculate the error of each data point, i.e., its Euclidean distance to the closest
centroid, and then compute the total sum of the squared errors. Given two different sets of
clusters that are produced by two different runs of K-means, we prefer the one with the smallest
squared error, since this means that the prototypes (centroids) of this clustering are a better
representation of the points in their cluster.
Document Data

To illustrate that K-means is not restricted to data in Euclidean space, we consider document
data and the cosine similarity measure:

Implementation in scikit-learn

It merely takes four lines to apply the algorithm in Python with sklearn: import the classifier,
create an instance, fit the data on the training set, and predict outcomes for the test set:
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)
>>> kmeans.cluster_centers_
array([[10., 2.],
[ 1., 2.]])

Parameter tuning in scikit-learn

1. n_clusters-int, default=8. n_clusters defines the number of clusters to form, as well as


the number of centroids to generate.

2. max_iter-int, default=300. Refers to the maximum number of iterations of the k-means


clustering algorithm for a single run.

3. algorithm{“auto”, “full”, “elkan”}, default=”auto”. Refers to which K-means algorithm


to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient by
using the triangle inequality, but it currently doesn’t support sparse data. “auto” chooses
“elkan” for dense data and “full” for sparse data.

Time and Space Complexity

The space requirements for k-means clustering are modest, because only the data points and
centroids are stored. Specifically, the storage required is O((m + K)n), where m is the number of
points and n is the number of attributes. The time requirements for k-means are also modest —
basically linear in terms of the number of data points. In particular, the time required is
O(I∗K∗m∗n), where I is the number of iterations required for convergence.

Machine learning is rapidly moving closer to where data is collected — edge devices. Subscribe to

the Fritz AI Newsletter to learn more about this transition and how it can help scale your business.

Choosing Initial Centroids

When random initialization of centroids is used, different runs of K-means typically produce
different total SSEs. Choosing the proper initial centroids is the key step of the basic K-means
procedure. A common approach is to choose the initial centroids randomly, but the resulting
clusters are often poor.
Another technique that’s commonly used to address the problem of choosing initial centroids is
to perform multiple runs, each with a different set of randomly-chosen initial centroids, and then
select the set of clusters with the minimum SSE.

But often, random initialization leads to sub-optimal results, and may not work well in cases with
clusters of different shapes and densities, or centroids located too far or too close to each other.
This can result in overlapping clusters of different classes, or the distribution of clusters
belonging to the same class.

Bisecting k-means: An Improvement

The bisecting k-means algorithm is a straightforward extension of the basic k-means


algorithm that’s based on a simple idea: to obtain K clusters, split the set of all points into two
clusters, select one of these clusters to split, and
so on, until k clusters have been produced. This helps in minimizing the SSE and results in an
optimal clustering.
Choosing K

There can be various methods to determine the optimal value of k for convergence of the
algorithm and to make clear distinction between clusters or different classes in a dataset.

Elbow Method

There’s a popular method known as elbow method, which is used to determine the optimal
value of k to perform clustering. The basic idea behind this method is that it plots the various
values of cost with changing k. The point where this distortion declines the most is the elbow
point, which works as an optimal value of k.
Silhouette Method

In the silhouette method, we assume that the data has already been clustered into k clusters by k-
means clustering. Then for each data point, we define the following:

 C(i): The cluster assigned to the ith data point

 |C(i)|: The number of data points in the cluster assigned to the ith data point

 a(i): Gives a measure of how well assigned the ith data point is to it’s cluster
 b(i): Defined as the average dissimilarity to the closest cluster which is not it’s cluster

The silhouette coefficient s(i) is given by:

We determine the average silhouette for each value of k, and the value of k that has
the maximum value of s(i) is considered the optimal number of clusters for the unsupervised
learning algorithm.

The Curse of Dimensionality

The curse of dimensionality refers to various phenomena that arise when analyzing and


organizing data in high-dimensional spaces (often with hundreds or thousands of dimensions)
that do not occur in low-dimensional settings, such as the three-dimensional physical space of
everyday experience.

The common theme of these problems is that when the dimensionality increases, the volume of
the space increases so fast that the available data become sparse. This sparsity is problematic for
any method that requires statistical significance.

In order to obtain a statistically sound and reliable result, the amount of data needed to support
the result often grows exponentially with dimensionality. Also, organizing and searching data
often relies on detecting areas where objects form groups with similar properties; in high-
dimensional data, however, all objects appear to be sparse and dissimilar in many ways, which
prevents common data organization strategies from being efficient.
In case of k-means clustering, the curse of dimensionality results in difficulty in clustering data
due to vast data space. For example, with Euclidean space as a proximity measure, two data
points that may be very dissimilar could be grouped together because, due to too many
dimensions, somehow, their net distance from the centroid is comparable.

Advantages of k-means clustering

1. K-means clustering is relatively simple to implement, and can be implemented without


using frameworks—just simple programming language, specifying one’s own proximity
measures.

2. The algorithm is known to easily adapt to new examples.


3. It guarantees convergence by trying to minimize the total SSE as an objective function
over a number of iterations.

4. The algorithm is fast and efficient in terms of computational cost, which is typically
O(K*n*d).

Disadvantages of k-means clustering

1. Choosing k manually. This is the greatest factor in the convergence of the algorithm


and can provide widely different results for different values of k.

2. Clustering data of varying sizes and density. K-means doesn’t perform well with
clusters of different sizes, shapes, and density. To cluster such data, you need to generalize k-
means.

3. Clustering outliers. Outliers must be removed before clustering, or they may affect the
position of the centroid or make a new cluster of their own.

4. Being dependent on initial values. As the value of k increases, other algorithms(i.e.


k-means seeding) need to be applied to give better values for the initial centroids.

5. Scaling with number of dimensions. As the number of dimensions increases, the


difficulty in getting the algorithm to converge increases due to the curse of dimensionality,
discussed above.

6. If there is overlapping between clusters, k-means doesn’t have an intrinsic measure


for uncertainty; thus it’s difficult to identify which points in the overlapping region should be
assigned to which class.

How to prepare your data for k-means clustering

1. The algorithm provides best results when the data points are well separated from each
other; thus, we must ensure that all the data points are the most similar to their centroid and
as different as possible from the other centroids. Various iterations are required for
convergence, and we can also use methods like splitting clusters, choosing one centroid
randomly, and placing the next centroid as far from the previously chosen one as possible.
All of these techniques can help reduce the overall SSE.

2. Scale/standardize the data when applying k-means algorithm—because it is dependent of


the distances of the data points from the centroid, if all the features are not scaled, some
features may dominate the data space and lead to biased results.

Sources to get started with K-means clustering

Here are a few sources which will help you to implement k-means on your dataset:
K-Means Clustering + PCA
Explore and run machine learning code with Kaggle Notebooks | Using data from
Simplified Human Activity Recognition…
www.kaggle.com

K-Means Clustering Implementation in Python


Explore and run machine learning code with Kaggle Notebooks | Using data from Iris
Species
www.kaggle.com

Tutorial: Clustering wines with k-means


Explore and run machine learning code with Kaggle Notebooks | Using data from
Wine_pca
www.kaggle.com

Conclusion

In this post, we read about k-means clustering in detail and gained insights about the
mathematics behind it. Despite being widely used and strongly supported, it has its share of
advantages and disadvantages.

Let me know if you liked the article and how I can improve it. All feedback is welcome. Check out
my other articles in the series: Understanding the mathematics behind Naive
Bayes, Support Vector Machines and Principal Component Analysis.

I’ll be exploring the mathematics involved in other foundational machine learning algorithms in
future posts, so stay tuned.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated


to exploring the emerging intersection of mobile app development and machine learning. We’re
committed to supporting and inspiring developers and engineers from all walks of life.
Editorially independent, Heartbeat is sponsored and published by Fritz AI, the machine
learning platform that helps developers teach devices to see, hear, sense, and think. We pay our
contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to
receive our weekly newsletters (Deep Learning Weekly and the Fritz AI Newsletter),
join us on Slack, and follow Fritz AI on Twitter for all the latest in mobile machine learning.

https://www.geeksforgeeks.org/k-means-clustering-introduction/

K means Clustering – Introduction

We are given a data set of items, with certain features, and values for these features (like a
vector). The task is to categorize those items into groups. To achieve this, we will use the
kMeans algorithm; an unsupervised learning algorithm.
Overview
(It will help if you think of items as points in an n-dimensional space).  The algorithm will
categorize the items into k groups of similarity. To calculate that similarity, we will use the
euclidean distance as measurement.

The algorithm works as follows:


1. First we initialize k points, called means, randomly.
2. We categorize each item to its closest mean and we update the mean’s coordinates,
which are the averages of the items categorized in that mean so far.
3. We repeat the process for a given number of iterations and at the end, we have our
clusters.
The “points” mentioned above are called means, because they hold the mean values of the
items categorized in it. To initialize these means, we have a lot of options. An intuitive method is
to initialize the means at random items in the data set. Another method is to initialize the means
at random values between the boundaries of the data set (if for a feature x the items have
values in [0,3], we will initialize the means with values for x at [0,3]).
The above algorithm in pseudocode:
Initialize k means with random values

For a given number of iterations:


Iterate through items:
Find the mean closest to the item
Assign item to mean
Update mean
Read Data
We receive input as a text file (‘data.txt’). Each line represents an item, and it contains
numerical values (one for each feature) split by commas. You can find a sample data set here.
We will read the data from the file, saving it into a list. Each element of the list is another list
containing the item values for the features. We do this with the following function:
filter_none

edit
play_arrow
brightness_4
def ReadData(fileName):

  
    # Read the file, splitting by lines
    f = open(fileName, 'r');
    lines = f.read().splitlines();
    f.close();

  
    items = [];

  
    for i in range(1, len(lines)):
        line = lines[i].split(',');
        itemFeatures = [];

  
        for j in range(len(line)-1):
            v = float(line[j]); # Convert feature value to float
            itemFeatures.append(v); # Add feature value to dict

  
        items.append(itemFeatures);

  
    shuffle(items);

  
    return items;
Initialize Means
We want to initialize each mean’s values in the range of the feature values of the items. For
that, we need to find the min and max for each feature. We accomplish that with the following
function:
filter_none

edit
play_arrow
brightness_4
def FindColMinMax(items):
    n = len(items[0]);
    minima = [sys.maxint for i in range(n)];
    maxima = [-sys.maxint -1 for i in range(n)];

      
    for item in items:
        for f in range(len(item)):
            if (item[f] < minima[f]):
                minima[f] = item[f];

              
            if (item[f] > maxima[f]):
                maxima[f] = item[f];

  
return minima,maxima;
The variables minima, maxima are lists containing the min and max values of the items
respectively. We initialize each mean’s feature values randomly between the corresponding
minimum and maximum in those above two lists:
filter_none

edit
play_arrow
brightness_4
def InitializeMeans(items, k, cMin, cMax):

  
    # Initialize means to random numbers between
    # the min and max of each column/feature    
    f = len(items[0]); # number of features
    means = [[0 for i in range(f)] for j in range(k)];

      
    for mean in means:
        for i in range(len(mean)):

  
            # Set value to a random float
            # (adding +-1 to avoid a wide placement of a
mean)
            mean[i] = uniform(cMin[i]+1, cMax[i]-1);

  
    return means;
Euclidean Distance

We will be using the euclidean distance as a metric of similarity for our data set (note:
depending on your items, you can use another similarity metric).
filter_none

edit
play_arrow
brightness_4
def EuclideanDistance(x, y):
    S = 0; #  The sum of the squared differences of the elements
    for i in range(len(x)):
        S += math.pow(x[i]-y[i], 2);

  
    return math.sqrt(S); #The square root of the sum
Update Means
To update a mean, we need to find the average value for its feature, for all the items in the
mean/cluster. We can do this by adding all the values and then dividing by the number of items,
or we can use a more elegant solution. We will calculate the new average without having to re-
add all the values, by doing the following:
m = (m*(n-1)+x)/n
where m is the mean value for a feature, n is the number of items in the cluster and x is the
feature value for the added item. We do the above for each feature to get the new mean.
filter_none
edit
play_arrow
brightness_4
def UpdateMean(n,mean,item):
    for i in range(len(mean)):
        m = mean[i];
        m = (m*(n-1)+item[i])/float(n);
        mean[i] = round(m, 3);

      
    return mean;
Classify Items
Now we need to write a function to classify an item to a group/cluster. For the given item, we will
find its similarity to each mean, and we will classify the item to the closest one.
filter_none

edit
play_arrow
brightness_4
def Classify(means,item):

  
    # Classify item to the mean with minimum
distance    
    minimum = sys.maxint;
    index = -1;

  
    for i in range(len(means)):

  
        # Find distance from item to mean
        dis = EuclideanDistance(item, means[i]);

  
        if (dis < minimum):
            minimum = dis;
            index = i;

      
    return index;
Find Means
To actually find the means, we will loop through all the items, classify them to their nearest
cluster and update the cluster’s mean. We will repeat the process for some fixed number of
iterations. If between two iterations no item changes classification, we stop the process as the
algorithm has found the optimal solution.
The below function takes as input k (the number of desired clusters), the items and the number
of maximum iterations, and returns the means and the clusters. The classification of an item is
stored in the array belongsTo and the number of items in a cluster is stored in clusterSizes.
filter_none

edit
play_arrow
brightness_4
def CalculateMeans(k,items,maxIterations=100000):

  
    # Find the minima and maxima for columns
    cMin, cMax = FindColMinMax(items);

      
    # Initialize means at random points
    means = InitializeMeans(items,k,cMin,cMax);

      
    # Initialize clusters, the array to hold
    # the number of items in a class
    clusterSizes= [0 for i in range(len(means))];

  
    # An array to hold the cluster an item is in
    belongsTo = [0 for i in range(len(items))];

  
    # Calculate means
    for e in range(maxIterations):

  
        # If no change of cluster occurs, halt
        noChange = True;
        for i in range(len(items)):

  
            item = items[i];

  
            # Classify item into a cluster and update the
            # corresponding means.        
            index = Classify(means,item);

  
            clusterSizes[index] += 1;
            cSize = clusterSizes[index];
            means[index] = UpdateMean(cSize,means[index],item);

  
            # Item changed cluster
            if(index != belongsTo[i]):
                noChange = False;

  
            belongsTo[i] = index;

  
        # Nothing changed, return
        if (noChange):
            break;

  
    return means;
Find Clusters

Finally we want to find the clusters, given the means. We will iterate through all the items and
we will classify each item to its closest cluster.
filter_none

edit
play_arrow
brightness_4
def FindClusters(means,items):
    clusters = [[] for i in range(len(means))]; # Init clusters

      
    for item in items:

  
        # Classify item into a cluster
        index = Classify(means,item);

  
        # Add item to cluster
        clusters[index].append(item);

  
    return clusters;
The other popularly used similarity measures are:-
1. Cosine distance: It determines the cosine of the angle between the point vectors of the two
points in the n dimensional space

2. Manhattan distance: It computes the sum of the absolute differences between the co-
ordinates of the two data points.

3. Minkowski distance: It is also known as the generalised distance metric. It can be used for
both ordinal and quantitative variables

You can find the entire code on my GitHub, along with a sample data set and a plotting function.
Thanks for reading.
This article is contributed by Antonis Maronikolakis. If you like GeeksforGeeks and would like
to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your
article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks
main page and help other Geeks.
Please write comments if you find anything incorrect, or you want to share more information
about the topic discussed above.

Book ankar Orelliy Hand on Unsupervised Learning :


k-Means

The objective of clustering is to identify distinct groups in a dataset such that the observations within a
group are similar to each other but different from observations in other groups. In k-means clustering, we
specify the number of desired clusters k, and the algorithm will assign each observation to exactly one of
these k clusters. The algorithm optimizes the groups by minimizing the within-cluster variation (also
known as inertia) such that the sum of the within-cluster variations across all k clusters is as small as
possible.

Different runs of k-means will result in slightly different cluster assignments because k-means randomly
assigns each observation to one of the k clusters to kick off the clustering process. k-means does this
random initialization to speed up the clustering process. After this random initialization, k-means
reassigns the observations to different clusters as it attempts to minimize the Euclidean distance between
each observation and its cluster’s center point, or centroid. This random initialization is a source of
randomness, resulting in slightly different clustering assignments, from one k-means run to another.
Typically, the k-means algorithm does several runs and chooses the run that has the best separation,
defined as the lowest total sum of within-cluster variations across all k clusters.

k-Means Inertia

Let’s introduce the algorithm. We need to set the number of clusters we would like (n_clusters), the
number of initializations we would like to perform (n_init), the maximum number of iterations the
algorithm will run to reassign observations to minimize inertia (max_iter), and the tolerance to declare
convergence (tol).

We will keep the default values for number of initializations (10), maximum number of iterations (300),
and tolerance (0.0001). Also, for now, we will use the first 100 principal components from PCA
(cutoff). To test

how the number of clusters we designate affects the inertia measure, let’s run k-means for cluster sizes 2
through 20 and record the inertia for each.

Here is the code:

# k-means - Inertia as the number of clusters varies


from sklearn.cluster import KMeans

n_clusters = 10
n_init = 10
max_iter = 300
tol = 0.0001
random_state = 2018
n_jobs = 2

kMeans_inertia = pd.DataFrame(data=[],index=range(2,21), \
columns=['inertia'])
for n_clusters in range(2,21):
kmeans = KMeans(n_clusters=n_clusters, n_init=n_init, \
max_iter=max_iter, tol=tol, random_state=random_state,
\
n_jobs=n_jobs)

cutoff = 99
kmeans.fit(X_train_PCA.loc[:,0:cutoff])
kMeans_inertia.loc[n_clusters] = kmeans.inertia_

As Figure 5-1 shows, the inertia decreases as the number of clusters increases. This makes sense. The
more clusters we have, the greater the homogeneity among observations within each cluster. However,
fewer clusters are easier to work with than more, so finding the right number of clusters to generate is an
important consideration when running k-means.
Figure 5-1. k-means inertia for cluster sizes 2 through 20
Evaluating the Clustering Results

To demonstrate how k-means works and how increasing the number of clusters results in more
homogeneous clusters, let’s define a function to analyze the results of each experiment we do. The cluster
assignments—generated by the clustering algorithm—will be stored in a Pandas DataFrame called
clusterDF.

Let’s count the number of observations in each cluster and store these in a Pandas DataFrame called
countByCluster:

def analyzeCluster(clusterDF, labelsDF):

countByCluster = \

pd.DataFrame(data=clusterDF['cluster'].value_counts())

countByCluster.reset_index(inplace=True,drop=False)

countByCluster.columns = ['cluster','clusterCount']

Next, let’s join the clusterDF with the true labels array, which we will call labelsDF:

preds = pd.concat([labelsDF,clusterDF], axis=1)

preds.columns = ['trueLabel','cluster']

Let’s also count the number of observations for each true label in the training set (this won’t change but is
good for us to know):
countByLabel = pd.DataFrame(data=preds.groupby('trueLabel').count())

Now, for each cluster, we will count the number of observations for each distinct label within a cluster.
For example, if a given cluster has three thousand observations, two thousand may represent the number
two, five hundred may represent the number one, three hundred may represent the number zero, and the
remaining two hundred may represent the number nine.

Once we calculate these, we will store the count for the most frequently occurring number for each
cluster. In the example above, we would store a count of two thousand for this cluster:

countMostFreq = \

pd.DataFrame(data=preds.groupby('cluster').agg( \

lambda x:x.value_counts().iloc[0]))

countMostFreq.reset_index(inplace=True,drop=False)

countMostFreq.columns = ['cluster','countMostFrequent']

Finally, we will judge the success of each clustering run based on how tightly grouped the observations
are within each cluster. For example, in the example above, the cluster has two thousand observations that
have the same label out of a total of three thousand observations in the cluster.

This cluster is not great since we ideally want to group similar observations together in the same cluster
and exclude dissimilar ones.

Let’s define the overall accuracy of the clustering as the sum of the counts of the most frequently
occuring observations across all the clusters divided by the total number of observations in the training set
(i.e., 50,000):

accuracyDF = countMostFreq.merge(countByCluster, \

left_on="cluster",right_on="cluster")

overallAccuracy = accuracyDF.countMostFrequent.sum()/ \

accuracyDF.clusterCount.sum()

We can also assess the accuracy by cluster:

accuracyByLabel = accuracyDF.countMostFrequent/ \
accuracyDF.clusterCount

For the sake of conciseness, we have all this code in a single function, available on GitHub.

k-Means Accuracy

Let’s now perform the experiments we did earlier, but instead of calculating inertia, we will calculate the
overall homogeneity of the clusters based on the accuracy measure we’ve defined for this MNIST digits
dataset:

# k-means - Accuracy as the number of clusters varies

n_clusters = 5
n_init = 10
max_iter = 300
tol = 0.0001
random_state = 2018
n_jobs = 2

kMeans_inertia = \
pd.DataFrame(data=[],index=range(2,21),columns=['inertia'])
overallAccuracy_kMeansDF = \

pd.DataFrame(data=[],index=range(2,21),columns=['overallAccuracy'])

for n_clusters in range(2,21):


kmeans = KMeans(n_clusters=n_clusters, n_init=n_init, \
max_iter=max_iter, tol=tol, random_state=random_state,
\
n_jobs=n_jobs)

cutoff = 99
kmeans.fit(X_train_PCA.loc[:,0:cutoff])
kMeans_inertia.loc[n_clusters] = kmeans.inertia_
X_train_kmeansClustered =
kmeans.predict(X_train_PCA.loc[:,0:cutoff])
X_train_kmeansClustered = \
pd.DataFrame(data=X_train_kmeansClustered,
index=X_train.index, \
columns=['cluster'])

countByCluster_kMeans, countByLabel_kMeans,
countMostFreq_kMeans, \
accuracyDF_kMeans, overallAccuracy_kMeans,
accuracyByLabel_kMeans \
= analyzeCluster(X_train_kmeansClustered, y_train)

overallAccuracy_kMeansDF.loc[n_clusters] = overallAccuracy_kMeans

Figure 5-2 shows the plot of the overall accuracy for different cluster sizes.
https://medium.com/rahasak/k-means-clustering-with-apache-spark-cab44aef0a16

Happy ML

This is the first part of my Happy ML blog series. In this post I will discuss about Machine
Learning basics and K-Means unsupervised machine learning algorithm with an example. The
second part of this blog series which discussed about Logistic Regression algorithm can be
found from here.

About Machine Learning

Machine learning uses algorithms to find patterns in data. It first built a model based on the
patterns on existing/historical data. Then use this model to do the prediction on newly generated
live data. In general machine learning can be categorized into three main
categories Supervised, Unsupervised and Reinforcement machine learning.

Supervised machine learning also identified as Predictive Modeling build on labeled


data(data with defined categories or groups). Classification and are two types of problems
in supervised machine learning. Decision Tree, Linear Regression, Logistic
Regression are some examples for supervised machine learning algorithms. Unsupervised
machine learning finds patterns on unlabeled data(data without defined categories or groups). It
deals with two types of problems, Clustering and Dimensionality Reduction. The
example of Unsupervised machine learning algorithms are K-Means, K-Medoids and Feature
Selection. Reinforcement machine learning uses combination of labeled and unlabeled data.
Since there are several machine learning algorithms available we have to choose right algorithm
to solve our problem. This article describes about the available machine learning algorithm and
their application scenarios.

In this post I’m gonna use K-Means algorithm to build a machine learning model with Apache
Spark.(if you are new to Apache Spark please find more informations for here). The K-Means
model clusters the uber trip data based on the trip attributes. Then this model can be used to do
real time analysis of new uber trips. All the source codes and dataset which relates to this post
available on the gitlab. Please clone the repo and continue the post.

About K-Means

K-Means clustering is one of the simplest and popular unsupervised machine learning
algorithms. The goal of this algorithm is to find groups in the data, with the number of
groups/clusters represented by the variable K. K-Means algorithm iteratively allocates every data
point to the nearest cluster based on the features. In every iteration of the algorithm, each data
point is assigned to its nearest cluster based on some distance metric, which is
usually Euclidean distance. The outputs of the K-means clustering algorithm are the
centroids of K clusters and the labels of training data. Once the algorithm runs and identified the
groups from a data set, any new data can be easily assigned to a group.
K-Means algorithm can be used to identifies unknown groups in complex and unlabeled data
sets. Following are some business use cases of K-Means clustering.

1. Customer segmentation based on purchase history

2. Customer segmentation based on interest

3. Insurance fraud detection

4. Transaction fraud detection

5. Detect unauthorized IoT devices based on network traffic

6. Identity crime locality

7. Group inventory by sales

Uber data set

As mentioned previously I’m gonna use K-Means to build model from uber trip data. This model
clusters the uber trips based based on trip attributes/features(lat, lon etc). The uber trip data
set exists on the gitlab repo as .CSV file. Following is the structure/schema of single uber trip
record.

Load data set

To build K-Means model from this data set first we need to load this data set into
spark DataFrame. Following is the way to do that. It load the data into DataFrame
from .CSV file based on the schema.

Add feature column

We need to transform features on the DataFrame records(lat, lon values on each record)


into FeatureVector. In order to the features to be used by a machine learning algorithm this
vector need to be added as a feature column into the DataFrame. Following is the way to do that
with VectorAssembler.

Build K-Means model

Next we can build K-Means model by defining no of clusters, feature column and output
prediction column. In order to train and test the K-Means model the data set need to be split into
a training data set and a test data set. 70% of the data is used to train the model, and 30% will be
used for testing.

Save K-Means model


The built model can be persisted in to disk in order to use later. For an example use with spark
streams application to detect the clusters of realtime uber trips.

Use K-Means model

Finally the K-Means model can use to detect the clusters/category of new data(ex real time uber
trip data). Following example shows the detecting clusters of sample records on a DataFrame.

Reference

1. https://www.quora.com/What-is-machine-learning-in-laymans-terms-1

2. https://www.goodworklabs.com/machine-learning-algorithm/

3. https://mapr.com/blog/apache-spark-machine-learning-tutorial/

4. https://mapr.com/blog/fast-data-processing-pipeline-predicting-flight-delays-using-
apache-apis-pt-1/

5. https://www.datascience.com/blog/k-means-clustering

6. https://medium.com/rahasak/hacking-with-apache-spark-f6b0cabf0703

7. https://medium.com/rahasak/hacking-with-spark-dataframe-d717404c5812

https://www.kaggle.com/xvivancos/tutorial-clustering-wines-with-k-means ( R Analysis )

Clustering wines with k-means


Xavier Vivancos García
2020-03-25

 1 Introduction
 2 Loading data
 3 Data analysis
 4 Data preparation
 5 k-means execution
 6 How many clusters?
 7 Results
 8 Summary
 9 Citations for used packages
1 Introduction
k-means is an unsupervised machine learning algorithm used to find groups of observations
(clusters) that share similar characteristics. What is the meaning of unsupervised learning? It
means that the observations given in the data set are unlabeled, there is no outcome to be
predicted. We are going to use a Wine data set to cluster different types of wines. This data set
contains the results of a chemical analysis of wines grown in a specific area of Italy.

2 Loading data
First we need to load some libraries and read the data set.
# Load libraries

library(tidyverse)

library(corrplot)

library(gridExtra)

library(GGally)

library(knitr)

# Read the stats

wines <- read.csv("../input/Wine.csv")

We don’t need the Customer_Segment column. As we have said before, k-means is an


unsupervised machine learning algorithm and works with unlabeled data.
# Remove the Type column

wines <- wines[, -14]

Let’s get an idea of what we’re working with.

 2.1 First rows
 2.2 Last rows
 2.3 Summary
 2.4 Structure

# First rows

kable(head(wines))
Alco Malic_A As Ash_Alca Magnesi Total_Phe Flavano Nonflavanoid_Ph Proanthocya Color_Inte Hu OD
hol cid h nity um nols ids enols nins nsity e 8

14.2 1.71 2. 15.6 127 2.80 3.06 0.28 2.29 5.64 1. 3.9
3 43 04

13.2 1.78 2. 11.2 100 2.65 2.76 0.26 1.28 4.38 1. 3.4
0 14 05

13.1 2.36 2. 18.6 101 2.80 3.24 0.30 2.81 5.68 1. 3.1
6 67 03

14.3 1.95 2. 16.8 113 3.85 3.49 0.24 2.18 7.80 0. 3.4
7 50 86

13.2 2.59 2. 21.0 118 2.80 2.69 0.39 1.82 4.32 1. 2.9
4 87 04

14.2 1.76 2. 15.2 112 3.27 3.39 0.34 1.97 6.75 1. 2.8
0 45 05

3 Data analysis
First we have to explore and visualize the data.
# Histogram for each Attribute

wines %>%

gather(Attributes, value, 1:13) %>%

ggplot(aes(x=value, fill=Attributes)) +

geom_histogram(colour="black", show.legend=FALSE) +
facet_wrap(~Attributes, scales="free_x") +

labs(x="Values", y="Frequency",

title="Wines Attributes - Histograms") +

theme_bw()

# Density plot for each Attribute

wines %>%

gather(Attributes, value, 1:13) %>%

ggplot(aes(x=value, fill=Attributes)) +

geom_density(colour="black", alpha=0.5, show.legend=FALSE) +

facet_wrap(~Attributes, scales="free_x") +

labs(x="Values", y="Density",

title="Wines Attributes - Density plots") +

theme_bw()

# Boxplot for each Attribute

wines %>%

gather(Attributes, values, c(1:4, 6:12)) %>%

ggplot(aes(x=reorder(Attributes, values, FUN=median), y=values,


fill=Attributes)) +

geom_boxplot(show.legend=FALSE) +

labs(title="Wines Attributes - Boxplots") +

theme_bw() +

theme(axis.title.y=element_blank(),

axis.title.x=element_blank()) +

ylim(0, 35) +

coord_flip()
We haven’t included magnesium and proline, since their values are very high and worsen the
visualization.
What is the relationship between the different attributes? We can use the corrplot() function
to create a graphical display of a correlation matrix.
# Correlation matrix

corrplot(cor(wines), type="upper", method="ellipse", tl.cex=0.9)

There is a strong linear correlation between Total_Phenols and Flavanoids. We can model


the relationship between these two variables by fitting a linear equation.
# Relationship between Phenols and Flavanoids

ggplot(wines, aes(x=Total_Phenols, y=Flavanoids)) +

geom_point() +

geom_smooth(method="lm", se=FALSE) +

labs(title="Wines Attributes",

subtitle="Relationship between Phenols and Flavanoids") +

theme_bw()

Now that we have done a exploratory data analysis, we can prepare the data in order to execute
the k-means algorithm.

4 Data preparation
We have to normalize the variables to express them in the same range of values. In other
words, normalization means adjusting values measured on different scales to a common scale.
# Normalization

winesNorm <- as.data.frame(scale(wines))

# Original data

p1 <- ggplot(wines, aes(x=Alcohol, y=Malic_Acid)) +

geom_point() +
labs(title="Original data") +

theme_bw()

# Normalized data

p2 <- ggplot(winesNorm, aes(x=Alcohol, y=Malic_Acid)) +

geom_point() +

labs(title="Normalized data") +

theme_bw()

# Subplot

grid.arrange(p1, p2, ncol=2)

The points in the normalized data are the same as the original one. The only thing that changes
is the scale of the axis.

5 k-means execution
In this section we are going to execute the k-means algorithm and analyze the main
components that the function returns.
# Execution of k-means with k=2

set.seed(1234)

wines_k2 <- kmeans(winesNorm, centers=2)

The kmeans() function returns an object of class “kmeans” with information about the partition:

 cluster. A vector of integers indicating the cluster to which each point is allocated.
 centers. A matrix of cluster centers.
 size. The number of points in each cluster.
# Cluster to which each point is allocated

wines_k2$cluster

## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 2
1

## [71] 2 1 2 1 1 2 1 2 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 1 2 1 1 1 1 2 2 2
1

## [106] 2 2 2 2 1 1 2 2 2 2 1 2 2 2 2 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2

## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2

## [176] 2 2 2

# Cluster centers

wines_k2$centers

## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols

## 1 0.3248845 -0.3529345 0.05207966 -0.4899811 0.3206911 0.7826625

## 2 -0.3106038 0.3374209 -0.04979045 0.4684435 -0.3065948 -0.7482598

## Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity

## 1 0.8235093 -0.5921337 0.6378483 -0.1024529

## 2 -0.7873111 0.5661058 -0.6098110 0.0979495

## Hue OD280 Proline

## 1 0.5633135 0.7146506 0.6051873

## 2 -0.5385525 -0.6832374 -0.5785857

# Cluster size

wines_k2$size

## [1] 87 91

Additionally, the kmeans() function returns some ratios that let us know how compact is a
cluster and how different are several clusters among themselves.

 betweenss. The between-cluster sum of squares. In an optimal segmentation, one


expects this ratio to be as higher as possible, since we would like to have heterogeneous
clusters.
 withinss. Vector of within-cluster sum of squares, one component per cluster. In an
optimal segmentation, one expects this ratio to be as lower as possible for each cluster,
 since we would like to have homogeneity within the clusters.
 tot.withinss. Total within-cluster sum of squares.
 totss. The total sum of squares.
# Between-cluster sum of squares

wines_k2$betweenss

## [1] 651.56

# Within-cluster sum of squares

wines_k2$withinss

## [1] 765.0965 884.3435

# Total within-cluster sum of squares

wines_k2$tot.withinss

## [1] 1649.44

# Total sum of squares

wines_k2$totss

## [1] 2301

6 How many clusters?


To study graphically which value of k gives us the best partition, we can
plot betweenss and tot.withinss vs Choice of k.
bss <- numeric()

wss <- numeric()

# Run the algorithm for different values of k

set.seed(1234)

for(i in 1:10){

# For each k, calculate betweenss and tot.withinss

bss[i] <- kmeans(winesNorm, centers=i)$betweenss


wss[i] <- kmeans(winesNorm, centers=i)$tot.withinss

# Between-cluster sum of squares vs Choice of k

p3 <- qplot(1:10, bss, geom=c("point", "line"),

xlab="Number of clusters", ylab="Between-cluster sum of squares")


+

scale_x_continuous(breaks=seq(0, 10, 1)) +

theme_bw()

# Total within-cluster sum of squares vs Choice of k

p4 <- qplot(1:10, wss, geom=c("point", "line"),

xlab="Number of clusters", ylab="Total within-cluster sum of


squares") +

scale_x_continuous(breaks=seq(0, 10, 1)) +

theme_bw()

# Subplot

grid.arrange(p3, p4, ncol=2)

Which is the optimal value for k? One should choose a number of clusters so that adding
another cluster doesn’t give much better partition of the data. At some point the gain will drop,
giving an angle in the graph (elbow criterion). The number of clusters is chosen at this point. In
our case, it is clear that 3 is the appropriate value for k.

7 Results
# Execution of k-means with k=3

set.seed(1234)
wines_k3 <- kmeans(winesNorm, centers=3)

# Mean values of each cluster

aggregate(wines, by=list(wines_k3$cluster), mean)

## Group.1 Alcohol Malic_Acid Ash Ash_Alcanity Magnesium

## 1 1 13.67677 1.997903 2.466290 17.46290 107.96774

## 2 2 12.25092 1.897385 2.231231 20.06308 92.73846

## 3 3 13.13412 3.307255 2.417647 21.24118 98.66667

## Total_Phenols Flavanoids Nonflavanoid_Phenols Proanthocyanins

## 1 2.847581 3.0032258 0.2920968 1.922097

## 2 2.247692 2.0500000 0.3576923 1.624154

## 3 1.683922 0.8188235 0.4519608 1.145882

## Color_Intensity Hue OD280 Proline

## 1 5.453548 1.0654839 3.163387 1100.2258

## 2 2.973077 1.0627077 2.803385 510.1692

## 3 7.234706 0.6919608 1.696667 619.0588

# Clustering

ggpairs(cbind(wines, Cluster=as.factor(wines_k3$cluster)),

columns=1:6, aes(colour=Cluster, alpha=0.5),

lower=list(continuous="points"),

upper=list(continuous="blank"),

axisLabels="none", switch="both") +

theme_bw()

8 Summary
In this entry we have learned about the k-means algorithm, including the data normalization
before we execute it, the choice of the optimal number of clusters (elbow criterion) and the
visualization of the clustering.
It has been a pleasure to make this post, I have learned a lot! Thank you for reading and if you
like it, please upvote it.

9 Citations for used packages


Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository
[http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and
Computer Science.
Hadley Wickham (2017). tidyverse: Easily Install and Load the ‘Tidyverse’. R package version
1.2.1. https://CRAN.R-project.org/package=tidyverse
Taiyun Wei and Viliam Simko (2017). R package “corrplot”: Visualization of a Correlation Matrix
(Version 0.84). Available from https://github.com/taiyun/corrplot
Baptiste Auguie (2017). gridExtra: Miscellaneous Functions for “Grid” Graphics. R package
version 2.3. https://CRAN.R-project.org/package=gridExtra
Barret Schloerke, Jason Crowley, Di Cook, Francois Briatte, Moritz Marbach, Edwin Thoen,
Amos Elberg and Joseph Larmarange (2017). GGally: Extension to ‘ggplot2’. R package version
1.3.2. https://CRAN.R-project.org/package=GGally
Yihui Xie (2018). knitr: A General-Purpose Package for Dynamic Report Generation in R. R
package version 1.20.
Yihui Xie (2015) Dynamic Documents with R and knitr. 2nd edition. Chapman and Hall/CRC.
ISBN 978-1498716963
Yihui Xie (2014) knitr: A Comprehensive Tool for Reproducible Research in R. In Victoria
Stodden, Friedrich Leisch and Roger D. Peng, editors, Implementing Reproducible
Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595

https://www.geeksforgeeks.org/dbscan-clustering-in-ml-density-based-clustering/?ref=rp

Now the question should be raised is – Why should we use DBSCAN where K-Means is the
widely used method in clustering analysis?
Disadvantage Of K-MEANS:
1. K-Means forms spherical clusters only. This algorithm fails when data is not spherical
( i.e. same variance in all directions).

2. K-Means algorithm is sensitive towards outlier. Outliers can skew the clusters in K-
Means in very large extent.

3. K-Means algorithm requires one to specify the number of clusters a priory etc.
Basically, DBSCAN algorithm overcomes all the above-mentioned drawbacks of K-Means
algorithm. DBSCAN algorithm identifies the dense region by grouping together data points that
are closed to each other based on distance measurement.
Python implementation of above algorithm without using the sklearn library can be found
here dbscan_in_python.
 
References :
https://en.wikipedia.org/wiki/DBSCAN
https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
https://www.geeksforgeeks.org/ml-determine-the-optimal-value-of-k-in-k-means-clustering/?ref=rp
ML | Determine the optimal value of K in K-Means Clustering

Prerequisite: K-Means Clustering | Introduction


There is a popular method known as elbow method which is used to determine the optimal
value of K to perform the K-Means Clustering Algorithm. The basic idea behind this method is
that it plots the various values of cost with changing k. As the value of K increases, there will be
fewer elements in the cluster. So average distortion will decrease. The lesser number of
elements means closer to the centroid. So, the point where this distortion declines the most is
the elbow point.
3 clusters are forming
In the above figure, its clearly observed that the distribution of points are forming 3 clusters.
Now, let’s see the plot for the squared error(Cost) for different values of K.
Elbow is forming at K=3
Clearly the elbow is forming at K=3. So the optimal value will be 3 for performing K-Means.

Another Example with 4 clusters.


4-clusters
Corresponding Cost graph-
Elbow is forming at K=4
In this case the optimal value for k would be 4. (Observable from the scattered points).
Below is the Python implementation:
filter_none

brightness_4
import matplotlib.pyplot as plt 
from matplotlib import style
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs

  
style.use("fivethirtyeight")

  
# make_blobs() is used to generate sample points
# around c centers (randomly chosen)
X, y = make_blobs(n_samples = 100, centers = 4, 
                cluster_std = 1, n_features = 2)

                  
plt.scatter(X[:, 0], X[:, 1], s = 30, color ='b')

  
# label the axes
plt.xlabel('X')
plt.ylabel('Y')

  
plt.show()
plt.clf() # clear the figure
Output:

filter_none

brightness_4
cost =[]
for i in range(1, 11):
    KM = KMeans(n_clusters = i, max_iter = 500)
    KM.fit(X)

      
    # calculates squared error
    # for the clustered points
    cost.append(KM.inertia_)     

  
# plot the cost against K values
plt.plot(range(1, 11), cost, color ='g', linewidth ='3')
plt.xlabel("Value of K")
plt.ylabel("Sqaured Error (Cost)")
plt.show() # clear the plot
  
# the point of the elbow is the 
# most optimal value for choosing k
Output:

https://www.kaggle.com/ruslankl/k-means-clustering-pca ( Human activity)

K-Means Clustering and PCA of Human Activity Recognition


Ruslan Klymentiev

Date created: July 21st, 2018

Intro
Clustering was always a subject I tried to avoid (for no reason). In this project I will finally use
my knowledge of clustering and PCA algorithms to explore the Human Activity Recognition
dataset.

I would love to point on resourses I have learned from:

1. DataCamp Tutorial: Python Machine Learning: Scikit-Learn Tutorial;


2. DataCamp course: Unsupervised Learning in Python;
3. Cognitive Class course: Machine Learning with Python;
4. And of course Prof. Google!
Dataset info
Human Activity Recognition database built from the recordings of 30 subjects performing
activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded
inertial sensors. The experiments have been carried out with a group of 30 volunteers within an
age bracket of 19-48 years. Each person performed six activities (WALKING,
WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a
smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and
gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant
rate of 50Hz. The experiments have been video-recorded to label the data manually.

In [1]:
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from IPython.display import display
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from sklearn.metrics import homogeneity_score, completeness_score, \
v_measure_score, adjusted_rand_score, adjusted_mutual_info_score,
silhouette_score
%matplotlib inline

np.random.seed(123)
In [2]:
Data = pd.read_csv('../input/train.csv')
In [3]:
Data.sample(5)
Out[3]:
fB
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
fBodyBo
fBod
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.i
tBodyAcc.i
tBodyAcc.i
tBodyAcc.
tBodyAcc.
tBodyAcc. yA
rn activity arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
correlatio
... yAccJer
yAccJ
mean.X
mean.Y
mean.Z
std.X
std.Y
std.Z
mad.X
mad.Y
mad.Z
max.X
max.Y
max.Z
min.X
min.Y
min.Z
sma
energy.X
energy.Y
energy.Z
qr.X
qr.Y
qr.Z
entropy.X
entropy.Y
entropy.Z ag
1 2 3 4 1 2 3 4 1 2 3 4 n.X.Y ag.ener ag.iq
y
fB
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
fBodyBo
fBod
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.i
tBodyAcc.i
tBodyAcc.i
tBodyAcc.
tBodyAcc.
tBodyAcc. yA
rn activity arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
correlatio
... yAccJer
yAccJ
mean.X
mean.Y
mean.Z
std.X
std.Y
std.Z
mad.X
mad.Y
mad.Z
max.X
max.Y
max.Z
min.X
min.Y
min.Z
sma
energy.X
energy.Y
energy.Z
qr.X
qr.Y
qr.Z
entropy.X
entropy.Y
entropy.Z ag
1 2 3 4 1 2 3 4 1 2 3 4 n.X.Y ag.ener ag.iq
y
fB
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
fBodyBo
fBod
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.i
tBodyAcc.i
tBodyAcc.i
tBodyAcc.
tBodyAcc.
tBodyAcc. yA
rn activity arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
correlatio
... yAccJer
yAccJ
mean.X
mean.Y
mean.Z
std.X
std.Y
std.Z
mad.X
mad.Y
mad.Z
max.X
max.Y
max.Z
min.X
min.Y
min.Z
sma
energy.X
energy.Y
energy.Z
qr.X
qr.Y
qr.Z
entropy.X
entropy.Y
entropy.Z ag
1 2 3 4 1 2 3 4 1 2 3 4 n.X.Y ag.ener ag.iq
y
fB
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
fBodyBo
fBod
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.i
tBodyAcc.i
tBodyAcc.i
tBodyAcc.
tBodyAcc.
tBodyAcc. yA
rn activity arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
correlatio
... yAccJer
yAccJ
mean.X
mean.Y
mean.Z
std.X
std.Y
std.Z
mad.X
mad.Y
mad.Z
max.X
max.Y
max.Z
min.X
min.Y
min.Z
sma
energy.X
energy.Y
energy.Z
qr.X
qr.Y
qr.Z
entropy.X
entropy.Y
entropy.Z ag
1 2 3 4 1 2 3 4 1 2 3 4 n.X.Y ag.ener ag.iq
y
fB
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
fBodyBo
fBod
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.
tBodyAcc.i
tBodyAcc.i
tBodyAcc.i
tBodyAcc.
tBodyAcc.
tBodyAcc. yA
rn activity arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.X.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Y.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
arCoeff.Z.
correlatio
... yAccJer
yAccJ
mean.X
mean.Y
mean.Z
std.X
std.Y
std.Z
mad.X
mad.Y
mad.Z
max.X
max.Y
max.Z
min.X
min.Y
min.Z
sma
energy.X
energy.Y
energy.Z
qr.X
qr.Y
qr.Z
entropy.X
entropy.Y
entropy.Z ag
1 2 3 4 1 2 3 4 1 2 3 4 n.X.Y ag.ener ag.iq
y
In [4]:
print('Shape of the data set: ' + str(Data.shape))
Shape of the data set: (3609, 563)
In [5]:
#save labels as string
Labels = Data['activity']
Data = Data.drop(['rn', 'activity'], axis = 1)
Labels_keys = Labels.unique().tolist()
Labels = np.array(Labels)
print('Activity labels: ' + str(Labels_keys))
Activity labels: ['STANDING', 'SITTING', 'LAYING', 'WALKING',
'WALKING_DOWNSTAIRS', 'WALKING_UPSTAIRS']
In [6]:
#check for missing values
Temp = pd.DataFrame(Data.isnull().sum())
Temp.columns = ['Sum']
print('Amount of rows with missing values: ' + str(len(Temp.index[Temp['Sum']
> 0])) )
Amount of rows with missing values: 0
In [7]:
#normalize the dataset
scaler = StandardScaler()
Data = scaler.fit_transform(Data)
In [8]:
#check the optimal k value
ks = range(1, 10)
inertias = []

for k in ks:
model = KMeans(n_clusters=k)
model.fit(Data)
inertias.append(model.inertia_)

plt.figure(figsize=(8,5))
plt.style.use('bmh')
plt.plot(ks, inertias, '-o')
plt.xlabel('Number of clusters, k')
plt.ylabel('Inertia')
plt.xticks(ks)
plt.show()

Looks like the best value ("elbow" of the line) for k is 2 (two clusters).

In [9]:
def k_means(n_clust, data_frame, true_labels):
"""
Function k_means applies k-means clustering alrorithm on dataset and
prints the crosstab of cluster and actual labels
and clustering performance parameters.

Input:
n_clust - number of clusters (k value)
data_frame - dataset we want to cluster
true_labels - original labels

Output:
1 - crosstab of cluster and actual labels
2 - performance table
"""
k_means = KMeans(n_clusters = n_clust, random_state=123, n_init=30)
k_means.fit(data_frame)
c_labels = k_means.labels_
df = pd.DataFrame({'clust_label': c_labels, 'orig_label':
true_labels.tolist()})
ct = pd.crosstab(df['clust_label'], df['orig_label'])
y_clust = k_means.predict(data_frame)
display(ct)
print('% 9s' % 'inertia homo compl v-meas ARI AMI
silhouette')
print('%i %.3f %.3f %.3f %.3f %.3f %.3f'
%(k_means.inertia_,
homogeneity_score(true_labels, y_clust),
completeness_score(true_labels, y_clust),
v_measure_score(true_labels, y_clust),
adjusted_rand_score(true_labels, y_clust),
adjusted_mutual_info_score(true_labels, y_clust),
silhouette_score(data_frame, y_clust, metric='euclidean')))
More on clustering metrics can be found in DataCamp Tutorial.

In [10]:
k_means(n_clust=2, data_frame=Data, true_labels=Labels)
W
A W
L A
K L
o I K
r S N I
S W
i L T G N
I A
g A A _ G
T L
_ Y N D _
T K
l I D O U
I I
a N I W P
N N
b G N N S
G G
e G S T
l T A
A I
I R
R S
S

c
l
u
s
t
_
l
a
b
e
l

6 6 6
0 8 2 6 0 0 6
0 2 8

6 4 5
1 1 1 0 0 9 3
3 3 5

inertia homo compl v-meas ARI AMI silhouette


1156484 0.378 0.981 0.546 0.329 0.378 0.390
It looks like algorithm found patterns for Moving and Not-Moving activity with high level
of accuracy.
Check how it will cluster by 6 clusters (original number of classes).

In [11]:
k_means(n_clust=6, data_frame=Data, true_labels=Labels)

W
A W
L A
K L
o I K
r S N I
S W
i L T G N
I A
g A A _ G
T L
_ Y N D _
T K
l I D O U
I I
a N I W P
N N
b G N N S
G G
e G S T
l T A
A I
I R
R S
S

c
l
u
s
t
_
l
a
b
e
l

0 5 2 0 0 0 0
5
W
A W
L A
K L
o I K
r S N I
S W
i L T G N
I A
g A A _ G
T L
_ Y N D _
T K
l I D O U
I I
a N I W P
N N
b G N N S
G G
e G S T
l T A
A I
I R
R S
S

c
l
u
s
t
_
l
a
b
e
l

4 1

2 3
9
1 0 0 0 4 1
7
8 1

2 1 0 0 3 1 4
W
A W
L A
K L
o I K
r S N I
S W
i L T G N
I A
g A A _ G
T L
_ Y N D _
T K
l I D O U
I I
a N I W P
N N
b G N N S
G G
e G S T
l T A
A I
I R
R S
S

c
l
u
s
t
_
l
a
b
e
l

2 0 3
9 7 8

4 4
2
3 4 7 0 0 0
0
5 9
W
A W
L A
K L
o I K
r S N I
S W
i L T G N
I A
g A A _ G
T L
_ Y N D _
T K
l I D O U
I I
a N I W P
N N
b G N N S
G G
e G S T
l T A
A I
I R
R S
S

c
l
u
s
t
_
l
a
b
e
l

2 7
4 0 0 0 4
6 5

1 1 1
5 0 5 8 0 0 2
6 7 9

inertia homo compl v-meas ARI AMI silhouette


895967 0.548 0.589 0.568 0.429 0.547 0.113
Doesn't look like good connection between clusters and original labels so I will stick with
2 clusters.

In [12]:
#change labels into binary: 0 - not moving, 1 - moving
Labels_binary = Labels.copy()
for i in range(len(Labels_binary)):
if (Labels_binary[i] == 'STANDING' or Labels_binary[i] == 'SITTING' or
Labels_binary[i] == 'LAYING'):
Labels_binary[i] = 0
else:
Labels_binary[i] = 1
Labels_binary = np.array(Labels_binary.astype(int))
In [13]:
k_means(n_clust=2, data_frame=Data, true_labels=Labels_binary)

orig_label 0 1

clust_labe
l

0 1970 6

1 2 1631

inertia homo compl v-meas ARI AMI silhouette


1156484 0.977 0.978 0.978 0.991 0.977 0.390
Principal component analysis (PCA)
Principal Component Analysis is a dimension-reduction tool that can be used to reduce a large
set of variables to a small set that still contains most of the information in the large set.
2-cluster algorithm seems to fbe able to find patterns for moving/not-moving labels
perfectly so far, but let's see if it can still be improved by dimension reduction.

In [14]:
#check for optimal number of features
pca = PCA(random_state=123)
pca.fit(Data)
features = range(pca.n_components_)

plt.figure(figsize=(8,4))
plt.bar(features[:15], pca.explained_variance_[:15], color='lightskyblue')
plt.xlabel('PCA feature')
plt.ylabel('Variance')
plt.xticks(features[:15])
plt.show()

1 feature seems to be best fit for our algorithm.

In [15]:
def pca_transform(n_comp):
pca = PCA(n_components=n_comp, random_state=123)
global Data_reduced
Data_reduced = pca.fit_transform(Data)
print('Shape of the new Data df: ' + str(Data_reduced.shape))
In [16]:
# pca_transform(n_comp=3)
# k_means(n_clust=2, data_frame=Data_reduced, true_labels=Labels)
Code

In [18]:
pca_transform(n_comp=1)
k_means(n_clust=2, data_frame=Data_reduced, true_labels=Labels_binary)
Shape of the new Data df: (3609, 1)

orig_label 0 1

clust_labe
l

0 1971 8

1 1 1629

inertia homo compl v-meas ARI AMI silhouette


168716 0.976 0.976 0.976 0.990 0.976 0.794
Inertia and Silhouette seems to be much better now after reduction.
Just check clustering model for 2 components.

In [19]:
pca_transform(n_comp=2)
k_means(n_clust=2, data_frame=Data_reduced, true_labels=Labels_binary)
Shape of the new Data df: (3609, 2)
orig_label 0 1

clust_labe
l

0 1969 6

1 3 1631

inertia homo compl v-meas ARI AMI silhouette


295753 0.975 0.975 0.975 0.990 0.975 0.694
No improvements here.
So far it seems like this was best I could do. Still learning clustering algorithms and I
might come back to this project later.

If you know any interesting dataset to practice clustering on (not Iris dataset, haha),
please suggest!
https://mubaris.com/posts/kmeans-clustering/

All Articles

K-Means Clustering in Python

Clustering is a type of Unsupervised learning. This is very often used when


you don’t have labeled data. K-Means Clustering is one of the popular
clustering algorithm. The goal of this algorithm is to find groups(clusters) in
the given data. In this post we will implement K-Means algorithm using
Python from scratch.

K-Means Clustering

K-Means is a very simple algorithm which clusters the data into K number


of clusters. The following image from PyPR is an example of K-Means
Clustering.

Use Cases

K-Means is widely used for many applications.

 Image Segmentation
 Clustering Gene Segementation Data
 News Article Clustering
 Clustering Languages
 Species Clustering
 Anomaly Detection

Algorithm

Our algorithm works as follows, assuming we have inputs x_1, x_2, x_3, ..., x_nx1
,x2,x3,...,xn and value of K

 Step 1 - Pick K random points as cluster centers called centroids.


 Step 2 - Assign each x_ixi to nearest cluster by calculating its
distance to each centroid.
 Step 3 - Find new cluster center by taking the average of the assigned
points.
 Step 4 - Repeat Step 2 and 3 until none of the cluster assignments
change.
The above animation is an example of running K-Means Clustering on a two
dimensional data.

Step 1

We randomly pick K cluster centers(centroids). Let’s assume these are c_1,


c_2, ..., c_kc1,c2,...,ck, and we can say that;

C = {c_1, c_2,..., c_k}C=c1,c2,...,ck

CC is the set of all centroids.

Step 2

In this step we assign each input value to closest center. This is done by
calculating Euclidean(L2) distance between the point and the each centroid.

\arg \min_{c_i \in C} dist(c_i, x)^2argci∈Cmindist(ci,x)2

Where dist(.)dist(.) is the Euclidean distance.

Step 3

In this step, we find the new centroid by taking the average of all the points
assigned to that cluster.

c_i = \frac{1}{\lvert S_i \rvert}\sum_{x_i \in S_i} x_ici=∣Si∣1xi∈Si∑xi

S_iSi is the set of all points assigned to the iith cluster.

Step 4

In this step, we repeat step 2 and 3 until none of the cluster assignments
change. That means until our clusters remain stable, we repeat the
algorithm.

Choosing the Value of K

We often know the value of K. In that case we use the value of K. Else we
use the Elbow Method.
We run the algorithm for different values of K(say K = 10 to 1) and plot the K
values against SSE(Sum of Squared Errors). And select the value of K for the
elbow point as shown in the figure.

Implementation using Python

The dataset we are gonna use has 3000 entries with 3 clusters. So we
already know the value of K.

Checkout this Github Repo for full code and dataset.

We will start by importing the dataset.

%matplotlib inline
from copy import deepcopy
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')
# Importing the dataset
data = pd.read_csv('xclara.csv')
print(data.shape)
data.head()
(3000, 2)
V1 V2

0 2.072345 -3.241693

1 17.936710 15.784810

2 1.083576 7.319176

3 11.120670 14.406780

4 23.711550 2.557729

# Getting the values and plotting it


f1 = data['V1'].values
f2 = data['V2'].values
X = np.array(list(zip(f1, f2)))
plt.scatter(f1, f2, c='black', s=7)

# Euclidean Distance Caculator


def dist(a, b, ax=1):
return np.linalg.norm(a - b, axis=ax)
# Number of clusters
k = 3
# X coordinates of random centroids
C_x = np.random.randint(0, np.max(X)-20, size=k)
# Y coordinates of random centroids
C_y = np.random.randint(0, np.max(X)-20, size=k)
C = np.array(list(zip(C_x, C_y)), dtype=np.float32)
print(C)
[[ 11. 26.]
[ 79. 56.]
[ 79. 21.]]
# Plotting along with the Centroids
plt.scatter(f1, f2, c='#050505', s=7)
plt.scatter(C_x, C_y, marker='*', s=200, c='g')

# To store the value of centroids when it updates


C_old = np.zeros(C.shape)
# Cluster Lables(0, 1, 2)
clusters = np.zeros(len(X))
# Error func. - Distance between new centroids and old centroids
error = dist(C, C_old, None)
# Loop will run till the error becomes zero
while error != 0:
# Assigning each value to its closest cluster
for i in range(len(X)):
distances = dist(X[i], C)
cluster = np.argmin(distances)
clusters[i] = cluster
# Storing the old centroid values
C_old = deepcopy(C)
# Finding the new centroids by taking the average value
for i in range(k):
points = [X[j] for j in range(len(X)) if clusters[j] == i]
C[i] = np.mean(points, axis=0)
error = dist(C, C_old, None)
colors = ['r', 'g', 'b', 'y', 'c', 'm']
fig, ax = plt.subplots()
for i in range(k):
points = np.array([X[j] for j in range(len(X)) if clusters[j] == i])
ax.scatter(points[:, 0], points[:, 1], s=7, c=colors[i])
ax.scatter(C[:, 0], C[:, 1], marker='*', s=200, c='#050505')
From this visualization it is clear that there are 3 clusters with black stars as
their centroid.

If you run K-Means with wrong values of K, you will get completely
misleading clusters. For example, if you run K-Means on this with values 2, 4,
5 and 6, you will get the following clusters.

Now we will see how to implement K-Means Clustering using scikit-learn


The scikit-learn approach
Example 1

We will use the same dataset in this example.

from sklearn.cluster import KMeans

# Number of clusters
kmeans = KMeans(n_clusters=3)
# Fitting the input data
kmeans = kmeans.fit(X)
# Getting the cluster labels
labels = kmeans.predict(X)
# Centroid values
centroids = kmeans.cluster_centers_
# Comparing with scikit-learn centroids
print(C) # From Scratch
print(centroids) # From sci-kit learn
[[ 9.47804546 10.68605232]
[ 40.68362808 59.71589279]
[ 69.92418671 -10.1196413 ]]
[[ 9.4780459 10.686052 ]
[ 69.92418447 -10.11964119]
[ 40.68362784 59.71589274]]

You can see that the centroid values are equal, but in different order.

Example 2

We will generate a new dataset using make_blobs function.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
plt.rcParams['figure.figsize'] = (16, 9)

# Creating a sample dataset with 4 clusters


X, y = make_blobs(n_samples=800, n_features=3, centers=4)
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X[:, 0], X[:, 1], X[:, 2])

# Initializing KMeans
kmeans = KMeans(n_clusters=4)
# Fitting with inputs
kmeans = kmeans.fit(X)
# Predicting the clusters
labels = kmeans.predict(X)
# Getting the cluster centers
C = kmeans.cluster_centers_
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y)
ax.scatter(C[:, 0], C[:, 1], C[:, 2], marker='*', c='#050505', s=1000)

In the above image, you can see 4 clusters and their centroids as stars.
scikit-learn approach is very simple and concise.
More Resources

 K-Means Clustering Video by Siraj Raval


 K-Means Clustering Lecture Notes by Andrew Ng
 K-Means Clustering Slides by David Sontag (New York University)
 Programming Collective Intelligence Chapter 3
 The Elements of Statistical Learning Chapter 14
 Pattern Recognition and Machine Learning Chapter 9

Checkout this Github Repo for full code and dataset.

Conclusion

Even though it works very well, K-Means clustering has its own issues. That
include:

 If you run K-means on uniform data, you will get clusters.


 Sensitive to scale due to its reliance on Euclidean distance.
 Even on perfect data sets, it can get stuck in a local minimum
http://www.patterns7tech.com/customer-segmentation-using-machine-learning-k-means-clustering/

Customer segmentation using Machine Learning K-Means Clustering


by Rajshekhar Bodhale | Nov 17, 2017 | Machine Learning | 0 comments
Most of platforms build in Information Technologies are generating huge amount of data. This
data is called as Big Data and it carries lots of business intelligence. This data is crossing
boundaries to meet different goals and opportunities. There is opportunity to apply Machine
Learning to create value for clients.
Problems
1. We have big data based platforms in Accounting and IoT domain that keep on
generating customer behavior and device monitoring data.
2. Identifying targeted customer base or deriving patterns based on different dimensions is
key and really provide an edge to the platforms.
Idea
Imagine you got 1000’s of customers using your platform and vast amount of big data that’s
keep on generating, any insight on this is really going to value add.
As part of Machine Learning initiatives and innovative things that Patterns7 team keep on trying,
we experimented on K-Means Clustering and value it brings to our Clients is awesome.
Solution
Clustering is the process of partitioning a group of data points into a small number of clusters. In
this part, you will understand and learn how to implement the K-Means Clustering.
K-Means Clustering
K-means clustering is a method commonly used to automatically partition a data set into k
groups. It is unsupervised learning algorithm.

K-Means Objective
 The objective of k-means is to minimize the total sum of the squared distance of every
point to its corresponding cluster centroid. Given a set of observations (x1, x2, …, xn), where
each observation is a d-dimensional real vector, k-means clustering aims to partition the n
observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of
squares where µi is the mean of points in Si.
 The k-means algorithm is guaranteed to converge a local optimum.
Business Uses
This is a versatile algorithm that can be used for any type of grouping. Some examples of use
cases are:
 Behavioral segmentation: Segment by purchase history ,Segment by activities on
application, website, or platform.
 Inventory categorization:Group inventory by sales activity.
 Sorting sensor measurements:Detect activity types in motion sensors ,Group images.
 Detecting bots or anomalies:Separate valid activity groups from bots.
K-Means Clustering Algorithm
 Step 1: Choose the number K of clusters.
 Step 2: Select at random K points, the centroids.(not necessarily from your dataset)
 Step 3: Assign each data point to the closest centroid -> That forms K clusters.
 Step 4: Compute and place the new centroid of each cluster.
 Step 5: Reassign each data point to the new closest centroid. If any reassignment took
place, go to Step 4, otherwise go to FIN.
Example: Applying K-Means Clustering to Customer Expenses and Invoices Data in
python.
For python i am using Spyder Editor. As an example, we’ll show how the K-means algorithm
works with a  Customer Expenses and Invoices Data.We have 500 customers data we’ll looking
at two customer features: Customer Invoices, Customer Expenses. In general, this algorithm
can be used for any number of features, so long as the number of data samples is much greater
than the number of features.
Step 1: Clean and Transform Your Data
For this example, we’ve already cleaned and completed some simple data transformations. A
sample of the data as a pandas DataFrame is shown below. Import libraries in python i.e.
1. numpy for mathematical tool to include any types of mathematics in our code.
2. matplotlib.pyplot it help to plot nice chart.
3. pandas for import dataset and manage dataset.
 

Step 2: We want to apply clustering on Total Expenses and Total Invoices. So select
required columns in X.

The chart below shows the dataset for 500 customers, with the Total Invoices on the x-axis and
Total Expenses on the y-axis.
 
 
Step 3: Choose K and Run the Algorithm
Choosing K
The algorithm described above finds the clusters and data set labels for a particular pre-chosen
K. To find the number of clusters in the data, the user needs to run the K-means clustering
algorithm for a range of K values and compare the results. In general, there is no method for
determining exact value of K, but an accurate estimate can be obtained using the following
techniques.
One of the metrics that is commonly used to compare results across different values of K is the
mean distance between data points and their cluster centroid. Since increasing the number of
clusters will always reduce the distance to data points, increasing K will always decrease this
metric, to the extreme of reaching zero when K is the same as the number of data points. Thus,
this metric cannot be used as the sole target. Instead, mean distance to the centroid as function
of K is plotted and the “elbow point,” where the rate of decrease sharply shifts, can be used to
roughly determine K.
Using the elbow method we find the optimal number of clusters i.e. K=3. For this example, use
the Python packages scikit-learn for computations as shown below:
# K-Means Clustering

# importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# importing tha customer Expenses Invoices dataset with pandas


dataset=pd.read_csv('Expense_Invoice.csv')
X=dataset.iloc[: , [3,2]].values

# Using the elbow method to find the optimal number of clusters


from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans=KMeans(n_clusters=i, init='k-means++', max_iter= 300, n_init=
10, random_state= 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters K')
plt.ylabel('Average Within-Cluster distance to Centroid (WCSS)')
plt.show()

# Applying k-means to the mall dataset


kmeans=KMeans(n_clusters=3, init='k-means++', max_iter= 300, n_init=
10, random_state= 0)
y_kmeans=kmeans.fit_predict(X)

# Visualizing the clusters


plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c =
'red', label='Careful(c1)')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c =
'green', label='Standard(c2)')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c =
'blue', label='Target(c3)')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,
1], s = 250, c = 'yellow',
label='Centroids')
plt.title('Clusters of customer Invoices & Expenses')
plt.xlabel('Total Invoices ')
plt.ylabel('Total Expenses')
plt.legend()
plt.show()

Step 4: Review the Results


The chart below shows the results. Visually, you can see that the K-means algorithm splits the
three groups based on the invoice feature. Each cluster centroid is marked with a yellow circle.
Now customers are divided into
1. “careful” who’s income is less also they spend less.
2. “Standard” who’s income is Average and they spends less and,
3. “Target ” who’s income is more and they spends more .
https://towardsdatascience.com/segmenting-customers-using-k-means-and-transaction-records-
76f4055d856a

Segmenting Customers using K-Means, RFM and Transaction Records

Adeline Ong

Follow

Mar 16 · 6 min read


Photo by rupixen.com on Unsplash

In this article, I will walk you through how I applied K-means and RFM segmentation to
cluster online gift shop customers based on their transaction records.

Introduction

When I was in college, I started a simple e-store selling pet products. Back then, I only collected
enough customer information to make the sale, and get my products to them. Simply put, I only
had their transaction records and addresses.

Back then, I didn’t think I had enough information to perform any useful segmentation.
However, I recently came across an intuitive segmentation approach
called RFM (Recency Frequency Monetary Value), which can be easily applied to basic customer
transaction records.

About RFM Segmentation

Here’s what each letter of RFM means:

 Recency: How long has it been since the customer last purchased from you (e.g. in days,
in months)?

 Frequency: How many times has the customer purchased from you within a fixed period
(e.g. past 3 months, past year)

 Monetary Value: How much has the customer spent at your store within a fixed period
(which should be the same period set for Frequency).

We can group customers, and come up with business recommendations based on RFM scores.
For example, you could offer promotions to reengage customers who have not bought from your
store recently. You could further prioritize your promotional strategy by focusing on customers
who used to buy frequently and spend at least average monetary value.

Using K-Means Instead of the Traditional Approach

The traditional RFM approach requires you to manually rank customers from 1 to 5 on each of
their RFM features. Two ways to define ranks would be to create groups of equal intervals (e.g.
range/5), or categorize them based on percentiles (those up to 20th percentile would form a
rank).
Since we are data scientists, why not use an unsupervised learning model to do the job? In fact,
our model might perform better than the traditional approach since it groups customers based on
their RFM values, instead of their ranking.

The Data

The dataset was from UCL’s machine learning repository. The file contained 1 million customer
transaction records for a UK-based online gift store for the period between 2009 to 2010, and
2010 to 2011. There were two sheets in the excel file (one for each year), and each sheet had the
same 8 features:

 Customer ID

 Country (I didn’t really look at this since most customers were UK-based as well)

 Invoice Code

 Invoice Date

 Stock Code

 Stock Description

 Unit Price

 Unit Quantity

Data Cleaning

Since both datasets contained the same features, I appended one to the other. Following this, I
dropped rows that had:

 Missing Customer ID

 Missing Stock Description

 Abnormal Stock Codes that did not conform to the expected format, such as Stock Codes
that started with letters, and had less than 5 digits. These tended to be from manual entries
(Stock Code ‘M’), postage costs (Stock Code ‘DOT’) and cancelled orders (Stock Codes
starting with ‘C’). However, I retained Stock Codes that ended with letters, as these tended to
indicate product variations (e.g. pattern, color).

After creating RFM features for each customer (see Feature Engineering), I also removed
extreme outliers that were more than 4 standard deviations away from the mean. Removing
extreme outliers is important because they can skew unsupervised learning models that use
distance-based measures.
Feature Engineering

To derive a customer’s Recency, I calculated the time difference (in days) between the latest
purchase in the combined dataset, and the customer’s last purchase. Lower scores indicate a
more recent purchase, which is better for the store.

I created features that corresponded to each customer’s frequency of purchase (over the 2 year
period) and total spend (Monetary Value) through aggregation:

 Frequency: Count the number of unique Invoice Codes per customer

 Monetary Value: Sum the price of all items purchased

I also created other features, which I thought would be useful cluster descriptors:

 Total spend per invoice

 Time (in days) between orders

Choosing an Unsupervised Learning Model using Silhouette Score

Silhouette score can be used to evaluate the quality of unsupervised learning models where the
ground truth is unknown. Silhouette score measures how similar an observation is to its own
cluster, as compared to other clusters.

Values closer to 1 indicate better cluster separation, while values near 0 indicate overlapping
clusters. Avoid values that are negative.

I applied 3 unsupervised learning models to the data, and chose go with K-Means because it had
the best silhouette scores regardless of the number of clusters.
Silhouette scores of unsupervised learning models by number of clusters

Choosing the Number of K-Means Clusters

To choose the number of clusters (n_clusters), I took into account each cluster’s silhouette score.
Optimally, every cluster’s coefficient value should be higher than the mean silhouette score (in
the graph, each cluster’s peak should exceed the red dotted line). I also took into account the
RFM values of each cluster.

I differed the number of K-Means clusters and examined the RFM values and silhouette scores of
the models. I decided to go with n_clusters =5 instead of anything less despite a lower silhouette
score because an important customer segment that had good RFM values only appeared when
n_clusters = 5. Clusters that appeared beyond n_clusters = 5 were less critical because they had
poorer RFM scores.
Table depicting silhouette scores across n number of clusters, and whether each cluster’s coefficient
value was higher than the mean silhouette score within each model

Visualizing and Describing the Clusters

Having choose a unsupervised learning model, and a suitable number of clusters, I visualized the
clusters using a 3D plot.
3D plot depicting customer segments derived using RFM segmentation and K-Means

Clusters 4 and 2 have better RFM scores and represent the store’s core customers. The other 3
clusters appear to be more causal customers who purchase less frequently.

Core Customers

Based on this dataset, 18% of customers are core customer, and they contributed to 62% of
revenue for the past two years. They spend a lot, purchase frequently, every one or two months,
and are still engaged with the store. As the typical price of the online store’s products tends to be
low, the clusters’ average spend suggest that they are purchasing in large quantities, so they are
probably wholesalers and smaller shops that resell the store’s goods.
Table describing key features of core customers. Non-percentage figures represent averages.

Casual Customers

As for casual customers, I‘d like to highlight Cluster 0 (which I’ve called Gift Hunters) as they are
most critical to the store. They contributed to about a quarter of revenue, which is a lot more than
the other casual clusters. They tended to purchase from the store once every quarter in small
amounts, which suggest that they are individuals buying for special occasions.
Table describing the key features of casual customers. Non-percentage figures represent averages.

Possible Promotional Strategies to Pursue

Given the features of the clusters, I propose the following promotional strategies for key groups:

 Wholesalers: Given their small numbers, it might make sense to engage them directly
to build goodwill and loyalty. It would be best to lock them in with a custom solution.

 Small Shops: Explore cashback discounts that can be used during subsequent
purchases. This will also lower their cost and encourage them to spend more.

 Gift Hunters: Engage them just before special occasions and encourage them to spend
more by giving them free gifts for a minimum spend that is higher than their current mean
spend of 347 pounds.
To End Off…

I think RFM segmentation pairs very well with unsupervised learning models, as they remove the
need for marketers to manually segment their customer records. I hope I’ve illustrated how
meaningful customer segments can be created from very basic customer information. For more
details, you can look at my notebook. It contains code and details about the other models that I
explored.

Thanks for reading!


https://medium.com/@16611050/k-means-clustering-8476c74ad462 (very important)

K-Means Clustering

FATA MUKHAMMAD IZZADIN

Follow

Jun 26, 2019 · 7 min read

Cryotherapy Dataset user agreement in python

Introduction To K-Means Clustering

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled
data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups
in the data, with the number of groups represented by the variable K. The algorithm works
iteratively to assign each data point to one of K groups based on the features that are provided.
Data points are clustered based on feature similarity. The results of the K-means clustering
algorithm are:

1. The centroids of the K clusters, which can be used to label new data

2. Labels for the training data (each data point is assigned to a single cluster)

Rather than defining groups before looking at the data, clustering allows you to find and analyze
the groups that have formed organically. The “Choosing K” section below describes how the
number of groups can be determined.

Each centroid of a cluster is a collection of feature values which define the resulting groups.
Examining the centroid feature weights can be used to qualitatively interpret what kind of group
each cluster represents.

This introduction to the K-means clustering algorithm covers:

 Common business cases where K-means is used

 The steps involved in running the algorithm


 A Python example using delivery fleet data

Business Uses

The K-means clustering algorithm is used to find groups which have not been explicitly labeled in
the data. This can be used to confirm business assumptions about what types of groups exist or to
identify unknown groups in complex data sets. Once the algorithm has been run and the groups
are defined, any new data can be easily assigned to the correct group.

This is a versatile algorithm that can be used for any type of grouping. Some examples of use
cases are:

 Behavioral segmentation:

 Segment by purchase history

 Segment by activities on application, website, or platform

 Define personas based on interests

 Create profiles based on activity monitoring

 Inventory categorization:

 Group inventory by sales activity

 Group inventory by manufacturing metrics

 Sorting sensor measurements:

 Detect activity types in motion sensors

 Group images

 Separate audio

 Identify groups in health monitoring

 Detecting bots or anomalies:

 Separate valid activity groups from bots

 Group valid activity to clean up outlier detection

In addition, monitoring if a tracked data point switches between groups over time can be used to
detect meaningful changes in the data.
Algorithm

The Κ-means clustering algorithm uses iterative refinement to produce a final result. The
algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of
features for each data point. The algorithms starts with initial estimates for the Κ centroids,
which can either be randomly generated or randomly selected from the data set. The algorithm
then iterates between two steps:

1. Data assigment step:

Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest
centroid, based on the squared Euclidean distance. More formally, if ci is the collection of
centroids in set C, then each data point x is assigned to a cluster based on

where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point assignments for
each ith cluster centroid be Si.

2. Centroid update step:

In this step, the centroids are recomputed. This is done by taking the mean of all data points
assigned to that centroid’s cluster.

The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data
points change clusters, the sum of the distances is minimized, or some maximum number of
iterations is reached).

This algorithm is guaranteed to converge to a result. The result may be a local optimum (i.e. not
necessarily the best possible outcome), meaning that assessing more than one run of the
algorithm with randomized starting centroids may give a better outcome.

Choosing K

The algorithm described above finds the clusters and data set labels for a particular pre-chosen K.
To find the number of clusters in the data, the user needs to run the K-means clustering
algorithm for a range of K values and compare the results. In general, there is no method for
determining exact value of K, but an accurate estimate can be obtained using the following
techniques.

One of the metrics that is commonly used to compare results across different values of K is the
mean distance between data points and their cluster centroid. Since increasing the number of
clusters will always reduce the distance to data points, increasing K will always decrease this
metric, to the extreme of reaching zero when K is the same as the number of data points. Thus,
this metric cannot be used as the sole target. Instead, mean distance to the centroid as a function
of K is plotted and the “elbow point,” where the rate of decrease sharply shifts, can be used to
roughly determine K.

A number of other techniques exist for validating K, including cross-validation, information


criteria, the information theoretic jump method, the silhouette method, and the G-means
algorithm. In addition, monitoring the distribution of data points across groups provides insight
into how the algorithm is splitting the data for each K.

Exempel

writing on this data uses secondary data obtained from the website https://archive.ics.uci.edu

This dataset contains information about wart treatment results of 90 patients using cryotherapy.

information of atribut to data cryotherapy is sex,age, time,number of warts,type,area,result of


treatment.
“Import pandas as pd” calls the “pandas” library and initializes it as “pd” which is used for
processing data related to the data frame, “import numpy as np” caller “numpy” library then
initializes as “np” used to build data the multidimensional array, “import matplotlib.pylot as plt”
invokes the library “matplotlib.pyplot” and initializes it as “plt” which is used to visualize the data
in the form of lines, bar charts, histograms, boxplots, and pie charts, and “import as old as sns
“library” seaborn “is then initialized as” sns “which is used to describe graphs of interesting and
informative statistics.

Read data

input data in python using the syntax in the image above


exploration data

Attributes in files try to use integer and float data types, non-null indicating that none of the data
is missing, with 90 data amounts
Plot distribution

The distribution of the plot to be carried out is the distribution of age attributes / variables with
sex attributes / variables.
The highest distribution is less than 20 year old, while the least spread is at the range of ages 51–
60 years.

K-Means Cluster

Use the package from “sklearn (scikit-learn)” to do clustering on import KMeans and use the
package from “sklearn” to do preprocessing “to import MinMaxScaler.
he first thing to do in K-Means Cluster is to change the data frame variable to an array data. From
the data array the variable size will be standardized using “MinMaxScaler ()”. Alternative
standardization is the scaling of features to be between the minimum and maximum values
given, often between zero and one, or so that the maximum absolute value of each feature is
scaled to the unit size.
To do the K-Means model the first step is to determine and confirm the K-Means function of all
variables using the “n_clusters” command of 5, and use the “random_state” command of 123
which is used to create a random key.
The final step is to display the cluster results and enter them into the data frame with the name
“cluster” and then visualize it in a scatterplot.

Reference :

Trevino,Andrea (2016,December 6) Introduction to K-means Clustering,retrieved from


datascience : https://www.datascience.com/blog/k-means-clustering

https://archive.ics.uci.edu/ml/datasets/Cryotherapy+Dataset+#

Additional Notes and Alternatives

Feature Engineering

Feature engineering is the process of using domain knowledge to choose which data metrics to
input as features into a machine learning algorithm. Feature engineering plays a key role in K-
means clustering; using meaningful features that capture the variability of the data is essential
for the algorithm to find all of the naturally-occurring groups.  
Categorical data (i.e., category labels such as gender, country, browser type) needs to be
encoded or separated in a way that can still work with the algorithm.  
Feature transformations, particularly to represent rates rather than measurements, can help to
normalize the data. For example, in the delivery fleet example above, if total distance driven had
been used rather than mean distance per day, then drivers would have been grouped by how
long they had been driving for the company rather than rural vs. urban.  
Alternatives

A number of alternative clustering algorithms exist including DBScan, spectral clustering, and


modeling with Gaussian mixtures. A dimensionality reduction technique, such as principal
component analysis, can be used to separate groups of patterns in data. You can read more
about alternatives to K-means in this post.
One possible outcome is that there are no organic clusters in the data; instead, all of the data
fall along the continuous feature ranges within one single group. In this case, you may need to
revisit the data features to see if different measurements need to be included or a feature
transformation would better represent the variability in the data. In addition, you may want to
impose categories or labels based on domain knowledge and modify your analysis approach.
For more information on K-means clustering, visit the scikit learn site. 
https://datascience.stackexchange.com/questions/22795/do-clustering-algorithms-need-feature-
scaling-in-the-pre-processing-stage

Do Clustering algorithms need feature scaling in the pre-processing stage?


Clustering algorithms are certainly effected by the feature scaling.

Example:

Let's say that you have two features:

1. weight (in Lbs)


2. height (in Feet)
... and we are using these to predict whether a person needs a 'S' or 'L' size shirt.

We are using weight+height for that, and in our trained set let's say we have two people already
in clusters:

1. Adam (175Lbs+5.9ft) in 'L'


2. Lucy (115Lbs+5.2ft) in 'S'.
We have a new person - Alan (140Lbs+6.1ft.), and your clustering algo will put it in the cluster
which is nearest. So, if we don't scale the features here, the height is not having much effect
and Alan will be allotted in 'S' cluster.

So, we need to scale it. Scikit Learn provides many functions for scaling. One you can use
is sklearn.preprocessing.MinMaxScaler.

Yes. Clustering algorithms such as K-means do need feature scaling before they are fed to the
algo. Since, clustering techniques use Euclidean Distance to form the cohorts, it will be wise
e.g to scale the variables having heights in meters and weights in KGs before calculating the
distance.

If your variables are of incomparable units (e.g. height in cm and weight in kg) then you should
standardize variables, of course. Even if variables are of the same units but show quite different
variances it is still a good idea to standardize before K-means. You see, K-means clustering is
"isotropic" in all directions of space and therefore tends to produce more or less round (rather
than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more
weight on variables with smaller variance, so clusters will tend to be separated along variables
with greater variance.
A different thing also worth to remind is that K-means clustering results are potentially sensitive
to the order of objects in the data set11. A justified practice would be to run the analysis several
times, randomizing objects order; then average the cluster centres of those runs and input the
centres as initial ones for one final run of the analysis.
Here is some general reasoning about the issue of standardizing features in cluster or other
multivariate analysis.

11 Specifically, (1) some methods of centres initialization are sensitive to case order; (2) even
when the initialization method isn't sensitive, results might depend sometimes on the order the
initial centres are introduced to the program by (in particular, when there are tied, equal
distances within data); (3) so-called running means version of k-means algorithm is naturaly
sensitive to case order (in this version - which is not often used apart from maybe online
clustering - recalculation of centroids take place after each individual case is re-asssigned to
another cluster).

As explained in this paper, the k-means minimizes the error function using the Newton
algorithm, i.e. a gradient-based optimization algorithm. Normalizing the data improves
convergence of such algorithms. See here for some details on it.
The idea is that if different components of data (features) have different scales, then derivatives
tend to align along directions with higher variance, which leads to poorer/slower convergence.

It depends on your data.


If you have attributes with a well-defined meaning. Say, latitude and longitude, then you should
not scale your data, because this will cause distortion. (K-means might be a bad choice, too -
you need something that can handle lat/lon naturally)

If you have mixed numerical data, where each attribute is something entirely different (say, shoe
size and weight), has different units attached (lb, tons, m, kg ...) then these values aren't really
comparable anyway; z-standardizing them is a best-practise to give equal weight to them.

If you have binary values, discrete attributes or categorial attributes, stay away from k-means.
K-means needs to compute means, and the mean value is not meaningful on this kind of data.
The issue is what represents a good measure of distance between cases.

If you have two features, one where the differences between cases is large and the other small,
are you prepared to have the former as almost the only driver of distance?

So for example if you clustered people on their weights in kilograms and heights in metres, is a
1kg difference as significant as a 1m difference in height? Does it matter that you would get
different clusterings on weights in kilograms and heights in centimetres? If your answers are
"no" and "yes" respectively then you should probably scale.

On the other hand, if you were clustering Canadian cities based on distances east/west and
distances north/south then, although there will typically be much bigger differences east/west,
you may be happy just to use unscaled distances in either kilometres or miles (though you might
want to adjust degrees of longitude and latitude for the curvature of the earth).

I think standard scaling mostly depends on the model being used, and normalizing depend on
how the data is originated

Most of distance based models e.g. k-means need standard scaling so that large-scaled
features don't dominate the variation. Same goes to PCA.

About the normalization, it mostly depends on the data. For example, if you have sensor data
(each time step being a variable) with different scaling, you need to L2 normalize the data to
bring them into the same scale. Or if you are working on customer recommendation and your
entry are the number of times they bought each item (items being variables), you might need to
L2 normalize the items if you don't want people who buy a lot to skew the feature.

Personally, I think if the variables are well-defined, their log might result in losing interpretaility.
So if you get good looking clusters without the log transform, I'd stick to it.

It's simply a case of getting all your data on the same scale: if the scales for different features
are wildly different, this can have a knock-on effect on your ability to learn (depending on what
methods you're using to do it). Ensuring standardised feature values implicitly weights all
features equally in their representation.
https://www.quora.com/Should-you-standardize-binary-categorical-and-indicator-primary-key-
variables-before-performing-K-means-clustering

Should you standardize binary, categorical and indicator (primary key) variables before
performing K-means clustering?
Yes, standardizing (normalizing) the input features is an important preprocessing step for
using  k-means. This is done to make all the features in the same scale and give equal
importance to all features during learning. You can either use min-max normalization or mean-
SD normalization.

Are mean normalization and feature scaling needed for k-means clustering?

Having said that, the standard k-means technique preferably should not be directly applied to
categorical data, for various reasons. This is because the sample space for categorical data is
discrete, and doesn't have a natural origin. A Euclidean distance function on such a space isn't
really meaningful.

k-modes (for only categorical data) and k-prototypes (for data with mixed data types) are more
appropriate.

For more info:

http://www.cs.ust.hk/~qyang/Teac...

Why does K-means clustering perform poorly on categorical data? The weakness of the K-
means method is that it is applicable only when the mean is defined, one needs to specify K in
advance, and it is unable to handle noisy data and outliers.

nicodv/kmodes (Source code)

K-Means clustering for mixed numeric and categorical data

Related Questions

 Can a binary categorical variable be used in K-means clustering?

 How can I perform PCA before k-means clustering?

 How do you interpret k-means clustering results?

 What are the most typical applications of K-means clustering algorithm or its variants?

 How do I understand the characteristics of each cluster when doing a K-Means clustering
algorithm?

 What hypothesis can I set before I do k-means clustering?

 Do I need to consider the importance of variables to run K-means clustering?


 How can I assign meaning to different clusters formed by K Means clustering?

 How should I run K-means clustering when I cannot choose which variable to run on?

 How can I choose variables using principal component analysis for K-means clustering?

 How do we apply k-means clustering algorithm for mixed data-numeric and categorical?

 How do I do clustering for categorical data?

 Can we use K-means for clustering binary vectors?

 What happens when you try clustering data with higher dimensions using k-means? For
example, if the dimensionality of the data set is 1000, nu...

 What is the intuition behind distance and clustering in a space formed by categorical
variables?

http://www.cs.ust.hk/~qyang/Teaching/537/Papers/huang98extensions.pdf ( Paper )

Why does K-means clustering perform poorly on categorical data? The weakness of the K-
means method is that it is applicable only when the mean is defined, one needs to specify
K in advance, and it is unable to handle noisy data and outliers.

Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

The K-means algorithm defines a cost function that computes Euclidean distance (or it can be
anything similar) between two numeric values. However, it is not possible to define such
distance between categorical values. for e.g. if Euclidean distance between numeric points A and
B is 25 and A and C is 10, we know A is closer to C than B. However, as Sean Owen and User-
9806452280263883043 suggested, categorical values are not numbers but are enumerations
such as 'banana', 'apple' and 'oranges'. Euclidean distance cannot be used to compute euclidean
distances between the above fruits. We cannot say apple is closer to orange or banana because
Euclidean distance is not meant to handle such information. Therefore, we need to change the
cost function. In his paper, Huang  proposed two things to handle such situation :

1. Use Hamming distance instead of Euclidean distance , i.e. if we two categorical


values are same then make the distance 0 or else 1.
2. Instead of mean, compute mode i.e the most occurring categorical value of a
feature is used as its representative. That's how you compute the centers of a cluster.
Here you go, you have defined a new cost function that can perform partitional clustering of
categorical data and it is called K-modes clustering. The basic steps of K-modes algorithm are
the same, except for the cost function it optimizes. Here is the original paper that describes K-
modes algorithm in Section 4. http://www.cse.ust.hk/~qyang/537...

HTH
There are lots of good answers given. Definitely, Euclidean distance between two points that
have a categorical dimension does not make sense when you want to compute the mean of a
possible centroid for a cluster.

As an alternative to all the suggestions, if you convert your categorical data to a numeric value
and if you scale your *actual* numeric features to the range of numeric values you derived from
converted categorical features, then probably you could runs k means ( I havent tested this
myself ) using something like cosine similarity etc to identify the centroids.

If you google for kmeans on categorical data, there are many papers that list various different
aprroach, one such approach as everybody mentioned is http://arxiv.org/ftp/cs/papers/0...

https://github.com/nicodv/kmodes
https://datascience.stackexchange.com/questions/22/k-means-clustering-for-mixed-numeric-and-
categorical-data

K-Means clustering for mixed numeric and categorical data


The standard k-means algorithm isn't directly applicable to categorical data, for various reasons.
The sample space for categorical data is discrete, and doesn't have a natural origin. A
Euclidean distance function on such a space isn't really meaningful. As someone put it, "The
fact a snake possesses neither wheels nor legs allows us to say nothing about the relative value
of wheels and legs." (from here)
There's a variation of k-means known as k-modes, introduced in this paper by Zhexue Huang,
which is suitable for categorical data. Note that the solutions you get are sensitive to initial
conditions, as discussed here (PDF), for instance.
Huang's paper (linked above) also has a section on "k-prototypes" which applies to data with a
mix of categorical and numeric features. It uses a distance measure which mixes the Hamming
distance for categorical features and the Euclidean distance for numeric features.

A Google search for "k-means mix of categorical data" turns up quite a few more recent papers
on various algorithms for k-means-like clustering with a mix of categorical and numeric data. (I
haven't yet read them, so I can't comment on their merits.)

Actually, what you suggest (converting categorical attributes to binary values, and then doing k-
means as if these were numeric values) is another approach that has been tried before
(predating k-modes). (See Ralambondrainy, H. 1995. A conceptual version of the k-means
algorithm. Pattern Recognition Letters, 16:1147–1157.) But I believe the k-modes approach is
preferred for the reasons I indicated above.

Euclidean distance is not defined for categorical data; therefore, K-means cannot be used
directly. You may like to read more here

Shehroz Khan's answer to Why does K-means clustering perform poorly on categorical data?
The weakness of the K-means method is that it is applicable only when the mean is defined, one
needs to specify K in advance, and it is unable to handle noisy data and outliers.

Shehroz Khan's answer to How do we apply k-means clustering algorithm for mixed data-
numeric and categorical?

Shehroz Khan's answer to About converting a categorical variable into a numeric variable: When
is it better to use dummy variables instead of a single numerical variable?

How can I perform PCA before k-means clustering?

You just apply PCA and choose the principle components with the largest eigenvalues (usually 2
or 3 for visualization purposes). The issue though is that you won't benefit insight from the
clustering about the features and how they cluster the data points because you are not using
the original data features, but instead you're using the principal components.

What hypothesis can I set before I do k-means clustering?


 The clusters are spherical in shape (or circular in 2 dimension)
 The number of initial partitions (i.e. K) is known
 Initial cluster centers are either known, computed before-hand or randomly chosen

How do I do clustering for categorical data?


The apparent difficulty of clustering categorical data (nominal and ordinal, mixed with
continuous variables) is in finding an appropriate distance metric between two observations.

One standard approach is to compute a distance or dissimilarity matrix from the data and then
cluster it using hierarchical clustering, PAM etc.

Here are a few methods:

 Use gower’s metric. Here is an R implementation called daisy.


 Obtain a distance matrix from Random Forest based proximity.
 kmodes
Seraj:

https://www.youtube.com/watch?edufilter=NULL&v=9991JlKnFmk

https://github.com/llSourcell/k_means_clustering

Let's detect the intruder trying to break into our security system using a very popular ML
technique called K-Means Clustering! This is an example of learning from data that has no
labels (unsupervised) and we'll use some concepts that we've already learned about like
computing the Euclidean distance and a loss function to do this. Code for this video:
https://github.com/llSourcell/k_means... Please Subscribe! And like. And comment. That's what
keeps me going. More learning resources: http://www.kdnuggets.com/2016/12/data...
http://opencv-python-tutroals.readthe... http://people.revoledu.com/kardi/tuto...
https://home.deib.polimi.it/matteucc/... http://mnemstudio.org/clustering-k-me...
https://www.dezyre.com/data-science-i... http://scikit-learn.org/stable/tutori...

https://medium.com/search?q=K-mean
https://www.kaggle.com/patneshubham123/k-means-clustering-and-cluster-profiling
# Clustering
In [27]:
from sklearn.cluster import KMeans
In [28]:
km_3=KMeans(n_clusters=3,random_state=123)
km_3.fit(train_num)
km_3.cluster_centers_
km_3.labels_
pd.Series(km_3.labels_).value_counts()
km_4=KMeans(n_clusters=4,random_state=123).fit(train_num)

#km_4.labels_

km_5=KMeans(n_clusters=5,random_state=123).fit(train_num)

#km_5.labels_

km_6=KMeans(n_clusters=6,random_state=123).fit(train_num)

#km_6.labels_

km_7=KMeans(n_clusters=7,random_state=123).fit(train_num)
#km_7.labels_

km_8=KMeans(n_clusters=8,random_state=123).fit(train_num)
#km_8.labels_
# save the cluster labels and sort by cluster
train_num['cluster_3'] = km_3.labels_
train_num['cluster_4'] = km_4.labels_
train_num['cluster_5'] = km_5.labels_
train_num['cluster_6'] = km_6.labels_
train_num['cluster_7'] = km_7.labels_
train_num['cluster_8'] = km_8.labels_
train_num.head()
pd.Series.sort_index(train_num.cluster_3.value_counts())
pd.Series(train_num.cluster_3.size)
size=pd.concat([pd.Series(train_num.cluster_3.size),
pd.Series.sort_index(train_num.cluster_3.value_counts()),
pd.Series.sort_index(train_num.cluster_4.value_counts()),
pd.Series.sort_index(train_num.cluster_5.value_counts()),
pd.Series.sort_index(train_num.cluster_6.value_counts()),
pd.Series.sort_index(train_num.cluster_7.value_counts()),
pd.Series.sort_index(train_num.cluster_8.value_counts())])
In [38]:
Seg_size=pd.DataFrame(size, columns=['Seg_size'])
Seg_Pct = pd.DataFrame(size/train_num.cluster_3.size, columns=['Seg_Pct'])
Seg_size.T
Seg_Pct.T

# Mean value gives a good indication of the distribution of data. So we are


finding mean value for each variable for each cluster
Profling_output = pd.concat([train_num.apply(lambda x: x.mean()).T,
train_num.groupby('cluster_3').apply(lambda x: x.mean()).T,
train_num.groupby('cluster_4').apply(lambda x: x.mean()).T,
train_num.groupby('cluster_5').apply(lambda x: x.mean()).T,
train_num.groupby('cluster_6').apply(lambda x: x.mean()).T,
train_num.groupby('cluster_7').apply(lambda x: x.mean()).T,
train_num.groupby('cluster_8').apply(lambda x: x.mean()).T], axis=1)

Profling_output_final=pd.concat([Seg_size.T, Seg_Pct.T, Profling_output],


axis=0)
#Profling_output_final.columns = ['Seg_' + str(i) for i in
Profling_output_final.columns]
Profling_output_final.columns = ['Overall', 'KM3_1', 'KM3_2', 'KM3_3',
'KM4_1', 'KM4_2', 'KM4_3', 'KM4_4',
'KM5_1', 'KM5_2', 'KM5_3', 'KM5_4', 'KM5_5',
'KM6_1', 'KM6_2', 'KM6_3', 'KM6_4',
'KM6_5','KM6_6',
'KM7_1', 'KM7_2', 'KM7_3', 'KM7_4',
'KM7_5','KM7_6','KM7_7',
'KM8_1', 'KM8_2', 'KM8_3', 'KM8_4',
'KM8_5','KM8_6','KM8_7','KM8_8',]
In [41]:
Profling_output_final
Profling_output_final.to_csv('Profiling_output.csv')
Finding Optimal number of clusters
In [43]:
# Elbow Plot
cluster_range = range( 1, 20 )
cluster_errors = []

for num_clusters in cluster_range:


clusters = KMeans( num_clusters )
clusters.fit( train_num )
cluster_errors.append( clusters.inertia_ )
In [44]:
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors":
cluster_errors } )

clusters_df[0:10]
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o"
)

Note:

 The elbow diagram shows that the gain in explained variance reduces significantly
to k=2. So, optimal number of clusters is 2.
Silhouette Coefficient
In [46]:
from sklearn import metrics
# calculate SC for K=3 through K=12
k_range = range(2, 12)
scores = []
for k in k_range:
km = KMeans(n_clusters=k, random_state=1)
km.fit(train_num)
scores.append(metrics.silhouette_score(train_num, km.labels_))
In [47]:
scores
# The sc is maximum for k=2 so we will select the 2 as our optimum cluster
# plot the results
plt.plot(k_range, scores)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Coefficient')
plt.grid(True)
ote:

 The SC plot shows that Silhouette Coefficient is maximum at k=2 So, optimal
number of clusters is 2.

https://www.kaggle.com/karthickaravindan/k-means-clustering-project

K Means Cluster Creation


Now it is time to create the Cluster labels!

Import KMeans from SciKit Learn.

In [14]:
from sklearn.cluster import KMeans
Create an instance of a K Means model with 2 clusters.

In [15]:
kmeans=KMeans(n_clusters=2)
Fit the model to all the data except for the Private label.

In [16]:
kmeans.fit(df.drop('Private',axis=1))

What are the cluster center vectors?

In [17]:
kmeans.cluster_centers_
def converter(cluster):
if cluster=='Yes':
return 1
else:
return 0
In [19]:
df['Cluster'] = df['Private'].apply(converter)
In [20]:
df.head()
Create a confusion matrix and classification report to see how well the Kmeans
clustering worked without being given any labels.

In [21]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(df['Cluster'],kmeans.labels_))
print(classification_report(df['Cluster'],kmeans.labels_))
https://www.kaggle.com/sirpunch/k-means-clustering (very important)

K-Means Clustering
Python notebook using data from The Movies Dataset · 6,575 views · 2y ago
1import numpy as np
import pandas as pd
import pylab as pl
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Read the movies metadata csv file

In [2]:
df = pd.read_csv("../input/movies_metadata.csv")
1
Only keep the numeric columns for our analysis. However, we'll keep titles also to interpret the
results at the end of clustering. Note that this title column will not be used in the analysis.

In [3]:
df.drop(df.index[19730],inplace=True)
df.drop(df.index[29502],inplace=True)
df.drop(df.index[35585],inplace=True)
In [4]:
df_numeric =
df[['budget','popularity','revenue','runtime','vote_average','vote_count','ti
tle']]
In [5]:
df_numeric.head()

Check if rows contain any null values

In [6]:
df_numeric.isnull().sum()
Drop all the rows with null values

In [7]:
df_numeric.dropna(inplace=True)
Normalize data
Normalize the data with MinMax scaling provided by sklearn

In [12]:
from sklearn import preprocessing
In [13]:
minmax_processed =
preprocessing.MinMaxScaler().fit_transform(df_numeric.drop('title',axis=1))
In [14]:
df_numeric_scaled = pd.DataFrame(minmax_processed, index=df_numeric.index,
columns=df_numeric.columns[:-1])
In [15]:
df_numeric_scaled.head()

Apply K-Means Clustering


What k to choose?
Let's fit cluster size 1 to 20 on our data and take a look at the corresponding score value.

In [16]:
Nc = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in Nc]
In [17]:
score = [kmeans[i].fit(df_numeric_scaled).score(df_numeric_scaled) for i in
range(len(kmeans))]
These score values signify how far our observations are from the cluster center. We want to
keep this score value around 0. A large positive or a large negative value would indicate that the
cluster center is far from the observations.
Based on these scores value, we plot an Elbow curve to decide which cluster size is optimal.
Note that we are dealing with tradeoff between cluster size(hence the computation required)
and the relative accuracy.

In [18]:
pl.plot(Nc,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()

Our Elbow point is around cluster size of 5. We will use k=5 to further interpret our clustering
result. I'm prefering this number for ease of interpretation in this demo. We can also pick a
higher number like 9.
Fit K-Means clustering for k=5
In [19]:
kmeans = KMeans(n_clusters=5)
kmeans.fit(df_numeric_scaled)
Out[19]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
As a result of clustering, we have the clustering label. Let's put these labels back into the
original numeric data frame.

In [20]:
len(kmeans.labels_)
Out[20]:
12178
In [21]:
df_numeric['cluster'] = kmeans.labels_
In [22]:
df_numeric.head()

Interpret clustering results


Let's see cluster sizes first.

In [23]:
plt.figure(figsize=(12,7))
axis =
sns.barplot(x=np.arange(0,5,1),y=df_numeric.groupby(['cluster']).count()
['budget'].values)
x=axis.set_xlabel("Cluster Number")
x=axis.set_ylabel("Number of movies")
We clearly see that one cluster is the largest and one cluster has the fewest number of movies.
Let's look at the cluster statistics.

In [24]:
df_numeric.groupby(['cluster']).mean()

size_array = list(df_numeric.groupby(['cluster']).count()['budget'].values)
In [26]:
size_array
Out[26]:
df_numeric[df_numeric['cluster']==size_array.index(sorted(size_array)
[0])].sample(5)
73, 253, 3744, 4801, 1107]

We see many big movie names in this cluster. So the results are intuitive.

 Cluster that is the second smallest cluster in the results, has 2nd highest votes
count and the most highly rated movies. The runtime for these movies is on the higher
end and popularity score is also good. Let's see some of the movie names from this
cluster

df_numeric[df_numeric['cluster']==size_array.index(sorted(size_array)
[1])].sample(5)
 Lastly, let's take a look at the least successful movies. This cluster represents the
movies that recieved least number of votes and also has the smallest runtime,
revenue and popularity score.

In [29]:
df_numeric[df_numeric['cluster']==size_array.index(sorted(size_array)[-
1])].sample(5)

As we can see this cluster also includes the movies for which our dataset has no information
about the budget and revenue, hence there corresponding fields have 0 value in it. This pulls
down the net revenue of the whole cluster. If we keep the cluster size slightly larger, we might
get to see these movies clustered separately.

https://www.kaggle.com/rounakbanik/the-movies-dataset

https://www.kaggle.com/vjchoudhary7/kmeans-clustering-in-customer-segmentation

KMeans Clustering in Customer Segmentation


Python notebook using data from Mall Customer Segmentation Data · 25,246 views · 2y
agotutorial
78
# This Python 3 environment comes with many helpful analytics libraries
installed
# It is defined by the kaggle/python docker image:
https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in

#import the libraries


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #Data Visualization
import seaborn as sns #Python library for Vidualization

# Input data files are available in the "../input/" directory.


# For example, running this (by clicking run or pressing Shift+Enter) will
list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.
['Mall_Customers.csv']
In [2]:
#Import the dataset

dataset = pd.read_csv('../input/Mall_Customers.csv')

#Exploratory Data Analysis


#As this is unsupervised learning so Label (Output Column) is unknown

dataset.head(10) #Printing first 10 rows of the dataset


Out[2]:
S
p
A e
n n
n d
u i
C a n
u l g
s G
t e I S
A
o n n c
g
m d c o
e
e e o r
r r m e
I e
D (
( 1
k -
$ 1
) 0
0
)

M
a 1 1 3
0 1
l 9 5 9
e

M
a 2 1 8
1 2
l 1 5 1
e

F
e
m 2 1
2 3 6
a 0 6
l
e

3 4 F 2 1 7
e 3 6 7
m
S
p
A e
n n
n d
u i
C a n
u l g
s G
t e I S
A
o n n c
g
m d c o
e
e e o r
r r m e
I e
D (
( 1
k -
$ 1
) 0
0
)

a
l
e

F
e
m 3 1 4
4 5
a 1 7 0
l
e

F
e
m 2 1 7
5 6
a 2 7 6
l
e

6 7 F 3 1 6
e
S
p
A e
n n
n d
u i
C a n
u l g
s G
t e I S
A
o n n c
g
m d c o
e
e e o r
r r m e
I e
D (
( 1
k -
$ 1
) 0
0
)

m
a
5 8
l
e

F
e
m 2 1 9
7 8
a 3 8 4
l
e

M
a 6 1
8 9 3
l 4 9
e

9 1 F 3 1 7
0 e 0 9 2
m
S
p
A e
n n
n d
u i
C a n
u l g
s G
t e I S
A
o n n c
g
m d c o
e
e e o r
r r m e
I e
D (
( 1
k -
$ 1
) 0
0
)

a
l
e

In [3]:
#total rows and colums in the dataset
dataset.shape
Out[3]:
(200, 5)
In [4]:
dataset.info() # there are no missing values as all the columns has 200
entries properly
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
CustomerID 200 non-null int64
Gender 200 non-null object
Age 200 non-null int64
Annual Income (k$) 200 non-null int64
Spending Score (1-100) 200 non-null int64
dtypes: int64(4), object(1)
memory usage: 7.9+ KB
In [5]:
#Missing values computation
dataset.isnull().sum()
Out[5]:
CustomerID 0
Gender 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64
In [6]:
### Feature sleection for the model
#Considering only 2 features (Annual income and Spending Score) and no Label
available
X= dataset.iloc[:, [3,4]].values
In [7]:
#Building the Model
#KMeans Algorithm to decide the optimum cluster number , KMeans++ using Elbow
Mmethod
#to figure out K for KMeans, I will use ELBOW Method on KMEANS++ Calculation
from sklearn.cluster import KMeans
wcss=[]

#we always assume the max number of cluster would be 10


#you can judge the number of clusters by doing averaging
###Static code to get max no of clusters

for i in range(1,11):
kmeans = KMeans(n_clusters= i, init='k-means++', random_state=0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)

#inertia_ is the formula used to segregate the data points into clusters
In [8]:
#Visualizing the ELBOW method to get the optimal value of K
plt.plot(range(1,11), wcss)
plt.title('The Elbow Method')
plt.xlabel('no of clusters')
plt.ylabel('wcss')
plt.show()

In [9]:
#If you zoom out this curve then you will see that last elbow comes at k=5
#no matter what range we select ex- (1,21) also i will see the same behaviour
but if we chose higher range it is little difficult to visualize the ELBOW
#that is why we usually prefer range (1,11)
##Finally we got that k=5
#Model Build
kmeansmodel = KMeans(n_clusters= 5, init='k-means++', random_state=0)
y_kmeans= kmeansmodel.fit_predict(X)

#For unsupervised learning we use "fit_predict()" wherein for supervised


learning we use "fit_tranform()"
#y_kmeans is the final model . Now how and where we will deploy this model in
production is depends on what tool we are using.
#This use case is very common and it is used in BFS industry(credit card) and
retail for customer segmenattion.
In [10]:
#Visualizing all the clusters

plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red',


label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue',
label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green',
label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan',
label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta',
label = 'Cluster 5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s =
300, c = 'yellow', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

In [11]:
###Model Interpretation
#Cluster 1 (Red Color) -> earning high but spending less
#cluster 2 (Blue Colr) -> average in terms of earning and spending
#cluster 3 (Green Color) -> earning high and also spending high [TARGET SET]
#cluster 4 (cyan Color) -> earning less but spending more
#Cluster 5 (magenta Color) -> Earning less , spending less

######We can put Cluster 3 into some alerting system where email can be send
to them on daily basis as these re easy to converse ######
#wherein others we can set like once in a week or once in a month

# Thank you and please upvote for the motivation


https://www.kaggle.com/thebrownviking20/cars-k-means-clustering-script

# K-Means Clustering

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('../input/cars.csv')

X = dataset.iloc[:,:-1].values

X = pd.DataFrame(X)
X = X.convert_objects(convert_numeric=True)
X.columns = ['mpg', ' cylinders', ' cubicinches', ' hp', ' weightlbs',
' time-to-60', 'year']

# Eliminating null values


for i in X.columns:
X[i] = X[i].fillna(int(X[i].mean()))
for i in X.columns:
print(X[i].isnull().sum())

# Using the elbow method to find the optimal number of clusters


from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
kmeans = KMeans(n_clusters=i,init='k-means+
+',max_iter=300,n_init=10,random_state=0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1,11),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

# Applying k-means to the cars dataset


kmeans = KMeans(n_clusters=3,init='k-means+
+',max_iter=300,n_init=10,random_state=0)
y_kmeans = kmeans.fit_predict(X)
X = X.as_matrix(columns=None)

# Visualising the clusters


plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans ==
0,1],s=100,c='red',label='US')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans ==
1,1],s=100,c='blue',label='Japan')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans ==
2,1],s=100,c='green',label='Europe')
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],
s=300,c='yellow',label='Centroids')
plt.title('Clusters of car brands')
plt.legend()
plt.show()# K-Means Clustering

https://www.kaggle.com/roshansharma/mall-customers-clustering-analysis

Clustering Analysis
In [17]:
x = data.iloc[:, [3, 4]].values

# let's check the shape of x


print(x.shape)
Kmeans Algorithm
The Elbow Method to find the No. of Optimal Clusters
Hide

In [18]:
from sklearn.cluster import KMeans

wcss = []
for i in range(1, 11):
km = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init =
10, random_state = 0)
km.fit(x)
wcss.append(km.inertia_)

plt.plot(range(1, 11), wcss)


plt.title('The Elbow Method', fontsize = 20)
plt.xlabel('No. of Clusters')
plt.ylabel('wcss')
plt.show()
Visualizaing the Clusters
Hide
In [19]:
km = KMeans(n_clusters = 5, init = 'k-means++', max_iter = 300, n_init = 10,
random_state = 0)
y_means = km.fit_predict(x)

plt.scatter(x[y_means == 0, 0], x[y_means == 0, 1], s = 100, c = 'pink',


label = 'miser')
plt.scatter(x[y_means == 1, 0], x[y_means == 1, 1], s = 100, c = 'yellow',
label = 'general')
plt.scatter(x[y_means == 2, 0], x[y_means == 2, 1], s = 100, c = 'cyan',
label = 'target')
plt.scatter(x[y_means == 3, 0], x[y_means == 3, 1], s = 100, c = 'magenta',
label = 'spendthrift')
plt.scatter(x[y_means == 4, 0], x[y_means == 4, 1], s = 100, c = 'orange',
label = 'careful')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c =
'blue' , label = 'centeroid')

plt.style.use('fivethirtyeight')
plt.title('K Means Clustering', fontsize = 20)
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.grid()
plt.show()
This Clustering Analysis gives us a very clear insight about the different segments of the
customers in the Mall. There are clearly Five segments of Customers namely Miser, General,
Target, Spendthrift, Careful based on their Annual Income and Spending Score which are
reportedly the best factors/attributes to determine the segments of a customer in a Mall.
Hierarchial Clustering
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups
similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster
is distinct from each other cluster, and the objects within each cluster are broadly similar to each
other
Using Dendrograms to find the no. of Optimal Clusters

import scipy.cluster.hierarchy as sch

dendrogram = sch.dendrogram(sch.linkage(x, method = 'ward'))


plt.title('Dendrogam', fontsize = 20)
plt.xlabel('Customers')
plt.ylabel('Ecuclidean Distance')
plt.show()
Visualizing the Clusters of Hierarchial Clustering
Hide
In [21]:
from sklearn.cluster import AgglomerativeClustering

hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage


= 'ward')
y_hc = hc.fit_predict(x)

plt.scatter(x[y_hc == 0, 0], x[y_hc == 0, 1], s = 100, c = 'pink', label =


'miser')
plt.scatter(x[y_hc == 1, 0], x[y_hc == 1, 1], s = 100, c = 'yellow', label =
'general')
plt.scatter(x[y_hc == 2, 0], x[y_hc == 2, 1], s = 100, c = 'cyan', label =
'target')
plt.scatter(x[y_hc == 3, 0], x[y_hc == 3, 1], s = 100, c = 'magenta', label =
'spendthrift')
plt.scatter(x[y_hc == 4, 0], x[y_hc == 4, 1], s = 100, c = 'orange', label =
'careful')
plt.scatter(km.cluster_centers_[:,0], km.cluster_centers_[:, 1], s = 50, c =
'blue' , label = 'centeroid')

plt.style.use('fivethirtyeight')
plt.title('Hierarchial Clustering', fontsize = 20)
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.legend()
plt.grid()
plt.show()
Clusters of Customers Based on their Ages

In [22]:
x = data.iloc[:, [2, 4]].values
x.shape
K-means Algorithm
Hide
In [23]:
from sklearn.cluster import KMeans

wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300,
n_init = 10, random_state = 0)
kmeans.fit(x)
wcss.append(kmeans.inertia_)

plt.rcParams['figure.figsize'] = (15, 5)
plt.plot(range(1, 11), wcss)
plt.title('K-Means Clustering(The Elbow Method)', fontsize = 20)
plt.xlabel('Age')
plt.ylabel('Count')
plt.grid()
plt.show()

kmeans = KMeans(n_clusters = 4, init = 'k-means++', max_iter = 300, n_init =


10, random_state = 0)
ymeans = kmeans.fit_predict(x)

plt.rcParams['figure.figsize'] = (10, 10)


plt.title('Cluster of Ages', fontsize = 30)

plt.scatter(x[ymeans == 0, 0], x[ymeans == 0, 1], s = 100, c = 'pink', label


= 'Usual Customers' )
plt.scatter(x[ymeans == 1, 0], x[ymeans == 1, 1], s = 100, c = 'orange',
label = 'Priority Customers')
plt.scatter(x[ymeans == 2, 0], x[ymeans == 2, 1], s = 100, c = 'lightgreen',
label = 'Target Customers(Young)')
plt.scatter(x[ymeans == 3, 0], x[ymeans == 3, 1], s = 100, c = 'red', label =
'Target Customers(Old)')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s =
50, c = 'black')

plt.style.use('fivethirtyeight')
plt.xlabel('Age')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.grid()
plt.show()
According to my own intuition by looking at the above clustering plot between the age of
the customers and their corresponding spending scores, I have aggregated them into 4 different
categories namely Usual Customers, Priority Customers, Senior Citizen Target Customers,
Young Target Customers. Then after getting the results we can accordingly make different
marketing strategies and policies to optimize the spending scores of the customer in the Mall.

x = data[['Age', 'Spending Score (1-100)', 'Annual Income (k$)']].values


km = KMeans(n_clusters = 5, init = 'k-means++', max_iter = 300, n_init = 10,
random_state = 0)
km.fit(x)
labels = km.labels_
centroids = km.cluster_centers_
Hide
In [26]:
data['labels'] = labels
trace1 = go.Scatter3d(
x= data['Age'],
y= data['Spending Score (1-100)'],
z= data['Annual Income (k$)'],
mode='markers',
marker=dict(
color = data['labels'],
size= 10,
line=dict(
color= data['labels'],
width= 12
),
opacity=0.8
)
)
df = [trace1]

layout = go.Layout(
title = 'Character vs Gender vs Alive or not',
margin=dict(
l=0,
r=0,
b=0,
t=0
),
scene = dict(
xaxis = dict(title = 'Age'),
yaxis = dict(title = 'Spending Score'),
zaxis = dict(title = 'Annual Income')
)
)

fig = go.Figure(data = df, layout = layout)


py.iplot(fig)
http://localhost:8888/notebooks/Documents/2019%20OSS/Untitled.ipynb?
kernel_name=python3#

Yousef-Project-1

%matplotlib inline

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

#data = pd.read_csv('data_visualization.csv', index_col=0)

data = pd.read_csv('MRR-24-Feb.csv', index_col=0)

corr = data.corr()

fig = plt.figure()

ax = fig.add_subplot(111)

cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)

fig.colorbar(cax)

ticks = np.arange(0,len(data.columns),1)

ax.set_xticks(ticks)

plt.xticks(rotation=90)

ax.set_yticks(ticks)

ax.set_xticklabels(data.columns)

ax.set_yticklabels(data.columns)

plt.show()
data.describe()

data.describe()
P
a
t
h
L Pat RX RX
RXL RXL RX RX RXL RXL Traf
o Po h QU QU
EV EV QU QU EV EV fic
s wer Los AL AL
DL UL AL AL UL DL Lev
s Red s UL DL
> > DL UL Ave Ave el
D . BS DL Ave Ave
-95 -95 >4 >4 rag rag Ave
if =0 > rag rag
dB dB GS GS e e rag
f. dB 150 e e
m m M M (dB (dB e
> (%) dB (GS (GS
(%) (%) (%) (%) m) m) (E)
0 (%) M) M)
d
B
(
%
)

c
662 662 662 662 662 662 662 662 662 662 662 680
o
.00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00
u
000 000 000 000 000 000 000 000 000 000 000 000
n
0 0 0 0 0 0 0 0 0 0 0 0
t

m - -
76. 56. 0.2 93. 63. 2.6 2.1 0.2 0.6 1.2
e 92. 77.
338 031 328 787 751 853 422 973 528 160
a 313 934
308 495 25 100 601 93 05 87 10 44
n 006 245

12. 43. 0.5 5.9 15. 2.2 1.6 0.1 0.5 2.9 6.0 1.1
st
152 498 055 739 424 816 929 971 182 068 663 415
d
037 307 91 86 403 37 25 28 01 42 40 84

-
-
14. 3.3 0.0 54. 2.4 0.0 0.0 0.0 0.0 104 0.0
m 93.
290 600 000 400 400 000 000 000 000 .71 000
in 050
000 00 00 000 00 00 00 00 00 000 00
000
0
P
a
t
h
L Pat RX RX
RXL RXL RX RX RXL RXL Traf
o Po h QU QU
EV EV QU QU EV EV fic
s wer Los AL AL
DL UL AL AL UL DL Lev
s Red s UL DL
> > DL UL Ave Ave el
D . BS DL Ave Ave
-95 -95 >4 >4 rag rag Ave
if =0 > rag rag
dB dB GS GS e e rag
f. dB 150 e e
m m M M (dB (dB e
> (%) dB (GS (GS
(%) (%) (%) (%) m) m) (E)
0 (%) M) M)
d
B
(
%
)

- -
2 67. 12. 0.0 91. 53. 0.9 1.0 0.1 0.2 0.4
94. 82.
5 622 015 200 730 567 100 400 500 100 275
250 555
% 500 000 00 000 500 00 00 00 00 00
000 000

- -
5 79. 31. 0.0 95. 65. 2.1 1.7 0.2 0.5 0.9
92. 78.
0 125 485 700 680 430 400 500 600 600 400
295 375
% 000 000 00 000 000 00 00 00 00 00
000 000

100 - -
7 85. 0.2 97. 75. 3.8 2.7 0.3 0.9 1.6
.00 90. 73.
5 525 200 907 092 600 900 875 600 125
000 352 565
% 000 00 500 500 00 00 00 00 00
0 500 000

100 100 100 100 - -


m 5.3 15. 17. 1.9 2.4 6.7
.00 .00 .00 .00 82. 60.
a 500 850 200 800 900 900
000 000 000 000 000 320
x 00 000 000 00 00 00
0 0 0 0 000 000
data.info()

data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 683 entries, EBSC05/KC2001A to TOTAL
Data columns (total 13 columns):
Path Loss Diff. > 0 dB (%) 662 non-null float64
Power Red. BS = 0 dB (%) 662 non-null float64
Path Loss DL > 150 dB (%) 662 non-null float64
RXLEV DL > -95 dBm (%) 662 non-null float64
RXLEV UL > -95 dBm (%) 662 non-null float64
RXQUAL DL > 4 GSM (%) 662 non-null float64
RXQUAL UL > 4 GSM (%) 662 non-null float64
Channel Group 683 non-null object
RXQUAL UL Average (GSM) 662 non-null float64
RXQUAL DL Average (GSM) 662 non-null float64
RXLEV UL Average (dBm) 662 non-null float64
RXLEV DL Average (dBm) 662 non-null float64
Traffic Level Average (E) 680 non-null float64
dtypes: float64(12), object(1)
memory usage: 94.7+ K

data.dtypes

Path Loss Diff. > 0 dB (%) float64


Power Red. BS = 0 dB (%) float64
Path Loss DL > 150 dB (%) float64
RXLEV DL > -95 dBm (%) float64
RXLEV UL > -95 dBm (%) float64
RXQUAL DL > 4 GSM (%) float64
RXQUAL UL > 4 GSM (%) float64
Channel Group object
RXQUAL UL Average (GSM) float64
RXQUAL DL Average (GSM) float64
RXLEV UL Average (dBm) float64
RXLEV DL Average (dBm) float64
Traffic Level Average (E) float64
dtype: object

data.shape

print("the no of rows is : {}".format(data.shape[0]))


print("the nu of columns is : {}".format(data.shape[1]))
the no of rows is : 683
the nu of columns is : 13
#empty all of them

print(data.isnull().sum() )
Path Loss Diff. > 0 dB (%) 21
Power Red. BS = 0 dB (%) 21
Path Loss DL > 150 dB (%) 21
RXLEV DL > -95 dBm (%) 21
RXLEV UL > -95 dBm (%) 21
RXQUAL DL > 4 GSM (%) 21
RXQUAL UL > 4 GSM (%) 21
Channel Group 0
RXQUAL UL Average (GSM) 21
RXQUAL DL Average (GSM) 21
RXLEV UL Average (dBm) 21
RXLEV DL Average (dBm) 21
Traffic Level Average (E) 3
dtype: int64

import seaborn as sns

sns.set(style="ticks", color_codes=True)

# sns.pairplot(data, hue='EBSC05/KC2001A', size=2.5);

sns.pairplot(data);

g = sns.pairplot(data, hue="Channel Group")


corr.head()

Path Loss Diff. > 0 dB (%)Power Red. BS = 0 dB (%)Path Loss DL > 150 dB
(%)RXLEV DL > -95 dBm (%)RXLEV UL > -95 dBm (%)RXQUAL DL > 4 GSM
(%)RXQUAL UL > 4 GSM (%)RXQUAL UL Average (GSM)RXQUAL DL Average
(GSM)RXLEV UL Average (dBm)RXLEV DL Average (dBm)Traffic Level
Average (E)Path Loss Diff. > 0 dB (%)1.000000-0.2132730.228015-
0.3236580.0943570.168676-0.174919-0.0844410.0759770.047227-
0.3994460.048085Power Red. BS = 0 dB (%)-0.2132731.000000-
0.0311980.343456-0.105120-0.398970-0.098418-0.341315-0.647911-
0.0863240.593539-0.535905Path Loss DL > 150 dB (%)0.228015-
0.0311981.000000-0.565437-0.3140080.2759810.3151730.2701670.123071-
0.329219-0.3174460.020621RXLEV DL > -95 dBm (%)-0.3236580.343456-
0.5654371.0000000.579133-0.442519-0.362459-0.443099-
0.2622240.5938660.748726-0.155881RXLEV UL > -95 dBm (%)0.094357-
0.105120-0.3140080.5791331.000000-0.260925-0.312329-0.275087-
0.0753350.9435020.5940840.051000

# sns.heatmap(corr,annot=True)

sns.heatmap(corr,annot=True,cmap="YlGnBu")

# ax = sns.heatmap(uniform_data, vmin=0, vmax=1)


# More Heat map Learn

plt.figure(figsize=(30,10))

sns.heatmap(corr,annot=True,cmap="YlGnBu",fmt='.1g',square=True)

# plt.figure(figsize=(15,4))

# plt.savefig('medals.svg')
#very important Subplot Seaborn

fig = plt.figure(figsize = (20, 25))

# result['diagnosis'] = data.iloc[:,0]

j=0

for i in data.columns:

plt.subplot(6, 4, j+1)

j += 1

sns.distplot(data[i][data['Channel Group']==0], color='g', label = 'CHGr-0')

sns.distplot(data[i][data['Channel Group']==1], color='r', label = 'CHGr-1')

plt.legend(loc='best')

fig.suptitle('KV2001_MRR')
fig.tight_layout()

fig.subplots_adjust(top=0.95)

plt.show()

g = sns.pairplot(data, vars=["RXLEV DL > -95 dBm (%)", "RXQUAL DL > 4 GSM (%)"])
g = sns.pairplot(data,x_vars=["Path Loss Diff. > 0 dB (%)", "Power Red. BS = 0 dB (%)"],

y_vars=["Path Loss DL > 150 dB (%)", "Traffic Level Average (E)"],kind="reg")

sns.pairplot(data, hue = 'Channel Group')

sns.pairplot(data, hue = 'Area Group')


sns.pairplot(data, hue = 'Clustering')

sns.pairplot(df, hue = 'clusters', diag_kind = 'kde',

plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'},

size = 4)

# Plot colored by continent for years 2000-2007

sns.pairplot(df[df['year'] >= 2000],

vars = ['life_exp', 'log_pop', 'log_gdp_per_cap'],

hue = 'continent', diag_kind = 'kde',

plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'},

size = 4);

# Title

plt.suptitle('Pair Plot of Socioeconomic Data for 2000-2007',

size = 28);

# Create an instance of the PairGrid class.

# per CHGR-0 and CHGr-1

grid = sns.PairGrid(data= df_log[df_log['year'] == 2007],

vars = ['life_exp', 'log_pop',

'log_gdp_per_cap'], size = 4)

# Map a histogram to the diagonal

grid = grid.map_diag(plt.hist, bins = 10, color = 'darkred',

edgecolor = 'k')

# Map a density plot to the lower triangle

grid = grid.map_lower(sns.kdeplot, cmap = 'Reds')


# https://stackoverflow.com/questions/30942577/seaborn-correlation-coefficient-on-pairgrid

import numpy as np

from scipy import stats

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

sns.set(style="white")

mean = np.zeros(3)

cov = np.random.uniform(.2, .4, (3, 3))

cov += cov.T

cov[np.diag_indices(3)] = 1

data = np.random.multivariate_normal(mean, cov, 100)

df = pd.DataFrame(data, columns=["X", "Y", "Z"])

def corrfunc(x, y, **kws):

r, _ = stats.pearsonr(x, y)

ax = plt.gca()

ax.annotate("r = {:.2f}".format(r),

xy=(.1, .9), xycoords=ax.transAxes)

g = sns.PairGrid(df, palette=["red"])

g.map_upper(plt.scatter, s=10)

g.map_diag(sns.distplot, kde=False)

g.map_lower(sns.kdeplot, cmap="Blues_d")

g.map_lower(corrfunc)
#K-means Clustering

#MRR-3-Feb

dataF = pd.read_csv('MRR-3-Feb.csv', index_col=0)

dataF.head()

sns.kdeplot(data['Path Loss Diff. > 0 dB (%)'], shade=True)

fig = plt.figure()

ax1 = fig.add_axes([0.1, 0.5, 0.8, 0.4],

xticklabels=[], ylim=(-1.2, 1.2))

ax2 = fig.add_axes([0.1, 0.1, 0.8, 0.4],

ylim=(-1.2, 1.2))

ax1.plot(plt.hist(data['Path Loss Diff. > 0 dB (%)'], normed=True, alpha=0.5))

ax2.plot(plt.hist(data['Path Loss Diff. > 0 dB (%)'], normed=True, alpha=0.5))


fig = plt.figure()

fig.subplots_adjust(hspace=0.4, wspace=0.4)

for i in range(1, 7):

ax = fig.add_subplot(2, 3, i)

plt.hist(data['Path Loss Diff. > 0 dB (%)'], normed=True, alpha=0.5)

plt.scatter(data['RXLEV DL > -95 dBm (%)'],

data['RXQUAL DL > 4 GSM (%)'],

alpha=0.4, edgecolors='w')

plt.xlabel('RXLEV DL > -95 dBm (%)')

plt.ylabel('RXQUAL DL > 4 GSM (%)')

plt.title('Wine Sulphates - Alcohol Content', y=1.05)


plt.scatter(data['RXLEV DL Average (dBm)'],

data['RXQUAL DL Average (GSM)'],

alpha=0.4, edgecolors='w')

plt.xlabel('RXLEV DL Average (dBm)')

plt.ylabel('RXQUAL DL Average (GSM)')

plt.title('Wine Sulphates - Alcohol Content', y=1.05)


jp = sns.jointplot(data=data,

x='RXLEV DL Average (dBm)',

y='RXQUAL DL Average (GSM)',

kind='reg', # <== 😀 Add regression and kernel density fits

space=0, size=6, ratio=4)


jp = sns.jointplot(data=data,

x='RXLEV DL Average (dBm)',

y='RXQUAL DL Average (GSM)',

kind='kde', # <== 😀 Add regression and kernel density fits

space=0, size=6, ratio=4)


fig = plt.figure(figsize=(10,4))

title = fig.suptitle("Sulphates Content in Wine", fontsize=14)

fig.subplots_adjust(top=0.85, wspace=0.3)

ax1 = fig.add_subplot(1,2,1)

ax1.set_title("Red Wine")

ax1.set_xlabel("Sulphates")

ax1.set_ylabel("Density")

sns.kdeplot(data['RXLEV DL Average (dBm)'], ax=ax1, shade=True, color='r')

ax2 = fig.add_subplot(1,2,2)

ax2.set_title("White Wine")

ax2.set_xlabel("Sulphates")
ax2.set_ylabel("Density")

sns.kdeplot(data['RXQUAL DL Average (GSM)'], ax=ax2, shade=True, color='y')

# very important about subplots

# how to draw all Feature Columns of MRR

fig, axes = plt.subplots(nrows=3, ncols=3)

# xx , yy = enumerate(data.columns)

# for i, column in enumerate(data.columns):

# sns.distplot(data[column],ax=axes[i//3,i%3])

i=1

for (columnName, columnData) in data.iteritems():

print(columnName)

sns.distplot(data[columnName],ax=axes[i//3,i%3])

i=i+1
plt.show()

size = wines['residual sugar']*25

fill_colors = ['#FF9999' if wt=='red' else '#FFE888' for wt in list(wines['wine_type'])]

edge_colors = ['red' if wt=='red' else 'orange' for wt in list(wines['wine_type'])]

plt.scatter(wines['fixed acidity'], # <== 😀 1st DIMENSION

wines['alcohol'], # <== 😀 2nd DIMENSION

s=size, # <== 😀 3rd DIMENSION

color=fill_colors, # <== 😀 4th DIMENSION

edgecolors=edge_colors,

alpha=0.4)

plt.xlabel('Fixed Acidity')

plt.ylabel('Alcohol')

plt.title('Wine Alcohol Content - Fixed Acidity - Residual Sugar - Type',y=1.05)


g = sns.FacetGrid(wines,

col="wine_type", # TWO COLUMNS coz there're TWO "wine types"

col_order=['red', 'white'], # -> Specify the labels

hue='quality_label', # ADD COLOR

hue_order=['low', 'medium', 'high'],

aspect=1.2,

size=3.5)

g.map(plt.scatter,

"residual sugar", # <== x-axis

"alcohol", # <== y-axis

alpha=0.5,

edgecolor='white',

linewidth=0.5,

s=wines['total sulfur dioxide']*2) # <== 😀 Adjust the size

fig = g.fig

fig.subplots_adjust(top=0.8, wspace=0.3)

fig.suptitle('Wine Type - Sulfur Dioxide - Residual Sugar - Alcohol - Quality', fontsize=14)

g.add_legend(title='Wine Quality Class')

f, (ax) = plt.subplots(1, 1, figsize=(12, 4))

f.suptitle('Wine Quality - Alcohol Content', fontsize=14)

sns.boxplot(data=data,

x=data['RXLEV DL Average (dBm)'],

y=data['RXQUAL DL Average (GSM)'],

ax=ax)
ax.set_xlabel("Wine Quality",size=12,alpha=0.8)

ax.set_ylabel("Wine Alcohol %",size=12,alpha=0.8)

# sns.boxplot(data=data['RXLEV DL Average (dBm)'])

# Pre-format DataFrame

stats_df = data.drop(['Channel Group', 'RXQUAL UL Average (GSM)', 'RXLEV UL Average (dBm)','RXLEV


DL Average (dBm)','RXQUAL DL > 4 GSM (%)','Traffic Level Average (E)','RXQUAL UL > 4 GSM (%)'],
axis=1)

# New boxplot using stats_df

sns.boxplot(data=stats_df)

#Groupby # Merge

df = pd.read_csv('BlackFriday.csv', usecols = ['User_ID', 'Gender', 'Age', 'Purchase'])

df_gp_1 = df[['User_ID', 'Purchase']].groupby('User_ID').agg(np.mean).reset_index()

df_gp_2 = df[['User_ID', 'Gender', 'Age']].groupby('User_ID').agg(max).reset_index()

df_gp = pd.merge(df_gp_1, df_gp_2, on = ['User_ID'])

#Drop Column of Channel Group

#Drop Null Values and Empty Values

# data.head()

data.isnull().sum()

# #clean

data.dropna(how="any", inplace=True)

data.isnull().sum()

data.head()
#Scaling :

from sklearn import preprocessing

# minmax_processed = preprocessing.MinMaxScaler().fit_transform(data)

# data = data.drop("Cell Name", axis=1)

# print (data.columns)

# data.head()

minmax_processed = preprocessing.MinMaxScaler().fit_transform(data)

# df_numeric_scaled = pd.DataFrame(minmax_processed)

df_numeric_scaled = pd.DataFrame(minmax_processed, index=data.index, columns=data.columns)

df_numeric_scaled.head()

#Elbow

#K-means

from sklearn.cluster import KMeans

wcss = []

for i in range(1, 100):

kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)

kmeans.fit(df_numeric_scaled)

wcss.append(kmeans.inertia_)

plt.plot(range(1, 100), wcss)

plt.title('Elbow Method')

plt.xlabel('Number of clusters')

plt.ylabel('WCSS')

plt.show()
kmeans = KMeans(n_clusters=10, init='k-means++', max_iter=300, n_init=10, random_state=0)

kmeans.fit(df_numeric_scaled)

# pred_y = kmeans.fit_predict(X)

# plt.scatter(X[:,0], X[:,1])

# plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red')

# plt.show()

kmeans.cluster_centers_

kmeans.labels_

len(kmeans.labels_)

data['cluster'] = kmeans.labels_

data.head()

# to Excel Save

data.to_excel(r'test1.xlsx', index = True)

plt.figure(figsize=(12,7))

axis = sns.barplot(x=np.arange(0,10,1),y=data.groupby(['cluster']).count()['Traffic Level Average


(E)'].values)

x=axis.set_xlabel("Cluster Number")

x=axis.set_ylabel("Traffic Level Average (E)")

data.groupby(['cluster']).mean()

# tips = sns.load_dataset("tips")

# ax = sns.barplot(x="day", y="total_bill", data=tips)

data.groupby(['cluster']).mean().to_excel(r'test2.xlsx', index = True)

data.groupby(['cluster']).median()

size_array = list(data.groupby(['cluster']).count()['Traffic Level Average (E)'].values)


size_array

data[data['cluster']==size_array.index(sorted(size_array)[0])].sample(5)

data[data['cluster']==size_array.index(sorted(size_array)[1])].sample(5)

#Visualization Clusters / Charts

DF_Cluster = data.groupby(['cluster']).mean()

DF_Cluster.head()

ax = sns.barplot(x="Power Red. BS = 0 dB (%)", y=DF_Cluster.index, data=DF_Cluster,orient="h" )

x=ax.set_xlabel("Power Red. BS = 0 dB (%)")

x=ax.set_ylabel("Cluster Number")

ax = sns.barplot(x="Path Loss DL > 150 dB (%)", y=DF_Cluster.index, data=DF_Cluster,orient="h" )

x=ax.set_xlabel("Path Loss DL > 150 dB (%)")

x=ax.set_ylabel("Cluster Number")
#Hue Clusters with Them

ax = sns.distplot(DF_Cluster["Traffic Level Average (E)"])

import numpy as np

from scipy import stats

x = np.random.standard_normal(1000)

ax = sns.distplot(DF_Cluster["Traffic Level Average (E)"], kde = False, norm_hist=True)


# calculate the pdf over a range of values

xx = np.arange(-4, +4, 0.001)

yy = stats.norm.pdf(xx)

# and plot on the same axes that seaborn put the histogram

ax.plot(xx, yy, 'r', lw=2)

sns.boxplot(x='sex', y='age', data=dataset, hue="survived")

# Sort the dataframe by target

target_0 = data.loc[data['cluster'] == 0]

target_1 = data.loc[data['cluster'] == 1]

target_2 = data.loc[data['cluster'] == 2]

target_6 = data.loc[data['cluster'] == 6]

sns.distplot(target_0[["Traffic Level Average (E)"]], hist=False, rug=True)

sns.distplot(target_1[["Traffic Level Average (E)"]], hist=False, rug=True)

sns.distplot(target_2[["Traffic Level Average (E)"]], hist=False, rug=True)

sns.distplot(target_6[["Traffic Level Average (E)"]], hist=False, rug=True)

# sns.plt.show()
unique_vals = data['cluster'].unique() # [0, 1, 2]

# Sort the dataframe by target

# Use a list comprehension to create list of sliced dataframes

targets = [data.loc[data['cluster'] == val] for val in unique_vals]

# Iterate through list and plot the sliced dataframe

for i ,target in enumerate(targets):

sns.distplot(target[["Traffic Level Average (E)"]],hist=False,rug=True,label="Cluster" + str(i))

# for target in targets:

# sns.distplot(target[["Traffic Level Average (E)"]], hist=False, rug=True , label="KDE")

# fig.legend(labels=['test_label1','test_label2'])

# sns.plt.show()

# sns.plt.show()
Subplot Data Columns >> For Loop Clusters Displot

I used the following code to create a synthetic dataset which appears to match yours:

import pandas
import numpy
import seaborn as sns
import matplotlib.pyplot as plt

# Generate synthetic data


omega = numpy.linspace(0, 50)

A0s = [1., 18., 40., 100.]

dfs = []
for A0 in A0s:
V_w_dr = numpy.sin(A0*omega)
V_w_tr = numpy.cos(A0*omega)
dfs.append(pandas.DataFrame({'omega': omega,
'V_w_dr': V_w_dr,
'V_w_tr': V_w_tr,
'A0': A0}))
dataframe = pandas.concat(dfs, axis=0)
Then you can do what you want (thanks to @mwaskom in the comments for )sharey='row',
margin_titles=True):
melted = dataframe.melt(id_vars=['A0', 'omega'], value_vars=['V_w_dr', 'V_w_tr'])
g = sns.FacetGrid(melted, col='A0', hue='A0', row='variable', sharey='row',
margin_titles=True)
g.map(plt.plot, 'omega', 'value')
You need melt for reshape with seaborn.factorplot:
df = df.melt('X_Axis', var_name='cols', value_name='vals')
#alternative for pandas < 0.20.0
#df = pd.melt(df, 'X_Axis', var_name='cols', value_name='vals')
g = sns.factorplot(x="X_Axis", y="vals", hue='cols', data=df)
Sample:

df = pd.DataFrame({'X_Axis':[1,3,5,7,10,20],
'col_2':[.4,.5,.4,.5,.5,.4],
'col_3':[.7,.8,.9,.4,.2,.3],
'col_4':[.1,.3,.5,.7,.1,.0],
'col_5':[.5,.3,.6,.9,.2,.4]})

print (df)
X_Axis col_2 col_3 col_4 col_5
0 1 0.4 0.7 0.1 0.5
1 3 0.5 0.8 0.3 0.3
2 5 0.4 0.9 0.5 0.6
3 7 0.5 0.4 0.7 0.9
4 10 0.5 0.2 0.1 0.2
5 20 0.4 0.3 0.0 0.4

df = df.melt('X_Axis', var_name='cols', value_name='vals')


g = sns.factorplot(x="X_Axis", y="vals", hue='cols', data=df)
Aside from cleaning up your data into a tidy format, you need to reformat the text data
(percentages) into numeric data types. Since that has nothing to do with barplots, I'll assume
you can take care of that on your own and focus on the plotting and data structures instead:

df = pandas.DataFrame({
'Factor': ['Growth', 'Value'],
'Weight': [0.10, 0.20],
'Variance': [0.15, 0.35]
})
fig, ax1 = pyplot.subplots(figsize=(10, 10))
tidy = df.melt(id_vars='Factor').rename(columns=str.title)
seaborn.barplot(x='Factor', y='Value', hue='Variable', data=tidy, ax=ax1)
seaborn.despine(fig)

Seaborn favors the "long format" as input. The key ingredient to convert your DataFrame from
its "wide format" (one column per measurement type) into long format (one column for all
measurement values, one column to indicate the type) is pandas.melt. Given
a data_preproc structured like yours, filled with random values:
num_rows = 20
years = list(range(1990, 1990 + num_rows))
data_preproc = pd.DataFrame({
'Year': years,
'A': np.random.randn(num_rows).cumsum(),
'B': np.random.randn(num_rows).cumsum(),
'C': np.random.randn(num_rows).cumsum(),
'D': np.random.randn(num_rows).cumsum()})
A single plot with four lines, one per measurement type, is obtained with

sns.lineplot(x='Year', y='value', hue='variable',


data=pd.melt(data_preproc, ['Year']))
(Note that 'value' and 'variable' are the default column names returned by melt, and can be
adapted to your liking.)
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
readCov = pd.DataFrame([ (1,
'\'"ID=PBANKA_010290;Name=PBANKA_010290;descript...', 108389, 0.785456,
0.899275, 0.803017),
(1, '\'"ID=PBANKA_010300;Name=PBANKA_010300;descript...', 117894,
1.070673, 0.964203, 0.9893719999999999),
(1, '\'"ID=PBANKA_010310;Name=PBANKA_010310;descript...', 119281,
1.0311059999999999, 1.042189, 0.883518),
(2, '\'"ID=PBANKA_010320;Name=PBANKA_010320;descript...', 122082,
0.880109, 1.031673, 1.0265389999999999),
(2, '\'"ID=PBANKA_010330;Name=PBANKA_010330;descript...', 126075,
0.948105, 0.969198, 0.8492129999999999)],
columns=[u'chr', u'description', u'pos', u'bergB7', u'bergC9', u'EvolB20'],
)

meltCov = pd.melt(readCov,id_vars=['chr','description','pos'], var_name='strain')


g = sns.FacetGrid(meltCov, col='chr', hue='strain')
g.map(plt.scatter, 'pos','value')
g.set_xticklabels(rotation=45)
g.add_legend()

#this plots a figure per script automatically


from os.path import realpath, basename
s = basename(realpath(__file__))
fig = plt.gcf()
fig.savefig(s.split('.')[0])
plt.show()
only simple change here g = g.map(plt.plot, 'DATE', 'IRR', 'TWR') use
df = pd.DataFrame({'PORTFOLIO': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'DATE': ['01.01.2018', '01.04.2018', '01.07.2018', '01.10.2018',
'01.01.2018',
'01.04.2018', '01.07.2018', '01.10.2018', ],
'IRR': [.7, .8, .9, .4, .2, .3, .4, .9],
'TWR': [.1, .3, .5, .7, .1, .0, .4, .9],
})

print(df)
sns.set(style='ticks', color_codes=True)
g = sns.FacetGrid(df, col="PORTFOLIO", col_wrap=4, height=4)
g = g.map(plt.plot, 'DATE', 'IRR', color='#FFAA11')
g = g.map(plt.plot, 'DATE', 'TWR', color='#22AA11')
plt.show()

Subplots

https://www.kaggle.com/sohailkhan/pandas-plotting-and-visualization

We create the figure with the subplots:

f, axes = plt.subplots(1, 2)
Where axes is an array with each subplot.

Then we tell each plot in which subplot we want them with the argument ax.
sns.boxplot( y="b", x= "a", data=df, orient='v' , ax=axes[0])
sns.boxplot( y="c", x= "a", data=df, orient='v' , ax=axes[1])
And the result is:
https://seaborn.pydata.org/examples/distplot_options.html

seaborn0.10.0

 Gallery
 Tutorial
 API
 Site 
 Page 

Distribution plot options


Python source code: [download source: distplot_options.py]

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="white", palette="muted", color_codes=True)


rs = np.random.RandomState(10)

# Set up the matplotlib figure


f, axes = plt.subplots(2, 2, figsize=(7, 7), sharex=True)
sns.despine(left=True)
# Generate a random univariate dataset
d = rs.normal(size=100)

# Plot a simple histogram with binsize determined automatically


sns.distplot(d, kde=False, color="b", ax=axes[0, 0])

# Plot a kernel density estimate and rug plot


sns.distplot(d, hist=False, rug=True, color="r", ax=axes[0, 1])

# Plot a filled kernel density estimate


sns.distplot(d, hist=False, color="g", kde_kws={"shade": True}, ax=axes[1,
0])

# Plot a histogram and kernel density estimate


sns.distplot(d, color="m", ax=axes[1, 1])

plt.setp(axes, yticks=[])
plt.tight_layout()

Back to top

© Copyright 2012-2020, Michael Waskom. Created using Sphinx 2.3.1.


# Sort the dataframe by target

target_0 = data.loc[data['cluster'] == 0]

target_1 = data.loc[data['cluster'] == 1]

target_2 = data.loc[data['cluster'] == 2]

target_6 = data.loc[data['cluster'] == 6]

f, axes = plt.subplots(2, 3)

sns.distplot(target_0[["Traffic Level Average (E)"]], hist=False, rug=True,ax=axes[0,0])

sns.distplot(target_1[["Traffic Level Average (E)"]], hist=False, rug=True,ax=axes[0,1])

sns.distplot(target_2[["Traffic Level Average (E)"]], hist=False, rug=True,ax=axes[0,2])

sns.distplot(target_6[["Traffic Level Average (E)"]], hist=False, rug=True,ax=axes[1,0])

# sns.plt.show()

Difficult solutions:

unique_vals = data['cluster'].unique() # [0, 1, 2]

# fig = plt.figure(figsize = (20, 25))

f, axes = plt.subplots(1, 3)

j=0

targets = [data.loc[data['cluster'] == val] for val in unique_vals]


data_t1 = data.loc[:, ['RXLEV DL > -95 dBm (%)','RXQUAL DL > 4 GSM (%)','Traffic Level Average (E)']]

data_t1.head()

for column in data_t1:

columnSeriesObj = data_t1[column]

print('Colunm Name : ', column)

# print('Column Contents : ', columnSeriesObj.values)

for i ,target in enumerate(targets):

print(target[[column]])

# print(target.column)

sns.distplot(target[[column]],ax=axes[0,j],hist=False,rug=True,label="Cluster" + str(i))

j += 1

Final Solution very important : idea

unique_vals = data['cluster'].unique() # [0, 1, 2]

# Sort the dataframe by target

# Use a list comprehension to create list of sliced dataframes

targets = [data.loc[data['cluster'] == val] for val in unique_vals]

# Iterate through list and plot the sliced dataframe

# data_t1 = data.loc[:, ['RXLEV DL > -95 dBm (%)','RXQUAL DL > 4 GSM (%)','Traffic Level Average (E)']]

c= ["Traffic Level Average (E)",'RXLEV DL > -95 dBm (%)']

f,axes = plt.subplots(1, 2)
for ix,cx in enumerate(c):

for i ,target in enumerate(targets):

sns.distplot(target[[cx]],hist=False,rug=True,label="Cluster" + str(i),ax=axes[ix])

You might also like