Professional Documents
Culture Documents
Business Analytics and Big Data
Business Analytics and Big Data
Clustering
It is often useful to partition data without having a training sample; this is also known as
unsupervised learning. For example, in business, it may be important to determine groups
of customers who have similar buying patterns, or in medicine, it may be important to
determine groups of patients who show similar reactions to prescribed drugs. The goal of
clustering is to place records into groups, such that records in a group are similar to each
other and dissimilar to records in other groups. The groups are usually disjoint.
An important aspect of clustering is the similarity function that is used. When the data is
numeric, a similarity function based on distance is typically used. For example, the
Euclidean distance can be used to measure similarity. Consider two n-dimensional data
records as points x and y in n-dimensional space. We can consider the value for the ith
dimension as xi and yi for the two records. The Euclidean distance between points
x=(x1,…,xn) and y=(y1,…,yn) in n-dimensional space is
n
U ( x, y ) ¦x yi
2
i
i 1
The smaller the distance between two points, the greater is the similarity. A classic
clustering algorithm is the following k-Means algorithm:
1
Assume that the number of desired clusters k is 2. Let the algorithm choose records with
RECORD 3 for cluster C1 and RECORD 6 for cluster C2 as the initial cluster centroids.
The remaining records will be assigned to one of those clusters during the first iteration of
the repeat loop.
RECORD 1 has a distance from C1 RI¥2 + 102) = 22.4 and a distance from C2 of 32.0,
so it joins cluster C1. RECORD 2 has a distance from C1 of 10.0 and a distance from C2
of 5.0, so it joins cluster C2. RECORD 4 has a distance from C1 of 25.5 and a distance
from C2 of 36.6, so it joins cluster C1. RECORD 5 has a distance from C1 of 20.6 and a
distance from C2 of 29.2, so it joins cluster C1.
Thus we have
C1 = {RECORD 1, RECORD 3, RECORD 4, RECORD 5}
Now, the new means (centroids) for the two clusters are computed. The mean for a cluster,
Ci, is a vector consisting of the mean of the individual dimensions within the cluster.
Tasks
3. Re-allocate all the records according to the new means calculated in (1) and (2)
5. Re-allocate all the records according to the new means calculated in (4)