Section 3

Section 3:Data Reduction and Segmentation:
Cluster Analysis and Principal Component

Analysis
1
3.1 Data reduction and visualisation
• Best way of understanding

multi-variate data (m cases by
p variables) is by visualizing it
• If m large (lots of data points)
– If you plot all the data of a
large data set on a diagram It
will just be a black mass.
– Overprinting means one
cannot see what is going on
– Better if you can draw density
contours ,i.e. estimate
probability density functions
2
3.2 Principal Component Analysis
• If p the number of data characteristics is very large impossible to
build visualisations except in a few dimensions
• Which dimensions?
– Can project onto each ( or a small subset – scatterplots, star icons)
– But maybe the interesting thing is not in the original variables
– Look for connection in data (x1 x2 ) with (1,2), (2,4), (3,6), (4,8), (5,10).
• Most interesting direction is x1 + 2x2 while the most boring direction
is the one perpendicular to it 2x1 -x2
– i.e interesting is the direction where the sample variance is the biggest
– So if we plot onto two dimensions then first is dimension where sample
variance is the largest; second is dimension, perpendicular to first with
biggest sample variance of such perpendicular dimensions
• In same way can define 1st , 2nd, 3rd , etc principal components.
– relatively easy to calculate in that if V is pxp covariance matrix of data
these are the eigenvectors, x, ( Vx =λx).
– Corresponding to highest eigenvalues.
3
6 (3,6)
4 (2,4)
2 (1,2)
1 2 3 4 5 6
3.3 Mixtures part of density estimation and
mixtures
• In some cases one might think the density function is described/is
explained by more than one object type.
• Example: supermarket store card usage divides into those who
use store regularly and those who use it occasionally.
• General form of mixture distribution is
K
f (x) = ∑ p (k ) f k (x;θ k )
k =1
• fk(x; θk) can be a normal distribution

• To estimate this, one needs to estimate both p(k) and θk.
• EM (Expectation-Maximisation) algorithm does this in steps
– First choose some arbitrary values θ0 of θ and find p(k)1 that best fits
data.
– Then fixing p(k)1 find best values of θ, θ1 that best fit data.
– Repeat until there is “convergence”, i.e. little change in loglikelihood 5
Example: Mixture distribution
6
3.4 Clustering and Segmentation
• The obvious question after using mixture models is given
a data set are there different types or “clusters” so that
the units in a cluster exhibit similar behaviour which is
very different from that of another cluster?
• Two types of clustering – segmentation and cluster
analysis
– Segmentation: partition the data into convenient groups
• In behaviour scoring less than 6 months history/ more than 6
months history
• In purchases, high spenders/low spenders
– Cluster analysis: find the natural partitions in the data
• Many different methods for cluster analysis
– Partition based clustering
– Hierarchical clustering
– Probabilistic model based clustering 7
Distance measures for clustering
Recency
20
Manhattan
10 Manhattan
Monetary
30 50
• Euclidean distance =
• Manhattan (City Block)=|50-30|+|20-10|=30
• Notes:
• When variables are measured on other scales (e.g. income) then need to
standardise (to e.g mean 0 and standard deviation 1 using z-scores)
• Use dummy coding for categorical variables 8
3.5 Partition based clustering
• Task; partition data into k (prespecified) disjoint sets
• Measurement: use score functions which are
combinations/variants of
– within cluster variation: sum of distance from each point to
cluster centre
– between cluster variation: sum of distances between cluster
centres
• Structure/Optimization:
– One example is k-means algorithm
• Randomly pick k-cluster centres
• Ascribe each point to nearest cluster which has nearest
centre.
• Compute new centre of points ascribed to each cluster
• Repeat until no or little change . 9
3.6 K-Means clustering
—User starts by specifying the number of

clusters (K)
—K datapoints are randomly selected
—Repeat until no change:
—Hyperplanes separating K points are
generated
—K Centroids of each cluster are computed
1010
Ejemplo
11
Measuring how good is the clustering
Suppose K clusters C1 , C2 ,....CK
∑x
x∈Ck
Centroid of cluster k ( n k datapoints) is rk =
nk
K
Within cluster variation w(C)= ∑ wc(C k ) = ∑ ∑ d ( x, r )
2
k
k k 1 x∈C
= k
where d(x,y) is Euclidean distance from x to y

Between cluster variation is b(C) = ∑
1≤ j ≤ k ≤ K
d ( rk , rj ) 2
Measure is some combination of two : b(C)/w(C) or b(C) -aw(C)

Extend to using variance-covariance
W(C k )ij = ∑ (x
x∈Ck ∂
i − rk ,i )( x j − rk , j ).
Variants of this can be used w(C)= ∑ trace(W

k
k ); or trace(∑ Wk )
k
12
3.7 Hierarchical Clustering
• Clusters are created in levels actually creating sets of clusters at

each level. So K is not prespecified
• Agglomerative
– Initially each item in its own cluster
– Iteratively clusters are merged together
– Bottom Up
• Divisive
– Initially all items in one cluster
– Large clusters are successively divided
– Top Down
– If splitting clusters on one variable then equivalent to classification tree
– If splitting on all variables, very complicated searching- so
agglomerative methods more common
13
Agglomerative versus divisive hierarchical clustering
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0
C1 C1
C1
C2
C2
C2 C3
C4
C3 C5
C3
C4 C4
C4 C5
C5 C5
Step 0 Step 1 Step 2 Step 3 Step 4
Agglomerative
Measuring distance in agglomerative methods
• Take all possible pairs of clusters
• Measure how far apart they are
• Combine the pair which are closest together and repeat
until distance apart above some prespecified level
• Distance Measures ( usually use Euclidean distance d(x,y))
15
Calculating distances between clusters
Single linkage
Complete linkage
Average linkage
Centroid method
Single link Shortest distance between any two members in both clusters
(Nearest Neighbour)
Complete link Maximum distance between any two members in both clusters
(furthest neighbour)
Average link Average distance between all members in both clusters
16
Centroid method Distance between both cluster centroids
Calculating distances between clusters
• Ward’s minimum variance

– First, the sum of squared distances within each cluster is
calculated.
– Then for each pair of clusters, the increase in sum of squared
distances is calculated assuming the pair is put together as 1
cluster
– Merge the pair of clusters where the increase in variance is least
– Similar to ANOVA
17
Hierarchical versus non hierarchical
clustering
• Hierarchical clustering
– No one distance measure is the best
• Ward most common but tends to produce clusters of roughly
same size and sensitive to outliers
– Hierarchical clustering methods do not scale up well – lots of
calculations
– Once made a mistake in combination cannot undo it
subsequently
• K means:
– Have to specify number of clusters (K)
– Need initial seeds
– Fast and works well with large data sets
18
3.8 Choosing the number of clusters:
Dendrogram
• Dendrogram: a tree data structure
which illustrates hierarchical clustering
techniques.
• Each level shows clusters for that level.
– Leaf – individual observations
– Root – one cluster
• A cluster at level i is the union of its
children clusters at level i+1.
• Helps decide how many cluster one
wants. Since vertical scale gives
“distance between two clusters
amalgamated
• Sometimes diagram represented on its
side
19
3.8 Defining the number of clusters: elbow rule (1)
20
Elbow rule (2): the scree diagram
21
Levels of Clustering
22

Section 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Section 3

Uploaded by

Copyright:

Available Formats

Section 3:Data Reduction and Segmentation:

Cluster Analysis and Principal Component

• Best way of understanding

• fk(x; θk) can be a normal distribution

—User starts by specifying the number of

where d(x,y) is Euclidean distance from x to y

Measure is some combination of two : b(C)/w(C) or b(C) -aw(C)

Variants of this can be used w(C)= ∑ trace(W

• Clusters are created in levels actually creating sets of clusters at

Step 0 Step 1 Step 2 Step 3 Step 4

• Ward’s minimum variance

You might also like