Professional Documents
Culture Documents
Bf7abCluster Analysis New
Bf7abCluster Analysis New
Pattern Recognition
Spatial Data Analysis
detect spatial clusters and explain them in spatial
data mining
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Web log data to discover groups of similar
access patterns
1.
Clustering
Applications
Many years ago, during a cholera outbreak in London, a physician
plotted the location of cases on a map, getting a plot that looked like
Fig. Properly visualized, the data indicated that cases clustered around
certain intersections, where there were polluted wells, not only
exposing the cause of cholera, but indicating what to do about the
problem. Alas, not all data mining is this easy, often because the
clusters are in so many dimensions that visualization is very hard.
Clustering
Applications
2.
Documents may be thought of as points in a highdimensional space, where each dimension corresponds to
one possible word. The position of a document in a
dimension is the number of times the word occurs in the
document (or just 1 if it occurs, 0 if not). Clusters of
documents in this space often correspond to groups of
documents on the same
3.
Clustering
Example
Clustering
Houses
Groups of
homes
Geographic
Distance
Based
Size Based
Clustering Problem
Given a database D={t1,t2,,tn} of tuples and an integer
Unsupervised learning
Clustering
Issues
Outlier handling
Dynamic data
Interpreting results
Evaluating results
Number of clusters
Data to be used
Scalability
x11
...
x
i1
p variables (attributes or
measures
a relational table or n by
p matrix
...
x
n1
...
x1f
...
x1p
...
...
...
xif
...
...
...
xip
...
...
...
...
xnf
...
...
xnp
d(2,1)
0
also called Object by object
d(3,1) d ( 3,2) 0
structure
:
:
:
: Nonnegative
: near 0 when objects are highly similar
Dissimilarity Matrix
Many clustering algorithms operate on
dissimilarity
matrix
Types of Data
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio
variables
Interval-scales Variables
Continuous measurements of a roughly linear scale
Weight, height, latitude and longitude coordinates,
temperature, etc.
Standardizing Variables
m f 1n (x1 f x2 f
...
xnf )
xif m f
zif
sf
Minkowski distance:
d (i, j) q (| x x |q | x x | q ... | x x |q )
i1
j1
i2
j2
ip
jp
where
Properties of Minkowski
Distance
d(i,j) 0 Nonnegativity
Binary Variables
0-varaible absent
1-variable present
Object i
Object j
1
0
1
a
b
0
c
d
sum a c b d
sum
a b
c d
p
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
Nominal Variables
Ordinal Variables
An ordinal variable can be discrete or continuous
order is important, e.g., rank
Can be treated like interval-scaled
replacing xif by their rank
rif {1,..., M f }
map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
zif
rif 1
M f 1
Ratio-Scaled Variables
Methods:
pf 1 ij( f ) d ij( f )
d (i, j )
pf 1 ij( f )
f is binary or nominal:
dij = 0 if xif = xjf , or dij(f) = 1 o.w.
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled
zif r
compute ranks rif and
M
and treat zif as interval-scaled
if
1
f
Distance Between
Clusters
If | p p | is distance between two points or two
objects, mi is mean of cluster Ci and ni is number of
objects
in Ci, then
Minimum
distance:
dmin(Ci, Cj) = minpCi , pCj | p p |
Maximum distance:
dmax(Ci, Cj) = maxpCi , pCj | p p |
Mean distance:
dmean(Ci, Cj) = | mi mj |
Average distance:
davg(Ci, Cj) = 1/(ninj) pCi pCj | p p |
Similarity Measures
If i = (xi1, xi2, , xip,) and j = (xj1, xj2, , xjp) are
two p-dimensional data objects, then
Euclidean distance
2
Manhattan
distance
Minkowski
distance
d (i, j ) xi1 x j1 xi 2 x j 2
... xip x jp
q 1/ q
Impact of Outliers on
Clustering
clusters
Some clustering algorithms find and eliminate outliers
Statistical techniques to detect outliers
Discordancy Test
Not very realistic for real life data
Clustering Approaches
Clusterin
g
Hierarchical Partitional Density-based Grid-based