Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Unit V Cluster Analysis: Basic Concepts and Methods

Clustering is the process of grouping a set of data objects into multiple groups or clusters so that
objects within a cluster have high similarity, but are very dissimilar to objects in other clusters.
What Is Cluster Analysis?
 Cluster analysis or simply clustering is the process of partitioning a set of data objects (or
observations) into subsets. Each subset is a cluster, such that objects in a cluster are
similar to one another, yet di similar to objects in other clusters. The set of clusters
i:esulting from a cluster analysis can be referred to as a clustering.
 Because a cluster is a collection of data objects that are similar to one another within the
cluster and dissimilar to objects in other clusters, a cluster of data objects can be treated
as an implicit class. In this sense, clustering is sometimes called automatic classification.
 Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity. Clustering can also be
used for outlier detection, where outliers (values that are "far away" from any cluster)
may be more interesting than common cases.
 Applications of outlier detection include the detection of credit card fraud and the
monitoring of criminal activities in electronic commerce.
 For example, exceptional cases in credit card transactions, such as very expensive and
infrequent purchases, may be of interest as possible fraudulent activities.
 Clustering is known as unsupervised learning because the class label information is not
present.
 For this reason, clustering is a form of learning by observation, rather than learning by
examples.
Requirements for Cluster Analysis
 Scalability - Clustering all the data instead of only on samples
 Ability to deal with different types of attributes: Numerical, binary, categorical, ordinal,
linked, and mixture of these
 Discovery of clusters with arbitrary shape:
 Requirements for domain knowledge to determine input parameters:
 Constraint-based clustering
 User may give inputs on constraints
 Use domain knowledge to determine input parameters
 Interpretability and usability
 Others
 Discovery of clusters with arbitrary shape
 Ability to deal with noisy data
 Incremental clustering and insensitivity to input order
 High dimensionality
Hierarchical Methods
 A hierarchical clustering method work by grouping data objects into a hierarchy or "tree"
of clusters
 Rep re eating data objects in the form of a hierarchy is usefuJ for data summarization and
visualization.
 Hierarchical clustering methods can encounter difficulties regarding the selection of
merge or split points. Such a decision is critical, because once a group of objects is
merged or split, the process at the next step will operate on the newly generated clusters.
 It will neither undo what was done previously, nor perform object swapping between
clusters. Thus, merge or split decisions, if not well chosen, may lead to low-quality
clusters.
 Moreover, the methods do not scale well because each decision of merge or split needs
to examine and evaluate many objects or clusters.
 A promising direction for improving the clustering quality of hierarchical methods is to
integrate hierarchical clustering with other clustering techniques, resulting in multiple-
phase (or multiphase) clustering.
 There are two types of hierarchical clustering
1. Agglomerative hierarchical
clustering.
2. Divisive hierarchical clustering
Agglomerative hierarchical clustering
 Group data objects in a bottom-up fashion.
 Initially each data object is in its own cluster.
 Then we merge these atomic clusters into larger and larger clusters, until all of the objects
are in a single cluster or until certain termination conditions are satisfied.
 A user can specify the desired number of clusters as a termination condition.
Divisive hierarchical clustering
 Groups data objects in a top-down fashion.
 Initially all data objects are in one cluster.
We then subdivide the cluster into smaller and smaller clusters, until each object forms
cluster on its own or satisfies certain termination conditions, such as a desired number of
clusters is obtained
AGNES & DIANA
 Application of AGNES( AGglomerative NESting) and DIANA( Divisive ANAlysis) to a
data set of five objects, {a, b, c, d, e}.
AGNES-Explored

1. Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the
basic process of Johnson's (1967) hierarchical clustering is this:
2. Start by assigning each item to its own cluster, so that if you have N items, you now have
N clusters, each containing just one item. Let the distances (similarities) between the
clusters equal the distances (similarities) between the items they contain.
3. Find the closest (most similar) pair of clusters and merge them into a single cluster, so
that now you have one less cluster.
4. Compute distances (similarities) between the new cluster and each of the old clusters.
5. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.
6. Step 3 can be done in different ways, which is what distinguishes single-link from
complete-link and average-link clustering
 single-link clustering
distance = shortest distance from any member of one cluster to any member of the other
cluster

 complete-link clustering
distance = longest distance from any member of one cluster to any member of the other
cluster

 average-link clustering
distance = average distance from any member of one cluster to any member of the other
cluster
Dendrogram: Shows How Clusters are Merged

Decompose data objects into a several levels of nested partitioning (tree of clusters),
called a dendrogram
A clustering of the data objects is obtained by cutting the dendrogram at the desired level,
then each connected component forms a cluster
 Major weakness of agglomerative clustering methods
 Can never undo what was done previously
 Do not scale well: time complexity of at least O(n2), where n is the number of
total objects

Solution to improve Hierarchical clustering


 One promising direction for improving the clustering quality of hierarchical methods is to
integrate hierarchical clustering with other clustering techniques. A few such methods
are:
1. Birch
2. Cure
3. Chameleon
Birch (Balanced Iterative Reducing and Clustering Using Hierarchies)
Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for
multiphase clustering
o Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering structure of
the data)
o Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-
tree
 Scales linearly: finds a good clustering with a single scan and improves the quality with a
few additional scans
 Weakness: handles only numeric data, and sensitive to the order of the data record

 A hierarchical clustering method.


 It introduces two concepts :
1. Clustering feature
2. Clustering feature tree (CF tree)
These structures help the clustering method achieve good speed and scalability in large
databases.
Clustering Feature Definition
The Birch algorithm builds a dendrogram called clustering feature tree (CF tree) while
scanning the data set.
Each entry in the CF tree represents a cluster of objects and is characterized by a 3-
tuple: (N, LS, SS), where N is the number of objects in the cluster and LS, SS are defined
in the following.

Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N,


CF = (N, LS, SS)

LS   Pi
Pi N
2
SS   Pi
N is the number of data points
Pi N in the cluster,
LS is the linear sum of the N data points,
SS is the square sum of the N data points.
Example
Clustering feature = CF=( N, LS, SS)
N=5
LS = (16, 30)
SS = ( 54, 190)
Properties of Clustering Feature

CF entry is more compact


Stores significantly less than all of the data points in the sub-cluster
A CF entry has sufficient information to calculate D0-D4
Additivity theorem allows us to merge sub-clusters incrementally & consistently

CF-Tree
 A CF tree is a height-balanced tree that stores the clustering features for a hierarchical
clustering
 A nonleaf node in a tree has descendants or “children”
 The nonleaf nodes store sums of the CFs of their children
 A CF tree has two parameters
(i)Branching factor: maximum number of children
(ii) Threshold: max diameter of sub-clusters stored at the leaf nodes
 These two parameters implicitly control the resulting tree's size.
 CF tree can be viewed as a multilevel compression of the data that tries to preserve the
inherent clustering structure of the data
 A non-leaf node entry is a CF triplet and a child node link
 Each non-leaf node contains a number of child nodes. The number of children that a
non-leaf node can contain is limited by a threshold called the branching factor.
 A leaf node is a collection of CF triplets and links to the next and previous leaf nodes
 Each leaf node contains a number of subclusters that contains a group of data instances
 The diameter of a subcluster under a leaf node can not exceed a threshold.
 For Phase 1, the CF-tree is built dynamically as objects are inserted. Thus, the method is
incremental.
 An object is inserted into the closest leaf entry (subcluster).
 If the diameter of the subcluster stored in the leaf node after insertion is larger than the
threshold value, then the leaf node and possibly other nodes are split.
 After the insertion of the new object, information about the object is passed toward the
root of the tree.
 The size of the CF-tree can be changed by modifying the threshold.
 If the size of the memory that is needed for storing the CF-tree is larger than the size of
the main memory, then a larger threshold value can be specified and the CF-tree is
rebuilt.
 The rebuild process is performed by building a new tree from the leaf nodes of the old
tree. Thus, the process of rebuilding the tree is done without the necessity of rereading all
the objects or points.
 This is similar to the insertion and node split in the construction of B+-trees. Therefore,
for building the tree, data has to be read just once.
 Some heuristics and methods have been introduced to deal with outliers and improve the
quality of CF-trees by additional scans of the data.
 Once the CF-tree is built, any clustering algorithm, such as a typical partitioning
algorithm, can be used with the CF-tree in Phase 2.
BIRCH overview

Effectiveness of BIRCH
 Given a limited amount of main memory, BIRCH can minimize the time required
for I/O.
 BIRCH is a scalable clustering algorithm with respect to the number of objects,
and good quality of clustering of the data.

Chameleon: Multiphase Hierarchical Clustering Using Dynamic Modeling


 Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to
determine the similarity between pairs of dusters.
 In Chameleon, cluster similarity is assessed based on (1) how well connected
objects are within a duster and (2) the proximity of dusters. That is-, two dusters
are merged if their interconnectivity is high and they are close together.
 Thus, Chameleon does not depend on a static, user-supplied model and can
automatically adapt to the internal characteristics of the clusters being merged.
 Chameleon uses a k-nearest-neighbor graph approach to construct a sparse graph,
where each vertex of the graph represents a data object, and there exists an edge
between two vertices (objects) if one object is among the k-most similar objects to
the other.
 The edges are weighted to reflect the similarity between objects. Chameleon uses
a graph partitioning algorithm to partition the k-nearest-neighbor graph into a
large number of relatively small subclusters such that it minimizes the edge cut.
 That is, a cluster C is partitioned into subclusters C; and q so as to minimize the
weigh I of the edges that would be cut should C be bisected into C; and q. It
assesses the absolute interconnectivity between clusters C; and q.

Overall Framework of CHAMELEON


Construct
Sparse Graph Partition the Graph

Data Set

Merge Partition

Final Clusters

2016年9月7日星期三 Data Mining: Concepts and Techniques 49

 To determine the pairs of most similar subclusters, it takes into account both the
interconnectivity and the closeness of the clusters.
 Specifically, Chameleon determines the similarity between each pair of dusters Ci
and Cj according to their relative interconnectivityRI( C;, Cj), and their relative
closeness, RC(C;,Cj).
 Relative inter-connectivity:

where EC[c;.Cj) is the edge cut as previously defined for a cluster containing both
C;and q. Similarly, ECc; (or ECci ) is the minimum sum of the cut edges that
partition Ci (or Cj) into two roughly equal parts.
 Relative closeness
where SEcc c;. Ci l is the average weight of the edges that connect vertices in Ci
to vertices in q, and SEcc. (or SEcc. ) is the average weight of the edges that
belong to the min' ) cut bisector of cluster C; (or Cj).
Probabilistic hierarchical clustering
 Use probabilistic models to measure distances between clusters
 Generative model: Regard the set of data objects to be clustered as a sample of the
underlying data generation mechanism to be analyzed
 Easy to understand, same efficiency as algorithmic agglomerative clustering
method, can handle partially observed data.
 In practice, assume the generative models adopt common distributions functions,
e.g., Gaussian distribution or Bernoulli distribution, governed by parameters
Generative Model
 Given a set of 1-D points X = {x1, …, xn} for clustering analysis & assuming they are
generated by a Gaussian distribution:

 The probability that a point xi ∈X is generated by the model

 The likelihood that X is generated by the model:

 The task of learning the generative model: find the parameters μ and σ 2 such that

 For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality can be
measured by,

where P(Ci) is the maximum likelihood


 Distance between clusters C1 and C2:
 Algorithm:
 Progressively merge points and clusters
Input: D = {o1, ..., on}: a data set containing n objects
Output: A hierarchy of clusters
Method
Create a cluster for each object Ci = {oi}, 1 ≤ i ≤ n;
For i = 1 to n {
Find pair of clusters Ci and Cj such that
Ci,Cj = argmaxi ≠ j {log (P(Ci∪Cj )/(P(Ci)P(Cj ))};
If log (P(Ci∪Cj )/(P(Ci)P(Cj )) > 0 then merge Ci and Cj }
a probabilistic hierarchical clustering scheme can start with one cluster per object, and merge
two clusters, C; and q, if the distance between them is negative. In each iteration, we try to
find C; and q so as to maximize
I P(C;UCj) Th . . . l l P(C;UCj) h . l
og P(C;)P(Cj). e iteration contmues as ong as og P(C;)P(Cj) > 0, t at is, as ong as
there is an improvement in clustering quality.

Density-Based Methods
•To find clusters of arbitrary shape, alternatively, we can model clusters as dense regions in
the data space, separated by sparse regions. This is the main strategy behind density-based
clustering methods, which can discover clusters of nonspherical shape.
 Several interesting methods:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)

DBSCAN: Density-Based Clustering Based on Connected Regions with High Density


The density of an object o can be measured by the number of objects close too.
 DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds core
objects, that is, objects that have dense neighborhoods. It connects core objects and
their neighborhoods to form dense regions as clusters.
 A user-specified parameter E > 0 is used to specify the radius of a neighborhood we
consider for every object.The E-neighborhood of an object o is the space within a
radius E centered at o.
 the density of a neighborhood can be measured simply by the number of objects in
the neighborhood
 To determine whether a neighborhood is dense or not, DBSCAN uses another user-
specified parameter, MinPts, which specifies the density threshold of dense regions.
 An object is a core object if the t::-neighborhood of the object contains at least
MinPts objects.
 Directly density-reachable: A point p is directly density-reachable from a point q
w.r.t. Eps, MinPts if (i) p belongs to NEps(q) and (ii) core point condition:
|NEps (q)| ≥ MinPts

 Density-reachable:
 A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain
of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from
pi

 Density-connected
 A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o
such that both, p and q are density-reachable from o w.r.t. Eps and MinPts

We can use the closure of density-connectedness to find connected dense regions as


clusters. Each closed set is a density-based duster. A subset C ~ D is a cluster if (1)
for any two objects o1,o2 E 'c, 01 and o2 are density-connected; and (2) there does
not exist an object o E C and another object o' E (D- C) such that o and o' are density
connected.
"How does DBSCAN find clusters?" Initially, all objects in a given data set D are
marked as "unvisited." DBSCAN randomly selects an unvisited object p, marks p as
"visited," and checks whether the E-neighborhood of p contains at least MinPts
objects.
 If not, p is marked as a noise point. Otherwise, a new cluster C is created for
p, and all the objects in the E-neighborhood of pare added to a candidate set,
N.
 DBSCAN iteratively adds to C those objects in N that do not belong to any
cluster. In this process, for an object p' in N that carries the label "unvisited,"
DBSCAN marks it as "visited" and checks its E-neighborhood.
 If the E-neighborhood of p' has at least MinPts objects, those objects in the E-
neighborhood of p' are added to N.
 DBSCAN continues adding objects to C until C can no longer be expanded, that is, N
is empty. At this time, cluster C is completed, and thus is output.
 To find the next cluster, DBSCAN randomly selects an unvisited object from the
remaining ones.
The clustering process continues until all objects are visited. The pseudocode of the
DBSCAN algorithm is given in Figure 10.15.
Algorithm: DBSCAN: a density-based clustering algorithm.
Input:
D: a data set containing n objects,
€:the radius parameter, and
MinPts: the neighborhood density threshold.
Output: A set of density-based clusters.
Method:
(1) mark all objects as unvisited;
(2) do
( 3) randomly select an unvisited object p;
( 4) mark pas visited;
(5) if the €-neighborhood of p has at least MinPts objects
(6) create a new cluster C, and add p to C;
(7) let N be the set of objects in the E-neighborhood of p;
(8) for each point p' in N
(9) if p' is unvisited
(10) markp' as visited;
( 11) if the E -neighborhood of p' has at least MinPts points,
add those points to N;
(12) if p' is not yet a member of any cluster, add p' to C;
{13) end for
(14) output C;
(15) else mark p as noise;
(16) until no object is unvisited;
If a spatial index is used, the computational complexity of DBSCAN is O(nlog n),
where n is the number of database objects. Otherwise, the complexity is O(n2). With
appropriate settings of the user-defined parameters, E and MinPts, the algorithm is
effective in finding arbitrary-shaped clusters.
OPTICS
 OPTICS does not explicitly produce a data set clustering. Instead, it outputs a
cluster ordering.
 Objects in a denser cluster are listed closer to each other in the cluster
ordering
 OPTICS does not require the user to provide a specific density
 threshold. The cluster ordering can be used to extract basic clustering
information (e.g.,cluster centers, or arbitrary-shaped clusters), derive the
intrinsic clustering structure, as well as provide a visualization of the
clustering.
 To construct the different clusterings simultaneously, the objects are
processed in a specific order. This order selects an object that is density-
reachable with respect to the lowest E value so that clusters with higher
density (lower E) will be finished first. Based on this idea, OPTICS needs two
important pieces of information per object.
 The core-distance of an object p is the smallest value E1 such that the E1 -
neighborhood of p has at least MinPts objects. That is, E1 is the minimum
distance threshold that makes p a core object. If p is not a core object with respect
to E and MinPts, the core-distance of p is undefined.
 The reachability-distance to object p from q is the minimum radius value that makes p
density-reachable from q. According to the definition of density-reachability, q has to
be a core object and p must be in the neighborhood of q. Therefore, the reachability-
distance from q top is max{core-distance(q), dist(p, q)}. If q is not a core object with
respect to E and MinPts, the reachability-distance to p from q is undefined.
 An object p may be directly reachable from multiple core objects. Therefore, p may have
multiple reachability-distances with respect to different core objects. The smallest
reachability-distance of pis of particular interest because it gives the shortest path for which p
is connected to a dense cluster.
 OPTICS begins with an arbitrary object from the input database as the current
 object, p. It retrieves the E-neighborhood of p, determines the core-distance, and sets
 the reachability-distance to undefined. The current object, p, is then written to output.
 ff p is not a c re object, OPTICS simply moves n t the next object in the Order Seeds
list (or the input database if Order Seeds is empty). lf p is a core object, then for each
object, q, in the !-neighborhood of p, OPTlCS updates its reachability-distance from p
and inserts q into Order Seeds if q has not yet been processed. The iteration continue
until the input is fully consumed and Order Seeds is empty .A data set's cluster
ordering can be represented graphically, which helps to visualize
 and understand the clustering structure in a data set. For example, Figure 10.17 is the
reachability plot for a simple 2-D data set, which presents a general overview of how
the data are structured and clustered. The data objects are plotted in the clustering
order (horizontal axis) together with their respective reachability-distances (vertical
axis). The three Gaussian "bumps" in the plot reflect three clusters in the data set.
Methods have also been developed for viewing clustering structures of high-
dimensional data at various levels of detail.
 The structure of the OPTICS algorithm is very similar to that of DBSCAN.
Consequently,
 the two algorithms have the same time complexity. The complexity is O(nlog n)
 if a spatial index is used, and O(n2) otherwise, where n is the number of objects.
DENCLUE: Clustering Based on Density Distribution Functions
DENCLUE (DENsity-based CLUstEring) is a clustering method based on a set of density
distribution functions.
Density estimation
 density estimation is the estimation of an unobservable underlying probability density
function based on a set of observed data.
 The observed data set is regarded as a random sample from that population.
 In DBSCAN and OPTICS, density is calculated by counting the number of objects in
a neighborhood defined by a radius parameter, E. Such density estimates can be
highly sensitive to the radius value used.
 To overcome this problem, kernel density estimation can be used, which is a
nonparametric density estimation approach from statistics.
 A kernel can be regarded as a function modeling the influence of a sample point
within its neighborhood.
 let x1,x2, ..xn be an independent and identically distributed sample of a random
variable f. The kernel density approximation of the probability density function is

 where K( ) is a kernel and h is the bandwidth serving as a smoothing parameter. A


frequently used kernel is a standard Gaussian function with a mean of O and a
variance of 1:
 A point x• is called a density attractor if it is a local maximum of the estimated
density function. To avoid trivial local maximum points, DENCLUE uses a noise
threshold, ~, and only considers those density attractors x• such that J (x*) ::::: ~.
 The non trivial density attractors are the centers of clusters.
 Objects under analysis are assigned to clusters through density attractors using a
stepwise hill-climbing procedure. F ran object, x, the hill-climbing procedure starts
from x and is guided by the gradient f the estimated density function. That is, the
density attractor for x is computed as

 where 8 is a parameter to control the speed of convergence, and

 The hill-climbing procedure stops at step k > 0 iff(x1<+ 1) <f(xk), and assigns x to
the density attractor x"' = x1<. An object x is an outlier or noise if it converges in the
hill climbing procedure to a local maximum x• with J (x*) < ~.
 A cluster in DENCLUE is a set of density attractor X and a set of input objects C
 such that each object in C is assigned to a density attractor in X, and there exists a
path between every pair of density attractors where the density is above ~. By using
multiple density attractors connected by paths, DEN CLUE can find clusters of
arbitrary shape

Grid-Based Methods
 The grid-based clustering approach uses a multi resolution grid data structure.
 It quantizes the object space into a finite number of cells that form a grid structure on
which all of the operations for clustering are performed.
The main advantage of the approach is its fast processing time, which is typically
independent of the number of data objects, yet dependent on only the number of cells
in each dimension in the quantized space.

STING: STatistical INformation Grid


 STING is a grid-based multi resolution clustering technique in which the
embedding spatial area of the input objects is divided into rectangular cells.
 The space can be divided in a hierarchical and recursive way.
 Several levels of such rectangular cells correspond to different levels of
resolution and form a hierarchical structure
 Each cell at a high level is partitioned to form a number of cells at the next
lower level.
 Statistical information regarding the attributes in each grid cell, such as the
mean, maximum, and minimum values, is pre-computed and stored as
statistical parameters. These statistical parameters are useful for query
processing and for other data analysis tasks.
 The statistical parameters of higher-level cells can easily be computed from
the parameters of the lower-level cells.
 These parameters include the following: the attribute-independent parameter,
count; and the attribute-dependent parameters, mean, stdev (standard
deviation), min (minimum), max (maximum), and the type of distribution that
the attribute value in the cell follows such as normal, uniform, exponential, or
none (if the distribution is unknown).
 Here, the attribute is a selected measure for analysis such as price for house
objects.
 When the data are loaded into the database, the parameters count, mean,
 stdev, min, and max of the bottom-level cells are calculated directly from the
data.
 The value of distribution may either be assigned by the user if the distribution
type is known beforehand or obtained by hypothesis tests such as the x2 test.
The type of distribution of a higher-level c II can be computed based on the
majority of distribution types or its corresponding lower-level cells in conjun
tion with a threshold filtering process.
 If the distributions of the lower-level cells disagree with each other and fail
the threshold test, the distribution type of the high-level cell is set to none.
 For any query, we begin by examining cells on a high level layer
 calculate the likelihood that this cell is relevant to the query at some
confidence level using the parameters of this cell
 If the distribution type is NONE, we estimate the likelihood using some
distribution free techniques instead
 After we obtain the confidence interval, we label this cell to be relevant or not
relevant at the specified confidence level
 Proceed to the next layer but only consider the Childs of relevant cells of
upper layer
 We repeat this until we reach to the final layer
 Relevant cells of final layer have enough statistical information to give
satisfactory result to query.
However for accurate mining we may refer to data corresponding to relevant

cells and further process it .



A hierarchical structure for STING clustering

Finding regions
After we have got all the relevant cells at the final level we need to output regions
that satisfies the query
We can do it using Breadth First Search
Breadth First Search
we examine cells within a certain distance from the center of current cell
If the average density within this small area is greater than the density specified mark
this area
Put the relevant cells just examined in the queue.
Take element from queue repeat the same procedure except that only those relevant
cells that are not examined before are enqueued. When queue is empty we have
identified one region

Statistical Information Grid-based Algorithm


1. Determine a layer to begin with.
2. For each cell of this layer, we calculate the confidence interval (or estimated range)
of probability that this cell is relevant to the query.
3. From the interval calculated above, we label the cell as relevant or not relevant.
4. If this layer is the bottom layer, go to Step 6; otherwise, go to Step 5.
5. We go down the hierarchy structure by one level. Go to Step 2 for those cells that
form the relevant cells of the higher level layer.
6. If the specification of the query is met, go to Step 8; otherwise, go to Step 7.
7. Retrieve those data fall into the relevant cells and do further processing. Return the
result that meet the requirement of the query. Go to Step 9.
8. Find the regions of relevant cells. Return those regions that meet the requirement of
the query. Go to Step 9.
9. Stop
CLIQUE: An Apriori-like Subspace Clustering Method
 CLIQUE (CLustering In QUEst) is a simple grid-Based method for finding
density based clusters in subspaces.
 CLIQUE partitions each dimension into non overlapping
 intervals, thereby partitioning the entire embedding space of the data objects
into cells.
 It uses a density threshold to identify dense cells and sparse ones. A cell is
dense if the number of objects mapped to it exceeds the density threshold.
 The main strategy behind CLIQUE for identifying a candidate search space
uses the monotonicity of dense cells with respect to dimensionality. This is
based on the Apriori property used in frequent pattern and association rule
mining
 In the context of clusters in subspaces, the monotonicity says the following.
A k-dimensional cell c( k > 1) can have at least I points only if every ( k - 1 )-
dimensional projection of c, which is a cell in a (k - !)-dimensional subspace,
has at least l points.

 where the embedding data space contains three dimensions: age, salary, and
vacation.
 A 2-D cell, say in the subspace formed by age and salary, contains l points
only if theprojection of this cell in every dimension, that is, age and salary,
respectively, contains at least l points.
 CLIQUE performs clustering in two steps.
 In the first step, CLIQUE partitions the d-dimensional data space into non
overlapping rectangular units, identifying the dense units among these.
CLIQUE finds dense cells in all of the subspaces.
 The subspaces representing these dense units are intersected to form a
candidate search space in which dense units of higher dimensionality may
exist.
 This approach of selecting candidates is quite similar to Apiori Gen process of
generating candidates.
Here it is expected that if some thing is dense in higher dimensional space it
cant be sparse in lower dimension state.
 If a k-dimensional unit is dense, then so are its projections in (k-1)-
dimensional space.
 Given a k-dimensional candidate dense unit, if any of it’s (k-1)th projection
unit is not dense then kth dimensional unit cannot be dense
 So,we can generate candidate dense units in k-dimensional space from the
dense units found in (k-1)-dimensional space
 The resulting space searched is much smaller than the original space.
 The dense units are then examined in order to determine the clusters.

Figure :Dense units found with respect to age for the dimensions salary and vacation are intersected in order
to provide a candidate search space for dense units of higher dimensionality

2nd stage- Minimal Description


 For each cluster, Clique determines the maximal region that covers the cluster
of connected dense units.
 It then determines a minimal cover (logic description) for each cluster
Effectiveness of Clique-
 CLIQUE automatically finds subspaces of the highest dimensionality such
that high-density clusters exist in those subspaces.
 It is insensitive to the order of input objects
 It scales linearly with the size of input
 Easily scalable with the number of dimensions in the data
Clustering High-Dimensional Data
-Problems, Challenges, and Major Methodologies
the needs of cluster analysis on high-dimensional data
Major challenges:
 Many irrelevant dimensions may mask clusters
 Distance measure becomes meaningless—due to equi-distance
 Clusters may exist only in some subspaces
 Clustering high-dimensional data is the search for clusters and the space in
which they exist. Thus, there are two major kinds of methods:
 Subspace clustering approaches search for clusters existing in subspaces of
the given high-dimensional data space, where a subspace is defined using a
subset of attributes in the full space.
 Dimensionality reduction approaches try to construct a much lower-
dimensional space and search for clusters in such a space. Often, a method
may construct new dimensions by combining some dimensions from the
original data
Subspace Clustering Methods
 "How can we find subspace clusters from high-dimensional data?" Many
methods have been proposed. They generally can be categorized into three
major groups: subspace search methods, correlation-based clustering methods,
and biclustering methods.
Subspace Search Methods
 A sub pace search method searches various subspaces for clusters. Here, a
cluster is a subset of objects that are similar to each other in a subspace. The
similarity is often captured by conventional measures such as distance or
density.
 For example, the CLIQUE algorithm is a subspace clustering method. It
enumerates subspaces and the clusters in those subspaces in a dimensionality-
increasing order, and applies antimonotonicity to prune subspaces in which no
cluster may exist.
 A major challenge that subspace search methods face is how to search a series
of subspaces effectively and efficiently. Generally there are two kinds of
strategies: ,
Bottom-up approaches start from low-dimensional subspaces and search
higherdimensional subspaces only subspaces only when there may be clusters
in those higher-dimensional subspaces. CLIQUE is an example of a bottom-
up approach.
Top-down approaches start from the full space and search smaller and
smaller subspaces recursively. Top-down approaches are effective only if the
locality assumption holds, which require that the subspace of a cluster can be
determined by the local neighborhood.
Example 11.10 PROCLUS, a top-down subspace approach. PROCLUS is a
k-medoid-like method that first generates k potential cluster centers for a
high-dimensional data set using a sample of the data set. It then refines the
subspace clusters iteratively. In each iteration, for each of the current k-
medoids, PROCLUS considers the local neighborhood of the medoid in the
whole data set, and identifies a subspace for the cluster by minimizing the
standard deviation of the distances of the points in the neighborhood to the
medoid on each dimension. Once all the subspaces for the medoids are
determined, each point in the data set is assigned to the closest medoid
according to the corresponding subspace. Clusters and possible outliers are
identified. In the next iteration, new medoids replace existing ones if doing so
improves the clustering quality
Correlation-Based Clustering Methods
While subspace search methods search for clusters with a similarity that is
measured using conventional metrics like distance or density, correlation-
based approaches can further discover clusters that are defined by advanced
correlation models.
Biclustering Methods
In some applications, we want to cluster both objects and attributes
simultaneously. The resulting clusters are known as biclusters and meet four
requirements: ( 1) only a small set of objects participate in a cluster; (2) a
cluster only involves a small number of attributes; (3) an object may
participate in multiple clusters, or does not participate in any cluster; and ( 4)
an attribute may be involved in multiple clusters, or is not involved in any
cluster.
Biclustering
Ex 1. Gene expression or microarray data: a gene sample/condition matrix.
 Each row corresponds to one gene, and each column corresponds to
one sample or condition
 Each element in the matrix, a real number, records the expression level
of a gene under a specific condition
 From the clustering viewpoint, an interesting issue is that a gene
expression data matrix can be analyzed in two dimensions-the gene
dimension and the sample/condition dimension


 When analyzing in the gene dimension, we treat each gene as an object
and treat the samples/conditions as attributes. By mining in the gene
dimension, we may find patterns shared by multiple genes, or cluster
genes into groups. For example, we may find a group of genes that
express themselves similarly, which is highly interesting in
bioinformatics, such as in finding pathways.
 When analyzing in the sample/condition dimension, we treat each
sample/condition as an object and treat the genes as attributes. In this
way, we may find patterns of samples/conditions, or cluster
samples/conditions into groups. For example, we may find the
differences in gene expression by comparing a group of tumor samples
and nontumor samples.
 Ex. 2. Clustering customers and products
 Another bi-clustering problem

Types of Biclusters
 Let A = {a1, ..., an} be a set of genes, B = {b1, …, bn} a set of conditions
 A bi-cluster: A submatrix where genes and conditions follow some consistent patterns
 4 types of bi-clusters (ideal cases)
 Bi-clusters with constant values:
 for any i in I and j in J, eij = c
 Bi-clusters with constant values on rows:
 eij = c + αi
 Also, it can be constant values on columns


 Bi-clusters with coherent values (aka. pattern-based clusters)
 eij = c + αi + βj

 Bi-clusters with coherent evolutions on rows


 eij (ei1j1− ei1j2)(ei2j1− ei2j2) ≥ 0
 i.e., only interested in the up- or down- regulated changes across genes or
conditions without constraining on the exact values

Bi-Clustering Methods
In real
data sets, such perfect biclusters rarely exist.
Methods
(i) Optimization-based methods (ii) enumeration methods
 Optimization-based methods
 Try to find a submatrix at a time that achieves the best significance as a bi-cluster
 Due to the cost in computation, greedy search is employed to find local optimal
bi-clusters
 Ex. δ-Cluster Algorithm (Cheng and Church, ISMB’2000)
 Enumeration methods
 Use a tolerance threshold to specify the degree of noise allowed in the bi-clusters
to be mined
 Then try to enumerate all submatrices as bi-clusters that satisfy the requirements
 Ex. δ-pCluster Algorithm (H. Wang et al.’ SIGMOD’2002, MaPle: Pei et al.,
ICDM’2003)
Optimization Using the δ Cluster Algorithm

For a submatrix, I x J, the mean of the ith row is

The mean of the j-th column

The mean of all elements in the submatrix is


The quality of the submatrix as a bi-cluster can be measured by the mean squared residue value

A submatrix I x J is δ-bi-cluster if H(I x J) ≤ δ where δ ≥ 0 is a threshold. When δ = 0, I x J is a


perfect bi-cluster with coherent values. By setting δ > 0, a user can specify the tolerance of
average noise per element against a perfect bi-cluster
residue(eij) = eij − eiJ − eIj + eIJ
 Maximal δ-bi-cluster is a δ-bi-cluster I x J such that there does not exist another δ-bi-
cluster I′ x J′ which contains I x J
 Computing is costly: Use heuristic greedy search to obtain local optimal clusters
 Two phase computation: deletion phase and additional phase
 Deletion phase: Start from the whole matrix, iteratively remove rows and columns while
the mean squared residue of the matrix is over δ
 At each iteration, for each row/column, compute the mean squared residue:

 Remove the row or column of the largest mean squared residue


 Addition phase:
 Expand iteratively the δ-bi-cluster I x J obtained in the deletion phase as long as
the δ-bi-cluster requirement is maintained
 Consider all the rows/columns not involved in the current bi-cluster I x J by
calculating their mean squared residues
 A row/column of the smallest mean squared residue is added into the current δ-bi-
cluster
 It finds only one δ-bi-cluster, thus needs to run multiple times: replacing the elements in
the output bi-cluster by random numbers
 Enumerating all bi-clusters (δ-pClusters) [H. Wang, et al., Clustering by pattern similarity
in large data sets. SIGMOD’02]
 Since a submatrix I x J is a bi-cluster with (perfect) coherent values iff ei1j1 − ei2j1 = ei1j2
− ei2j2. For any 2 x 2 submatrix of I x J, define p-score

 A submatrix I x J is a δ-pCluster (pattern-based cluster) if the p-score of every 2 x 2


submatrix of I x J is at most δ, where δ ≥ 0 is a threshold specifying a user's tolerance of
noise against a perfect bi-cluster
 The p-score controls the noise on every element in a bi-cluster, while the mean squared
residue captures the average noise
 Monotonicity: If I x J is a δ-pClusters, every x x y (x,y ≥ 2) submatrix of I x J is also a δ-
pClusters.
 A δ-pCluster is maximal if no more row or column can be added into the cluster and
retain δ-pCluster: We only need to compute all maximal δ-pClusters.
 Pei et al., MaPle: Efficient enumerating all maximal δ-pClusters. ICDM'03
 Framework: Same as pattern-growth in frequent pattern mining (based on the downward
closure property)
 For each condition combination J, find the maximal subsets of genes I such that I x J is a
δ-pClusters
 If I x J is not a submatrix of another δ-pClusters then I x J is a maximal δ-pCluster.
 Algorithm is very similar to mining frequent closed itemsets
Additional advantages of δ-pClusters:
 Due to averaging of δ-cluster, it may contain outliers but still within δ-threshold
 Computing bi-clusters for scaling patterns, take logarithm ic on d xa / d ya
 will lead to the p-score form 
d xb / d yb
Dimensionality reduction: In some situations, it is more effective to construct a new space
instead of using some subspaces of the original data

 Ex. To cluster the points in the right figure, any subspace of the original one, X and Y,
cannot help, since all the three clusters will be projected into the overlapping areas in X
and Y axes.
 Construct a new dimension as the dashed one, the three clusters become apparent
when the points projected into the new dimension
 Dimensionality reduction methods
 Feature selection and extraction: But may not focus on clustering structure finding
 Spectral clustering: Combining feature extraction and clustering (i.e., use the
spectrum of the similarity matrix of the data to perform dimensionality reduction
for clustering in fewer dimensions)
Normalized Cuts (Shi and Malik, CVPR’97 or PAMI’2000) The Ng-Jordan-Weiss algorithm
Algorithm: SCAN for clusters on graph ·data.
Input: a graph G = ( V, E), a similarity threshold e, and a population threshold μ,
Output: a set of clusters
Method: set all vertices in V unlabeled
for all unlabeled vertex t1 do
if t1 is a core then
generate a new cluster-id c
insert all v E N,(u) into a queue Q
while Q# do
w *-" the first vertex in Q
R *-" the set of vertices that can be directly reached from w
for all s E R do
ifs is not unlabeled or labeled as nonmember then
assign the current cluster-id c to s
endif
ifs is unlabeled then
insert s into queue Q
endif
end for
remove w from Q
end while
else
label u as nonmember
endif
endfor
for all vertex u labeled nonmember do
i£3x,y E f(u): X and y have different cluster-ids then
label u as hub
else
label u as outlier
endif
endfor

Clustering with Constraints


Categorization of Constraints
Three types: constraints on instances, constraints on clusters, and constraints on similarity
measurement.
Constraints on instances: A constraint on instances specifies how a pair or a set of instances
should be grouped in the cluster analysis. Two common types of constraints from this category
include:
Must-link constraints. If a must-link constraint is specified on two objects x and y, then x and y
should be grouped into one cluster in the output of the cluster analysis. These must-link
constraints are transitive. That is, if must-link(x,y) and must-link(y, z), then must-link(x, z).
Cannot-link constraints. Cannot-link constraints are the opposite of must-link constraints. If a
cannot-link constraint is specified on two objects, x and y, then in the output of the cluster
analysis, x and y should belong to different clusters. Cannot-link constraints can be entailed.
That is, if cannot-link(x,y), must-link(x,x'), and must-link(y,y'), then cannot-link{x',y'). A
constraint on instances can be defined using specific instances. Alternatively, it can also be
defined using instance variables or attributes of instances. For example, a constraint,
Constraint(x,y): must-link(x,y) if dist(x,y) =:::: E,
uses the distance between objects to specify a must-link constraint.
Constraints on clusters: A constraint on clusters specifies a requirement on the clusters,
possibly using attributes of the clusters. For example, a constraint may specify the minimum
number of objects in a cluster, the maximum diameter of a cluster, or the shape of a cluster (e.g.,
a convex). The number of clusters specified for partitioning clustering methods can be regarded
as a constraint on clusters.
Constraints on similarity measurement: Often, a similarity measure, such as Euclidean
distance, is used to measure the similarity between objects in a cluster analysis. In some
applications, exceptions apply. A constraint on similarity measurement specifies a requirement
that the similarity calculation must respect. For example, to the walking distance between two points, a
constraint on similarity measurement is that the trajectory implementing the shortest distance
cannot cross a wall. There can be more than one way to express a constraint, depending on the
category.
For example, we can specify a constraint on clusters as Constraint1: the diameter of a cluster
cannot be larger than d. The requirement can also be expressed using a constraint on instances as
Constraint~: cannot-link(x,y) if dist(x,y) > d. cluster people as moving objects in a plaza, while
Euclidean distance is used to give
Another way to categorize clustering constraints considers how firmly the constraints
have to be respected. A constraint is hard if a clustering that violates the constraint
is unacceptable. A constraint is soft if a clustering that violates the constraint is not
preferable but acceptable when no better solution can be found. Soft constraints are also
called preferences.

Data Mining Applications

 Data mining is an interdisciplinary field with wide and diverse applications

o There exist nontrivial gaps between data mining principles and domain-specific
applications

 Some application domains

o Financial data analysis

o Retail industry

o Telecommunication industry

Biological data analysis

 Data Mining for Financial Data Analysis

 Financial data collected in banks and financial institutions are often relatively complete,
reliable, and of high quality

 Design and construction of data warehouses for multidimensional data analysis and data
mining

o View the debt and revenue changes by month, by region, by sector, and by other
factors

o Access statistical information such as max, min, total, average, trend, etc.

 Loan payment prediction/consumer credit policy analysis

o feature selection and attribute relevance ranking


o Loan payment performance

o Consumer credit rating

 Financial Data Mining

 Classification and clustering of customers for targeted marketing

o multidimensional segmentation by nearest-neighbor, classification, decision trees,


etc. to identify customer groups or associate a new customer to an appropriate
customer group

 Detection of money laundering and other financial crimes

o integration of from multiple DBs (e.g., bank transactions, federal/state crime


history DBs)

o Tools: data visualization, linkage analysis, classification, clustering tools, outlier


analysis, and sequential pattern analysis tools (find unusual access sequences)

 Data Mining for Retail Industry

 Retail industry: huge amounts of data on sales, customer shopping history, etc.

 Applications of retail data mining

o Identify customer buying behaviors

o Discover customer shopping patterns and trends

o Improve the quality of customer service

o Achieve better customer retention and satisfaction

o Enhance goods consumption ratios

o Design more effective goods transportation and distribution policies

 Ex. 1. Design and construction of data warehouses based on the benefits of data mining

o Multidimensional analysis of sales, customers, products, time, and region

 Ex. 2. Analysis of the effectiveness of sales campaigns

 Ex. 3. Customer retention: Analysis of customer loyalty

o Use customer loyalty card information to register sequences of purchases of


particular customers
o Use sequential pattern mining to investigate changes in customer consumption or
loyalty

o Suggest adjustments on the pricing and variety of goods

 Ex. 4. Purchase recommendation and cross-reference of items

 Data Mining for Telecomm. Industry (1)

 A rapidly expanding and highly competitive industry and a great demand for data mining

o Understand the business involved

o Identify telecommunication patterns

o Catch fraudulent activities

o Make better use of resources

o Improve the quality of service

 Multidimensional analysis of telecommunication data

o Intrinsically multidimensional: calling-time, duration, location of caller, location


of callee, type of call, etc.

 Fraudulent pattern analysis and the identification of unusual patterns

o Identify potentially fraudulent users and their atypical usage patterns

o Detect attempts to gain fraudulent entry to customer accounts

o Discover unusual patterns which may need special attention

 Multidimensional association and sequential pattern analysis

o Find usage patterns for a set of communication services by customer group, by


month, etc.

o Promote the sales of specific services

o Improve the availability of particular services in a region

 Use of visualization tools in telecommunication data analysis

 Biomedical Data Analysis


 DNA sequences: 4 basic building blocks (nucleotides): adenine (A), cytosine (C),
guanine (G), and thymine (T).

 Gene: a sequence of hundreds of individual nucleotides arranged in a particular order

 Humans have around 30,000 genes

 Tremendous number of ways that the nucleotides can be ordered and sequenced to form
distinct genes

 Semantic integration of heterogeneous, distributed genome databases

o Current: highly distributed, uncontrolled generation and use of a wide variety of


DNA data

o Data cleaning and data integration methods developed in data mining will help

 DNA Analysis: Examples

 Similarity search and comparison among DNA sequences

o Compare the frequently occurring patterns of each class (e.g., diseased and
healthy)

o Identify gene sequence patterns that play roles in various diseases

 Association analysis: identification of co-occurring gene sequences

o Most diseases are not triggered by a single gene but by a combination of


genes acting together

o Association analysis may help determine the kinds of genes that are likely to
co-occur together in target samples

 Path analysis: linking genes to different disease development stages

o Different genes may become active at different stages of the disease

o Develop pharmaceutical interventions that target the different stages


separately

 Visualization tools and genetic data analysis

You might also like