Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

 Single link: smallest distance between an element in one cluster and an element in the other, i.e.

,
dist(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster and an element in the other,
i.e., dist(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an element in the other, i.e., dist(K i,
Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e., dist(K i, Kj) = dist(Ci, Cj)
N
Centroid: theΣ i =1 ( t ipof) a cluster
“middle”
C m =
 Radius: square rootNof average distance from any point of the cluster to its centroid


N 2
Σ ( t −c )
R m= i=1 ip m
 Diameter: square root of average mean squared distance between all pairs of points in the cluster

√ Σ i=1 Σ i=1 ( t ip−t iq )2


N N

Dm=Reducing and Clustering using Hierarchies)


It is a scalable clustering method. N ( N −1)
BIRCH (Balanced Iterative

 Designed for very large data sets
 Only one scan of data is necessary
 It is based on the notation of CF (Clustering Feature) a CF Tree.
 CF tree is a height balanced tree that stores the clustering features for a hierarchical clustering.
 Cluster of data points is represented by a triple of numbers (N,LS,SS) Where
N= Number of items in the sub cluster
LS=Linear sum of the points
SS=sum of the squared of the points
A CF Tree structure is given as below:
 Each non-leaf node has at most B entries.
 Each leaf node has at most L CF entries which satisfy threshold T, a maximum diameter of radius
 P(page size in bytes) is the maximum size of a node
 Compact: each leaf node is a subcluster, not a data point
Basic Algorithm:
 Phase 1: Load data into memory
Scan DB and load data into memory by building a CF tree. If memory is exhausted rebuild the
tree from the leaf node.
 Phase 2: Condense data
Resize the data set by building a smaller CF tree
Remove more outliers
Condensing is optional
 Phase 3: Global clustering
Use existing clustering algorithm (e.g. KMEANS, HC) on CF entries
 Phase 4: Cluster refining
Refining is optional
Fixes the problem with CF trees where same valued data points may be assigned to different leaf
entries.
Example:
Clustering feature:
CF= (N, LS, SS)
N: number of data points
LS: ∑Ni=1=Xi∑i=1N=Xi
SS:∑Ni=1=X2ISS:∑i=1N=XI2

(3,4) (2,6)(4,5)(4,7)(3,8)
N=5
NS= (16, 30 ) i.e. 3+2+4+4+3=16 and 4+6+5+7+8=30
SS=(54,190)=32+22+42+42+32=54 and 42+62+52+72+82=190SS=(54,190)=32+22+42+42+32=54 an
d 42+62+52+72+82=190
 Advantages: Finds a good clustering with a single scan and improves the quality with a few
additional scans
 Disadvantages: Handles only numeric data
 Applications:
Pixel classification in images
Image compression
Works with very large data setsADD
 Two-phase approach

 Phase -I
 Uses a graph partitioning algorithm to divide the data set into a set of individual clusters.

 Phase -II
 uses an agglomerative hierarchical mining algorithm to merge the clusters.

 Using KNN

 Data points that are far away are completely avoided by the algorithm (reducing the
noise in the dataset)

 captures the concept of neighbourhood dynamically by taking into account the density of
the region.
 Partition the KNN graph such that the edge cut is minimized.

 Reason: Since edge cut represents similarity between the points, less edge cut => less
similarity.

 Multi-level graph partitioning algorithms to partition the graph – hMeTiS library.


Relative closeness
 Absolute closeness normalized with respect to the internal closeness of the two clusters.

 Absolute closeness got by average similarity between the points in Ci that are connected to the
points in Cj.

average weight of the edges from C(i)->C(j).


internal closeness
Internal closeness of the cluster got by average of the weights of the edges in the cluster

 If the relative inter-connectivity measure relative closeness measure are same, choose inter-
connectivity.

 You can also use,


 RI (Ci , C j )≥T(RI) and RC(C i,C j ) ≥ T(RC)
 Allows multiple clusters to merge at each level.

DBScan
Density-Based Clustering refers to unsupervised learning methods that identify distinctive
groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region of
high point density, separated from other such clusters by contiguous regions of low point density.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm for
density-based clustering. It can discover clusters of different shapes and sizes from a large
amount of data, which is containing noise and outliers.
The DBSCAN algorithm uses two parameters:
 minPts: The minimum number of points (a threshold) clustered together for a region to be
considered dense.
 eps (ε): A distance measure that will be used to locate the points in the neighborhood of any
point.
These parameters can be understood if we explore two concepts called Density Reachability and
Density Connectivity.
Reachability in terms of density establishes a point to be reachable from another if it lies within
a particular distance (eps) from it.
Connectivity, on the other hand, involves a transitivity based chaining-approach to determine
whether points are located in a particular cluster. For example, p and q points could be connected
if p->r->s->t->q, where a->b means b is in the neighborhood of a.
There are three types of points after the DBSCAN clustering is complete:

 Core — This is a point that has at least m points within distance n from itself.
 Border — This is a point that has at least one Core point at a distance n.
 Noise — This is a point that is neither a Core nor a Border. And it has less than m points
within distance n from itself.
Algorithmic steps for DBSCAN clustering
 The algorithm proceeds by arbitrarily picking up a point in the dataset (until all points have
been visited).
 If there are at least ‘minPoint’ points within a radius of ‘ε’ to the point then we consider all
these points to be part of the same cluster.
 The clusters are then expanded by recursively repeating the neighborhood calculation for each
neighboring point
DBSCAN in action

Parameter Estimation

Every data mining task has the problem of parameters. Every parameter influences the algorithm
in specific ways. For DBSCAN, the parameters ε and minPts are needed.
 minPts: As a rule of thumb, a minimum minPts can be derived from the number of
dimensions D in the data set, as minPts ≥ D + 1. The low value minPts = 1 does not make
sense, as then every point on its own will already be a cluster. With minPts ≤ 2, the result will
be the same as of hierarchical clustering with the single link metric, with the dendrogram cut
at height ε. Therefore, minPts must be chosen at least 3. However, larger values are usually
better for data sets with noise and will yield more significant clusters. As a rule of
thumb, minPts = 2·dim can be used, but it may be necessary to choose larger values for very
large data, for noisy data or for data that contains many duplicates.
 ε: The value for ε can then be chosen by using a k-distance graph, plotting the distance to
the k = minPts-1 nearest neighbor ordered from the largest to the smallest value. Good values
of ε are where this plot shows an “elbow”: if ε is chosen much too small, a large part of the
data will not be clustered; whereas for a too high value of ε, clusters will merge and the
majority of objects will be in the same cluster. In general, small values of ε are preferable, and
as a rule of thumb, only a small fraction of points should be within this distance of each other.
 Distance function: The choice of distance function is tightly linked to the choice of ε, and has
a major impact on the outcomes. In general, it will be necessary to first identify a reasonable
measure of similarity for the data set, before the parameter ε can be chosen. There is no
estimation for this parameter, but the distance functions need to be chosen appropriately for
the data set.

You might also like