Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 25

Contents

List of Figures
Chapter 1: State of the Art
1 Introduction
2 Unsupervised learning
3 Problem in unsupervised learning
4 Unsupervised learning methods
4.1 Hierarchical methods
4.1.1 Birch (balanced iterative reducing and clustering using hierarchies)
4.1.2 Cobweb
4.1.3 Chamelion
4.2 Methods by density
4.2.1 DBSCAN
4.2.2 DBCLASD
4.2.3 DEUCLUE
4.3 Methods by Grids
4.3.1 Sting
4.3.2 Click
4.3.3 WaveCluster
4.4 Methods by partitioning
4.4.1 K-means
4.4.3 Clara
4.4.4 Clarans (large clustering applications based upon randomized search)
4.5 Other clustering approaches
4.5.1 Approach based on graph theory
4.5.2 Competitive approach
5 Evaluation of a partition
5.1 Index of Dunn
5.2 Davies-Bouldin index
5.3 Compaction Index Wemmert and Gancarski
5.4 Silhouette Index
6 Conclusion

Page 1 sur 25
List of Figures

Chapter 01
Figure 1: Overview of BIRCH Algorithm
Figure 2 : Overall framework of CHAMELEON Algorithm
Figure 3 : DENCLUE process.
Figure 4: Identification of clusters along (age, salary) plane
Figure 5: Identification of clusters along (age, vacation) plane
Figure 6: Final clusters in three dimensional (age, salary, vacation) space
Figure 7: Show computation s(i) for each object, where object I belong to cluster A(12)

Page 2 sur 25
Chapter 1

State of the
Art
Page 3 sur 25
1. Introduction
“The only way to make an instructive and natural method is to put together things that are alike
and separate those that differ from each other.“

M. Georges Buffon, Histoire naturelle, 1749.

It is clear that the general process that classification in the computer field tries to apply to digital
data (points, tables, images, sounds, etc.), does not escape the rule imposed by this famous
naturalist. and writer Georges Buffon, and that the general work of classification methods since
1749 consists in imitating and automating this principle by using and inventing adequate means
(calculating materials, and classificatory theories).

Let us go from this principle, we will present in this chapter first of all what is the classification,
its methods, techniques, its major approaches, fields of application. . .etc. and at the end we will
detail one of his great approaches by studying and analyzing two of his algorithms.

2. Unsupervised Learning :

Unsupervised learning involves inferring knowledge about classes on the basis of learning
samples alone, without knowing in advance which classes they belong to. Contrary to supervised
learning, only an input database is available and it is the system that must determine its outputs
according to the similarities detected between the different inputs [1].

3. problem in unsupervised learning :

 Number of clusters are normally not known apriori.


 For partitional clustering algorithms, such as K-means, different initial centers may lead
to different clustering results, moreover K is unknown.
 Time complexity-parititional clustering algorithms are O(N) whereas hierarchical are
O(N2)
 The similarity criteria is not clear - should we use Euclidean or cosine or Tanimoto or
Mahalanobis distance or whatever?
 In hierarchical clustering, at what stage should we stop?
 Evaluating clustering results are difficult because labels are not available at the
beginning.

4. unsupervised learning methods :

4.1 hierarchical methods :

4.1.1 Birch (balanced iterative reducing and clustering using hierarchies) :

BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies) is an integrated


agglomerative hierarchical clustering method. It is mainly designed for clustering large amount
of metric data. It is mainly suitable when there is limited amount of main memory and have to
achieve a linear I/O time requiring only in one database scan. It introduces two concepts,

Page 4 sur 25
clustering feature and clustering feature tree (CF tree), which are used to summarize cluster
representations [Tian Zhang et al., 1996].
A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering.
It is similar to B+-Tree or R-Tree. CF tree is balanced tree with a branching factor (maximum
number of children per none leaf node) B and threshold T. Each internal node contains a CF
triple for each of its children. Each leaf node also represents a cluster and contains a CF entry for
each sub cluster in it. A sub cluster in a leaf node must have a diameter no greater than a given
threshold value (maximum diameter of sub-clusters in the leaf node) [Tian Zhang et al., 1996].
An object is inserted to the closest leaf entry (sub cluster). A leaf node represents a cluster made
up of all sub clusters represented by its entries. All the entries in a leaf node must satisfy the
threshold requirements with respect to the threshold value T, that is, the diameter of the sub
cluster must be less than T. If the diameter of the sub cluster stored in the leaf node after
insertion is larger than the threshold value, then the leaf node and other nodes are split. After the
insertion of the new object, information about it is passed toward the root of the tree. The size of
the CF tree can be changed by modifying the threshold. These structures help the clustering
method achieve good speed and scalability in large databases. Birch is also effective for
incremental and dynamic clustering of incoming objects. This algorithm can find approximate
solution to combinatorial problems with very large data sets [Harrington & Salibián-Barrera,
2010].
Figure 1 represents an overview of BIRCH algorithm. It mainly consists of four phases as shown
in figure. The main task of phase 1 is to scan all data and build an initial memory CF tree using
the given amount of memory and recycling space on disk. This CF tree tries to reflect the
clustering information of the dataset as fine as possible under the memory limit. With crowded
data points grouped as fine sub-clusters, and sparse data points removed as outliers, this phase
creates a memory summary of the data. BIRCH first performs a pre-clustering phase in which
dense regions of points are represented by compact summaries means by cluster tree, and then a
hierarchical algorithm is used to cluster the set of summaries. BIRCH is appropriate for very
large datasets, by making the time and memory constraints explicit. [2]

Page 5 sur 25
Figure 1: Overview of BIRCH Algorithm [2]

4.1.2 COBWEB Algorithm :

COBWEB clustering algorithm [3] is used. COBWEB is an incremental and unsupervised


clustering algorithm that produces a hierarchy of classes. Its incremental nature allows clustering
of new data to be made without having to repeat the clustering already made. The control
strategy of COBWEB is described in the following pseudo code:

1 Function Cobweb (object, root)

2 Incorporate object into the root cluster;

3 If root is a leaf then

4 return expanded leaf with the object;

5 else choose the operator that results in the best clustering:

6 a) Incorporate the object into the best host;

7 b) Create a new class containing the object;

8 c) Merge the two best hosts;

9 d) Split the best host;

10 If (a) or (c) or (d) then

Page 6 sur 25
11 call Cobweb (observation, best host);

The objective function used to evaluate the quality of clustering produced by each of the above
operators is called Category Utility (CU) and it is defined in [4], by:

N0
CU ( p)=∑ ¿ ¿ ¿
k=1

In this paper, modifications of COBWEB are examined in order to adapt the algorithm to the
needs of the problem of classification. Thus COBWEB is used to produce a clustering from a set
of training data with labeled observations and then is used to classify new unseen observations.
During classification, each unseen observation is inserted into the “training” COBWEB tree
without altering its structure. It becomes obvious that if an unseen observation is classified into
one of the existing classes, then some of the attributes values can be inferred using information
from the class whereto the observation has been classified. Thus when the observation reaches a
leaf node, it is removed from the tree and it is given the most common label amongst the labels
of the observations of that node.

4.1.3 Chameleon Algorithm :

CHAMELEON uses a graph partitioning algorithm to cluster the sparse graph data objects into a
large number of relatively small sub clusters. It then uses an agglomerative hierarchical
clustering algorithm to find the genuine clusters by repeatedly combining these clusters using the
connectivity and closeness measures. CHAMELEON algorithm has been derived based on the
observation of the weakness of two popular hierarchical clustering algorithms, CURE and
ROCK. CURE and related schemes ignore information about the aggregate inter-connectivity of
objects in two different clusters, they measure similarity between two clusters based on the
similarity of the closest pair of the representative points belonging to different clusters. ROCK
and related schemes ignore information about the closeness of two clusters while emphasizing
their inter-connectivity, they only consider the aggregate inter-connectivity across the pairs of
clusters and ignores the value of the stronger edges across clusters.

Page 7 sur 25
Figure 2 : Overall framework of CHAMELEON Algorithm

CHAMELEON uses k-nearest neighbor graph approach to represent its objects. This graph
captures the concept of neighborhood dynamically and results in more natural clusters. The
neighborhood is defined narrowly in a dense region, whereas it is defined more widely in a
sparse region.

4.2 Density methods :

4.2.1 DBSCAN Algorithm :

DBSCAN is a density-based algorithm, where user needs to define two parame-ters: ε and
minimum number of points to create a cluster – min Pts. In the first step random point (P) is
selected and algorithm computes distance between this point and every other point in dataset. It
is worth mentioning, that nothing stands in the way to define a distance function as an input
parameter of algo-rithm, but in this work we follow originally proposed algorithm[5] and define
this function inside algorithm. If the distance between two points in dataset is equal or less than
ε, this points become neighbors. If in ε-neighborhood of P is at least as many points as we
defined as min Pts (it is called a dense region), the cluster is created. If the condition of minPts is
not met, it means there is less points in ε-neighborhood of P, all points are labeled as noise.
However, these noise points can be found later in sufficiently sized ε-neighborhood of a
different point and become a part of another cluster. If a cluster has started, the next step is to
verify if algorithm can expand this cluster or should go to next point outside this cluster. This is
done simply by checking if distance and min Pts condition is met for every point (P’) within this
cluster. If verification is positive (condition is ful filled) algorithm expands this cluster for every
point in ε-neighborhood of P’. If the cluster is expanded to maximum size, it means we cannot
find more points that are in dense region of cluster, the expansion is finished and every point is
labeled as visited. Then algorithm randomly selects another point in dataset, which is not labeled
as visited and repeat the procedure. If there is no more not visited points, algorithm stops. Pseudo
codes of DBSCAN algorithm and expanding function are presented below.

Page 8 sur 25
Algorithm 1 :
Data
1 Dataset - D,
2 distance - ɛ,
3 minimum number of points to create dense region - minPts
4 begin
5 C <--- 0
6 for each point P in dataset D do
7 if P is visited then
8 Continue to next P
9 end
10 else
11 mark P as visited
12 nbrPts <--- points in ɛ - neighborhood of P (distance function)
13 if sizeof(nbrPts) < minPts then
14 mark P as NOISE
15 end
16 else
17 C <--- NewCluster
18 Call Expand Cluster Function(P, nbrPts, C, minPts)
19 end
20 end
21 end
22 end

Algorithm 2 :
1 Data
2 Point in dataset - P,
3 Neighbor points - nbrPts
4 Current cluster - C
5 distance - ɛ
6 minimum number of points to create dense region - minPts
7 begin
8 add P to cluster C
9 for each point P' in nbrPts do
10 if P' is not visited then
11 mark P' as visited
12 nbrPts' <--- points is ɛ -neighborhood of P' (distance function)
13 if sizeof(nbrPts') >= minPts then
14 nbrPts <--- nbrPts + nbrPts'
15 end
16 end
17 if P' is not a member of any cluster then

Page 9 sur 25
18 add P' to cluster C
19 end
20 end
21 end

4.2.2 DBCLASD Algorithm :

Basically, DBCLASD [6] is an incremental approach. DBCLASD is based on the assumption


that the points inside a cluster are distributed uniformly. DBCLASD dynamically determines the
proper number and shape of clusters for a database without needing any input parameters [7]. A
random point is assigned to a cluster which is then processed incrementally without considering
the cluster.

In DBCLASD, a cluster may be defined by three properties shown below:

1) Expected Distribution condition: NNDistSet(C ) which is a set of nearest neighbors of cluster


C has the expected distribution with required confidence level.

2) Optimality Condition: Each point that comes into neighboring of C does not fulfill condition
(1).

3) Connectivity Condition: Each pair (a,b) are connected to each other through grid cell
structure.

Algorithm:

1. Make set of candidates using region query

2. If distance set of C has the expected distribution then point will continue to remain in cluster.
3. Otherwise insert point in list of unsuccessful candidates.

4. In the same way expand cluster and check condition

5. Now list of unsuccessful candidates is checked again through condition.

6. If passes then put in cluster otherwise remain in that list.

Basically, there are two main concepts in DBCLASD. Initial task is generating candidates and
candidate generation is done on the basis of region query that shows some radius for circle query
to accept candidates. Second task is to test the candidate which is accomplished through chi
square testing. Points that lie below the threshold value are considered right candidates while
those that lie above threshold remain in unsuccessful candidates’ list. In the end, unsuccessful
candidate list is checked again and every point goes through the test and points that pass the test
are considered in cluster while those left, remain in unsuccessful candidates’ list.

Page 10 sur 25
4.2.3 DENCLUE :

DENCLUE13 (Density-based clustering) is considered as a special case of the Kernel Density


Estimation (KDE). The KDE is a non-parametric estimation technique, which aimed to find
dense regions points. The authors of DENCLUE developed this algorithm to classify large
multimedia databases, because this type of database contains large amounts of noise, and
requires clustering high-dimensional feature vectors.
Principally, DENCLUE operates through two stages, the pre-clustering step and the clustering
step as illustrated in figure 3. The first step is for constructing a map (a hyper-rectangle) of the
database. This map is used to speed the calculation of the density function. As for the second
step, it allows identifying clusters from highly populated cubes (the cubes of which the number
of points exceeds a threshold ξ determined in parameters), and theirs neighboring populated
cubes.
DENCLUE is based on the calculation of the influence of points between them. The total sum of
these influence functions represent the density function. There exist many influence functions,
based on the distance between two points x and y; but we will focus in this work on the Gaussian
function.
The equation (1), derived from13 , shows the influence function between two points x and y
d (x , y)²
2σ ²
Fgauss( x , y )=e

where d(x,y) is an euclidean distance between x and y, and σ represents the radius of the
neighborhood containing x.
Equation (2), extracted from13, represents the density function.
N
Fd (x)=∑ Fgauss (x , xi)
i =1

where D represents the set of points on the database, and Nits cardinal.
To determine the clusters, DENCLUE calculate the density attractor for each point in the
database. This attractor is considered as a local maximum of the density function. This maximum
is found by the Hill Climbing algorithm, which is based on gradient ascent approach 22 as shown
in equation (3), presented in13 .

D i
0 i+1 i ∇ f Gauss (x )
x=x,x =x +δ D
¿∨∇ f Gauss (x )∨¿
i

The calculation ends when fD ( xk ) < fD ( x k+1 ) with k ∈ N, then we take x∗ = xk as a density
attractor.
The points forming a path with the density attractor, are called attracted points. Clusters are
made by taking into account the density attractors and its attracted points.

Page 11 sur 25
The strength of this algorithm resides in the choice of the structure with which the data are
presented. A. Hinneburg and D. A. Keim 13 have chosen to work with the concept of hyper-
rectangle. A hyper-rectangle is constituted by hyper-cubes. Each hyper-cube is represented by
the dimension of the feature vector points (i.e., the number of criteria) and by a key. This
structure allows to DENCLUE an easy manipulation for the data, by using the cubes keys, and
considering only populated cubes

Figure 3 : DENCLUE process.

However, the use of Hill Climbing in DENCLUE presents limitations, in terms of the quality of
clustering and the execution time. We highlight that the hill climbing doesn’t converge exactly to
the maximum, which just comes close.
To overcome these limits, we have implemented in previous work 23 two algorithms :
DENCLUE-SA and DENCLUE-GA. They are based on replacing the hill climbing algorithm by
two promising met heuristics algorithms : simulated annealing (SA) and genetic algorithm (GA).
Despite that the two algorithms have a good clustering performance, they su ffer in terms of
runtime execution. In order to efficiently adapt DENCLUE algorithm in big data framework, we
develop in this work an improved version of DENCLUE algorithm which we will call
DENCLUE-IM. It is presented hereafter.

4.3 Grid Methods :

4.3.1 STING: A STATISTICAL INFORMATION GRID APPROACH :


In the STING algorithm, the spatial area is divided into rectangular cells. There are several
different levels of such rectangular cells corresponding to different resolution and these cells
form a hierarchical structure. Each cell at a high level is partitioned to form a number of cells of
the next lower level. Statistical information of each cell is calculated and stored beforehand and
is used to answer queries [8].

4.3.1.1 STING Algorithm


The algorithm is given below:
1. Determine a layer to begin with.

Page 12 sur 25
2. For each cell of this layer, we calculate the confidence interval (or estimated range) of
probability that this cell is relevant to the query.
3. From the interval calculated above, we label the cell as relevant or not relevant.
4. If this layer is the bottom layer, go to Step 6; otherwise, go to Step 5.
5. We go down the hierarchy structure by one level. Go to Step 2 for those cells that form the
relevant cells of the higher level layer.

6. If the specification of the query is met, go to Step 8; otherwise, go to Step 7.


7. Retrieve those data fall into the relevant cells and do further processing. Return the result that
meet the requirement of the query. Go to Step 9.
8. Find the regions of relevant cells. Return those regions that meet the requirement of the query.
Go to Step 9.
9. Stop [3].

4.3.2 CLIQUE: A DIMENSION GROWTH SUBSPACE CLUSTERING METHOD


CLIQUE (Clustering in QUEst) is a bottom-up subspace clustering algorithm that constructs
static grids. It uses apriori approach to reduce the search space. CLIQUE is a density and grid
based i.e. subspace clustering algorithm and find out the clusters by taking density threshold and
number of grids as input parameters. CLIQUE operates on multidimensional data by not
operating all the dimensions at once but by processing a single dimension at first step and then
grows upward to the higher one [9].

4.3.2.1 CLIQUE Working

The clustering process in CLIQUE involves:


1. CLIQUE partitions the d- dimensional data space into non-overlapping rectangular units
called grids according to the given grid size and then find out the dense region according to a
given threshold value. A unit is dense if the data points in this are exceeding the threshold
value.
2. Clusters are generated from the all dense subspaces by using the apriori approach [10].
CLIQUE algorithm generates minimal description for the clusters obtained by first
determining the maximal dense regions in the subspaces and then minimal cover for each
cluster from that maximal region. It repeats the same procedure until covered all the
dimensions.

A k-dimensional cell c (k>1) can have at least l points only if every (k-1)-dimensional projection
of c, which is a cell in a (k-1)-dimensional subspace, has at least l points, where l is the
density threshold. Consider the fig. where the embedding data space contains three
dimensions: age, salary, and vacation. A 2-D cell, say in the subspace formed by age and

Page 13 sur 25
salary, contains l points only if the projection of this cell in every dimension, that is age and
salary, respectively, contains at least l points [11].
The following figures show that dense units found with respect to age for the dimensions salary
and vacation are intersected to provide a candidate search space for dense units of higher
dimensionality [12].

Figure 04: Identification of clusters along (age, salary) plane

Page 14 sur 25
Figure 05: Identification of clusters along (age, vacation) plane

Figure 06: Final clusters in three dimensional (age, salary, vacation) space

The subspaces representing these dense units are to form a candidate search space in which
dense units of higher dimensionality may exist. In the second step, CLIQUE generates a minimal
description for each cluster as follows. For each cluster, it determines the maximal region that
covers the cluster of connected dense units. It then determines a minimal cover (logic
description) for each cluster. CLIQUE automatically finds subspaces of the highest

Page 15 sur 25
dimensionality such that high density clusters exist in those subspaces. It scales linearly with the
size of input © 2015-19, IJARCS All Rights Reserved 1512 and has good scalability as the
number of dimensions in the data is increased. However, obtaining meaningful clustering results
is dependent on proper tuning of the grid size and the density threshold. This is particularly
difficult because the grid size and density threshold are used across all combinations of
dimensions in the data set. Thus, the accuracy of the clustering results may be degraded at the
expense of the simplicity of the method. Moreover, for a given dense region, all projections of
the region onto lower dimensionality subspaces will also be dense. This can result in a large
overlap among the reported dense regions. Furthermore, it is difficult to find clusters of rather
different density within different dimensional subspaces.

4.3.3 WAVECLUSTER [13] :

Given a set of spatial objects oi; 1 <= I <= N, the goal of the algorithm is to detect clusters and
assign labels to the objects based on the cluster that they belong to. The main idea in
WaveCluster is to transform the original feature space by applying wavelet transform and then
find the dense re-gions in the new space. It yields sets of clusters at different resolutions and
scales, which can be chosen based on the user’s needs. The main steps of WaveCluster are
shown in WaveCluster Algorithm.

4.3.3.1 WaveCluster Algorithm :

Input: Multidimensional data objects’ feature vectors Output: clustered objects

1. Quantize feature space, then assign objects to the cells.

2. Apply wavelet transform on the quantized feature space.

3. Find the connected components (clusters) in the sub-bands of transformed feature space, at
different levels.

4. Assign labels to the cells.

5. Make the lookup table.

6. Map the objects to the clusters.

Page 16 sur 25
4.4 Partitioning Methods :

4.4.1 K-means Algorithm :

K-means defined by McQueen [14] is the simplest clustering algorithm. The main idea is
to choose randomly a set of centre fixed in priori and to look iteratively the optimal partition.
Every individual is allocated to the closest centre, after the affectation of all the data, the average
of every group is calculated, it constitutes the new representatives of the groups, when we end in
a stationary state (no data changes a group) the algorithm is stopped.

Algorithm 1: K-means
Entry
Data set of N object, noted X
Number of cluster, noted k
Out
A partition of K cluster { 1 2
C , C ,. .. C k }
Begin
Random Initialization of the centres Ck;
Repeat
Affectation:
Generate a new partition by assigning every object to the group which the centre is
closest;
x i ∈C k si ∀ j|x i−µk|=min|x i−µ j| µk the centre of the class K;
j With
Representation:
Calculate the centres associated with the new partition;

µk = N1 ∑ xi
xi ∈ C k
Until convergence of the algorithm towards a stable partition;
End.

This process tries to minimize the intra-cluster variance represented in the form of an objective
function:
k
J=∑ ∑ d ( Xj ,Cj )
i=1 xj

In the case of the Euclidian distance this function is called function of square error.

Page 17 sur 25
K
J=∑ ∑ ‖x j−C i‖2
i=1 x
j

4.4.2 CLARA Algorithm :

Designed by Kaufman and Rousseeuw to handle large datasets, CLARA (Clustering LARge
Applications) relies on sampling [15]. Instead of finding representative objects for the entire data
set, CLARA draws a sample of the data set, applies PAM on the sample, and finds the medoids
of the sample. The point is that, if the sample is drawn in a sufficiently random way, the medoids
of the sample would approximate the medoids of the entire data set. To come up with better
approximations, CLARA draws multiple sam-ples and gives the best clustering as the output.
Here, for accuracy, the quality of a clustering is measured based on the average dissimilarity of
all objects in the entire data set, and not only of those objects in the samples. Experiments
reported in [15] indicate that five samples of size40þ2kgive satisfactory results. Algorithm
CLARA1.Fori¼1to 5, repeat the following steps:2.Draw a sample of40þ2kobjects randomly
from the entire data set,2and call Algorithm PAM to find kmedoids of the sample.3.For each
object Ojin the entire data set, determine which of the kmedoids is the most similar
toOj.4.Calculate the average dissimilarity of the clustering obtained in the previous step. If this
value is less than the current minimum, use this value as the current minimum, and retain the
kmedoids found in Step 2 as the best set of medoids obtained so far.5.Return to Step 1 to start the
next iteration. Complementary to PAM, CLARA performs satisfactorily for large data sets (e.g.,
1,000 objects in 10 clusters). Recall from Section 2.2 that each iteration of PAM is of O(K(n-
k)²). But, for CLARA, by applying PAM just to the samples, each iteration is of O(K(40+k)² +
K(n - k) ). This explains why CLARA is more efficient than PAM for large values of n.

4.4.3 CLARANS :

Algorithm CLARANS1.Input parameters num local and max neighbor. Initi-alize I to 1

1, and min cost to a large number.

2.Setcurrentto an arbitrary node in Gn;k.

3.Set j to 1.

4.Consider a random neighbors of current, and based on 5, calculate the cost differential of the
two nodes.

5.If S has a lower cost, set current to S, and go to Step 3.

6.Otherwise, increment j by 1. If j <= max neighbor, goto Step 4.

Page 18 sur 25
7.Otherwise, when j > max neighbor, compare the cost of current with min cost. If the former is
less than min cost, set min cost to the cost of current and set best node to current.

8.Increment i by 1. If i > num local, output best node and halt. Otherwise, go to Step 2.

Steps 3 to 6 above search for nodes with progressively lower costs. But, if the current node has
already been compared with the maximum number of the neighbors of the node(specified by
max neighbor) and is still of the lowest cost, the current node is declared to be a “local”
minimum. Then, in Step 7, the cost of this local minimum is compared with the lowest cost
obtained so far. The lower of the two costs above is stored in min cost. Algorithm CLARANS
then repeats to search for other local minima, until num local of them have been found.

As shown above, CLARANS has two parameters: the maximum number of neighbors examined
(max neighbor)and the number of local minima obtained (num local). The higher the value of
max neighbor, the closer is CLARANS to PAM, and the longer is each search of a local minima.
But, the quality of such a local minima is higher and fewer local minim a needs to be obtained.
Like many applications of randomized search [15], [16], we rely on experiments to determine the
appropriate values of these parameters.

4.5 Other Clustering Approaches :

4.5.1 Approach based on graph theory :

4.5.1.1 Self-Organizing Maps (SOM) :

The self-organizing map method, or Self-Organizing Maps (SOM) is an unsupervised


classification algorithm based on an artificial neural network. The neurons of the output layer are
interconnected to form a network that remains fixed during learning. Each neuron has
coordinates in the data space and represents a prototype of a class. Two connected neurons
influence each other during learning. The algorithm "SOM" makes it possible to discover the
topology of the data. Neurons close to each other in this topology (often rectangular) can
correspond to the same class.

This algorithm is considered as a vector quantization method, gathering the information in their
respective classes while taking into account their topography in the space of the observations and
those in:

 Defining a priori a notion of neighborhood between classes.


 Making neighboring observations in the data space belong to the same class or to
neighboring classes, after classification.
 Compressing multidimensional data while preserving their characteristics. [16]

Page 19 sur 25
The principle is based on the theory of competitive networks, that is to say on the establishment
of a competition link between neurons other than those of entry. In practice, this means that
inhibitory links connect the neurons. During the learning phase, the network specializes its
neurons in recognition of input categories, which is indeed a way of learning classes. A class is
defined as the set of examples recognized by an output neuron of a competition network. SOMs
are composed, like all competition networks, of an entrance layer and a competition layer. The
competition layer is structured by a distance to define a notion of neighborhood. The
competition, at first, is no longer between a neuron and the others, but between a neighborhood
and the rest of the map. This distance is in fact a function of time, so that we find a competition
by neuron at the end of learning. The advantage of this approach by variable neighborhoods is to
effectively separate the very different examples from each other. The last classes are a
refinement of those obtained at the beginning.

5 Evaluation of a partition :

5.1 Dunn Index :

Dunn [5] proposed an index to identify compact and well separated clusters. The main objective
of the Dunn index is to maximize inter cluster distances (separation) and minimize intra-cluster
distances (increase compactness), it is defined by:

Or dist (Ck; Ckk) is a dissimilarity function between the Ck and Ckk clusters defined by:

d (u; w) being the Euclidean distance between u and w.

An optimal value of K (number of clusters) is the one that maximizes the Dunn index. But this
index has two major drawbacks:

 His calculation is expensive.


 He is sensitive to the presence of noise.

Page 20 sur 25
5.2 Davies and Bouldin index :

The objective of this index is to minimize the average similarity between each cluster and the
cluster that is most similar to it, this index is defended by [3]:

An optimal value of K is the one that minimizes DB2.

5.3 Silhouette Index :

The concept of Rousseeuw[17] is described as follows: the Silhouette is a tool used to assess the
validity of clustering. The silhouette constructed to select the optimal number of cluster with a
ratio scale data (as in the case of Euclidean distances) that suitable for clearly separated cluster.
The clustering are considered average proximities as the two are dissimilarities and similarities,
which work best in a situation with roughly spherical clusters.

Figure 7 : Show computation s(i) for each object, where object I belong to cluster A(12)

Case #1 considered dissimilarities(12). From fig. 9 described for take the object i in the data set,
and assigned to cluster A, then define as follows:

s(i)= in case of dissimilarities.

i= object i belong to cluster A.

a(i) =average dissimilarity of i to all other objects of A.

d(i, C) = average dissimilarity of i to all objects of C. b(i) = minimum d(i, C), where C ≠ A.
B=the cluster B for which minimum is attained the neighbor of object i The cluster B is like the
second-best choice for object i: if it could not be accommodated into cluster A, which cluster B
would be the closest competitor In Fig. 3. The number s(i) write this in formula:

b ( i )−a(i)
s ( i )=
max ⁡{a (i ) , b(i)}

Page 21 sur 25
The number s(i) is obtained by combining a(i) and b(i) as follows:

{ }
a (i)
1− if a ( i ) <b ( i ) ,
b (i)
s ( i )= 0 if a ( i )=b ( i ) ,
b (i)
−1 if a ( i ) >b ( i ) ,
a( i )

S(i) can will be -1 <= s(i) <= 1

Case #2 considered similarities(12). In this case consideration similarities and define a’(i), d’(i, C),
and put b'(i) = maximum d’(i, C), where C ≠ A. The numbers s(i) is obtained by

{ }
b ' (i)
1− if a ' ( i ) <b ' ( i ) ,
a ' (i)
s ( i )= 0 if a ' (i )=b ' ( i ) ,
a' (i )
−1 if a ' ( i ) >b ' ( i ) ,
b ' (i )

5.4 Compaction index Wemmert and Gancarski :

The Wemmert Gancarski [18] index is also calculated using the concept of distances between the
points and the barycenters of all the clusters. For calculating the index, we use the following
terms. If M is a point in a cluster Ck, the term R(M) is defined as the quotient of distance of this
point to the barycenter of the cluster to which it is a member and the smallest distance of this
point to the barycenters of other clusters,

R ( M )=¿|M −G {K }|∨ ¿ ¿
min {K ' }
K≠K' ¿∨M −G ∨¿

The mean value of this quotient is then calculated for each cluster, if the mean value is greater
than 1, the value is ignored, otherwise its complement to 1 is calculated. It is done using the
equation

J K =max 0 , 1−
{ 1
∑ R(M i )
n k i∈ I k
}
The Wemmert Gancarski index is calculated as the weighted mean of J K for all the clusters.

Page 22 sur 25
K
1
C=
N
∑ nk Jk
k=1

The complete equation can be rewritten as,


K
1
C=
N
∑ max {0 , nK − ∑ R (M i)}
k=1 i ∈I K

6 Conclusion :

In this chapter, we have seen that unsupervised learning can be viewed from the perspective of
statistical modelling. Statistics provides a coherent framework for learning from data and for
reasoning under uncertainty. Many interesting statis-tical models used for unsupervised learning
can be cast as latent variable models and graphical models. These types of models have played
an important role in defining unsupervised learning systems for a variety of different kinds of
data.

Page 23 sur 25
Réfrences

[1] John A. Hartigan , Clustering Algorithms, John Wiley & Sons New York , London ,
Sydney , Toronto,1975 .
[2] Fayyad, U.M., G. Piatetsky Shapiro, P. Smyth And R.
Uthurusamy, Advances In Knowledge Discovery And Data Mining,Aaai Press/The Mit Press,
Pp: 573-592, 1996.
[3] Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Machine
Learning, 2, (1987) 139-172.

[4] Lei Li, De-Zhang Yang, Fang-Cheng Shen,A Novel Rule-Based Intrusion Detection System
Using Data Mining,978-1-4244-5539, IEEE,2010.

[5] Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). Simoudis, Evangelos;
Han, Jiawei; Fayyad, Usama M., eds. A density-based algorithm for discovering clusters in large
spatial databases with noise. Proceedings of the Sec-ond International Conference on Knowledge
Discovery and Data Mining (KDD-96).

AAAI Press. pp. 226-–231.

[6] XU, X., ESTER, M., KRIEGEL, H.-P., and SANDER, J.1998. A distribution-based
clustering algorithm for mining in large spatial databases. In Proceedings of the 14th ICDE, 324-
331, Orlando, FL.

[7] Xu, Xiaowei, et al. "A distribution-based clustering algorithm for mining in large spatial
databases." Data Engineering, 1998. Proceedings., 14th International Conference on. IEEE,
1998.

[8] W.,Yang J., Muntz R. STING: A statistical information grid approach to spatial data mining.
Proc. 23rd Int. conf. on very large data bases. Morgan Kaufmann, 1997, pp.186-195.

[9] Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan, “Automatic
subspace clustering of high dimensional data for data mining applications”, ACM SIGMOD
inter- national conference on Management of data, vol.27, no.2, pp.94-105, June 1998.

[10] Hans-Peter Kriegel and Arthur Zimek, “Subspace clustering, Ensemble clustering,
Alternative clustering, Multiview clustering: What can we learn from each other”, In Proc. 1st
Int’l workshop on discovering, summarizing and using multiple clusterings, 2010.

Page 24 sur 25
[11] R.Agrawal, J.Gehrke, D.Gunopulos, P. Rag- havan, “Automatic Subspace Clustering of
High Dimensional Data for Data Mining Applications”, Proceedings of 1998 ACM-SIGMOD,
pp. 94-105, 1998.

[12] E.Schikuta,“Grid Clustering: An Efficient Hierarchical Clustering Method for Very Large
Data Sets”, Proceedings of the 13thInternational Confe

[13] Gholamhosein Sheikholeslami, Surojit Chatterjee, Aidong Zhang,“ WaveCluster: a wavelet-


based clustering approach for spatial datain very large databases”, The VLDB Journal (2000) 8:
289–304

[14] Celeux G,.Diday E., Govaert G., Lechevallier Y., Ralam-Bondrainy H. Classification
Automatique des Données. Bordas, Paris, 1989.

[15]Y. Ioannidis and Y. Kang, “Randomized Algorithms for Optimiz-ing Large Join
Queries,”Proc. 1990 ACM Special Interest Group on Management of Data,pp. 312–321, 1990.

[16]Y. Ioannidis and E. Wong, “Query Optimization by SimulatedAnnealing,”Proc. 1987 ACM


Special Interest Group on Management of Data,pp. 9–22, 1987.

[17] Bounneche Meriem Dorsaf, réduction de données pour le traitement d’images, 2009,
Université Mentouri Constantine, pages 04 – 07

[18] Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of
cluster analysis. Journal of Computational and Applied Mathematics. Vol.20, pp.53-65.

[19] Desgraupes, B. “ClusterCrit: Clustering Indices”. R Package Version 1.2.3., 2013. Available
online: https://cran.r-project.org/web/packages/clusterCrit/ (accessed on 6 September 2017).

Page 25 sur 25

You might also like