Professional Documents
Culture Documents
Full Chapter Reverse Clustering Formulation Interpretation and Case Studies 1St Edition Jan W Owsinski PDF
Full Chapter Reverse Clustering Formulation Interpretation and Case Studies 1St Edition Jan W Owsinski PDF
https://textbookfull.com/product/supply-chain-management-models-
forward-reverse-uncertain-and-intelligent-foundations-with-case-
studies-hamed-fazlollahtabar/
https://textbookfull.com/product/zooarchaeology-in-practice-case-
studies-in-methodology-and-interpretation-in-archaeofaunal-
analysis-1st-edition-christina-m-giovas/
https://textbookfull.com/product/rethinking-disability-a-
disability-studies-approach-to-inclusive-practices-2nd-edition-
jan-w-valle/
https://textbookfull.com/product/predictive-modeling-
applications-in-actuarial-science-volume-2-case-studies-in-
insurance-1st-edition-edward-w-frees/
Interpretation and Film Studies: Movie Made Meanings
Phillip Novak
https://textbookfull.com/product/interpretation-and-film-studies-
movie-made-meanings-phillip-novak/
https://textbookfull.com/product/python-machine-learning-case-
studies-five-case-studies-for-the-data-scientist-1st-edition-
danish-haroon/
https://textbookfull.com/product/python-machine-learning-case-
studies-five-case-studies-for-the-data-scientist-1st-edition-
danish-haroon-2/
https://textbookfull.com/product/case-formulation-for-
personality-disorders-tailoring-psychotherapy-to-the-individual-
client-ueli-kramer/
https://textbookfull.com/product/international-case-studies-in-
event-management-routledge-international-case-studies-in-
tourism-1st-edition-edited-by-judith-mair/
Studies in Computational Intelligence 957
Reverse
Clustering
Formulation, Interpretation and Case
Studies
Studies in Computational Intelligence
Volume 957
Series Editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new develop-
ments and advances in the various areas of computational intelligence—quickly and
with a high quality. The intent is to cover the theory, applications, and design
methods of computational intelligence, as embedded in the fields of engineering,
computer science, physics and life sciences, as well as the methodologies behind
them. The series contains monographs, lecture notes and edited volumes in
computational intelligence spanning the areas of neural networks, connectionist
systems, genetic algorithms, evolutionary computation, artificial intelligence,
cellular automata, self-organizing systems, soft computing, fuzzy systems, and
hybrid intelligent systems. Of particular value to both the contributors and the
readership are the short publication timeframe and the world-wide distribution,
which enable both wide and rapid dissemination of research output.
Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago.
All books published in the series are submitted for consideration in Web of Science.
Janusz Kacprzyk
Reverse Clustering
Formulation, Interpretation and Case Studies
123
Jan W. Owsiński Jarosław Stańczak
Polish Academy of Sciences Polish Academy of Sciences
Systems Research Institute Systems Research Institute
Warsaw, Poland Warsaw, Poland
Janusz Kacprzyk
Polish Academy of Sciences
Systems Research Institute
Warsaw, Poland
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
vi Preface
Under these circumstances—the data set and the partition being given—we try
to reconstruct the partition on the basis of the data set, using cluster analysis. We try
to find the entire clustering procedure that will yield, for this given data set, a
partition that is as close to the given one as possible. Thus, the result of the pro-
cedure is both the clustering procedure, defined by a number of attributes (clus-
tering method, its parameters, variable selection, distance definition,…) and the
concrete partition found.
It is obvious that the paradigm borders upon classification (for a very specific
formulation/interpretation of the situation faced), but extends to a much broader
domain, in which the perception of the problem itself and the meaning of solutions
can vary very widely. This is, in particular, shown in the present book.
In the current stage of work, the results obtained and largely contained in this
book pertain mainly to the substantive aspect of the paradigm, while the technical
aspects of the respective algorithms are, as of now, left to future research.
The reverse clustering paradigm constitutes a new perspective on quite a broad
spectrum of problems in data analysis, and, as the book shows, it can provide very
interesting, instructive and significant results, under a wide variety of interpreta-
tional assumptions. We sincerely hope, therefore, that this book does not only give
the Readers a new material and fresh insight into some problems of data analysis,
but may also provoke them to deeper studies in the direction here indicated.
The above does not explicitly state the purpose of the exercise (to say nothing
of the technical details), but it can easily be deduced that what is aimed at is closely
related to the notion of classification. While the close relation with classification is
not only obvious, but definitely true, the paradigm has a much wider spectrum of
applications and meanings, as this is explained in Chap. 2 of the book, following
the more precise presentation of this paradigm, given in Chap. 1.
The paradigm is constituted, first, by the above statement of the problem, which
then has to be expressed in pragmatic technical terms, involving
(1) the space of clustering algorithms with its granularity (what algorithms are
accounted for and what parameters, defining the entire clustering procedure, are
being subject of the search for PB);
vii
viii Introduction
(2) the measure of similarity between the partition of the set X, given at the outset,
i.e. PA, and the partitions, obtained from the clustering algorithms, this measure
being maximised (or the measure of distance between them, being minimised);
and
(3) the technique of search for the PB given the data of the concrete problem.
This paradigm is, however, also, and perhaps even more importantly, constituted
by the interpretation of the entire setting, and the particular instances of this
interpretation—as mentioned, treated at length in Chap. 2. This is important insofar
as it places the paradigm against the background of the data analysis domain, with
special emphasis on classification and related fields. These various interpretation
instances are associated primarily with the status of the partition PA, namely its
source, the degree of credibility we assign to it, as well as its actual or presumed
connection with the data set X. Depending on these, and on the results obtained, the
status of the obtained partition PB, including validity and applicability, will also
vary significantly.
Owing to this variety of interpretations, the paradigm may find application in a
broad spectrum of analytic, but also cognitive, situations. The subsequent chapters
of the book, starting with the third one, are exactly devoted to the presentation
of the cases treated, which definitely differ not only as to their substance matter
(domain, from which the data come), but, largely, as to the interpretation of the
actual problem and the results obtained. The implication is that the paradigm can be
used in many data analytic circumstances for diverse purposes, whenever the
structuration of the data set into groups is appropriate.
The paradigm of reverse clustering has been presented already in several papers
by the same team of authors, e.g. in Owsiński et al. (2017a, b), Owsiński, Stańczak
and Zadrożny (2018). The present book aims at a more complete presentation of the
paradigm and its interpretations. The book does not go into the computational and
numerical issues and details, which are, of course, of very high importance.
Namely, the main purpose of the book is to present the approach and its capacities
in terms of various kinds of situations, problems and interpretations of respective
results. We do indeed hope it conveys the intended message in an effective and
interesting manner.
The book is structured in the following manner: first, Chap. 1 presents the
scheme of the approach, characterised, in particular, as it has been used in the cases
illustrated in this book, along with notation used. Then, Chap. 2 outlines the context
of the reverse clustering, starting with other approaches, which concern similar
kinds of problems, related to data analysis, including also an ample reference to the
very general idea of reverse engineering, as well as explainable artificial intelli-
gence or data analysis. Then, the context is shortly analysed in terms of more
detailed specific problems, arising in connection with both the reverse clustering
procedure and the data analytic methods in a more general perspective (like, e.g.
selection of variables, or definitions of distance). This chapter contains also a very
important section on the potential interpretations of the reverse clustering paradigm
and its results. Chapter 3 constitutes a very short introduction to the cases studied
Introduction ix
and illustrated in the book, which are then presented in the consecutive chapters:
Chap. 4 is devoted to the motorway traffic data, Chap. 5 to environmental con-
tamination data, Chaps. 6 and 7 to two separate cases of typologies or classifications
of administrative units in Poland, and, finally, Chap. 8 to some more academic
exercises. The book closes with Chap. 9 summarising the work done and proposing
some new vistas.
This book is intended to offer the Readers truly interesting and novel perspec-
tives in data analysis, regarding the diverse ways of formulating and approaching
problems, and understanding the results, and we shall be very satisfied if it did it at
least in a perceptible degree.
Jan W. Owsiński
Jarosław Stańczak
Karol Opara
Sławomir Zadrożny
Janusz Kacprzyk
Contents
xi
xii Contents
xiii
xiv List of Figures
Fig. 6.2 Map of the province of Masovia with the indication of the
municipalities classified in three clusters resulting from the
reverse clustering according to the data from Table 6.3. Red
area in the middle corresponds to Warsaw and its
neighbourhood, the bigger red blobs correspond to subregional
centres (Radom, Płock, Siedlce and Mińsk Mazowiecki) . . . . . .. 70
Fig. 6.3 Map of Masovia province with the partition PB
from Table 6.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 76
Fig. 6.4 Map of Masovia province with the partition PB
from Table 6.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 77
Fig. 7.1 Two examples of the procedures, leading to the potential
prior categorization of the sort similar to the one of interest
here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 80
Fig. 7.2 Map of Poland with indication of municipalities, which
belonged in the solution of Table 7.2 to the “correct”
categories from the initial partition and those that belonged
to the other ones (“incorrect”) . . . . . . . . . . . . . . . . . . . . . . . . . .. 84
Fig. 7.3 Map of Poland, showing the partition of the set of Polish
municipalities obtained with the own evolutionary method
and the k-means algorithm, composed of 12 clusters,
corresponding to Table 7.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 85
Fig. 8.1 An example of the artificial data set with “nested clusters”,
subject to experiments with reverse clustering . . . . . . . . . . . . . .. 92
Fig. 8.2 An example of the artificial data set with “linear broken
structure”, subject to experiments with reverse clustering . . . . . .. 92
Fig. 9.1 Map of the province of Masovia showing the municipality
types, obtained from the reverse clustering performed
with DBSCAN algorithm, characterised in Table 9.1 . . . . . . . . . . 100
Fig. 9.2 The meta-scheme of application of the reverse clustering
paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
List of Tables
xv
xvi List of Tables
Table 5.6 Contingency table for the partition PA assumed and the one
obtained in Series 2 of calculations, PB, with the hierarchical
merger algorithm and data for all four elements . . . . . . . . . . .. 60
Table 6.1 Functional typology of municipalities of the province
of Masovia (data as of 2009) . . . . . . . . . . . . . . . . . . . . . . . . .. 65
Table 6.2 Variables describing municipalities, accounted
for in the study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 66
Table 6.3 Contingency matrix for the administrative breakdown
of municipalities of the province of Masovia in Poland
and reverse clustering performed with own evolutionary
algorithm using k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 67
Table 6.4 Contingency matrix for the administrative breakdown
of municipalities of the province of Masovia in Poland
and reverse clustering performed with own evolutionary
algorithm using hierarchical aggregation . . . . . . . . . . . . . . . .. 67
Table 6.5 Contingency matrix for the administrative breakdown
of municipalities of the province of Masovia in Poland
and reverse clustering performed with own evolutionary
algorithm using DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . .. 68
Table 6.6 Contingency matrix for the administrative breakdown
of municipalities of the province of Masovia in Poland
and reverse clustering performed with DE algorithm
using “pam” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 68
Table 6.7 Contingency matrix for the administrative breakdown
of municipalities of the province of Masovia in Poland
and reverse clustering performed with DE algorithm using
“agnes” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 68
Table 6.8 Examples of variable weights for two runs of calculations,
presented in Tables 6.3 and 6.4 . . . . . . . . . . . . . . . . . . . . . . .. 71
Table 6.9 Contingency matrix for the administrative breakdown
of municipalities of the province of Wielkopolska in Poland
and clustering performed with the Z vector obtained
for Masovia in the case shown in Table 6.3
(k-means algorithm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 71
Table 6.10 Contingency matrix for the administrative breakdown of
municipalities of the province of Wielkopolska in Poland and
clustering performed with the Z vector obtained for Masovia
in the case shown in Table 6.4 (hierarchical aggregation
algorithm). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 72
Table 6.11 The contingency matrix for the functional typology of
municipalities of Masovia from Table 6.1 and reverse
clustering with own evolutionary method using the k-means
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 73
List of Tables xvii
1 The concept of a Reverse Cluster Analysis has been introduced by Ríos and Velásquez (2011) in
case of the SOM based clustering, but it is meant there in a rather different sense than associating
original data points with the nodes in the trained network.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 1
J. W. Owsiński et al., Reverse Clustering, Studies in Computational Intelligence 957,
https://doi.org/10.1007/978-3-030-69359-6_1
2 1 The Concept of Reverse Clustering
(1.1)
(Notice that this optimization problem is in line with the reverse engineering, or
backward engineering paradigm, i.e. a procedure that aims at finding out for some
object or process what has been the underlying design, architecture or implementation
process that led to the appearance of the object in question—more on this subject in
Chap. 2.)
Because of the irregularity of circumstances of this search (the nature of the search
space and the values of the performance criterion, see later on for some more details
on this), the solution of the optimization problem defined above is a challenging task.
In our experiments, presented later on in the book, optimisation is performed with
the use of evolutionary algorithms.
Altogether, we try to reconstruct a way in which PA has been obtained from X, this
way, represented through the configuration, Z, pertaining to the broadly conceived
procedure of cluster analysis, for the very simple reason that clustering is a natural
way to produce partitions of sets of data. In some specific circumstances one might
imagine other approaches to the said reconstructing, but we stick to the apparently
most natural one. A short discussion of this subject is also provided in Chap. 2 of
the book.
In order to bring the problem formulated here closer to some real life situation,
let us consider the following three examples of situations:
Example 3: the actual toxicity of mushrooms. Even though this case might be
regarded as anecdotal, mushrooms do constitute an important part of cuisine and
diet in many cultures, and also in many of them lead, every year, to deaths or severe
hospitalizations. It is also well known that owing to the biological properties of
mushrooms their toxicity is highly variable, and the actual effects heavily depend
upon the way they are prepared (e.g. boiling mushrooms in water and then pouring
this water away) and consumed, as well as upon the consumer, her/his general and
current characteristics (like, e.g., age, weight, or alcohol currently consumed). The
partition PA is meant to correspond to the classes of toxicity / edibility of the particular
species, with the aim of communicating these characteristics to the wide public
in a possibly clear manner. Thus, PA , prepared by the experts is juxtaposed with
the partition PB , obtained from the set X of descriptions x i of the actual medically
described poisoning cases, as well as interviews with experienced cooks, specialized
in mushroom dishes. The juxtaposition is intended to lead to better justified and
cogently characterized classification PB (Z * ), supposedly communicated to the wide
public, including general edibility assessments, cooking indications, advice as to the
identification and first help, etc.
4 1 The Concept of Reverse Clustering
2 We do not consider here, in this book, the issue of missing data. Thus, it is assumed that for all
n objects each of m variable values is specified. Although the reverse clustering paradigm applies
also to the case of missing values, the book is devoted to the presentation of the main aspects and
implications of the paradigm, without delving into the multiple, even if important, side issues.
1.2 The Notation 5
We shall now give some additional details, which are associated with the concrete
implementation of the concept introduced here, according to the three aspects of the
space of configurations, specified before. Thereby, we shall be specifying the content
of the vector Z, composed of the individual parameters, subject to choice.
The choice of the clustering algorithms and their parameters.
Concerning the search with respect to the clustering algorithm, throughout this
volume we shall be confined to three families of algorithms:
1. The k-means-type algorithms with some of its varieties, like, e.g. k-medoids;
2. The classical progressive merger algorithms, such as single linkage, complete
linkage etc., and
3. A representative of the local density based algorithms, in this case the DBSCAN
algorithm.
No other kinds of clustering algorithms were accounted for in the experiments
reported in this volume, but, actually, considering the clustering algorithms proper,
the ones mentioned constitute the major part of those numerous clustering algorithms
that could be included in the search. It was important for us to consider the approaches,
which are by their very nature oriented at solving of the clustering problem—it should
namely be mentioned, for clarification, that the metaheuristics, very often used also
for clustering purposes, are by no means clustering algorithms themselves, and do
not contain in themselves the rationality, oriented at a possibly good partitioning of
a data set, but, quite generally, at finding an optimum solution.
We do by no means provide here any review of clustering methods, this domain
being the subject of a multitude of books and papers, both general, survey-like, and
devoted to concrete methods and algorithms, to say nothing of a myriad of appli-
cations. For the sake of completeness we mention such general references, dealing
with clustering, as Mirkin (1996), de Falguerolles (1977), Hayashi et al. (1996),
6 1 The Concept of Reverse Clustering
Banks et al. (2004, 2011), Wierzchoń and Kłopotek (2018), Bramer (2007), Owsiński
(2020), as well as, more focused on specific problems in clustering, Adolfsson et al.
(2019), Figueiredo et al. (1999), Guha et al. (2003), or Simovici and Hua (2019).
The k-means-type algorithms.
The k-means algorithms are based on the following general procedure:
1. for the given data set X = {x i }i∈I generate in some way p points3 in E x
(centroids), denote them x q , q = 1,…,p;
2. assign each object x i from X to the closest centroid x q , thus, for each x i distances
d(x i ,x q ) are calculated for q = 1,…,p, and x i is assigned to x q* , for which d(x i ,x q* )
= minq d(x i ,x q ); thereby, the clusters Aq are formed;
3. for the obtained clusters Aq determine the new centroids x q , being the “repre-
sentatives” of the clusters, e.g. as the means of the elements of clusters, assigned
to clusters in the previous step;
4. if the stopping criterion, e.g., the lack of essential changes between the centroids
in subsequent steps of the algorithm, is not satisfied (yet), go to 2, otherwise
terminate.
This simple procedure was initially formulated by Steinhaus (1956), and soon
afterwards was also developed by Lloyd (1957), but the main impact came from
Forgy (1965), Ball and Hall (1965), and MacQueen (1967). The fuzzy-set based
version of the general k-means method, which became enormously popular and
known as fuzzy c-means, was formulated by Bezdek (1981) (see also, for fuzzy
partitions, Dunn 1974, and Bezdek et al. 1999), following which quite a number of
varieties and algorithmic proposals within the k-means-like algorithm family were
forwarded (see, for instance, Lindsten et al. 2011, Dvoenko 2014, the recent work
of Kłopotek, 2020, or the discussions of equivalence with the Kohonen’s SOMs,
originally formulated by Kohonen 2001).
Nowadays, this generic procedure is being implemented in a variety of manners,
differing, in particular, as to the status of the x q —whether they are chosen from
among the objects x i (k-medoids version) or can be any elements of E x (the classical
k-means) and the way, in which they are determined, and it is available through a
number of open access and paid libraries.
The procedure, along with its varieties, is known to converge quickly (in a couple
or a dozen of iterations of the procedure above) to a local minimum, depending upon
the starting point (the initial points, “centroids”, from step 1) and the nature of the
set X. Since it converges quickly, it remains feasible to start it many times over from
diverse initial sets of centroids in order to increase the chances of finding the global
optimum.
The local minimum that is reached through the functioning of the above procedure
is, naturally, the minimum of the following criterion function:
Q(P) = d xi , x q .
q i∈Aq
3 Usually,
instead of p we would use k, as in „k-means” but this would overlap with an earlier
assumed meaning of k as an index of variables/attributes characterizing objects to be clustered.
1.3 The Elements of Vector Z: The Dimensions of the Search Space 7
The distance function used is the Euclidean metric squared, in order to preserve
the properties, associated with the choice of cluster mean as the representative of the
cluster. It is obvious that the above Q(P) is monotonic with respect to p, its minimum
for consecutive p’s decreasing with the increase of p down to Q(P) = 0 for p = n.
That is why the k-means type algorithms are applied with the number of clusters, p,
specified.
In the light of the above it becomes clear that the parameters of the vector Z,
associated with the k-means algorithm are the very choice of the algorithm (k-means
or one of its varieties, usually k-medoids as an alternative) and the number of clusters.
Although the choice of the distance definition appears to have an influence on the
results obtained from the k-means algorithms, it is not treated here, as considered
later on in this chapter.
The classical hierarchical merger algorithms.
The second group of algorithms accounted for in the here reported study of the
reverse clustering is the group of most classical clustering algorithms, consisting in
stepwise mergers of objects and then clusters. These algorithms are all constructed
as follows:
1. start from the set of objects, X, treating each object as a separate cluster (p = n);
calculate the distances d qq’ for all pairs of objects (indices) in I; these distances
are, therefore, treated in this step as inter-cluster distances, Dqq’ ;
2. find the minimum distance Dq*q** = minqq’ Dqq’ ; join/merge the clusters, indexed
by q* and q**, between which the distance is minimum, thereby forming a new
partition, with p: = p − 1;
3. check, whether p > 1; if not, terminate the procedure (all objects have been
merged into one all-embracing cluster);
4. recalculate the inter-cluster distances (i.e. the distances between the cluster
resulting from the merging of Aq* and Aq** in the previous step, on the one hand,
and all the other clusters on the other hand, the distance Dq*q** “disappearing”);
go to 2.
This—again—very simple procedure gives rise to a variety of concrete algorithms,
which differ by the inter-cluster distance recalculation step 4. The algorithms from
this group find their ancestor in the so-called “Wrocław taxonomy” by Florek et al.
(1956), who were the first to formulate what is now called “single-linkage” algorithm,
along with some of its more general properties. The essential step in the development
of the family of these algorithms came with the papers by Lance and Williams
(1966, 1967). They introduced the general formula, according to which the distance
recalculation step is performed:
Dq ∗ ∪q ∗∗ ,q = a1 Dq ∗ q + a2 Dq ∗∗ q + bDq ∗ q ∗∗ + c|Dq ∗ q − Dq ∗∗ q |
where q* ∪ q** denotes the index of the cluster resulting from the merging of
clusters q* and q**, with the values of the coefficients, corresponding to the particular
8 1 The Concept of Reverse Clustering
Table 1.1 Values of the Lance-Williams coefficients for the most popular of the hierarchical
aggregation clustering algorithms
Algorithm a1 a2 b c
Single linkage (nearest 1/2 1/2 0 −1/2
neighbor)
Complete linkage (farthest 1/2 1/2 0 1/2
neighbor)
Unweighted average nq* /(nq + nq* ) nq** /(nq + nq** ) 0 0
(UPGMA)
Weighted average 1/2 1/2 0 0
(WPGMA)
Centroid (UPGMC) nq* /(nq + nq* ) nq** /(nq + nq** ) − nq* nq** /(nq* + nq** ) 0
Median (WPGMC) 1/2 1/2 −1/4 0
4 Actually,
the Lance-Williams parameterisation was extended later on in order to encompass yet
more of similar algorithms, but this is of no interest for the main purpose of this book.
1.3 The Elements of Vector Z: The Dimensions of the Search Space 9
1979, 1981), but then they were virtually forgotten for a long time, mainly in view
of computational issues. They gained again popularity when, on the one hand, the
requirement of single-passage analysis of data sets became important (even before
the time of data streams analysis), in view of the volumes of available data to consider,
and, on the other hand, the new kinds of density techniques, much more computa-
tionally effective than those from before, have been proposed (see, e.g., Yager and
Filev 1994, or, more recently, Rodriguez and Laio 2014). These algorithms, in prin-
ciple, analyse the interrelations, based on distances/proximities of a limited number
of objects. One of the most commonly used algorithms in this group is DBSCAN,
due in its most popular form mainly to Ester et al. (1996), although it is claimed that
already Ling (1972) proposed the algorithm that was very similar to DBSCAN.
In this algorithm, the objects (points in E x ) are classified into three categories: core
points (implying that they are “inside” clusters), density reachable points (which may
form the “border” or the “edges” of clusters), and outliers or noise points. This clas-
sification is based on an essentially heuristic procedure, which refers to two param-
eters (these two parameters being, therefore, also the elements of the vector Z in our
approach), namely: the radius ε, within which we look for the “closest neighbours”
of a given point, and the minimum number of points, required to classify a given
region in E x as “dense”, originally denoted minPts. Based on these two parameters the
procedure classifies the objects into the three categories mentioned, and afterwards
establishes the clusters on the basis of the notion of density connectedness.
The algorithm is popular due to its fast performance and also owing to its inde-
pendence of the shape of the clusters it identifies, or forms. On the other hand, it
definitely strongly depends upon the choice of the two parameters, and, although a
similar criticism is true for, say, k-means, and its parameter p, executing k-means
for a (short) series of p’s is not a problem and may circumvent the arbitrariness of
the choice of the value of p, while finding the right pair of ε and minPts is quite
challenging, in general.
The weighing or selection of the variables.
In the search for the partition possibly similar to the given PA , operations may also
be performed on the set of variables, accounted for. Thus, two alternative options
can be applied: (i) weighing of each of variables, preferably on the scale between 0
(not considered at all) and 1 (considered as in the original data set), (ii) the binary
choice of variables, i.e. either considered or dropped (corresponding to the choice of
weights from among 1 and 0).
It is definitely not typical for clustering to proceed explicitly with such operations
on variables. Usually, such operation is performed in the preprocessing phase, often
even without explicit consideration of clustering as a possible next phase. Yet, in the
framework of reverse clustering, in some cases, this appears to be justified, especially
as it may not be known where does the partition PA come from and what is its relation
to the characterization of X.
10 1 The Concept of Reverse Clustering
Distance definitions.
It is well known that some of the clustering procedures depend to an extent, some-
times considerably, on the distance definitions used. This is absolutely clear for
the k-means family of algorithms, where squared Euclidean distance is virtually a
“must”, for formal reasons, although in some variations of this algorithm this is
no longer a strict requirement. Some definite implementations of specific algorithms
(e.g. from the hierarchical aggregation family) also work differently, depending upon
the distance definitions. The most important aspect in this regard is connected with
the influence, exerted by the objects, located far away from the other ones, the impact
of the increasing dimensionality on the significance of distance, or the differences
in densities in various regions of E x . In view of this influence, it was assumed in
the exercises in reverse clustering, illustrated in this book, that a flexible distance
definition be adopted, namely the general Minkowski distance:
h 1/ h
d xi , x j = xik − x jk ,
k
where for h = 1 we get the Manhattan (city-block) metric, and for h = 2 the Euclidean
metric. When h tends to infinity, the distance above approximates the Chebyshev
metric, according to which, simply,
d xi , x j = maxk (xik − xik ).
The search, realised in the space, outlined in the previous section, is performed
with respect to the fundamental criterion of the difference / affinity between the
two partitions, i.e. partition PA , which is given, and P, which is produced by the
clustering procedure, defined by Z, that is, the partition P = Z(X). Ultimately, for
the assessment of the clustering results, the classical Rand index (see Rand 1971)
was selected.5 Rand index measures the similarity of two partitions, P1 and P2 , of
a set of objects, in the following, simplest and highly intuitive manner, based on
the categorisation of pairs of objects, which is illustrated in Table 1.2. Namely, we
consider two partitions, P1 and P2 , and check, for each pair of objects from X (or I)
whether they are in the same cluster or in the different clusters.
Of course, a + b + c + d = n(n−1)/2. We aim at a (objects in the same clusters
in both partitions) and d (objects in different clusters in both partitions) as high as
possible, with b and c being as small as possible, according to the formula
a+d
Q P 1, P 2 = .
a+b+c+d
Thus, if the two partitions are identical, then Q(P1 ,P2 ) = 1, while Q(P1 ,P2 )
= 0 when they are “completely different” (actually, this occurs only in the sole,
very specific case, for the two partitions, of which one is constituted by a single,
all-embracing cluster, and the other one is composed of all objects being separate
singleton clusters).
In view of the probabilistic properties of this Rand index (its expected value for
two random partitions is not zero), often its adjusted version (see Hubert and Arabie,
1985), denoted Qa (.,.), is being used, accounting for the deviation of the mean from
the actual expected chance value. This adjusted Rand index is defined as:
a − E x p(a)
Qa P 1, P 2 =
Max(a) − E x p(a)
5 Some more general remarks on this subject shall be forwarded in the next chapter, when discussing
where Exp(a) is the expected value of the index, while the introduction of Max(a)
ensures that the maximum value of the respective measure is equal 1. These two
values can be calculated for two partitions, of which one consists of p1 clusters,
having, respectively, n11 , n12 ,…,n1p1 elements (objects), while the other partition
is composed of p2 clusters, having, respectively, n21 , n22 ,…,n2p2 elements, in the
manner as follows:
p1 n 1q p2 n 2q
q=1 · q=1
2 2
E x p(a) =
n
2
and
1 p1 n 1q p2 n 2q
Max(a) = + .
2 q=1 2 q=1 2
Denœud and Guénoche (2006) suggested that for larger datasets, this kind of
adjustment increases the discriminatory power of the Rand index. Therefore, in some
of the cases reported in this book, we use it as the similarity measure between
partitions. Likewise, in some calculations, definite penalty terms were introduced
for constraining the values of the elements of Z if the possibility arose of their
uncontrolled increase. Generally, however, the original Rand index was kept to as
the main criterion of the search for PB and is virtually kept in all cases as the index of
quality of the solution, if not the actual optimisation criterion (in some cases boiling
down to simply the number of “wrongly classified” objects).
Although this book is not devoted to the analysis of numerical and computational
aspects of the reverse clustering approach—a definitely very important issue—in the
framework of the presentation of the gist and the interpretations of the paradigm, we
shall shortly characterise here the computational aspect, as well.
Thus, in view of the expected very cumbersome landscape and highly complex
choice conditions (“constraints”) it was decided to use the evolutionary algorithms
as the search tools. In actual experiments two kinds of evolutionary algorithms were
used (see also a slightly ampler description in Sect. 4.2 of the book). The first of
them was developed by one of the authors of this book (see Stańczak 2003) and is
characterised by the two-level adaptation, namely at the level of individuals, which
is standard for the evolutionary algorithms, and also at the level of operators, which
are used in a highly flexible manner with respect to different individuals, depending
Another random document with
no related content on Scribd:
Mais pour substituer à l’égoïsme la sollicitude sociale pour la multitude,
la prévoyance, l’intuition, l’économie de la femme sont indispensables.
Les hommes, sans les femmes, ne parviendront jamais à retrancher du
budget les sommes nécessaires pour faire des réformes.
En France, l’argent public appartient surtout aux habiles. Bien plus que
le mérite ou le besoin, la ruse parvient à se le faire attribuer, et les
détournements commis au préjudice de tous s’accomplissent sans
déshonorer ceux qui en bénéficient; les riches n’hésitent pas à prendre ce
qui est le propre des dénués.
Les hommes, mêmes économes de leur argent, sont prodigues de la
fortune commune que les femmes, scrupuleusement, épargneraient.
Dans la famille on ne pourrait pas plus que dans l’état, équilibrer le
budget, si la maîtresse de la maison n’intervenait pour régler les
dépenses.
Puisque chacun reconnaît que les femmes sont capables d’augmenter
la valeur d’emploi de l’argent, au point de satisfaire avec une somme
modique, aux exigences de la maisonnée; pourquoi ne leur laisserait-on
pas accomplir dans l’Etat le miracle de la multiplication des deniers
qu’elles réalisent dans la maison?
C’est parce que les Françaises ne participent pas à la gestion des
affaires publiques que l’argent manque pour activer le progrès.
Tous les partis de gauche même réunis ne pourront, sans la
coopération de la femme, satisfaire le besoin de bien-être que le
développement intellectuel a suscité. Il est donc surprenant que les
républicains ne s’empressent point d’utiliser la puissance que les femmes
représentent.
Le parti qui octroiera la plénitude des droits politiques aux femmes, se
fera trouver indispensable et sera le maître en France puisque, grâce aux
femmes qui décuplent la valeur de l’argent, il aura des ressources
suffisantes pour accomplir les réformes désirées.
Hubertine Auclert.