Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Download and Read online, DOWNLOAD EBOOK, [PDF EBOOK EPUB ], Ebooks

download, Read Ebook EPUB/KINDE, Download Book Format PDF

Modern Algorithms of Cluster Analysis 1st Edition


Slawomir Wierzcho■

OR CLICK LINK
https://textbookfull.com/product/modern-
algorithms-of-cluster-analysis-1st-edition-
slawomir-wierzchon/

Read with Our Free App Audiobook Free Format PFD EBook, Ebooks dowload PDF
with Andible trial, Real book, online, KINDLE , Download[PDF] and Read and Read
Read book Format PDF Ebook, Dowload online, Read book Format PDF Ebook,
[PDF] and Real ONLINE Dowload [PDF] and Real ONLINE
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Practical Guide to Cluster Analysis in R Unsupervised


Machine Learning Alboukadel Kassambara

https://textbookfull.com/product/practical-guide-to-cluster-
analysis-in-r-unsupervised-machine-learning-alboukadel-
kassambara/

Beyond the Worst-Case Analysis of Algorithms 1st


Edition Tim Roughgarden

https://textbookfull.com/product/beyond-the-worst-case-analysis-
of-algorithms-1st-edition-tim-roughgarden/

The International Political Economy of Oil and Gas 1st


Edition Slawomir Raszewski (Eds.)

https://textbookfull.com/product/the-international-political-
economy-of-oil-and-gas-1st-edition-slawomir-raszewski-eds/

Modern Music Inspired Optimization Algorithms for


Electric Power Systems Modeling Analysis and Practice
Mohammad Kiani-Moghaddam

https://textbookfull.com/product/modern-music-inspired-
optimization-algorithms-for-electric-power-systems-modeling-
analysis-and-practice-mohammad-kiani-moghaddam/
Data Mining Algorithms in C++: Data Patterns and
Algorithms for Modern Applications 1st Edition Timothy
Masters

https://textbookfull.com/product/data-mining-algorithms-in-c-
data-patterns-and-algorithms-for-modern-applications-1st-edition-
timothy-masters/

Design and Analysis of Algorithms A Contemporary


Perspective Sandeep Sen

https://textbookfull.com/product/design-and-analysis-of-
algorithms-a-contemporary-perspective-sandeep-sen/

Intelligent Algorithms for Analysis and Control of


Dynamical Systems Rajesh Kumar

https://textbookfull.com/product/intelligent-algorithms-for-
analysis-and-control-of-dynamical-systems-rajesh-kumar/

A New History of Ireland, Volume VII: Ireland, 1921-84


Theodore William Moody

https://textbookfull.com/product/a-new-history-of-ireland-volume-
vii-ireland-1921-84-theodore-william-moody/

Pro MySQL NDB Cluster 1st Edition Jesper Wisborg Krogh

https://textbookfull.com/product/pro-mysql-ndb-cluster-1st-
edition-jesper-wisborg-krogh/
Studies in Big Data 34

Sławomir T. Wierzchoń
Mieczysław A. Kłopotek

Modern
Algorithms
of Cluster
Analysis
Studies in Big Data

Volume 34

Series editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
e-mail: kacprzyk@ibspan.waw.pl
The series “Studies in Big Data” (SBD) publishes new developments and advances
in the various areas of Big Data- quickly and with a high quality. The intent is to
cover the theory, research, development, and applications of Big Data, as embedded
in the fields of engineering, computer science, physics, economics and life sciences.
The books of the series refer to the analysis and understanding of large, complex,
and/or distributed data sets generated from recent digital sources coming from
sensors or other physical instruments as well as simulations, crowd sourcing, social
networks or other internet transactions, such as emails or video click streams and
other. The series contains monographs, lecture notes and edited volumes in Big
Data spanning the areas of computational intelligence incl. neural networks,
evolutionary computation, soft computing, fuzzy systems, as well as artificial
intelligence, data mining, modern statistics and Operations research, as well as
self-organizing systems. Of particular value to both the contributors and the
readership are the short publication timeframe and the world-wide distribution,
which enable both wide and rapid dissemination of research output.

More information about this series at http://www.springer.com/series/11970


Sławomir T. Wierzchoń Mieczysław A. Kłopotek

Modern Algorithms
of Cluster Analysis

123
Sławomir T. Wierzchoń Mieczysław A. Kłopotek
Institute of Computer Science Institute of Computer Science
Polish Academy of Sciences Polish Academy of Sciences
Warsaw Warsaw
Poland Poland

ISSN 2197-6503 ISSN 2197-6511 (electronic)


Studies in Big Data
ISBN 978-3-319-69307-1 ISBN 978-3-319-69308-8 (eBook)
https://doi.org/10.1007/978-3-319-69308-8

Library of Congress Control Number: 2017955662

The herein contained material cannot be reproduced nor distributed, in its entirety as well as with respect
to the fragments thereof, by means of electronic, mechanical, copying, registering and other devices,
without the written consent from the Authors.

© Springer International Publishing AG 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 9
2.1 Formalising the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 13
2.2 Measures of Similarity/Dissimilarity . . . . . . . . . . . . . . . . . . . ... 16
2.2.1 Comparing the Objects Having Quantitative
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 Comparing the Objects Having Qualitative Features . . . . 26
2.3 Hierarchical Methods of Cluster Analysis . . . . . . . . . . . . . . . . . . 29
2.4 Partitional Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Criteria of Grouping Based on Dissimilarity . . . . . . . . . 35
2.4.2 The Task of Cluster Analysis in Euclidean Space . . . . . 36
2.4.3 Grouping According to Cluster Volume . . . . . . . . . . . . . 45
2.4.4 Generalisations of the Task of Grouping . . . . . . . . . . . . 46
2.4.5 Relationship Between Partitional and Hierarchical
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5 Other Methods of Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . 48
2.5.1 Relational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.5.2 Graph and Spectral Methods . . . . . . . . . . . . . . . . . . . . . 49
2.5.3 Relationship Between Clustering for Embedded
and Relational Data Representations . . . . . . . . . . . . . . . 50
2.5.4 Density-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . 51
2.5.5 Grid-Based Clustering Algorithms . . . . . . . . . . . . . . . . . 55
2.5.6 Model-Based Clustering . . . . . . . . . . . . . . . . . . . . . . . . 56
2.5.7 Potential (Kernel) Function Methods . . . . . . . . . . . . . . . 56
2.5.8 Cluster Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.6 Whether and When Grouping Is Difficult? . . . . . . . . . . . . . . . . . 64

v
vi Contents

3 Algorithms of Combinatorial Cluster Analysis . . . . . . . . . . . . . . . . . 67


3.1 k-means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.1.1 The Batch Variant of the k-means Algorithm . . . . . . . . . 72
3.1.2 The Incremental Variant of the k-means Algorithm . . . . 72
3.1.3 Initialisation Methods for the k-means Algorithm . . . . . . 73
3.1.4 Enhancing the Efficiency of the k-means Algorithm . . . . 79
3.1.5 Variants of the k-means Algorithm . . . . . . . . . . . . . . . . 81
3.2 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.3 FCM: Fuzzy c-means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 100
3.3.1 Basic Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.3.2 Basic FCM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.3.3 Measures of Quality of Fuzzy Partition . . . . . . . . . . . . . 106
3.3.4 An Alternative Formulation . . . . . . . . . . . . . . . . . . . . . 110
3.3.5 Modifications of the FCM Algorithm . . . . . . . . . . . . . . 111
3.4 Affinity Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.5 Higher Dimensional Cluster “Centres” for k-means . . . . . . . . . . . 130
3.6 Clustering in Subspaces via k-means . . . . . . . . . . . . . . . . . . . . . 132
3.7 Clustering of Subsets—k-Bregman Bubble Clustering . . . . . . . . . 135
3.8 Projective Clustering with k-means . . . . . . . . . . . . . . . . . . . . . . 136
3.9 Random Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
3.10 Subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
3.11 Clustering Evolving Over Time . . . . . . . . . . . . . . . . . . . . . . . . . 142
3.11.1 Evolutionary Clustering . . . . . . . . . . . . . . . . . . . . . . . . 142
3.11.2 Streaming Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 143
3.11.3 Incremental Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 145
3.12 Co-clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
3.13 Tensor Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
3.14 Manifold Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
3.15 Semisupervised Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
3.15.1 Similarity-Adapting Methods . . . . . . . . . . . . . . . . . . . . 152
3.15.2 Search-Adapting Methods . . . . . . . . . . . . . . . . . . . . . . . 153
3.15.3 Target Variable Driven Methods . . . . . . . . . . . . . . . . . . 155
3.15.4 Weakened Classification Methods . . . . . . . . . . . . . . . . . 156
3.15.5 Information Spreading Algorithms . . . . . . . . . . . . . . . . . 157
3.15.6 Further Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 159
3.15.7 Evolutionary Clustering . . . . . . . . . . . . . . . . . . . . . . . . 161
4 Cluster Quality Versus Choice of Parameters . . . . . . . . . . . . . . . . . . 163
4.1 Preparing the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.2 Setting the Number of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.2.1 Simple Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.2.2 Methods Consisting in the Use of Information
Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Contents vii

4.2.3 Clustergrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168


4.2.4 Minimal Spanning Trees . . . . . . . . . . . . . . . . . . . . . . . . 169
4.3 Partition Quality Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.4 Comparing Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
4.4.1 Simple Methods of Comparing Partitions . . . . . . . . . . . . 175
4.4.2 Methods Measuring Common Parts of Partitions . . . . . . 176
4.4.3 Methods Using Mutual Information . . . . . . . . . . . . . . . . 177
4.5 Cover Quality Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
5 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.2 Basic Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
5.2.1 Similarity Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
5.2.2 Graph Laplacian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
5.2.3 Eigenvalues and Eigenvectors of Graph Laplacian . . . . . 195
5.2.4 Variational Characterization of Eigenvalues . . . . . . . . . . 198
5.2.5 Random Walk on Graphs . . . . . . . . . . . . . . . . . . . . . . . 203
5.3 Spectral Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
5.3.1 Graph Bi-Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 210
5.3.2 k-way Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
5.3.3 Isoperimetric Inequalities . . . . . . . . . . . . . . . . . . . . . . . 224
5.3.4 Clustering Using Random Walk . . . . . . . . . . . . . . . . . . 225
5.3.5 Total Variation Methods . . . . . . . . . . . . . . . . . . . . . . . . 229
5.3.6 Out of Sample Spectral Clustering . . . . . . . . . . . . . . . . 233
5.3.7 Incremental Spectral Clustering . . . . . . . . . . . . . . . . . . . 238
5.3.8 Nodal Sets and Nodal Domains . . . . . . . . . . . . . . . . . . 239
5.4 Local Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
5.4.1 The Nibble Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 242
5.4.2 The PageRank-Nibble Algorithm . . . . . . . . . . . . . . 244
5.5 Large Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
5.5.1 Using a Sampling Technique . . . . . . . . . . . . . . . . . . . . 248
5.5.2 Landmark-Based Spectral Clustering . . . . . . . . . . . . . . . 250
5.5.3 Randomized SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
5.5.4 Incomplete Cholesky Decomposition . . . . . . . . . . . . . . . 254
5.5.5 Compressive Spectral Clustering . . . . . . . . . . . . . . . . . . 256
6 Community Discovery and Identification in Empirical Graphs . . . . . 261
6.1 The Concept of the Community . . . . . . . . . . . . . . . . . . . . . . . . . 263
6.1.1 Local Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
6.1.2 Global Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
6.1.3 Node Similarity Based Definitions . . . . . . . . . . . . . . . . . 265
6.1.4 Probabilistic Labelling Based Definitions . . . . . . . . . . . . 265
6.2 Structure-based Similarity in Complex Networks . . . . . . . . . . . . 266
6.2.1 Local Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
viii Contents

6.2.2 Global Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268


6.2.3 Quasi-Local Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
6.3 Modularity—A Quality Measure of Division into
Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
6.3.1 Generalisations of the Concept of Modularity . . . . . . . . 274
6.3.2 Organised Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . 275
6.3.3 Scaled Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
6.3.4 Community Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
6.4 Community Discovery in Undirected Graphs . . . . . . . . . . . . . . . 277
6.4.1 Clique Based Communities . . . . . . . . . . . . . . . . . . . . . . 277
6.4.2 Optimisation of Modularity . . . . . . . . . . . . . . . . . . . . . . 277
6.4.3 Greedy Algorithms Computing Modularity . . . . . . . . . . 278
6.4.4 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 280
6.4.5 Spectral Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
6.4.6 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
6.5 Discovering Communities in Oriented Graphs . . . . . . . . . . . . . . 296
6.5.1 Newman Spectral Method . . . . . . . . . . . . . . . . . . . . . . . 297
6.5.2 Zhou/Huang/ Schölkopf Method . . . . . . . . . . . . . . . . . . 297
6.5.3 Other Random Walk Approaches . . . . . . . . . . . . . . . . . 298
6.6 Communities in Large Empirical Graphs . . . . . . . . . . . . . . . . . . 299
6.7 Heuristics and Metaheuristics Applied for Optimization
of Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
6.8 Overlapping Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
6.8.1 Detecting Overlapping Communities via Edge
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
6.9 Quality of Communities Detection Algorithms . . . . . . . . . . . . . . 307
6.10 Communities in Multi-Layered Graphs . . . . . . . . . . . . . . . . . . . . 311
6.11 Software and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
7 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Appendix A: Justification of the FCM Algorithm. . . . . . . . . . . . . . . . . . . 319
Appendix B: Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Appendix C: Personalized PageRank Vector . . . . . . . . . . . . . . . . . . . . . . 339
Appendix D: Axiomatic Systems for Clustering . . . . . . . . . . . . . . . . . . . . 347
Appendix E: Justification for the k-means++ Algorithm . . . . . . . . . . . . . 381
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Symbols

X Set of objects (observations)


x An element of the set X
X 2 Rmn Matrix, representing the set of objects, X ¼ ðx1    xm ÞT
m Cardinality of the set X
n Dimensionality (number of coordinates) of the vector of features of
x2X
C Set of clusters
k Number of clusters
diagðvÞ Diagonal matrix having the main diagonal equal to v
G ¼ ðV; EÞ Graph spanned over the set of vertices V linked with the edges
indicated in the set E
NðvÞ Set of neighbours of the vertex v in graph G
Nk ðvÞ Set of k nearest neighbours of the vertex v in graph G
@ðCÞ Edge separator in graph G: @ðCÞ ¼ ffu; vg 2 E : u 2 C; v 62 Cg
A Neighbourhood matrix having elements aij 2 f0; 1g, A ¼ ½aij mm
S Similarity matrix having elements sij 2 ½0; 1, S ¼ ½sij mm
m Number of edges in graph G P
di Degree (weight) of the i-th object, di ¼ m j¼1 sij
D Matrix of degrees, D ¼ diagð½d1 ; . . .; dm Þ
L Combinatorial Laplacian, L ¼ D  S
L Normalised Laplacian, L ¼ D1=2 LD1=2
L A variant of the Laplacian, L ¼ I  P
P Column-stochastic transition matrix, describing walk on graph G,
P ¼ SD1
e
P Row-stochastic transition matrix, describing walk on graph G,
e ¼ D1 S
P
b
P Column-stochastic transition matrix in a lazy walk on graph G,
b ¼ 1 ðI þ SD1 Þ
P 2
p Stationary distribution of the Markov chain having transition matrix
P, that is, a vector such that p ¼ Pp

ix
x Symbols

qðs; aÞ Global PageRank with positive preference vector s and the


coefficient of teleportation a: qðs; aÞ ¼ as þ ð1  aÞPqðs; aÞ
pðs; aÞ Personalised PageRank vector with nonnegative preference vector s
and the coefficient of teleportation a
e Column vector having elements equal 1
I Unit matrix
List of Figures

Fig. 2.1 The task of cluster analysis: to break down the set of
observations into k disjoint subsets, composed of similar
elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15
Fig. 2.2 Grouping of the set of randomly generated objects. a 100 points
randomly selected from the set ½0; 1  ½0; 1. b Division of the
objects into three groups using the k-means algorithm. Bigger
marks indicate the geometrical centers of groups . . . . . . . . . . . .. 15
Fig. 2.3 Unit circles for the selected values of the parameter p:
(a) p ¼ 1; 2; 1, (b) p ¼ 2:5 (Hein’s super-ellipse) . . . . . . . . . . .. 20
Fig. 2.4 The influence of the parameter p on the properties of
Minkowski distance. a Average values of the difference
between the most distant points from a 100-point set in
dependence upon the number of dimensions, n, and the value
of the exponent, p. b Unit circles for p\1 . . . . . . . . . . . . . . . .. 21
Fig. 2.5 The exemplary data set a and the corresponding dendrogram,
b obtained from the complete link method . . . . . . . . . . . . . . . . .. 32
Fig. 2.6 Voronoi tessellation, that is, the boundaries of clusters
determined by the gravity centres marked by dark dots . . . . . . .. 45
Fig. 2.7 Identification of clusters by examining mutual similarity
between objects: a a set of points. In illustrations like this we
follow the intuitive convention: The points are meant to lie in a
plane, the horizontal axis reflects the first coordinate and the
vertical one the second coordinate, b a graph obtained by
joining five nearest neighbors of each node . . . . . . . . . . . . . . . .. 49
Fig. 2.8 An example of clusters with irregular shapes and various
densities. The data set contains also outliers . . . . . . . . . . . . . . .. 51
Fig. 2.9 Diagram of k-distances between the points of an exemplary data
set X: a An exemplary data set composed of 90 points,
b increasingly ordered values of the average distance between
each of the 90 points of the data set X and their k ¼ 4 nearest
neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 53

xi
xii List of Figures

Fig. 2.10 a The set 2spirals is composed of two spirals, situated one
inside the other. b Grouping produced by the k-means
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 60
Fig. 2.11 Unordered (a) and ordered (b) concordance matrices, generated
with the use of the method, described in the text. Experiments
were carried out for k ¼ 2; 3; 4; 5 groups . . . . . . . . . . . . . . . . . .. 63
Fig. 2.12 Values of the cophenetic correlation coefficient for the
partitions obtained. The horizontal axis represents the number
of groups, the vertical axis shows the cophenetic correlation
coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 63
Fig. 2.13 Exemplary data having: a spherical and b non-spherical normal
distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 64
Fig. 3.1 Local optima in the problem of partitioning of the (one
dimensional) set X, consisting of 10 points. Figure to the left:
values of the indicator J (vertical axis) for the gravity centre of
one cluster placed at locations from the interval ½3; 9. Figure to
the right: a detailed look at these values when the centre
belongs to the set ½5:4686; 5:7651 . . . . . . . . . . . . . . . . . . . . . . .. 70
Fig. 3.2 The influence of initialisation and scaling on the results of
clustering of a set of 450 observations. Initial location of cluster
centres is marked with red squares, and respective symbols
denote the assignment of objects to clusters. In case
a xi1 2 f1; . . .; 450g, and in case b xi1 2 2:222185103 þ
2:0347051010  f1; . . .; 450g, i ¼ 1; . . .; 450 . . . . . . . . . . . . . . .. 74
Fig. 3.3 The distribution of the gravity centres (denoted with squares) in
two methods of initialisation. ðiÞ—initialisation described in
point (c), ðiiÞ—initialisation described in point (d). In this last
case the coordinates were initially normalised in such a way that
xij 2 ½1; 1, i ¼ 1; . . .; m, j ¼ 1; . . .; n, and afterwards—as
required by the method—each row of the matrix X was so
normalized that kxi k ¼ 1. The lines connecting the coordinate
system origin ð0; 0Þ with gravity centres play here an
explanatory role only—they allow to imagine the angles
between the centres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 76
Fig. 3.4 Contour diagrams of the function: a minða1 ; a2 Þ, and
b mh ða1 ; a2 Þ. It was assumed in both cases that
a1 ; a2 2 ð0; 400. Horizontal axis: a1 value, vertical axis: a2
value, the contour lines were drawn for equidistant values . . . .. 87
Fig. 3.5 Results of application of the kernel-based variant of the
k-means algorithm. Object-to-cluster allocation is indicated by
different colours. r ¼ 1 was assumed . . . . . . . . . . . . . . . . . . . . .. 91
Fig. 3.6 Clustering of separable data sets with the k-medoids algorithm:
a the set data3_2, b the set data6_2. Cluster examples are
marked with red dots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 93
List of Figures xiii

Fig. 3.7 Clustering of data with the use of the FCM algorithm: a Set
data3_2 (blue points) with added noise (black points).
b Assignments to groups generated by the FCM algorithm. The
bigger squares denote typical objects, smaller circles—the other
objects; the colour indicates the class membership. Same
convention applies to Figure (d) representing the results
obtained for the set data5_2 from Figure (c). a ¼ 2 was
assumed in computations. Object xi is treated as typical if
maxj uij  0:9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Fig. 3.8 Influence of the value of fuzziness exponent a on the number of
iterations (picture to the left) and the quality of prototypes
(picture to the right), expressed in terms of the average distance
davg according to the Eq. (3.62). Solid line indicates the mean
value m of the respective quantity (averaged over 100 runs), and
the dotted lines mark the values m  s, with s being the
standard deviation.  ¼ 108 was assumed. Experiments were
performed for the data set iris.txt . . . . . . . . . . . . . . . . . . . . . 105
Fig. 3.9 Influence of the a exponent (abscissa axis) on the reconstruction
error (ordinate axis). Test data: file iris.txt . . . . . . . . . . . . . . 108
Fig. 3.10 Test sets data_6_2 (Figure to the left) and data_4_2
(Figure to the right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Fig. 3.11 Dependence of the values of quality criteria for the sets
data_6_2 (to the left) and data_4_2 (to the right) on the
assumed number of classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Fig. 3.12 Diagrams of the function (3.75) for three values of the
parameter p. Function is of the form f ðxÞ ¼ 0:6a j4  xjp
þ 0:5a j1  xjp þ 0:8a j3  xjp þ 0:7m j6  xjp , where x
means ljl , and a ¼ 2:5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Fig. 3.13 Comparison of partitions generated by the a FCM algorithm
and b GK algorithm for a synthetic data set consisting of three
clusters with different shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Fig. 4.1 Relation between an average cluster radius (left figure) and the
total distance of objects from the prototypes depending on the
number of clusters (right figure). . . . . . . . . . . . . . . . . . . . . . . . . . 167
Fig. 5.1 Compact and well separated clusters (a) vs. clusters
characterized by a similarity relationship (b) . . . . . . . . . . . . . . . . 182
Fig. 5.2 A simple graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Fig. 5.3 Graph from Fig. 5.2 with (arbitrarily) oriented edges . . . . . . . . . . 192
Fig. 5.4 Eigenvectors of the combinatorial Laplacian of the path graph
consisting of 22 nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Fig. 5.5 The number of zero crossings of: a the combinatorial and
b normalized Laplacian eigenvectors for the jazz musician
network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
xiv List of Figures

Fig. 5.6 An application of the Fiedler vector to visualization of the


structure of a dataset: a an exemplary similarity matrix, b the
same matrix after its rows and columns have been reordered
according to the order obtained by sorting the entries of the
Fiedler vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
Fig. 5.7 First 20 eigenvalues of the Laplacian characterizing: linearly
separable clusters (left panel), and moon-shaped clusters that
are not linearly separable (right panel). In the first case Gaussian
kernel has been used with r ¼ 0:63, and in the second case the
K-NN grap has been constructed with K ¼ 10 and r ¼ 1 . . . . . . 213
Fig. 5.8 Exemplary graph embedding: a original graph and b its
embedding into spectral the space spanned by the eigenvectors
ðw2 ; w3 Þ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
Fig. 5.9 The NJW algorithm in action. Left column: data in spectral
coordinates determined by the eigenvectors Uk . Right column:
the same data in the coordinates described by the matrix Y.
Panel (a) corresponds to the data consisting of three linearly
separable groups, and panel (c) represents iris data set . . . . . . 222
Fig. 5.10 Examples of data with clear structure, (a) and (c), and similarity
matrices representing these data, (b) and (d) . . . . . . . . . . . . . . . . 223
Fig. 5.11 An example of unbalanced clusters: a Original data with three
clusters of size 2000 (red), 1000 (green), and 100 (blue).
b Clustering obtained by the k-means algorithm, and
c clustering returned by spectral clustering algorithm . . . . . . . . . . 224
Fig. 5.12 Eigenvalues of the 60  60 grid graph . . . . . . . . . . . . . . . . . . . . . 239
Fig. 5.13 Sign pattern of the eigenvectors w2 , w3 , w4 of the 4-by-4
grid graph. Circle (square) denotes positive (negative)
sign of the eigenvector at particular node. Note that
here k2 = k3 = 0.58579 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Fig. 5.14 Partition of the set of nodes of karate network into two
groups. Black square denotes the starting node used to compute
the personalized PageRank. Grey squares denote natural
neighbors of the black square . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Fig. 5.15 The values of the ordering function q (left panel), and the values
of conductance (right panel) computed for the karate
network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Fig. 6.1 Result of 10-fold application of the NJW algorithm for a
neighbourhood matrix describing relations between jazz
musician (Jazz network). The maximal value of Q ¼ 0:438591
is obtained each time when splitting into k ¼ 4 groups . . . . . . . . 284
Fig. 6.2 Distribution of the values of the modularity Q as a function
of the number of clusters for the sets C.elegans (to the left) and
e-mail (to the right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
List of Figures xv

Fig. 6.3 Fiedler vector values for a graph representing friendships


among members of a karate club [525] . . . . . . . . . . . . . . . . . . . . 286
Fig. 7.1 Test data sets from the Web site http://www.isical.ac.in/
*sanghami/data.html: a data3_2, b data5_2,
c data6_2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
Fig. 7.2 Test data sets: a 2rings, b 3rings, c 2spirals . . . . . . . 317
Fig. D.1 A mixture of 5 normal distributions as clustered by k-means
algorithm (Voronoi diagram superimposed) . . . . . . . . . . . . . . . . . 360
Fig. D.2 Data from Fig. D.1 after Kleinberg’s C-transformation clustered
by k-means algorithm into two groups . . . . . . . . . . . . . . . . . . . . . 360
Fig. D.3 Data from Fig. D.1 after a corrected C-transformation, clustered
by k-means algorithm into 5 groups . . . . . . . . . . . . . . . . . . . . . . . 365
Fig. D.4 Impact of contraction towards cluster centre by a
factor lambda—local optimum maintained . . . . . . . . . . . . . . . . . . 366
List of Tables

Table 2.1 Bregman divergences generated by various convex functions


[50] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 23
Table 3.1 Table to the left: Parameters of Gaussian distributions used to
generate samples belonging to the three groups. Table to the
right: Quality indicators of the partitions generated by the
algorithms FCM and GK (Gustafson-Kessel) . . . . . . . . . . . . . . 116
Table D.1 Variance explained (in percent) when applying k-means
algorithm with k ¼ 2; . . .; 6 to data from Figs. D.1 (Original),
D.2 (Kleinberg) and D.3 (Centric) . . . . . . . . . . . . . . . . . . . . . . . 360
Table D.2 Data points to be clustered using a ridiculous clustering
quality function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
Table D.3 Partition of the best quality (the lower the value the better)
after including n first points from Table D.2 . . . . . . . . . . . . . . . 363

xvii
List of Algorithms

Algorithm 2.1 Algorithm of agglomerative cluster analysis . . . . . . . . . . .. 29


Algorithm 2.2 “Robust” Single Link . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 34
Algorithm 2.3 Iterative algorithm of cluster analysis (generalised
Lloyd’s heuristics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Algorithm 2.4 DBSCAN algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Algorithm 2.5 Anchor grouping algorithm, [361] . . . . . . . . . . . . . . . . . . . 54
Algorithm 3.1 Incremental clustering algorithm BIMSEC . . . . . . . . . . . . . 73
Algorithm 3.2 k-means++ algorithm, [32] . . . . . . . . . . . . . . . . . . . . . . . . . 77
Algorithm 3.3 On line version of the k-means algorithm. . . . . . . . . . . . . . 81
Algorithm 3.4 Bisectional k-means algorithm . . . . . . . . . . . . . . . . . . . . . . 83
Algorithm 3.5 Spherical k-means algorithm. . . . . . . . . . . . . . . . . . . . . . . . 85
Algorithm 3.6 Kernel-based k-means algorithm . . . . . . . . . . . . . . . . . . . . . 90
Algorithm 3.7 k-medoids algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Algorithm 3.8 k-modes algorithm, [263] . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Algorithm 3.9 EM algorithm for cluster analysis. . . . . . . . . . . . . . . . . . . . 99
Algorithm 3.10 Affinity propagation, [191] . . . . . . . . . . . . . . . . . . . . . . . . . 130
Algorithm 3.11 k-means# algorithm, [11] . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Algorithm 3.12 Streaming divide-and-conquer clustering, version
after [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Algorithm 5.1 Repeated bisection spectral clustering [427] . . . . . . . . . . . . 218
Algorithm 5.2 A k-way partitional spectral clustering . . . . . . . . . . . . . . . . 218
Algorithm 5.3 Pivoted QR decomposition used to extract clusters from
the matrix Y [529] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Algorithm 5.4 Spectral clustering proposed by Ng et al. in [374] . . . . . . . 220
Algorithm 5.5 Spectral clustering algorithm MNCut [352] . . . . . . . . . . . . 226
Algorithm 5.6 k-way kernel spectral algorithm [17] . . . . . . . . . . . . . . . . . 238
Algorithm 5.7 Nibble—algorithm for local identification
of clusters [440] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Algorithm 5.8 Landmark-based Spectral Clustering [111] . . . . . . . . . . . . . 252
Algorithm 5.9 Randomized SVD [226] . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

xix
xx List of Algorithms

Algorithm 5.10 Approximating eigenvectors of a normalized similarity


matrix via incomplete Cholesky decomposition [316] . . . . 255
Algorithm 5.11 Compressive spectral clustering [465] . . . . . . . . . . . . . . . . 259
Algorithm 6.1 Newman [366] algorithm of community detection . . . . . . . 278
Algorithm 6.2 Louvain [77] algorithm of community detection . . . . . . . . 279
Algorithm 6.3 Bisectional spectral algorithm of community
detection [370] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Algorithm B.1 Power method returning the principal eigenpair
of the matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
Algorithm B.2 Applying power method to determine
the first p  m eigenvectors of the Laplacian . . . . . . . . . . 332
Algorithm C.1 Algorithm of determination of the approximation
to the personalised PageRank vector on the basis
of Eq. (C.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Algorithm C.2 Fast algorithm of determining an -approximation of the
personalised PageRank vector [118] . . . . . . . . . . . . . . . . . . 344
Chapter 1
Introduction

Abstract This chapter characterises the scope of this book. It explains the reasons
why one should be interested in cluster analysis, lists major application areas, basic
theoretical and practical problems, and highlights open research issues and challenges
faced by people applying clustering methods.

The methods of data analysis can be roughly classified into two categories1 : (a)
the descriptive (exploratory) ones, which are recommended when, having no initial
models or hypotheses, we try to understand the general nature and structure of the
high-dimensional data, and (b) the confirmatory (inferential) methods, applied in
order to confirm the correctness of the model or of the working hypotheses, concern-
ing the data collected. In this context, a particular role is played by various statistical
methods, such as, for instance, analysis of variance, linear regression, discriminant
analysis, multidimensional scaling, factor analysis, or, finally, the subject of our
considerations here—cluster analysis2 [237, 303].
It is really difficult to list all the domains of theoretical and practical applications
of cluster analysis. Jain [272] mentions three basic domains: image segmentation
[273], information retrieval [347], and bio-informatics [46].
The task of cluster analysis or grouping is to divide the set of objects into homo-
geneous groups: two arbitrary objects belonging to the same group are more similar
to each other than two arbitrary objects belonging to different groups. If we wish to
apply this recipe in practice, we must find the answers to two basic questions: (a)
how to define the similarity between the objects, and (b) in what manner should one
make use of the thus defined similarity in the process of grouping. The fact that the
answers to these questions are provided independently one from another results in
the multiplicity of algorithms.
A number of natural processes can give rise to clusters in the data [195]. One
may have collected data from multiple experiments on same or intersecting sets of
objects, or one obtained longitudinal data (same objects measured at different time

1 See,e.g., J. W. Tukey, Exploratory Data Analysis. Addison-Wesley, 1977.


2 Thisarea is also referred to as Q-analysis, typology, grouping, partitioning, clumping, and taxon-
omy, [273], depending upon the domain of application.
© Springer International Publishing AG 2018 1
S. T. Wierzchoń and M. A. Kłopotek, Modern Algorithms of Cluster Analysis,
Studies in Big Data 34, https://doi.org/10.1007/978-3-319-69308-8_1
2 1 Introduction

points), data may have been collected at different experimental sites, data may stem
from subjects (human or animal) related to one another like in families etc. Season-
ality, climate, geography, measurement equipment manufacturer are further possible
hidden variables, inducing cluster formation among objects of the study. If one is
unaware of such subtleties of the data collection process (no explicit information
is available) but one is suspicious of such a situation, then one is advised to apply
a clustering method prior to further analysis because the existence of clusters may
impact the results of further analyses as an intra-cluster correlation may exist so that
the frequent independence assumptions are violated. For example, if one intends to
perform regression analysis on the data, and there exist clusters, then one is advised
to add a factor for the cluster effect into the standard regression model (e.g. ANOVA
analysis). A number of statistical tests, like χ 2 , t-test, Wilcoxon test have also been
adjusted for clusters. In image analysis, clustering may be performed prior to apply-
ing more advanced methods, like object recognition.
Division of data into clusters may also by itself be a subject of interest.
An interesting example of application of cluster analysis in modern information
retrieval is constituted by the CLUSTY search engine (clusty.com). The methods
of cluster analysis play an important role in the development of the recommender
systems [258, 416], in various kinds of economic, medical etc. analyses. Application
of these methods results from at least three reasons:
(a) To gain insight into the nature of the data, and, first of all, to indicate the typical
and the atypical (outlying) data items, to uncover the potential anomalies, to find
hidden features, or, finally—to be able to formulate and verify the hypotheses
concerning the relations between the observations;
(b) To obtain a compact description of data, to select the most representative objects;
here, the classical application is image compression;
(c) To obtain a natural classification of data, for instance—by determining the simi-
larities between pairs of objects to establish the respective hierarchical structures.
Methods of cluster analysis are being applied there, where we wish to understand
the nature of the phenomenon, represented by the set of observations, to get a sort
of summary of the content of large data sets, and to process such sets effectively.
Jain and Dubes [273] list the following challenges, linked with the practical use
of clustering:

(a) What is a cluster (group, module)?


(b) What features ought to be used to analyse the data collected?
(c) Should the data be normalised?
(d) Are there outliers in the analysed dataset, and if so—how should they be treated?
(e) How to define the similarity between the pairs of objects?
(f) How many clusters are really there in the data set? Do we deal, at all, with
clusters?
(g) What method should be used in a given situation?
(h) Is the obtained partition of data justified?
1 Introduction 3

Some of these questions find at least partial answers in a multiplicity of books,


devoted to various aspects of cluster analysis, like [12, 19, 70, 164, 273], or [173].
The recent years, though, brought a number of new challenges. First, cluster
analysis is being applied nowadays to processing of huge sets of data, which makes
it necessary to develop specialised algorithms. Second, the necessity of analysing
the data sets having complex topologies (data being situated in certain submanifolds)
leads to the requirement of applying more refined tools. We mean here the spectral
methods, the methods making use of kernel functions, and the relations between
these two groups of methods.
The starting point for these methods is constituted by the matrix, describing the
strength of interrelations between the objects making up the data set. In the case of the
kernel-based methods, considerations are being transferred to the highly dimensional
and nonlinear space of features for the purpose of strengthening of the separability
of data. Due to application of the so-called kernel trick to determine the distance
between objects in this new space, the knowledge of a certain kernel function suffices.
The elements of the matrix, mentioned before, are the values of the kernel function,
calculated for the respective pairs of objects. On the other hand, in the case of spectral
methods, the elements of the matrix correspond to the values of similarity of the pairs
of objects. Eigenvalues and eigenvectors of this matrix, or a matrix, being its certain
transformation (usually a form of the Laplacian of the graph, corresponding to the
similarity matrix), provide important information on relations between objects.
An interesting example is constituted by the problem of ordering of documents,
collected by the crawlers, cooperating with a search engine. We deal in such cases
with enormous collections of objects (of the order of 109 ), with the issue of their
effective representation, and, finally, with the task of indicating the important doc-
uments. The common sense postulate that the important websites (documents) are
the ones that are referred to (linked to) by other important websites allows for the
reformulating of the problem of assigning ranks in the ordering into the problem of
determining the dominating eigenvector of the stochastic matrix P, being a slight
modification of the original matrix A, representing connections between the web-
sites, that is ai j = 1 when on the ith website there is a link to the jth one. Vector r,
which satisfies the matrix equation r = P T r, is called the PageRank vector [379].
Both this vector and its diverse variants (TotalRank, HillTop, TrustRank, GeneRank,
IsoRank, etc.) are the examples of the so-called spectral ranking, which find appli-
cation not only in the ordering of documents (or, more generally, in various aspects
of bibliometrics), but also in graph theory (and, consequently, also in the analysis of
social networks), in bio-informatics, and so on.
A development over the latter idea is constituted by the methods based on random
walk on graphs. Namely, the task of grouping is here considered as an attempt of
finding such a division of the graph that a randomly moving wanderer shall remain for
a long time in a single cluster (subgraph), only rarely jumping between the clusters.
An interesting work on this subject is presented in, e.g., [211], while a number of
theoretical results have been shown by Chung [23, 115, 116], and by Meila [352].
The transformation of the matrices, representing relations (e.g. similarity) between
the elements of the set of objects, into the stochastic Markovian matrices leads to
4 1 Introduction

yet another interesting formalism. As we treat the eigenvectors of these matrices


as forming a new system of coordinates, we transform the multidimensional data
into a cloud of points in a space with a much lower number of dimensions. This
new representation reflects the essential properties of the original multidimensional
structure. In this manner we enter the domain of the so-called diffusion maps [125],
which have two significant features, distinguishing them from the classical methods:
they are nonlinear, and, besides this, they do faithfully map the local structure of data.
From the point of view of, for instance, processing of text documents, an essential
advantage of such an approach is a relatively straightforward adaptation to the semi-
supervised clustering problems and the coclustering problems—the important, quite
recently developing, directions of machine learning.
These most up-to-date trends in grouping have not been, until now, systematically
presented nor considered in an integrated manner against the background of the clas-
sical methods of cluster analysis. Such a survey would be very welcome indeed from
the point of view of students with an interest in the techniques of knowledge extrac-
tion from data and computer-based modelling, in the context of their application in
pattern recognition, signal analysis, pattern and interdependence identification, and
establishment of other interesting characteristics in the large and very large collec-
tions of data, especially those put together automatically in real time.
That is why we prepared this monograph, which:
(a) Contains the survey of the basic algorithms of cluster analysis, along with pre-
sentation of their diverse modifications;
(b) Makes apparent the issue of evaluation and proper selection of the number of
clusters;
(c) Considers the specialisation of selected algorithms regarding the processing of
enormous data sets;
(d) Presents in a unified form a homogeneous material, constituting the basis for
developing new algorithms of spectral and diffusion data analysis;
(e) Comments upon the selected solutions for the semi-supervised learning;
(f) Provides a rich bibliography, which might also be a starting point to various
individual research undertakings.
This book has been divided into 7 chapters. Chapter 2 outlines the major steps
of cluster analysis. Chapters 3 and 5 are devoted to two distinct brands of clustering
algorithms. While Chap. 3 focusses on clustering of objects that are embedded in
some metric space, Chap. 5 concentrates on ones abandoning the embedding space
and concentrating on the similarities between objects alone. The former are referred
to as combinatorial algorithms, the latter are called spectral algorithms. Chapter 4
addresses the issues of parameter tuning for clustering algorithms and the assessment
of the resultant partitions. Chapter 6 discusses a specific, yet quite important applica-
tion area of cluster analysis that is of community detection in large empirical graphs.
Finally, in Chap. 7 we present the data sets used throughout this book in illustrative
examples.
From the point of view of knowledge discovery task the clustering can be split
into the following stages:
1 Introduction 5

• problem formulation,
• collecting data,
• choice and application of a clustering algorithm,
• evaluation of its results.
While the problem of clustering seems to be obvious in general terms, its formali-
sation is not as obvious in concrete case. It requires first a formal framework for the
data to be learned from, as outlined in Sect. 2.1. Furthermore, the user is requested
to define similarity between the objects (the geometry of the sample space) as well
as a cost function (a measure of value of the obtained clusters). The definition of
the geometry must be application-specific, tuned to the specific needs of the user.
Section 2.2 provides with a guidance in this process, listing many possibilities of
measuring similarity both in numerical and non-numerical data and explaining some
special properties of these measures. Section 2.4.1 provides with a set of possible
cost functions to be used in the clustering process. Note, however, that each concrete
algorithm targets at some specific cost function.
Advices on the choice of data for cluster analysis can be found in Sect. 4.1.
To enable the choice of an appropriate clustering algorithm, Sect. 2.3 provides
with an overview of hierarchical clustering methods, Sect. 2.4 presents partitional
(called also combinatorial) clustering methods, Sect. 2.5 hints at other categories of
clustering procedures like relational, spectral, density, kernel and ensemble methods.
Some of these types of algorithms are reflected on more deeply in Chaps. 3, 5 and 6,
as already mentioned.
The results of the clustering may be looked at from two perspectives: whether or
not the choice of parameters of the algorithm was appropriate (see Sect. 4.2 for the
choice of the number of clusters), and if the results meet the quality expectations
(Sect. 4.3ff). These quality expectations may encompass:
• the cost function reached sufficiently low value,
• the homogeneity of clusters and their separation are sufficient (aesthetically or
from business point of view),
• the geometry of the clusters corresponds to what the algorithm is expected to
provide,
• the clusters correspond to some external criteria, like manually clustered subsets
etc.
User’s intuition and/or expectations with respect to the geometry of clusters is
essential for the choice of the clustering algorithm. One shall therefore always inspect
the description of the respective algorithm if it can properly respond to our expec-
tations. In fact the abundance of clustering algorithms is a derivative of the fact that
these expectations may be different, depending on actual application.
So if the geometry of the data makes them suitable for representation in a feature
space, then visit first the methods described in Chap. 3. If they are conveniently
represented as a graph, then either Chap. 5 or 6 are worth visiting.
If in the feature space, you may expect that ideally your clusters should form
clearly separated (hyper)balls. In this case k-means algorithm (Sect. 3.1) may be
6 1 Introduction

most appropriate. If the shape should turn out to be more ellipsoidal, see Sect. 3.3.5.4.
Should the clusters take forms like line segments, circles etc., visit Sect. 3.3.5.3. If you
need to cluster text documents in vector space representation (which lie essentially
on a sphere in this representation), then you may prefer spherical-k-means algorithm
(Sect. 3.1.5.3). It may happen that you do not expect any particular shape but rather
oddly shaped clouds of points of constant density, you may prefer DBSCAN or
HDBSCAN algorithm (Sect. 2.5.4). If you expect that the border line between the
clusters is not expressible in simple terms, you may try kernel methods (Sects. 2.5.7
and 3.3.5.6). Last not least you may seek a sharp split of objects into clusters or a
more fuzzy one, using e.g. Fuzzy-C-Means algorithm.
There exist also multiple geometry choices for graph-based data representation.
You may seek dense subgraphs, subgraphs, where a random walker would be trapped,
subgraphs separable by removal of a low number of edges or subgraphs where con-
nections evidently are different from random.
The main topic of this book is to provide with a basic understanding of the formal
concepts of the cluster, clustering, partition, cluster analysis etc. Such an under-
standing is necessary in the epoch of Big Data, because due to the data volume and
characteristics it is no more feasible to rely solely on viewing the data when fac-
ing the clustering problem. Automated processing to the largest possible extent is
required and it can only be achieved if the concepts are defined in a clear-cut way. But
this is not the only contribution of this book to the solving problems in the realm of
Big Data. Big Data is usually understood as large amounts of data in terms of over-
all volume measured in (peta)bytes, large number of objects described by the data
and lengthy object descriptions in a structured, unstructured or semi-structured way.
Therefore a user interested in Big Data issues may find it helpful to get acquainted
with various measures of object similarity as described in Sect. 2.2, where measures
based on quantitative (like numerical measurement results) and on qualitative fea-
tures (like text) as well as on their mixtures are described. As Big Data encompass
hypertext data or other linked objects, graph-based similarity measures described
in Sects. 6.2 and 6.4.4 may be also of interest. For the discussion of other complex
structures in the data, like multilayered graphs, see Sect. 6.10. Various variants, how
such similarity measures can be exploited when defining clustering cost functions
are described in Sect. 2.4.1. Alternatively, they can be used in spectral analysis of
graphs, described in Chap. 5.
Last not least we present a number of approaches to handle large number of
objects in reasonable time. The diversity of such algorithms encompasses so-called
grid-based methods, see Sect. 2.5.5, methods based on sampling, see Sect. 3.10, par-
allelisation via map-reduce, usage of tree-structures, see Sect. 3.1.4, to various heuris-
tic approaches like those used for community detection, see Sect. 6.7. For particular
algorithms we provide also with references to publications of variants devoted to
huge data sets. Note that in some cases, like spectral clustering, accommodation to
big data sets requires some new theoretical developments, enabling for example to
sample linked data in a reasonable way, see Sect. 5.5.
Another random document with
no related content on Scribd:
Puzzle No. 2
TRY AND DO IT
By W. W. Foster

This enigma comes from New London, where they


specialize in regattas, which is the one unusual word
omitted from Mr. Foster’s contraption.
1 2 3 4 5 6 7 8 9 10
11 12
13 14 15 16 17
18 19 20 21
22 23 24 25
26 27 28 29
30 31
32 33 34 35 36 37
38 39 40 41
42 43 44 45
46 47 48 49
50 51 52
53 54

[26]

HORIZONTAL

Choose
1 As you
31 say
Pores6 Machine
32
Herb11used in soup They35go off or out
Rent12 Digits
38
Egyptian
13 god Supply
39 with strength
Serious
15 Arguments
41 in favor of
Either,
17 else Barbarian
42
Colloquial
18 Irish Nunnery
43
exclamation Sooner
45 than
Coffins
20 Within
46
Fifty-six
21 Wise47sayings
Pertaining
22 to the By 49
mouth The 50
dye indigo
Bog 24 Quietude
51
Viewed
25 Scoundrel,
53 pest
Oriental
26 dish of meat Seek54laboriously or
and rice blindly
Egyptian
28 goddess
Unclose
30

VERTICAL

Harsh1sound Domestic
25 animal
Behold2 Bog 27
Likewise,
3 also Sacred
29 vessel
Sour 4 Moral32science
Duty 5 Part 33
of speech
After ends
6 Preserve
34
Agitate
7 Administered
35 extreme
A worthless
8 leaving unction to
Personal
9 pronoun Character
36 in Ibsen
Reed-pipe
10 play
Town14in Italy Bloodsucking
37 fly
Born16 Primer
40
Used17for baking Covering
43 of a seed
Chemical
19 group Period
44 of time
Spare
21time Unit 47
Stone
23 Point48of compass
Exclamation
50
Editor
52(abbr.)

[29]

[Contents]
Puzzle No. 3
A SIMPLICITY
By Isidore Edelstein

Some abbreviations may be treated as words, such


as “i.e.,” “A.D.,” points of the compass and so on. It is
not the purest usage, but may be passed to get over
a difficult spot. This construction has an abbreviation
or two beyond these, but it is a simple, easy design
for a not too energetic moment.
1 2 3 4 5 6 7 8 9 10 11 12
13 14 15 16
17 18
19 20 21 22
23 24 25 26 27 28
29 30
31 32 33 34 35 36 37
38 39 40 41
42 43 44 45 46 47
48 49
50 51 52 53 54 55 56
57 58 59 60
61 62 63 64
65 66 67 68
69 70 71

[28]
HORIZONTAL

Banquet
1 A vase40
Lout 5 Congealed
41 water
Willow8 for making An injection
42
baskets There 44
Formerly
13 Class 46
Domesticated
14 And 48so forth
Genus16 of African plant Fen 49
Lockjaw
17 Level 50
Flowing
18 Short 52for a physician
For example
19 Thick 54cords
Ethiopian
20 Baby’s57 first word
French
22 for “of the” Order 58
Funeral
23 hymn Initials
60 of the Rough
Chinese
26 sauce Rider
Asiatic
27 animal Obscurity
61
Large29Australian bird Leaves63
Sick30 Impel 65
Musical
31 drama Consumed
66
An enclosure
33 Stupefy
68 with a blow
Questioner
35 Prophets
69
Bird 38 Back70(French)
And 39
not To correct
71

VERTICAL

Noisome1 Female
32 sheep
Before 2 To scrutinize
33
A flower
3 Woman
34 living in a
Reel 4 convent
Fertile5 spot in desert Relation
36
Part of
6 “to be” Things
37
Broad7boat Threatener
43
Foot coverings
9 Aroma45
Troy10 Self 47
conceit
An eternity
11 Religious
50
Royal12 Proverb
51
Fermenting
14 vat Nymph52
Two 15 Eating
53 places
Blood21 To gibber
55
To enroll
24 as for jury To squander
56
duty Devoured
58
To flow
25 forth In the
59distance
Clamor
27 Before,
62 prefix
Accomplish
28 Weapon
64
Lyric31
poem Towards
67

[31]

[Contents]
Puzzle No. 4
TETRACRUCIFORM
By Charles W. Snizek

A mystical design combined with everyday words


shorter than the name it it bears.
1 2 3 4 5 6 7 8 9 10
11 12 13 14
15 16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
31 32 33 34 35
36 37 38
39
40 41 42 43 44 45 46 47 48
49 50 51 52 53 54
55 56 57 58 59
60 61 62 63 64
65 66 67 68 69
70 71 72
73 74

[30]

HORIZONTAL

Unsteady
1 Temporary
41 quarters
Appoint
6 Gain44
Remotely
12 Foolish
49
Besides
13 Priest’s
51 vestment
Upon 15 Heaths
53 (Scot.)
Worry17 Faithful
55
Bang19 Either
56
Fourth
20 tone of musical Advertisement
57
scale (abbreviation)
A bark
21 or yelp Province
59 of Greece
Pooh!23 A small
60 deer
Climbing
24 plant Energy
61
Purpose
25 Bony62rod attached to
Poisonous
26 sap of the spine
Malaysian trees A bird
64
Toward
28 One,65or any
Perform
29 Ornamental
66 plant
Hireling
30 or serf Used67to allure a fish
A drink
31 made of wine, Unit 69
of printer’s
water, and lemon- measure
juice, sweetened Poison
70
Employ
33 To invade
71 suddenly
Made 35of oak Desist
73
Black36 Medicinal
74 herb
Practical
38 application of
knowledge (plural)
Without
39 life

VERTICAL

God of2 midday sun Consumed


38
Aside3from a main Pertaining
40 to stars
track Guide
41
Complete
4 dress Beverage
42
Vast 5 Possessive
43 pronoun
Proviso
7 Part 45
of “be”
Put to8death Grieve46
Doctrine
9 or system African
47 river
Depart
10 Take48
A large
11 serpent of Common50 metal
America Host51
Mourn
14 Sharp 52point
Back16of the neck Element
54 of poetry
Familiar
18 pronoun Anointed
56
Lateral
19 Record58 of daily events
Elegant
20 Slender
61 plant
Male22servant Prejudice
63
Inquires
25 Resinous
66 substance
Total27 It is 68
Consume
30 Happen70
Therefore
32 Execute
72
Shallow
34
Else35
Surface
37 of fibres

[33]

[Contents]
Puzzle No. 5
WELL BALANCED
By Mr. and Mrs. H. C. Waller

This is a nice study in words of medium length, as


was dictated by the two-ply “X” construction. The
solver will observe that the central interlock is of the
good, familiar “L” joint sort.
1 2 3 4 5 6 7
8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23
24 25 26 27
28 29 30 31
32 33 34 35
36 37
38 39 40 41
42 43 44 45 46
47 48 49 50 51
52 53 54 55
56 57 58 59 60
61 62 63
64 65

[32]

HORIZONTAL
Tumult1 Gone 37
Part of
5 harness Song38
Thin metal
9 strip Organ40 pipes
Seldom
11 Small42hollows
Preposition
13 Organ43 of sight
Small15drink The 45
other
Expression
17 Recent
47
Behold
18 Drudge
48
Salt 19
(chem.) Pertaining
50 to the Celts
Proposition
21 to be Time52
proved Pertaining
53 to earth
Fabulous
23 bird disturbances
Sustain
24 Parent
55
Not of
26the city Pronoun
56
Spanish
27 dry wine A luminary
57
Burlesque
28 Dull 58
Work30with thin steel Concerning
60
instrument Ire 61
Engrave
31 Issue62
Old-womanish
32 Large64book
Miner’s
34 hand-cart Indivisible
65 particle of
Birth36 matter

VERTICAL

Part of
2 “to be” Fungus
31
Long 3practiced Constellation
33
Keen 4 Cereal
35
Injury5 Quote
38
Bodily6member American
39 Pioneer
Pronoun
7 Scars
40
File 8 Carnelian
41
Wild10goat Sheet,
42 usually paper
Dance11 Opposed
44 to No
Counterfeit
12 Sovereign
46
Sailors
14 Oil-burning
47 vessel
World16war battle Slave48
Net 17 Door49of Masonic
Body18of water Lodge
(Scotch) Paradise
51
Division
20 of Hindu Sound53
philosophy Insensibility
54
Metal22as mined Silent
57
Piquant
23 Humor59
Feather-shaped
25 Stop61
First27
of two stanzas Preposition
63
Rows 29

[35]

[Contents]
Puzzle No. 6
BABY GRAND MODEL
By Helen V. Christ

Only two long words. This should be pie for the hard-
boiled solvers and a pleasant exercise for the
beginner.
1 2 3 4 5 6 7
8 9 10
11 12 13 14
15 16 17 18
19 20 21
22
23 24
25 26 27 28 29
30 31 32 33
34 35
36 37

[34]

HORIZONTAL

To praise
1 A ram 23or male sheep
A storm
5 Girl’s24name
Part of
8 mouth (Pl.) Possessive
25 pronoun
A conjunction
11 A spasm
26
Coarse
13 material A pen28for swine
Regarding
14 The 30
sixth tone of the
Pin 15 diatonic scale
The 17
exertion of power Cardium
31
To spread
18 loosely for To such
33 a degree
drying, as newly Rabbits
34
mown grass Fatty36
tissues
To gain
19 Bird 37
The 21
young of various
carnivora
Put through
22
evolutions; intrigued

VERTICAL

Closed 1 curve Becomes


18 sour, or
Above2 spoiled, as milk; takes
Papa 3 a new direction
Inhabitants
4 of a north- That20
which is first or
east division of China highest in rank or
Precious
5 stone degree
So 6 Vegetable
21
Increased
7 till barely Ailments;
25 worries
sufficient Swamp
26
A grassy
9 field Prefix;
27 three; three
One10 of various small times; thrice
birds To bring
29 into bondage;
A boat
12 race, or a enslave
series of such races Mean31dwelling
Experiences
14 sorrow To spread
32 loosely for
for sin drying
A group
16 or class To have
34 existence
embracing
subordinate classes or Provided
35 that; on
species condition that

[37]

[Contents]

You might also like