Professional Documents
Culture Documents
Clustering With Dendrograms On Interpretation Variables: M. Forina, C. Armanino, V. Raggio
Clustering With Dendrograms On Interpretation Variables: M. Forina, C. Armanino, V. Raggio
Abstract
Clustering techniques are used frequently in chemistry to show and to interpret similarities between objects or variables.
The results of a clustering technique are generally reported in a plot (the dendrogram of similarities) where the ordinate is the
similarity between groups and the abscissa has no specific meaning, but it is used only to separate the clusters. Here, the
use of interpretation variables for the projection of the dendrogram is suggested. Some examples show that sometimes the
so-modified dendrogram show the information more efficiently than the usual dendrograms. © 2002 Elsevier Science B.V.
All rights reserved.
Keywords: Clustering; Dendrograms; Chemometrics; Pattern recognition
0003-2670/02/$ – see front matter © 2002 Elsevier Science B.V. All rights reserved.
PII: S 0 0 0 3 - 2 6 7 0 ( 0 1 ) 0 1 5 1 7 - 3
14 M. Forina et al. / Analytica Chimica Acta 454 (2002) 13–19
Fig. 1. The original (univariate) information (left) and the dendro- Fig. 2. An example of dendrogram without information on both
gram of similarities (right). axes.
M. Forina et al. / Analytica Chimica Acta 454 (2002) 13–19 15
this quantity is the merging order (an integer between responding to the branch from 0.125 to 0.75. In this
0 and (N − 1)) or the number of clusters (from N way, the cluster of the objects 1 and 2 is separated
to 1) or a mysterious quantity as in the example of from the singleton 3, according to the visual inspec-
Fig. 2. tion of the variable (x).
When the similarity or the distance is reported, the The second axis is not informative. On the contrary,
corresponding axis is informative. The length of the because we are used to give significance to the axes,
vertical (according to the schema in Fig. 1) lines mea- the position of the objects can suggest erroneous con-
sures the separation between the merged clusters, so clusions. Fig. 3 shows the dendrogram obtained with
that it is common practice to “cut” the dendrogram at data SEBOU.DAT in [4], after autoscaling, with the
the similarity corresponding to the longest branches, unweighted average linkage method, as implemented
to obtain apparently “significant” clusters. For exam- in PARVUS [3]. The two dendrograms in the upper
ple, the dendrogram can be cut at similarity 0.5, cor- part of the figure differ simply because the order in
which the object have been presented to the clustering
program was modified. Obviously, the structure of the
dendrogram, with three clusters, is the same. However,
in one of the two dendrogram objects 1 and 8 are close
and objects 5 and 9 are very far, in the other objects
5 and 9 are close and objects 1 and 8 are very far.
There are up to 2N −2 ways of arranging a dendro-
gram of N objects, 512 ways in the case of the 11
objects in SEBOU.DAT.
The graphical presentation of a dendrogram and
its underlying similarity matrix can be improved by
“seriation”, i.e. by optimal re-ordering of the objects.
Gale et al. [5] and Wishart [6] have considered seri-
ation methods.
The dendrogram in the bottom of Fig. 3 was ob-
tained after seriation. The position of the objects on
the abscissa is exactly the same as that on the first
principal component of the autoscaled data, which ac-
count for the 83.3% of the variance (Fig. 4).
Fig. 7. Dataset SOY.DAT. Dendrogram projected on the content of proteins and moisture.
M. Forina et al. / Analytica Chimica Acta 454 (2002) 13–19 17
Fig. 6. Dataset SOY.DAT. Principal components after row autoscaling followed by column autoscaling.
Fig. 8. Dataset PORT1981.DAT. Dendrogram projected on the map of Portugal, each object on the corresponding sampling site.
Fig. 9. Dataset SEBOU.DAT. Dendrogram projected on the map of Sebou valley (Morocco), each object on the corresponding sampling site.
M. Forina et al. / Analytica Chimica Acta 454 (2002) 13–19 19
ton are obtained. Fig. 6 shows the corresponding PC other two clusters are those of medium-large and large
plot, with the dispersion polygons corresponding to pollution in the sampling sites downstream of large
the green cluster (left low), the red cluster (right low) towns with polluting industries.
and the singleton. The projected dendrogram shows
clearly that the red cluster corresponds to samples
with large content in proteins and the green cluster to 3. Conclusions
samples with medium proteins content.
Fig. 7 shows the projection of the same dendrogram The examples here reported are only a sample of
on two interpretation variables, the content of proteins the variety of cases (variables for clustering and in-
and moisture. The red cluster corresponds to sample terpretation variables), where the projection on one or
with large protein content and low moisture content; two interpretation variables and the use of colours can
the green cluster to samples with large moisture con- make the dendrogram more informative than the usual
tent and medium-low protein content, the blue clus- dendrogram.
ter corresponds to sample with low both proteins and
moisture content. The singleton (sample 28) has a very
small amount of both proteins and moisture. Acknowledgements
In the second example, the same clustering tech-
nique was applied after autoscaling to the dataset Study developed with funds from the project UE
PORT1981.DAT of 151 samples of olive oil from Por- STM4-CT98-7521 “European Network for the Inter-
tugal [9,10], described by 11 fatty acids (C16:0, C16:1, comparison of Chemometric Software and Methods”,
C17:0, C17:1, C18:0, C18:1, C18:2, C18:3, C20:0, from the University of Genova and from CNR
C22:0 and C24:0) and three sterols (-sytosterol, (National Research Council of Italy).
campesterol and stigmasterol). The interpretation
variables are the latitude and the longitude of Portu-
gal. The projection of the dendrogram on the Portugal References
map is shown in Fig. 8. A sample marked E (from the
[1] D.L. Massart, L. Kaufman, The Interpretation of Analytical
extreme north of Portugal) and two samples marked
Chemical Data by the Use of Cluster Analysis, Wiley, New
T from the extreme south of Portugal (Province of York, 1983.
Tavira) are outliers, very different from the other [2] D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S.
samples, that cluster in two main groups. With the De Jong, P.J. Lewi, J. Smeyers-Verbeke, Handbook of
exception of only one sample, marked A in Fig. 8, the Chemometrics and Qualimetrics, Elsevier, Amsterdam, 1998.
[3] M. Forina, S. Lanteri, C. Armanino, Q-PARVUS Release
red cluster is that of olive oils of cold Portugal (north
3.0, An extendable package of programs for explorative data
and inner valleys, especially Douro valley). The blue analysis, classification and regression analysis, Dip. Chimica
cluster is that of the oils of warm Portugal. e Tecnologie Farmaceutiche, University of Genova, available
The third example [4] refers to the eleven objects of (free, with manual and examples) http://parvus.unige.it.
SEBOU.DAT, collected in Morocco, in the low valley [4] L. Bennasser, M. Fekaoui, O. Mameli, P. Melis, Ann. Chim.
(Rome) 90 (2000) 637.
of Sebou river and of its effluents. Each object is the
[5] N. Gale, W.C. Halperin, C.M. Costanzo, J. Classification 1
mean of four samples of sediments, analysed for 8 (1984) 75.
metals. [6] D. Wishart, J. Comput. Sci. Stat. 29 (1997) 48.
Fig. 9 shows the dendrogram obtained (unweighted [7] P.K. Hopke, J. Environ. Sci. Health 367 (1976).
average linkage method, autoscaled data) and pro- [8] M. Forina, G. Drava, et al., Chemometr. Intell. Lab. Syst. 27
(1995) 189.
jected on the map of the sampling region.
[9] XIV Cadastro Oleicola, Instituto do Azeite e Produtos
The first cluster is that of low-pollution sediments, Oleaginosos, Portugal Ministry for Commerce and Tourism,
upstream of the human pollution sources (towns and Lisbon, November 1981.
industries). An exception is object 9, perhaps caused [10] M. Forina, C. Armanino, S. Lanteri, C. Calcagno, E.
by local dilution due to an unpolluted affluent. The Tiscornia, Riv. Ital. Sostanze Grasse 60 (1983) 607.