Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Analytica Chimica Acta 454 (2002) 13–19

Clustering with dendrograms on interpretation variables


M. Forina∗ , C. Armanino, V. Raggio
Department of Chemistry and Technology of Drugs and Foods, University of Genova, Via Brigata Salerno (s/n), I-16147 Genova, Italy
Received 16 July 2001; received in revised form 17 October 2001; accepted 5 November 2001

Abstract
Clustering techniques are used frequently in chemistry to show and to interpret similarities between objects or variables.
The results of a clustering technique are generally reported in a plot (the dendrogram of similarities) where the ordinate is the
similarity between groups and the abscissa has no specific meaning, but it is used only to separate the clusters. Here, the
use of interpretation variables for the projection of the dendrogram is suggested. Some examples show that sometimes the
so-modified dendrogram show the information more efficiently than the usual dendrograms. © 2002 Elsevier Science B.V.
All rights reserved.
Keywords: Clustering; Dendrograms; Chemometrics; Pattern recognition

1. Introduction other techniques hierarchical agglomerative methods


are the most popular. They are based on the similarity
Cluster analysis is one of the chapters of chemo- between two objects (or between two clusters), given
metrics and its tools are applied in unsupervised by equation:
pattern recognition, when the objective is to individ- dij
uate in a set of object groups of similar objects to sij = 1 − (1)
dmax
be interpreted as members of a “category”. Cluster-
ing can be applied also to variables, to find groups where dij is the distance between the two objects and
of similar variables, frequently with the objective of dmax the maximum distance between two objects in the
variable selection for calibration techniques. dataset. The two objects with the maximum distance
Clustering techniques have been classified [1–3] as: have similarity 0. The distance between two objects
depends on the used metric.
a) Visual techniques; Hierarchic agglomerative techniques start from as
b) Hierarchical methods many clusters as objects are. Gradually objects are
b1) Agglomerative; joined into clusters, up to the final cluster with all
b2) Divisive; the objects. In each step, two objects or one object
c) Non-hierarchical methods. and one cluster or two clusters are merged. In the
Visual techniques are based generally on the ob- first step, the two objects with the largest similarity
servation of principal components plots. Among the are merged. Then in each step, the two most simi-
lar clusters are merged. The value of their similarity
∗ Corresponding author. Tel.: +39-10-3532630; (or of their distance) is retained. It will be used to
fax: +39-10-3532684. draw the typical result of these techniques, the dendro-
E-mail address: forina@dictfa.unige.it (M. Forina). gram. With N objects, the final cluster is obtained after

0003-2670/02/$ – see front matter © 2002 Elsevier Science B.V. All rights reserved.
PII: S 0 0 0 3 - 2 6 7 0 ( 0 1 ) 0 1 5 1 7 - 3
14 M. Forina et al. / Analytica Chimica Acta 454 (2002) 13–19

(N − 1) steps. The hierarchy is a consequence of the similarity matrix:


fact that larger clusters are always obtained by the  
 1 0.75 0 
merger of smaller ones (with all their objects). 
 
In the unweighted average linkage method, when  1 0.25 
 
two clusters A and B are joined into the new cluster C,  1 
the position of C is between the positions of A and B,
weighted for the number of objects in the two joined so that the first cluster A is formed from objects 1
clusters. In the evaluation of the position of C, all and 2 at the similarity level 0.75. Its position is, with
the original objects have the same weight (this is the the unweighted average linkage method, x A = 1.5.
reason of the name “unweighted”). For example, in the The similarity between A and the object 3 is 1 − (5 −
case of three objects described by only one variable 1.5)/(5 − 1) = 0.125.
(x), the first cluster (A) is obtained by merging the two So, the dendrogram starts from similarity 0 (or from
closest objects, be objects 1 and 2. The co-ordinate of the maximum distance 4) and at similarity 0.125 (or
the cluster A is: at distance 3.5), the tree divides itself in the cluster A
(with objects 1 and 2) and in the singleton B. From
xA = 21 (x1 + x2 ) (2) similarity 0.125, the line representing the cluster A
goes down (up to similarity 0.75) where A splits in
In the next (and final) step, the cluster A and the ob- the two objects 1 and 2. The so obtained dendrogram
ject 3 are joined to obtain the large cluster B, with is represented in Fig. 1.
co-ordinate: In Fig. 1 the ordinate is the similarity, with the value
1 in the origin. The abscissa indicates the objects, it
xB = 13 (2xA + x3 ) = 13 (x1 + x2 + x3 ) (3) serves only to separate the objects and has no special
significance.
In the position of B its three objects have the same In the past years, dendrograms were frequently ob-
weight. tained by means of printers without graphic capa-
The dendrogram is the graphical representation of bility, so that the lines were obtained by means of
the clustering. Usually, it is drawn backward, starting the usual alphabetic characters, as in the example of
from the final cluster with all the objects and from Fig. 2, which is, however, very recent [4]. The same
similarity 0. At the similarity where two clusters were example shows that the dendrogram can be rotated,
merged to origin the final cluster, the final cluster splits so that the objects are reported on the ordinate and
in the two parent clusters and so on. Fig. 1 represents that the abscissa can be, instead of the similarity or
the dendrogram in the simple case of three objects of the merging distance, a different quantity. Often
described by only one variable (x). Be x1 = 1, x2 = 2
and x3 = 5. The starting similarities are shown in the

Fig. 1. The original (univariate) information (left) and the dendro- Fig. 2. An example of dendrogram without information on both
gram of similarities (right). axes.
M. Forina et al. / Analytica Chimica Acta 454 (2002) 13–19 15

this quantity is the merging order (an integer between responding to the branch from 0.125 to 0.75. In this
0 and (N − 1)) or the number of clusters (from N way, the cluster of the objects 1 and 2 is separated
to 1) or a mysterious quantity as in the example of from the singleton 3, according to the visual inspec-
Fig. 2. tion of the variable (x).
When the similarity or the distance is reported, the The second axis is not informative. On the contrary,
corresponding axis is informative. The length of the because we are used to give significance to the axes,
vertical (according to the schema in Fig. 1) lines mea- the position of the objects can suggest erroneous con-
sures the separation between the merged clusters, so clusions. Fig. 3 shows the dendrogram obtained with
that it is common practice to “cut” the dendrogram at data SEBOU.DAT in [4], after autoscaling, with the
the similarity corresponding to the longest branches, unweighted average linkage method, as implemented
to obtain apparently “significant” clusters. For exam- in PARVUS [3]. The two dendrograms in the upper
ple, the dendrogram can be cut at similarity 0.5, cor- part of the figure differ simply because the order in
which the object have been presented to the clustering
program was modified. Obviously, the structure of the
dendrogram, with three clusters, is the same. However,
in one of the two dendrogram objects 1 and 8 are close
and objects 5 and 9 are very far, in the other objects
5 and 9 are close and objects 1 and 8 are very far.
There are up to 2N −2 ways of arranging a dendro-
gram of N objects, 512 ways in the case of the 11
objects in SEBOU.DAT.
The graphical presentation of a dendrogram and
its underlying similarity matrix can be improved by
“seriation”, i.e. by optimal re-ordering of the objects.
Gale et al. [5] and Wishart [6] have considered seri-
ation methods.
The dendrogram in the bottom of Fig. 3 was ob-
tained after seriation. The position of the objects on
the abscissa is exactly the same as that on the first
principal component of the autoscaled data, which ac-
count for the 83.3% of the variance (Fig. 4).

Fig. 3. Dataset SEBOU.DAT. Two dendrograms obtained with


different order of the objects and a dendrogram obtained after Fig. 4. Dataset SEBOU.DAT. Principal components of autoscaled
seriation (bottom). data.
16 M. Forina et al. / Analytica Chimica Acta 454 (2002) 13–19

Fig. 5. Dataset SOY.DAT. Dendrogram projected on the content of proteins.

Fig. 7. Dataset SOY.DAT. Dendrogram projected on the content of proteins and moisture.
M. Forina et al. / Analytica Chimica Acta 454 (2002) 13–19 17

Fig. 6. Dataset SOY.DAT. Principal components after row autoscaling followed by column autoscaling.

2. Projection on interpretation variables The direct projection of the dendrogram on one or


two interpretation variables does not present mathe-
The statistical significance of the clusters can be matical difficulties. In the case of one interpretation
evaluated by means of a Fisher test (under the hypoth- variable the abscissa is the interpretation variable
esis of normal distribution of the distance from the ob- and the ordinate is the similarity. In the case of two
jects in the cluster from the centroid). Other methods interpretation variables the dendrogram simulates a
have been suggested [1], but no satisfactory solution tri-dimensional space.
has been found because the very different structure of Instead of the backwise procedure of the usual
clusters found in practical problems. dendrogram, the projected dendrogram starts from
In chemometrics, the chemical significance of a the objects and continues exactly as the agglomera-
cluster is based on the possibility of interpretation. tive procedure. When two objects are merged their
Interpretation means that we compare the clusters position on the abscissa is the mean of the values of
suggested by the dendrogram with external informa- the interpretation variable for the two objects.
tion, not used to compute the dendrogram. Frequently, Fig. 5 shows a first example of projection of
the real problem suggests that some categories exist a dendrogram on one interpretation variable. Data
and the interpretation step is to compare the num- (SOY.DAT) are spectra of 60 samples of soy flour;
ber and the composition of the clusters with these measured with a filter instrument with 19 filters. The
categories. The category index is a discrete variable. response variables are moisture, protein and oil. De-
In other cases the interpretation uses one or more tails are reported in the original paper [8]. Data are
continuous variables. Frequently, these variables are available [3]. Clustering (unweighted average link-
the co-ordinates of the sampling places. For example, age method) was performed on the spectra after row
Hopke [7] studied the clustering of 79 samples of autoscaling (i.e. standard normal variate) followed
lake sediments. Four clusters have been obtained and by column autoscaling. The interpretation variable in
they were visualised on the map of the lake with a Fig. 5 is the content of proteins. By cutting the den-
different symbol for each cluster. drogram at similarity 0.35, three clusters and a single-
18 M. Forina et al. / Analytica Chimica Acta 454 (2002) 13–19

Fig. 8. Dataset PORT1981.DAT. Dendrogram projected on the map of Portugal, each object on the corresponding sampling site.

Fig. 9. Dataset SEBOU.DAT. Dendrogram projected on the map of Sebou valley (Morocco), each object on the corresponding sampling site.
M. Forina et al. / Analytica Chimica Acta 454 (2002) 13–19 19

ton are obtained. Fig. 6 shows the corresponding PC other two clusters are those of medium-large and large
plot, with the dispersion polygons corresponding to pollution in the sampling sites downstream of large
the green cluster (left low), the red cluster (right low) towns with polluting industries.
and the singleton. The projected dendrogram shows
clearly that the red cluster corresponds to samples
with large content in proteins and the green cluster to 3. Conclusions
samples with medium proteins content.
Fig. 7 shows the projection of the same dendrogram The examples here reported are only a sample of
on two interpretation variables, the content of proteins the variety of cases (variables for clustering and in-
and moisture. The red cluster corresponds to sample terpretation variables), where the projection on one or
with large protein content and low moisture content; two interpretation variables and the use of colours can
the green cluster to samples with large moisture con- make the dendrogram more informative than the usual
tent and medium-low protein content, the blue clus- dendrogram.
ter corresponds to sample with low both proteins and
moisture content. The singleton (sample 28) has a very
small amount of both proteins and moisture. Acknowledgements
In the second example, the same clustering tech-
nique was applied after autoscaling to the dataset Study developed with funds from the project UE
PORT1981.DAT of 151 samples of olive oil from Por- STM4-CT98-7521 “European Network for the Inter-
tugal [9,10], described by 11 fatty acids (C16:0, C16:1, comparison of Chemometric Software and Methods”,
C17:0, C17:1, C18:0, C18:1, C18:2, C18:3, C20:0, from the University of Genova and from CNR
C22:0 and C24:0) and three sterols (␤-sytosterol, (National Research Council of Italy).
campesterol and stigmasterol). The interpretation
variables are the latitude and the longitude of Portu-
gal. The projection of the dendrogram on the Portugal References
map is shown in Fig. 8. A sample marked E (from the
[1] D.L. Massart, L. Kaufman, The Interpretation of Analytical
extreme north of Portugal) and two samples marked
Chemical Data by the Use of Cluster Analysis, Wiley, New
T from the extreme south of Portugal (Province of York, 1983.
Tavira) are outliers, very different from the other [2] D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S.
samples, that cluster in two main groups. With the De Jong, P.J. Lewi, J. Smeyers-Verbeke, Handbook of
exception of only one sample, marked A in Fig. 8, the Chemometrics and Qualimetrics, Elsevier, Amsterdam, 1998.
[3] M. Forina, S. Lanteri, C. Armanino, Q-PARVUS Release
red cluster is that of olive oils of cold Portugal (north
3.0, An extendable package of programs for explorative data
and inner valleys, especially Douro valley). The blue analysis, classification and regression analysis, Dip. Chimica
cluster is that of the oils of warm Portugal. e Tecnologie Farmaceutiche, University of Genova, available
The third example [4] refers to the eleven objects of (free, with manual and examples) http://parvus.unige.it.
SEBOU.DAT, collected in Morocco, in the low valley [4] L. Bennasser, M. Fekaoui, O. Mameli, P. Melis, Ann. Chim.
(Rome) 90 (2000) 637.
of Sebou river and of its effluents. Each object is the
[5] N. Gale, W.C. Halperin, C.M. Costanzo, J. Classification 1
mean of four samples of sediments, analysed for 8 (1984) 75.
metals. [6] D. Wishart, J. Comput. Sci. Stat. 29 (1997) 48.
Fig. 9 shows the dendrogram obtained (unweighted [7] P.K. Hopke, J. Environ. Sci. Health 367 (1976).
average linkage method, autoscaled data) and pro- [8] M. Forina, G. Drava, et al., Chemometr. Intell. Lab. Syst. 27
(1995) 189.
jected on the map of the sampling region.
[9] XIV Cadastro Oleicola, Instituto do Azeite e Produtos
The first cluster is that of low-pollution sediments, Oleaginosos, Portugal Ministry for Commerce and Tourism,
upstream of the human pollution sources (towns and Lisbon, November 1981.
industries). An exception is object 9, perhaps caused [10] M. Forina, C. Armanino, S. Lanteri, C. Calcagno, E.
by local dilution due to an unpolluted affluent. The Tiscornia, Riv. Ital. Sostanze Grasse 60 (1983) 607.

You might also like