Professional Documents
Culture Documents
Ghattas 2017
Ghattas 2017
Ghattas 2017
PII: S0031-3203(17)30039-0
DOI: 10.1016/j.patcog.2017.01.031
Reference: PR 6033
Please cite this article as: Badih Ghattas, Pierre Michel, Laurent Boyer, Clustering nominal data using
Unsupervised Binary decision Trees: Comparisons with the state of the art methods., Pattern Recog-
nition (2017), doi: 10.1016/j.patcog.2017.01.031
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights
T
• New heuristics are given for tuning the parameters of CUBT.
IP
• CUBT outperforms many of the existing approaches for nominal datasets.
CR
• The tree structure helps for the interpretation of the obtained clusters.
US
• The method may be used with parallel computing and thus for Big data.
AN
M
ED
PT
CE
AC
1
ACCEPTED MANUSCRIPT
T
a Aix Marseille University,
IP
Department of Mathematics
163 Avenue de Luminy, 13288 Marseille cedex 09
Tel.: +491829089, Fax: +491829356
CR
b Aix-Marseille University,
US
EA 3279 Research Unit - Public Health, Chronic Diseases and Quality of Life
13005 Marseille, France
AN
Abstract
binary trees) to nominal data. For this purpose, we primarily use heterogeneity
criteria and dissimilarity measures based on mutual information, entropy and
Hamming distance. We show that for this type of data, CUBT outperforms
ED
most of the existing methods. We also provide and justify some guidelines and
heuristics to tune the parameters in CUBT. Extensive comparisons are done
PT
with other well known approaches using simulations, and two examples of real
datasets applications are given.
Keywords: CUBT, Unsupervised learning, Clustering, Binary decision trees,
CE
∗ Corresponding
author
Email addresses: badih.ghattas@univ-amu.fr (Badih Ghattas ),
pierre.michel@univ-amu.fr (Pierre Michel), Laurent.BOYER@ap-hm.fr (Laurent Boyer)
1. Introduction
T
sure. Most clustering methods aim to construct a partition of a set of n ob-
IP
servations in k clusters, k may be specified a priori or be determined by the
method. There are two main types of clustering methods: hierarchical and non-
CR
hierarchical. Hierarchical algorithms [32] construct a set of clusters that may
be represented by a tree. The root of the tree corresponds to the entire set
of observations. The tree is often binary, with each node corresponding to the
US
union of two clusters. A recent algorithm of this type is DIVCLUS-T [6]. It is
based on the minimization of the inertia criterion, such as the classical hierar-
AN
chical cluster analysis (HCA) with Ward’s criterion. The main difference with
the latter is that it produces a natural interpretation of the clusters obtained,
by using cut-off values on the variables that explain the clusters.
M
methods. The method k-means searches for a k-cluster partition that mini-
mize the within-cluster variance. Density-based methods attempt to construct
PT
latent class analysis (LCA, [39]). Similarly, there are fuzzy clustering methods,
where each observation may belong to more than one cluster (see, for example,
the fuzzy c-means algorithm [10]).
In this work, we focus on hierarchical clustering methods based on decision
3
ACCEPTED MANUSCRIPT
trees. To our knowledge, the first proposal for using decision trees for clustering
came with the algorithm CO.5 [9], which uses top-down induction of logical
decision trees [3]. These types of trees are called ”predictive clustering trees”.
They are based on Quinlan’s classification algorithm C4.5 [34] and have been
T
widely used in the literature, particularly in machine learning. Another tree-
IP
based algorithm, CLTree [26], is perfectly capable of ignoring outliers using a
technique to introduce ”non-existing” points in the data space, but it is limited
CR
to continuous data. A more recent algorithm exclusively designed for nominal
data is CG-CLUS [41]. This algorithm possesses good properties of efficiency
and performance, compared to the well-known COBWEB algorithm [12]. All
US
of these methods are related to ”conceptual clustering” (ı.e, a common machine
learning task), which aims to cluster objects that are represented by pairs of
attributes-values (ı.e, nominal descriptions) and do not take into account any
AN
ordering of variable levels. The main advantage of this approach is the direct
interpretation of the output tree. In this framework, a common information
theory measure to assess the quality of a partition is often used, namely, the
M
category utility [7], which is formally equivalent to the mutual information [15].
CUBT [13] is a top-down hierarchical clustering method inspired by CART
ED
[5], and it consists of three stages. The first step grows a “maximal tree”
by recursively splitting the dataset into several subsets of observations that
minimize a heterogeneity criterion, the deviance, within the final clusters. The
PT
deviance criterion is the trace of the covariance matrix within each subset of
observations. The second step prunes the tree. For each pair of sibling terminal
CE
nodes (ı.e, leaves with the same ascendant node), a measure of dissimilarity
between the two nodes is computed. If this dissimilarity measure is lower than
a fixed threshold mindist, then the two nodes are aggregated into a single node
AC
(ı.e, their parent node). The dissimilarity measure used by CUBT is based on
the Euclidean distance. The final step, joining, is also a node aggregation step,
in which the constraint of node adjacency for cluster aggregation is not required.
For the joining step, either the deviance or the dissimilarity measure may be
used.
4
ACCEPTED MANUSCRIPT
CUBT shares various advantages with CART. It is flexible and efficient, ı.e,
produces good partitions for a large family of data structures; it is interpretable
(because of the binary labeled splits); and it has good convergence properties.
In their original work [13], the authors compared CUBT with several clustering
T
methods and showed that it produces quite satisfactory results. Several appli-
IP
cations of CUBT have been undertaken, mainly in social sciences and medicine;
see [29, 30], for example. This approach allowed clinicians to improve inter-
CR
pretability and decision making with regard to cut-off scores that define the
membership of individuals in different clusters.
One of the limitations of CUBT is that the criteria used to grow and prune
US
a tree (deviance and dissimilarity measure) are specific to continuous data.
Herein, we present an extension of CUBT to nominal data. So, for each step of
CUBT, namely, growing and pruning, we propose a new criterion based either
AN
on mutual information or on entropy. We provide some heuristics for selecting
the parameters used at each stage. Section 2 describes the nominal version of
CUBT. Section 3 presents some simulation models and comparisons with other
M
clustering methods for nominal data. Section 4 provides two real data examples
to illustrate the use of CUBT. The final section gives the conclusion and some
ED
We describe the three steps of CUBT using the new criteria for nominal
data. The first step grows the maximal tree, while the second and third steps
CE
variable, and Xj is the set of levels (or categories) of the j th variable. We have
a set S of n random vectors, denoted as Xi with i ∈ {1, ..., n}. Finally, Xij is
the ith observation of the j th component of X. Similar notations are used with
small letters to denote the realizations of these variables: x, xi , x.j and xij .
5
ACCEPTED MANUSCRIPT
We herein describe the method and show the specific new features for nom-
inal data.
T
X (t) = {X|X ∈ t }, and define R(t), the heterogeneity measure of t as follows:
IP
p mj
p X
X (t)
X (t) (t)
R(t) = H(X.j ) =− pkj log2 pkj
j=1 j=1 k=1
CR
(t) (t) (t)
where H(X.j ) is the Shannon entropy of X.j within node t and pkj is the
probability for the component j of X to take value k within node t. This is the
US
sum of the entropies of each variable. R(t) may also be written as follows:
where MI is the mutual information matrix of X (t) , and the mutual information
AN
between two variables Y and Z is defined by:
X p(y, z)
MI(Y, Z) = p(y, z) log2
p(y)p(z)
M
y,z
decreases monotonically with the splitting process: The total entropy within
subsamples is always lower or equal then the entropy of the whole sample.
Mutual information of two variables measures the dependence between them; it
AC
is comparable to the covariance in the continuous case, and both statistics are
related when the variables are Gaussian. Finally one could use the continuous
version of mutual information for continuous data, but its estimation is not effi-
cient in general as it goes through discretization and numerical approximations
of double integrates [16].
6
ACCEPTED MANUSCRIPT
Initially, the primary node (the root) of the tree contains all the observations
of S. The sample will be split recursively into two disjoint samples using binary
splits with the form x.j ∈ Aj , where j ∈ {1, ..., p} and Aj is a subset of levels
T
of x.j . Thus, a split of a node t into two sibling nodes tl and tr is defined by a
IP
pair (j, Aj ). The nodes tl and tr are defined as follows:
CR
The best split of t into two sibling nodes tl and tr is defined by:
US
argmax(j,Aj )∈{1,...,p}×{1,...,mj } {∆(t, j, Aj )}
are fixed thresholds, and R(S) is the deviance of the entire sample. Once the
algorithm stops, a class label is assigned to each leaf of the maximal tree. A par-
tition of the initial dataset is obtained, and each leaf from the tree corresponds
PT
to a cluster.
If the number of classes of the final partition k is known, then the number of
leaves of the maximal tree could be greater than k. Thus, it is necessary to prune
AC
7
ACCEPTED MANUSCRIPT
(respectively tr ) and δ ∈ [0, 1]. For each xi ∈ tr and xj ∈ tl , with i, j ∈ {1, ..., n},
we consider:
T
and their ordered versions di and dj . Note that d(y, z) can be either the Ham-
IP
ming distance (denoted dHam ) or the mutual information between two obser-
vations. The Hamming distance is defined as follows:
CR
n
X
dHam (y, z) = 1{yi 6=zi }
i=1
δ
dl =
δn
1 Xl
δnl i=1
US δ
di and dr =
δn
1 Xr
δnr i=1
dj
AN
Thus, the empirical dissimilarity measure between tl and tr is computed as
follows:
δ δ
dδ (l, r) = dδ (tl , tr ) = max(dl , dr ).
M
At each step of the algorithm, the leaves tl and tr are aggregated and replaced
by their parent t if dδ (l, r) ≤ with > 0. The pruning stage requires two
ED
two subnodes. The presence of outliers in a node may induce a relevant split
of that node in the growing process as it aims to reduce the heterogeneity of
CE
the subnodes resulting from the split. Using this dissimilarity measure may
aggregate back such splits.
AC
The joining stage aggregates pairs of nodes that do not share the same as-
cendant (not sibling nodes) as in ascendant hierarchical clustering, successively
joining the most similar pairs of clusters. Two joining criteria may be used for
this step.
8
ACCEPTED MANUSCRIPT
1. The first criterion used is the same as in the growing stage. Each pair of
nodes tl and tr (sibling or not) is compared using the following measure
(loss of deviance):
T
IP
The pairs of nodes with minimal loss of deviance are aggregated.
2. The second criterion is the same as the pruning stage. Pairs of nodes tl
CR
and tr are compared by computing ∆(tl , tr ) = dδ (l, r). The pairs of nodes
with minimal dissimilarity ∆(tl , tr ) are aggregated.
US
For either criterion, let NL be the number of leaves of the maximal tree.
For each pair of values (i, j), with i, j ∈ {1, ..., NL } and i 6= j, we have (ĩ, j̃) =
argmini,j {∆(ti , tj )}. The pair of nodes tĩ and tj̃ are replaced by their union
AN
tĩ ∪ tj̃ and NL = NL − 1. There are two types of stopping rules depending on
whether the number of clusters k is known.
3. Experiments
CE
COBWEB [12], DIVCLUS-T [6] and LCA [39]. The misclassification error [13]
and the adjusted Rand index [20] are used for these comparisons. We have up-
dated the actual CUBT package [14] in R to account for the criteria described
in the previous section.
9
ACCEPTED MANUSCRIPT
T
down”). In ascendant hierarchical clustering, the hierarchy is constructed by
IP
iterative aggregation of pairs of nearest clusters. The distance between two
clusters may be defined in several ways, the most popular being simple, average
CR
or complete linkage. Here, we use the complete linkage option to aggregate
clusters. We use the mutual information as the dissimilarity measure between
the observations. The algorithm is used with the hclust function from R.
US
k-modes is an extension of the well-known k-means algorithm [27]. This
sequential algorithm chooses k initial cluster centers arbitrarily, assigns each
observation to the closest center, and computes the new center of each class using
AN
the observations assigned to that class. Because the resulting partition depends
on the initial centers, the algorithm may be run several times, and we keep
Pn Pk
the run with the minimum-cluster sum of squares, given by i=1 j=1 ||Xi −
M
cj ||2 1xi ∈Gj , where Gj is the j-th group and cj is the corresponding center.
The k-modes algorithm [19] is a clustering method for categorical data that
ED
between pairs of observations. We use the klaR package [40] from R for this
method.
CE
fixed point P from the data, the -neighborhood is constructed, that is the set of
points which are at a distance less than from P . If there are at least M inP ts
in its -neighborhood, this point (said to be dense) and all its neighbors will
form a cluster, otherwise it is labeled as noise. All points that are found within
the -neighborhood which are dense are added also to the cluster, as is their
10
ACCEPTED MANUSCRIPT
own -neighborhood. Once no dense points are found, a new unvisited point is
retrieved and processed to explore new clusters.
The mutual information is used as the dissimilarity measure in this method.
The parameters and M inP ts are fixed by the user. See the dbscan function
T
of the fpc package [18] from R.
IP
COBWEB [12] is a conceptual clustering approach based on the category
utility measure [7]. This incremental hierarchical algorithm is designed for qual-
CR
itative data. It constructs a tree dynamically, inserting one individual at a time
in the tree construction. At each individual insertion, COBWEB has four avail-
able options: inserting the individual to an existing cluster, creating a new
US
cluster (represented by the new individual), merging two nodes, or splitting one
node. For each option, the category utility is computed for each corresponding
partition. The partition that provides the maximum value for this measure is
AN
selected. Then, the next individual in the dataset is considered. The stopping
criterion is a minimal value of the category utility that can not be exceeded. We
use the default threshold for the comparison simulations (0.02). The algorithm
M
is used with the Cobweb function [17] of the RWeka package from R.
DIVCLUS-T [6] is a top-down hierarchical clustering method. It is mono-
ED
thetic, ı.e subsets of observations in the dataset are split using single variable.
Its goal is to optimize the same criterion as for classical CAH using Ward’s
method. It is designed for quantitative, qualitative and mixed data. To com-
PT
pute the splitting criterion, ı.e the within-class inertia, DIVCLUS-T uses the
euclidean distance in the quantitative case while it uses the Khi-2 distance in
CE
the qualitative case. The main advantage of this method, in comparison with
other hierarchical methods, is that in addition to providing a dendrogram, each
node is labeled by its binary splitting rule. So the dendrogram provided by
AC
DIVCLUS-T can be read exactly like a binary decision tree. Thus, DIVCLUS-
T is the closest method to CUBT conceptually. The algorithm is used with the
divclust function of the divclust package from R.
LCA [39] is a mixture model-based clustering method in which observations
belong to one out of k latent classes. Observations belonging to the same class
11
ACCEPTED MANUSCRIPT
are similar with respect to the observed variables and are supposed to come from
a same probability distribution (with unknown parameters). This method is well
designed for the analysis of multivariate categorical data. A LCA model is a
finite mixture model [1] in which distributions are multi-way cross-classification
T
tables. Applied to categorical data, this method is fairly similar to a mixture of
IP
item response theory (IRT) models. Indeed, the model assumes that variables
are conditionally independent. Model parameters are estimated by likelihood
CR
maximization using both expectation-maximization and Newton-Raphson algo-
rithms. The algorithm is used with the lca function [25] of the poLCA package
from R.
M1: Linear combination (LC) Model. In this model, each variable X.j , j ∈
{1, ..., p = 9} has m = 5 levels. We define k = 3 clusters, each characterized by
ED
for l 6= 1. For clusters 2 and 3, the frequent levels are 3 and 5, respectively,
PT
using the same probabilities. We fix q = 0.8. This simulation model is very
Pp
difficult for CUBT because j=1 X.j is a perfectly discriminating variable for
CE
M2: Tree Model 1. We use here a tree structure model. We fix the dimension
p = 3 and the number of groups k = 4. Each variable X.j , j ∈ {1, ..., p}, has
m = 4 levels. Each level is coded as an integer, and we distinguish odd and
even levels. The partition used for the simulation is shown in figure 1. Clusters
are defined as follows:
12
ACCEPTED MANUSCRIPT
T
• C4: x1 and x3 have even levels, and x2 is arbitrary
IP
x1 odd?
CR
yes no
x2 odd? x3 odd?
yes no yes no
C1 C2 US C3 C4
AN
Figure 1: Tree structure used for data simulation model M3.
This model produces clusters that are expected to be easily found by CUBT,
M
M3: Tree Model 2. We use the same tree structure model. As in the previous
case, we fix p = 3 and the number of groups k = 4. Here, each variable X.j ,
PT
j ∈ {1, ..., p}, has m = 4 levels. The only difference is that variable levels are
not uniformly distributed in each cluster. Here, we consider a parameter p0 that
CE
• C2: x1 has odd levels, x2 has even levels with P (x1 = 1) = P (x2 = 2) =
p0 , and x3 is arbitrary
13
ACCEPTED MANUSCRIPT
• C3: x1 has even levels, x3 has odd levels with P (x1 = 2) = P (x3 = 1) =
p0 , and x2 is arbitrary
T
IP
This model should generate clusters that are easier to find than those of the
previous model. Because levels within clusters for each splitting variable are
CR
non-uniformly distributed, the contribution of these variables to the global en-
tropy is minimized.
US
M4: Nominal IRT-based model. We use here an IRT model designed for nominal
data. We fix the dimension p = 9 and the number of groups k = 3. Each variable
has mj = 5 levels. We now suppose that variables are representing multiple-
AN
choice items. The nominal response model [4] (NRM) can address nominal data.
It is a specialization of the general model for multinomial response relations and
is defined as follows:
M
Let θ be a level of latent ability underlying the response to the items. The
probability that a subject of ability θ responds category k for item j is given by
ED
mj
X
Ψjkj (θ) = exp[zjkj (θ)]/ exp(zjh (θ))
h=1
where zjh (θ) = cjh + ajh θ with h = 1, 2, ..., kj , ..., mj , θ is a latent trait, and
PT
cjh and ajh are item parameters associated with the h-th category of item j. We
generate random datasets using the NRM by simulating latent trait values for
CE
the four groups. For c ∈ {1, 2, 3}, we simulate a vector of latent trait values for
each group c using N (µc , σ 2 ), µ = (−3, −1, 1, 3) and σ 2 = 0.2. For j ∈ {1, ..., p},
the values of cjh range uniformly between -2 and 2 while cjh are distributed as
AC
N (1, 0.1). Simulations are performed using the NRM.sim function of the mcIRT
package [35] with R.
M5: IRT-based Model. We use again the item response theory (IRT) models.
These models allow us to assess the probability of observing a level for each
14
ACCEPTED MANUSCRIPT
variable given a latent trait level. The latent trait is an unobservable continuous
variable that defines the individual’s ability, measured by the observed variables.
In the IRT framework, the variables called items are ordinal. The observations
can be either binary or polytomous. Here, we introduce a polytomous IRT
T
model to generate data in a probabilistic way. The generalized partial credit
IP
model [31] (GPCM) is an IRT model that can address ordinal data. It is an
extension of the 2-parameter logistic model for dichotomous data. The model
CR
is defined as follows:
Px
exp k=0 αj (θi − βjk )
pjx (θ) = P (Xij = x|θ) = Pmj Pr
r=0 exp k=0 αj (θi − βjk )
US
where θ is the latent trait and θi represents the latent trait level of individual
i. βjk is a difficulty threshold parameter for the category k of the item j. For
j ∈ {1, ..., p}, βj is a vector of dimension m−1. αj is a discrimination parameter
AN
represented by a scalar. We generate random datasets using the GPCM by
simulating latent trait values for the three groups. For c ∈ {1, 2, 3}, we simulate
a vector of latent trait values for each class c using N (µc , σ 2 ), µ = (−3, 0, 3)
M
and σ 2 = 0.2. For j ∈ {1, ..., p}, αj is distributed as N (1, 0.1), and βj is a
vector of ordered values that range uniformly between -2 and 2. Simulations
ED
are performed using the rmvordlogis function of the ltm package [36] with R.
For each simulation model, we test two configurations of high and low cluster
separability (or data dispersion).
CE
For the LC model (M1), the probability q of the most frequent level in each
cluster, controls the separability of the clusters; we fix q to 0.8 for highly sepa-
rated clusters and to 0.4 for less separated clusters.
AC
For both tree models (M2 and M3), separability is decreased by adding some
”noise” variables to the dataset. We use the initial setting for well-separated
clusters and we add six variables distributed uniformly over {1, ..., 6} to the
dataset, for less separated clusters.
For both the NRM and GPCM (M4 and M5), the separation of the clusters is
15
ACCEPTED MANUSCRIPT
controlled by changing the latent trait distributions for the three groups. For
c ∈ {1, 2, 3}, we simulate a vector of latent trait values for each class c using
N (µc , σ 2 ), with µ = (−3, 0, 3) for highly separated clusters and µ = (−1, 0, 1)
for less separated clusters, taking σ 2 = 0.2 for both cases. We use the same
T
item parameters in both cases.
IP
3.4. Prediction using the partitions
CR
CUBT offers a direct method for predicting the cluster of any new observa-
tion by simply going through the tree structure following the binary splitting
rules. Any clustering approach may be used to make predictions by assigning
US
a new observation to its closest cluster from the partition. One needs only to
define the similarity measure to use between a new observation and any cluster.
We use an average link-type similarity, the mean similarity between the new
AN
observation and all the observations of the cluster.
Because the true clusters are known we assess the performance of the dif-
ferent algorithms using the adjusted Rand index (ARI, [20]) and the matching
ED
error [13]. Let y1 , ..., yn be the class labels of each observation, and let ŷ1 , ..., ŷn
be the labels assigned to the n observations by a clustering algorithm. ARI is
a classical combinatorial index computed from a contingency table of y and ŷ.
PT
follows:
n
1X
M E = min 1{yi 6=σ(ŷi )}
σ∈Σ n i=1
AC
For more than seven categories it may be computed efficiently using the Hun-
garian method [33].
16
ACCEPTED MANUSCRIPT
T
minsize parameter for the final partition obtained by CUBT. For each model,
IP
for each sample size and both separable and non separable cases, we have tested
15 values of minsize taken over a regular grid between blog(n)c and bn/4c.
CR
For each case, 20 replicas were simulated for learning and test samples, and we
have computed the deviance, the category utility, the matching error for the
maximal tree, the pruned tree and the joined tree. The main conclusions of
these experiments are the following:
US
• For all cases the optimal minsize value is the same with respect to the
AN
deviance and the category utility. In fact, category utility and deviance
are strongly related [15].
M
tance. It showed significant differences between models but very few dif-
ferences between the separable and non separable cases. In all cases,
observed differences never exceed 5 × 10−2 . For models M1 and M5, prun-
PT
ing with mutual information gives better results. For the other models,
pruning with Hamming’s distance is better, but differences with mutual
information are still very low (lower than 1.2 × 10−2 ).
CE
• The values of the minsize parameter minimizing the total deviance of the
AC
joined trees over the test samples, are very often the same as the ones min-
imizing the matching error. When they are different, they are still very
close. This means that, in our simulations, optimizing the total deviance
over test samples is equivalent to optimizing the matching error.
17
ACCEPTED MANUSCRIPT
T
IP
• The optimal minsize value increases with the sample size in all cases.
CR
Finally, we choose the different parameters as follows. We use a 10-fold
cross validation of the total deviance of the joined tree and select the value of
minsize minimizing this criterion. The parameter mindev is fixed to 0.001. For
US
the pruning stage, we fix δ = 0.3. In their original work, the authors showed that
CUBT is considerably more sensitive to the parameter mindist for the pruning
stage. Thus, the dissimilarity measure is computed for all sibling nodes in the
AN
maximal tree, and rather than choosing a value for mindist, which is dependent
on the data scale, the user may specify the value of the parameter qdist, which
corresponds to the quantile of the dissimilarities’ empirical distribution over the
M
maximal tree. The only parameter to be fixed in the joining stage is the number
of classes k, and this is performed as in the original version of CUBT.
ED
3.7. Results
Table 1 presents the results for all data simulation models, for prediction
PT
(test samples). We report the matching error as a percentage obtained for each
clustering algorithm, averaging each time over the 100 replicas. The results
using ARI are omitted because they lead to exactly the same conclusions as the
CE
matching error.
In terms of prediction, CUBT is always one of the two best performing
AC
method for models M1, M3 and M5, for all sample sizes, separable and non
separable cases. For model M2 CUBT outperforms also the other methods
except when N = 500 in case of high separability, where DIVCLUS-T and HCA
are the best performing methods. For model M4, k-modes, HCA and DIVCLUS-
T are among the two best performing methods, outperforming CUBT.
18
ACCEPTED MANUSCRIPT
Table 1: Average matching errors (in percentage) over 100 test samples replicas for all the
methods, using the five simulation models, different sample sizes (N), according to separability
level (Sep) of the clusters. Values in bold correspond to the two least errors for each case.
CUBTMI and CUBTHam correspond to CUBT using respectively mutual information and
T
Hamming’s distance for pruning.
IP
Sep N k-modes HCA DBSCAN DIVCLUS-T LCA COBWEB CUBTMI CUBTHam
Model M1
High 100 55.4 55.8 61.3 53.6 54.7 62.2 37.0 36.7
CR
300 57.6 56.4 63.2 55.3 57.0 61.9 37.5 35.0
500 59.0 53.4 61.5 55.2 58.4 62.2 37.6 33.1
Low 100 59.5 60.0 65.3 60.4 60.2 73.9 56.4 56.1
300 58.5 62.0 65.0 60.6 59.6 73.7 58.1 58.7
High
500
100
300
57.3
64.1
63.3
62.1
63.2
63.2
67.0
75.0
75.0
US 59.7
Model M2
60.9
58.6
59.5
62.9
64.8
74.0
68.5
69.9
59.7
57.3
62.1
59.4
61.2
62.4
AN
500 64.6 61.9 75.0 56.4 63.0 70.4 62.4 62.7
Low 100 66.5 66.4 72.5 65.6 66.0 79.7 64.2 64.3
300 68.0 68.8 71.6 67.7 68.0 82.3 66.4 67.4
500 67.9 69.2 71.4 67.3 68.6 81.4 67.9 66.3
M
Model M3
High 100 59.4 57.6 74.9 56.3 57.6 66.5 43.2 46.1
300 60.2 59.0 75.0 55.5 59.4 67.2 45.0 38.5
ED
Model M4
High 100 59.3 59.2 62.3 59.6 59.5 71.2 62.5 62.7
300 61.5 61.1 62.9 60.6 61.5 78.8 63.8 63.9
500 61.1 60.9 62.7 61.4 61.8 81.3 64.2 64.2
CE
Low 100 59.7 60.0 61.3 59.5 59.9 72.2 63.0 63.3
300 61.1 61.6 63.3 62.2 61.9 79.4 64.7 64.7
500 61.7 61.8 63.0 62.4 62.3 81.4 65.2 65.1
Model M5
AC
High 100 56.9 39.5 58.3 55.7 56.3 60.1 31.0 29.7
300 57.3 35.7 56.3 57.2 57.4 60.2 36.5 34.3
500 58.3 36.1 59.3 57.2 57.7 60.2 37.6 36.7
Low 100 61.2 60.0 61.7 60.9 61.2 75.6 52.3 52.2
300 62.5 61.7 63.4 62.1 62.7 78.1 53.2 50.2
500 63.1 62.3 64.4 63.1 63.8 80.4 48.8 49.8
19
ACCEPTED MANUSCRIPT
Both options for pruning, using mutual information or hamming’s distance give
very similar results in most of the cases.
T
We present here two applications on real data sets coming from a supervised
IP
learning framework [24], where a grouping variable is also available. The first
concerns the soybean diseases and the second concerns the Tic-Tac-Toe endgame
CR
dataset.
US
We use here the short version of the well-known Michalski’s soybean disease
database [28]. This database contains 47 observations of 35 nominal attributes.
AN
Each row in this dataset represents a soybean disease case, which is described
by attributes such as precipitation, temperature or root condition. Four types
of soybean diseases are available in this dataset: Diaporthe stem canker (10
M
cases), charcoal rot (10 cases), Rhizoctonia root rot (10 cases) and Phytophtora
rot (17 cases).
First, we run CUBT on the whole dataset, fixing the parameter minsize =
ED
10. The four classes were discovered by CUBT as terminal nodes of the resulting
maximal tree (see the left panel of figure 2). The main split found by CUBT
separates the cases that differ according to the presence of stem cankers (x23 ∈
PT
{0, 3}). The two other best splits defined on the two subsets have the form
x7 ∈ {0, 1} and x22 ∈ {1}. These splits discriminate cases according to the
CE
damaged area and the color of the canker lesion. The direct interpretation
of this partition is in accordance with Fisher’s findings using the COBWEB
algorithm [12]. For example, the ”Charcoal Rot” cluster is defined by Fisher
AC
20
ACCEPTED MANUSCRIPT
CUBT uses only these two nominal attributes (stem cankers and area damaged)
to retrieve the clusters while 8 other attributes are needed in the description
provided by Fisher.
x3 ∈ {0}?
T
yes no
x21 ∈ {0; 3}?
IP
yes C1 x21 ∈ {1, 2}?
no
yes no
x7 ∈ {0, 1}? x22 ∈ {1}?
CR
x22 ∈ {1}? C2
yes no yes no yes no
C1 C2 C3 C4 C3 C4
US
Figure 2: Tree structures discovered by CUBT (left) and DIVCLUS-T (right) for the soybean
disease dataset.
AN
The right panel of figure 2 shows the tree obtained by DIVCLUST-T. A
perfect partition is found for the same data set, but having a different structure
M
Hamming’s distance for the pruning stage, for both learning and prediction on
the soybean disease dataset and compare it to the other clustering methods used
in Section 3, ı.e, k-modes, HCA, DBSCAN, COBWEB, DIVCLUS-T and LCA.
PT
We draw 100 stratified random samples, taking at each iteration two thirds of
the observations for learning and the remaining observations as the test sample.
For CUBT using both mutual information (CUBTMI ) and Hamming’s distance
CE
(CUBTHam ) for pruning, we select the value of minsize cross validating the
total deviance of the joined tree. The values obtained respectively for CUBTMI
AC
and CUBTHam were 5 and 3. The learning and prediction results of these
experiments are shown in Table 2. Using CUBTHam provides better results than
CUBTMI , according to both learning and prediction. CUBTHam and DIVCLUS-
T outperform the other methods over learning samples while CUBTMI and
CUBTHam outperform all the other methods over test samples, in prediction.
21
ACCEPTED MANUSCRIPT
The matching error for CUBTHam is 0 for learning and 0.39 for prediction, while
the error rates for the other methods range from 0 to 0.87 for learning and from
0.39 to 0.56 for prediction.
T
Table 2: Learning and prediction results: matching error for the soybean disease and Tic-Tac-
Toe endgame datasets.
IP
k-modes HCA DBSCAN COBWEB DIVCLUS-T LCA CUBTMI CUBTHam
CR
Soybean diseases
Learning 14.7 68.9 67.7 86.8 0 18.5 34.8 0
Prediction 41 39.4 56.3 49.8 41.8 40.4 38.8 38.6
Tic-Tac-Toe endgame
Learning
Prediction
47.5
46.6
37.6
39.8
US 34.7
34.6
98.9
75.8
42.2
45.2
43.1
45.2
40.7
39.6
40.6
37.0
AN
4.2. Tic-Tac-Toe endgame dataset
We use here the Aha’s Tic-Tac-Toe endgame dataset [2]. This dataset con-
M
figuration represents the case where player ”x” wins, and −1 if he does not.
CE
As in the previous section we first optimized the choice of the growing and
pruning parameters (minsize and qdist) and we run the different methods com-
puting their performance over stratified learning and test samples.
AC
The learning and prediction results of these experiments are shown in Table
2. Using CUBTHam provides better results than CUBTMI , according to both
learning and prediction. HCA and DBSCAN outperform the other methods
22
ACCEPTED MANUSCRIPT
over learning samples (CUBTHam is the third best performing method) while
CUBTHam and DBSCAN outperform all the other methods over test samples.
The matching error for CUBTHam is 0.41 for the learning samples and 0.37 for
prediction, while the error rates for the other methods range from 0.35 to 0.99
T
for learning and from 0.35 to 0.76 for prediction.
IP
5. Conclusion
CR
We have presented an extension of the CUBT algorithm for nominal data,
which uses suitable criteria to handle these types of data, based on the Shannon
entropy and mutual information. We have compared this approach to other
US
classical methods using several data simulation models. Among the compared
methods, CUBT exhibits excellent performance and outperforms most of the
AN
methods it is compared with, mainly over test samples. In all cases, CUBT still
has at least three practical advantages over the other methods: the obtained
partition is interpretable in terms of the original variables, the obtained tree
M
may be used to simply assign a cluster to new observation s, and the algorithm
is parallelizable, thus usable for large data sets.
ED
We also provided some heuristics for selecting the two main parameters of
CUBT: the minimum size of the leaves for the maximal tree and the minimum
PT
distance for pruning, and justified the selection models using extension simula-
tions with various models. These tuning methods may be used for continuous
datasets.
CE
CUBT gives excellent results over the tested real data samples when compared
with different methods.
Future work will focus on new criteria to take into account for mixed data like
AC
mixing additive criteria [21, 22]. Another issue that is also under consideration,
is the importance of variables in CUBT.
23
ACCEPTED MANUSCRIPT
director of the public health unity at Aix Marseille University for his financial
support and to Claude Deniau for his valuable and helpful comments concerning
this work. This work is partially supported by the program ECOS-sud U14E02.
T
IP
References
[1] Agresti A. (2002) Categorical Data Analysis. John Wiley & Sons, Hoboken.
CR
[2] Aha D. W. (1991) Incremental constructive induction: An instance-based
approach. Proceedings of the Eighth International Workshop on Machine
US
Learning. 117-121. Evanston, ILL: Morgan Kaufmann.
[4] Bock, R. D. (1972) Estimating item parameters and latent ability when
M
[5] Breiman, L., Friedman, J., Stone, C.J. and Olshen, R.A. (1984) Classi-
fication and regression trees. Editions Chapman & Hall/CRC, Monterey,
CA.
PT
[8] Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum Likelihood
from Incomplete Data via the EM algorithm. Journal of the Royal Statistical
Society, Series B, vol. 39, 1:1-38.
24
ACCEPTED MANUSCRIPT
[9] De Raedt, L. and Blockeel, H. (1997) Using Logical Decision Trees for
Clustering. In Proceedings of the 7th International Workshop on Inductive
Logic Programming, 133-140, Springer-Verlag.
[10] Dunn, J. C. (1973) A Fuzzy Relative of the ISODATA Process and Its Use
T
in Detecting Compact Well-Separated Clusters. Journal of Cybernetics 3,
IP
3, 32-57. doi:10.1080/01969727308546046.
CR
[11] Ester, M., Kriegel, H. P., Sander, J., and Xu, X. (1996) A density-based
algorithm for discovering clusters in large spatial databases with noise. Pro-
ceedings of the Second International Conference on Knowledge Discovery
and Data Mining, 226–231.
US
[12] Fisher, D. H. (1987) Knowledge Acquisition Via Incremental Conceptual
AN
Clustering. Machine Learning 2, 2:139-72. doi:10.1023/A:1022852608280.
[13] Fraiman, R., Ghattas, B. and Svarc, M. (2013) Interpretable clustering us-
ing unsupervised binary trees. Advances in data analysis and classification,
M
7, 125–145.
[17] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Wit-
ten, I., H. (2009). The WEKA Data Mining Software: An Update. SIGKDD
Explorations, Volume 11, Issue 1.
[18] Hennig C. (2014) fpc: Flexible procedures for clustering. R package version
2.1-9, http://CRAN.R-project.org/package=fpc.
25
ACCEPTED MANUSCRIPT
[19] Huang, Z. (1998) Extensions to the k-modes algorithm for clustering large
data sets with categorical values. Data Mining and Knowledge Discovery,
2(3), 283–304.
T
fication, 2(1), 193–218.
IP
[21] Jagannatha Reddy, M. V. and Kavitha, B. (2012) Clustering the Mixed Nu-
CR
merical and Categorical Dataset using Similarity Weight and Filter Method
International Journal of Database Theory and Application, 5(1), 121–134.
US
algorithms for handling numerical and categorical data: a review, arXiv,
arXiv:1311.7219v1.
AN
[23] Leisch, F. (2006) A tool box for K-centroids cluster analysis. Computational
Statistics and Data Analysis, 51(2), 526–544.
[26] Liu, B., Xia, Y., and Yu., P. S. (2000) Clustering Through Decision Tree
CE
[27] MacQueen, J. (1967) Some methods for classification and analysis of mul-
tivariate observations. Proceedings of the Fifth Berkeley Symposium on
Mathematical Statistics and Probability, Editions L. M. Le Cam & J. Ney-
man, 1, 281–297.
26
ACCEPTED MANUSCRIPT
[28] Michalski,R.S. (1980) Learning by being told and learning from examples:
an experimental comparison of the two methods of knowledge acquisition in
the context of developing an expert system for soybean disease diagnosis.
International Journal of Policy Analysis and Information Systems, 4(2),
T
125-161.
IP
[29] Michel, P., P.Boyer, L., Baumstarck, K., Fernandez, O., Flachenecker, P.,
Pelletier, J., Loundou, A., Ghattas, B. and Auquier, P. (2014) Defining
CR
quality of life levels to enhance clinical interpretation in multiple sclerosis:
application of a novel clustering method. Medical Care, PMID: 24638117.
US
[30] Michel, P., Auquier, P., Baumstarck, K., Loundou, A., Ghattas, B., Lan-
con, C., and Boyer, L. (2015) How to interpret multidimensional quality
of life questionnaires for patients with schizophrenia ? Quality of Life Re-
AN
search, 24(10):2483-92. doi: 10.1007/s11136-015-0982-y.
[34] Quinlan, J. R. (1992) C4.5: Programs for Machine Learning Morgan Kauf-
mann Publishers, Inc.
CE
[35] Reif, M. (2014) mcIRT: IRT models for multiple choice items. R package
version 0.41. https://github.com/manuelreif/mcIRT.
AC
27
ACCEPTED MANUSCRIPT
T
(2684): 677680.
IP
[39] Vermunt, J.K., and Magidson, J. (2002) Latent class cluster analysis In: J.
CR
A. Hagenaars and A. L. McCutcheon (Eds.), Applied Latent Class Analysis,
89-106. Cambridge: Cambridge University Press.
[40] Weihs, C., Ligges, U., Luebke, K. and Raabe, N. (2005) klaR Analyzing
US
German Business Cycles. In Baier, D., Decker, R. and Schmidt-Thieme,
L. (eds.). Data Analysis and Decision Support, 335–343, Springer-Verlag,
AN
Berlin.
doi:10.1007/s10994-009-5121-y.
ED
PT
CE
AC
28
ACCEPTED MANUSCRIPT
Badih Ghattas
Aix Marseille University
Faculté des Sciences de Luminy
163 Avenue de Luminy
13009 Marseille
T
Badih Ghattas
A senior statistician, having forty publications in the field of statistics. Several
IP
high level theoretical papers and several papers with applications in medicine,
ecology and bioinformatics. A senior statistician, having forty publications in
the field of statistics. He has several high level theoretical papers about classifi-
CR
cation , clustering and density estimation and several papers with applications in
medicine (public health, neurology, oncology, MRI) , ecology and bioinformatics.
Pierre Michel
US
PhD student supervised by a statistician and an epidemiologist. He works
mainly in the domain of public health and is contributing for the development
of new approaches of clustering and variables selection in the context of com-
puterized adaptive tests for quality of life questionnaires.
AN
Laurent Boyer
Dr in psychiatry, he works mainly on schizophrenia and multiple sclerosis. He
contributed for the development of advanced tools for measuring quality of life
for patients suffering from such chronic diseases. He is part of the MusiQol team
M