Professional Documents
Culture Documents
16.C-Fuzzy Decision Trees
16.C-Fuzzy Decision Trees
4, NOVEMBER 2005
Abstract—This paper introduces a concept and design of deci- of the attributes existing in the problem at hand) is quantified
sion trees based on information granules—multivariable entities by means of some criterion such as entropy, Gini index, etc.
characterized by high homogeneity (low variability). As such gran- [13], [5]. Third, decision trees in their generic version are pre-
ules are developed via fuzzy clustering and play a pivotal role in the
growth of the decision trees, they will be referred to as C-fuzzy de- dominantly applied to discrete class problems (the continuous
cision trees. In contrast with “standard” decision trees in which one prediction problems are handled by regression trees). Interest-
variable (feature) is considered at a time, this form of decision trees ingly, these three fundamental features somewhat restrict the
involves all variables that are considered at each node of the tree. nature of the trees and identify a range of applications that
Obviously, this gives rise to a completely new geometry of the par- are pertinent in this setting. When dealing with continuous
tition of the feature space that is quite different from the guillotine
cuts implemented by standard decision trees. The growth of the attributes, it is evident that the discretization is a must. As
C-decision tree is realized by expanding a node of tree character- such, it directly impacts the performance of the tree. One may
ized by the highest variability of the information granule residing argue that the discretization requires some optimization that by
there. This paper shows how the tree is grown depending on some being guided by the classification error can be realized once the
additional node expansion criteria such as cardinality (number of development of the tree has been finalized. In this sense, the
data) at a given node and a level of structural dependencies (struc-
turability) of data existing there. A series of experiments is re- second phase (namely a way in which the tree has been grown
ported using both synthetic and machine learning data sets. The and a sequence of the attributes selected) is inherently affected
results are compared with those produced by the “standard” ver- by the discretization mechanism. In a nutshell, it means that
sion of the decision tree (namely, C4.5). these two design steps cannot be disjointed. The growth of
Index Terms—Decision trees, depth-and-breadth tree expan- the tree relying on a choice of a single attribute can be also
sion, experimental studies, fuzzy clustering, node variability, tree treated as some conceptual drawback. While being quite simple
growing. and transparent, it could well be that considering two or more
attributes as a indivisible group of variables occurring as the
I. INTRODUCTION discrimination condition located at some node of the tree may
lead to the better tree.
A. Fuzzy Clustering
Fuzzy clustering is a core functional part of the overall tree.
It builds the clusters and provides its full description. We con-
fine ourselves to the standard fuzzy C-means (FCM), which is
an omnipresent technique of information granulation. The de-
scription of this algorithm is well documented in the literature.
We refer the reader to [2] and [11] and revisit it here in the set-
ting of the decision trees. The FCM algorithm is an example
of an objective-oriented fuzzy clustering where the clusters are
built through a minimization of some objective function. The
Fig. 3. Node splitting controlled by the variability criterion V .
standard objective function assumes the format
partition update
(1)
(3)
and (5)
Here, denotes all elements of the data set that belong to this
node in virtue of the highest membership grade
(7)
(8)
In the next step, we select the node of the tree (leaf) that has the
(11)
highest value of , say and expand the node by forming
its children by applying the clustering of the associated data
set into clusters. The process is then repeated: we examine the Again, with no structuralability present in the data, this expres-
leaves of the tree and expand the one with the highest value of sion returns a zero value.
the diversity criterion. To gain a better feel as to the lack of structure and the en-
We treat a C-decision tree as a regular tree structure of the suing values of (11), let us consider a case where all entries in
form with nodes, where nodes are a certain column of the partition matrix (pattern) are equal to
described by (5) and each nonterminal node has children. with some slight deviation being equal to . In 50% cases,
The growth of the tree is controlled by conditions under which we consider that these entries are higher than , and we put
the clusters can be further expanded (split). We envision two in- ; in the remaining 50%, we consider the decrease
tuitively appealing conditions that tackle the nature of the data over and have . Furthermore, let us treat as a
behind each node. The first one is self-evident: a given node can fraction of the original membership grade, that is, make it equal
be expanded if it contains enough data points. With clusters, to , where in the interval (0, 1/2). Then, (11) reads as
we require this number to be greater than the number of the clus-
ters; otherwise, the clusters cannot be formed. While this is the
lower bound on the cardinality of the data, practically we would
expect this number to be a multiplicity of , say , ,
etc. The second stopping condition pertains to the structure of (12)
data that we attempt to discover through clustering. It becomes
obvious that once we approach smaller subsets of data, the dom- The plot of this relationship treated as a function of is shown
inant structure (which is strongly visible at the level of the en- in Fig. 4. It shows how the departure from the situation where no
tire and far more numerous data set) may not manifest that pro- structure has been detected ( ) to the case where
foundly in the subset. It is likely that the smaller the data, the less quantifies in the values of the structurability expression. The
pronounced its structure. This becomes reflected in the entries plot shows several curves over the number of clusters ( ); higher
of the partition matrix that tend to be equal to each other and values of lead to the substantial drop in the values of the index
equal to . If no structure becomes present, this equal distri- (9).
bution of membership grades occurs across each column of the The two measures introduced previously can be used as a
partition matrix. This lack of visible structure can be quantified stopping criterion in the expansion (growth) of the tree. We can
by the expression (for the th pattern) leave the node intact once the number of patterns falls under the
assumed threshold and/or the structurability index is too low.
The first index is a sort of precondition: if not satisfied, it pre-
(10) vents us from expanding the node. The second index comes in
a form of some postcondition: to compute its value, we have
If all entries of the partition matrix are equal to , then the to cluster the data first and then determine its value. It is also
result is equal to zero. If we encounter a full membership to a stronger as one may encounter cases where there a significant
certain cluster, then the resulting value is equal to 1 (that is a number of the data points to warrant clustering in light of the
502 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005
TABLE I
EXPERIMENTAL SETTING OF THE FCM ALGORITHM
first criterion; however, the second one concludes that the un-
derlying structure is too “weak,” and this may advise us to back-
track and refuse to expand this particular node.
These two indexes support decision making that focuses on
the structural aspects of the data (namely, we decide whether to
split the data). It does not guarantee that the resulting C-tree will
be the best from the point of view of classification or prediction
of continuous output variable. The diversity criterion (sum of
at the leaves) can be also viewed as another termination cri-
Fig. 6. Classification boundaries (thick lines) for some configurations of
terion. While conceptually appealing, we may have difficulties the prototypes formed by (13) and hyperboxes: (a) v = [1:5 1:2] v =
in “translating” its values into a more tangible and appealing [2:5 2:1] v = [0:6 3:5] and (b) v = [1:5 1:5] v = [1:5 2:6] v = [0:6 3:5].
descriptor of the tree (obviously the lower, the better). Another Also shown are contour plots of the membership functions of the three clusters.
possible termination option (which may equally well apply to
each of these three indexes) is to monitor their changes along variable (denoted here by ). In the calculations, we rely on the
the nodes of the tree once being built; an evident saturation of membership grades computed for each cluster as follows:
the values of each of them could be treated as a signal to stop
growing the tree. (13)
IV. USE OF THE TREE IN THE CLASSIFICATION where is a distance computed between and (as a
(PREDICTION) MODE matter of fact, we have the same expression as used in the FCM
method, refer to (3)). The calculations pertain to the leaves of
Once the C-tree has been constructed, it can be used to clas- the C-tree, so for several levels of depth we have to traverse the
sify a new input ( ) or predict a value of the associated output tree first to reach the specific leaves. This is done by computing
PEDRYCZ AND SOSNOWSKI: C – FUZZY DECISION TREES 503
Fig. 9. Top level of the C-decision tree; note two clusters of different values
of the variability index; the cluster with its higher value is shaded.
Fig. 11. Complete C-tree; note that the values of the variability criterion have reached zero at all leaves, and this terminates further growth of the tree.
several discrete classes). Experiments 1 and 2 concern two-di- of splitting criterion ( ), the number of the data points from the
mensional (2-D) synthetic data sets. Data used in experi- training set allocated to the cluster ( ), and the predicted value
ments 3 and 4 come from the Machine Learning repository at the node (class) that is rounded to the nearest integer value of
(http://www.ics.uci.edu/~mlearn/MLRepository.html), which , which is evident as we are dealing with two discrete classes
makes the experiments fully reproducible and facilitates further (labels) of the patterns.
comparative analysis. The data sets we experimented with are In the next step, we select the first node of the tree, which
as follows: (experiment 3) auto-mpg [9] and (a) pima diabetes is characterized by the highest value of the variability index,
[9], (b) ionosphere [9], (c) hepatitis [9], (d) dermatology [9], and expand it by forming two children nodes by applying the
and (e) auto data [14] in experiment 4. The first one deals FCM algorithm to data associated with this original node. The
with a continuous problem, while the other ones concern dis- decision trees grown in this manner is visualized in Fig. 10.
crete-class data. In all experiments, we use the FCM algorithm At the next step, we select the second node of the tree (the
with the settings summarized in Table I. As far as learning and one with the highest variability) and expand it in the same way
prediction abilities of the tree are concerned, we proceed with a as before (see Fig. 11).
fivefold cross-validation that generates an 80–20 split by taking As expected, the classification error is equal to zero both for
80% of the data as a training set and testing the tree on the the training and testing set. This is not surprising considering
remaining 20% of the data set. Furthermore, the experiments that the classes are positioned quite apart from each other.
are repeated five times by taking splits of data into the training
and testing part, respectively.
B. Experiment 2
A. Experiment 1
Experiment 1 is a 2-D synthetic data set generated by uniform Two-dimensional synthetic patterns used in this experiment
distribution random generators. The training set comprises 239 are normally distributed with some overlap between the classes
data points (patterns), while the testing set consists of 61 pat- (see Figs. 12 and 13). The resulting C-decision tree is visualized
terns. These two data sets are visualized in Figs. 7 and 8, re- in Fig. 14. For this tree, an average classification error on the
spectively. training data is equal to 0.001 250 (with a standard deviation
The results of the FCM (with ) are visualized in Fig. 9. equal to 0.002 013). For the testing data, these numbers are equal
Here, we report the prototypes of each cluster ( ), the values to 0.188 333 and 0.043 780, respectively.
PEDRYCZ AND SOSNOWSKI: C – FUZZY DECISION TREES 505
C. Experiment 3
This auto-mpg data set [9] involves a collection of vehicles
described by a number of features (such as weight of a vehicle,
number of cylinders, and fuel consumption).We complete a se-
ries of experiments in which we sweep through the number of
the clusters (c) varying it from 1 to 20 and carrying out 20 expan-
sions (iterations) of the tree (the tree is expanded step by step,
leading either to its in-depth or breadth expansion). The vari-
ability observed at all leaves of the tree
characterizes a process of the growth of the tree (refer to
Figs. 1 and 2).
The variability measure is reported for the training and testing
set as visualized in Figs. 15 and 16. The variability goes down
with the growth of the tree, and this becomes evident for the
training and testing data. It is also clear that most of the changes
in the reduction of the variability occur at the early stages of the
growth of the trees; afterwards, the changes are quite limited.
Likewise, we note that the drop in the variability values becomes
visible when moving from two to three or four clusters. Notice- Fig. 14. Complete C-tree; as the values of the variability criterion has reached
ably for the increased number of the clusters, the variability is zero at all leaves, this terminates further growth of the tree.
506 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005
Fig. 15. Variability of the tree reported during the growth of the tree (training data) for a selected number of clusters.
Fig. 16. Variability of the tree reported during the growth of the tree (testing data) for a selected number of clusters.
practically left unaffected (we see a series of barely distinguish- residing at each node ( ) as well as their variability ( ) and
able plots for greater than five). This effect is present for the the predicted value at the node (8).
training and testing data. While the variability criterion is an underlying measure in the
After taking a careful look at the variability of the tree, we design process, the predictive capabilities of the C-decision tree
conclude that the optimal configuration occurs at 5 clusters are quantified by the following performance index:
with the number of expansion equal to seven. In this case, the
resulting tree is portrayed in Fig. 17. The arrows shown there
along with the labels (numbers) visualize a growth of the tree,
namely a way in which the tree is grown (expanded in consecu- (14)
tive iterations). The numbers in circles denote the node number.
The last digit of the node number denotes the number of clus- In the above expression, denotes the predicted value occur-
ters while the beginning digits denote the parent node number. ring at the corresponding terminal node (refer to (8)). More
Detailed description of the nodes is given in Table II. Again, we specifically, the representative of this node in the output space is
report the details of the tree, including the number of patterns calculated as a weighted sum of those elements from the training
PEDRYCZ AND SOSNOWSKI: C – FUZZY DECISION TREES 507
Fig. 17. Detailed C-decision tree for the optimal number of clusters and iterations; see a detailed description in text.
set that contribute to this node, while is the output value en- is chosen as a potential candidate to expand (split into clus-
countered in the data set. For a discrete (classification) problem, ters). Prior to this node unfolding, we check if there are a suf-
this index is simply a classification error that is determined by ficient number of patterns located there (here, we consider that
counting the number of patterns that have been misclassified by the criterion is satisfied when this number is greater than the
the C-decision tree. number of the clusters). Once this holds, we proceed with the
The values of the error obtained on the training set for a clustering and then look at the structurability index (9) whose
number of different configurations of the tree (number of clus- values should be greater or equal to 0.05 (this threshold value
ters and iterations) are shown in Fig. 18. Again, most of the has been selected arbitrarily) to accept the unfolding of the node.
changes occur for low values of the clusters and are character- The number of iterations is set to 10. The plots of the variability
istic of the early phases of the growth of the tree. shown in Fig. 20 point out that there is not a substantial differ-
We see that low values of do not reduce the error even with a ence in the number of iterations on the value of ; the changes
substantial growth (number of iterations) of the tree. Similarly, occur mainly during the first few iterations (expansions of the
we observe that the same effect occurs for the testing set (see tree). Similarly, there are changes when we increase the number
Fig. 19) (obviously, these results are reported for the sake of of the clusters from two up to six, but beyond this the changes
completeness; in practice, we choose the “best” tree on a basis are not very significant.
of the training set and test it by means of the testing data). When using the C-decision tree in the predictive mode, its
performance is evaluated by means of (14). The collection of
pertinent plots is shown in Figs. 21 and 22 for the training data
D. Experiment 4 (the performance results are averaged over the series of exper-
iments). Similarly, the results of the tree on the testing set are
Again, we use a data set from the Machine Learning repos- visualized in Fig. 23. Evidently, with the increase in the number
itory [9], a two-class pima-diabetes consisting of 768 patterns of clusters, we note a drop in the error values; yet, values of that
distributed in an 8-dimensional feature space. In the design of are too high lead to some fluctuations of the error (so it is not evi-
the C-tree, we use the development procedure outlined in the dent that growing larger trees is still fully legitimate). Such fluc-
previous section: a node with the maximal value of diversity tuations are even more profound when studying the plots of error
508 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005
TABLE II
DESCRIPTION OF THE NODES OF THE C-DECISIONS TREE INCLUDED IN FIG. 17
Fig. 19. Average error of the C-decision tree reported for the testing set.
Fig. 18. Average error of the C-decision tree reported for the training set.
dard” decision trees, namely C4.5. In this analysis, we
reported for the testing set. In a search for the “optimal” con- have experimented with the software available on the Web
figuration, we have found that the number of clusters between (http://www.cse.unsw.edu.au/~quinlan/), which is C4.5 revi-
three and five and a few iterations led to the best results (see sion 8 run with the standard settings (i.e., selection of the
Fig. 24). We observe an evident tendency: while growing larger attribute that maximizes the information gain and no pruning
trees is definitely beneficial in case of the training data (gener- was used). The results are summarized in Table III. Following
ally, the error is reduced with a few exceptions), the error does the experimental scenario outlined at the beginning of the
not change quite visibly on the testing data (where the changes section, we report the mean values and the standard deviation
are in the range of 1%). of the error. For the C-decision trees, the number of nodes is
It is of interest to compare the results produced by the equal to the number of clusters multiplied by the number of
C-decision tree with those obtained when applying “stan- iterations.
PEDRYCZ AND SOSNOWSKI: C – FUZZY DECISION TREES 509
Fig. 21. Error (e) as a function of iterations (expansion of the tree) for the
training set for selected number of clusters.
Fig. 24. Classification error for the C-decision tree versus successive iterations
=
(expansions) of the tree; c 3 and 5; both training and testing sets are included.
of the tree has been reduced to one half of the larger tree (on the
pima data). The increase in the size of the tree does not dramati-
cally affect the classification results; the classification rates tend
to be better but do not differ significantly from structure to struc-
ture. In general, we note that the C-decision tree produces more
consistent results in terms of the classification for the training
and testing sets; these are closer when compared with the results
produced by the C4.5 tree. In some cases, the results of C-de-
cision tree are better than C4.5; this happens for the hepatatis
Fig. 22. Error (e) as a function of the number of clusters for selected iterations.
data.
Overall, we note that the C-tree is more compact (in terms of VI. CONCLUSION
the number of nodes). This is not surprising as its nodes are more The C-decision trees are classification constructs that are built
complex than those in the original decision tree. If our intent is on a basis of information granules—fuzzy clusters. The way
to have smaller and more compact structures, C-trees become in which these trees are constructed deals with successive re-
quite an appealing architecture. The results on the training sets finements of the clusters (granules) forming the nodes of the
are better for the C-trees at the level of 3%–6% improvement tree. When growing the tree, the nodes (clusters) are split into
(for the pima data set). The standard deviation of the error is granules of lower diversity (higher homogeneity). In contrast to
two times lower for these trees in comparison with the C4.5. C4.5-like trees, all features are used once at a time, and such a
For the testing set, we note that the larger out of the two C-trees development approach promotes more compact trees and a ver-
in Table III produces almost the same results as the C4.5. With satile geometry of the partition of the feature space. The exper-
the smaller C-tree, we note an increase in the classification rate imental studies illustrate the main features of the C-trees. One
by 1% in comparison with the larger structure however the size of them is quite profound and highly desirable for any practical
510 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005
TABLE III
C-DECISIONS TREE AND C4.5: A COMPARATIVE ANALYSIS FOR SEVERAL MACHINE LEARNING DATA SETS:
(a) PIMA-DIABETES, (b) IONOSPHERE, (c) HEPATITIS ( IN THIS DATA SET ALL MISSING VALUES
WERE REPLACED BY THE AVERAGES OF THE CORRESPONDING ATTRIBUTES),
(d) DERMATOLOGY, AND (e) AUTO DATA
usage: the difference of performance of C-trees on the training The C-tree is also sought as a certain blueprint of some de-
and testing sets is lower than the ones reported for the C4.5. tailed models that can be formed on a local basis by considering
PEDRYCZ AND SOSNOWSKI: C – FUZZY DECISION TREES 511
data allocated to the individual nodes. At this stage, the models [16] O. T. Yildiz and E. Alpaydin, “Omnivariate decision trees,” IEEE Trans.
are refined by choosing their topology (e.g., linear models and Neural Netw., vol. 12, no. 6, pp. 1539–1546, Nov. 2001.
neural networks) and making a decision about detailed learning.