Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

498 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 35, NO.

4, NOVEMBER 2005

C–Fuzzy Decision Trees


Witold Pedrycz, Fellow, IEEE, and Zenon A. Sosnowski, Member, IEEE

Abstract—This paper introduces a concept and design of deci- of the attributes existing in the problem at hand) is quantified
sion trees based on information granules—multivariable entities by means of some criterion such as entropy, Gini index, etc.
characterized by high homogeneity (low variability). As such gran- [13], [5]. Third, decision trees in their generic version are pre-
ules are developed via fuzzy clustering and play a pivotal role in the
growth of the decision trees, they will be referred to as C-fuzzy de- dominantly applied to discrete class problems (the continuous
cision trees. In contrast with “standard” decision trees in which one prediction problems are handled by regression trees). Interest-
variable (feature) is considered at a time, this form of decision trees ingly, these three fundamental features somewhat restrict the
involves all variables that are considered at each node of the tree. nature of the trees and identify a range of applications that
Obviously, this gives rise to a completely new geometry of the par- are pertinent in this setting. When dealing with continuous
tition of the feature space that is quite different from the guillotine
cuts implemented by standard decision trees. The growth of the attributes, it is evident that the discretization is a must. As
C-decision tree is realized by expanding a node of tree character- such, it directly impacts the performance of the tree. One may
ized by the highest variability of the information granule residing argue that the discretization requires some optimization that by
there. This paper shows how the tree is grown depending on some being guided by the classification error can be realized once the
additional node expansion criteria such as cardinality (number of development of the tree has been finalized. In this sense, the
data) at a given node and a level of structural dependencies (struc-
turability) of data existing there. A series of experiments is re- second phase (namely a way in which the tree has been grown
ported using both synthetic and machine learning data sets. The and a sequence of the attributes selected) is inherently affected
results are compared with those produced by the “standard” ver- by the discretization mechanism. In a nutshell, it means that
sion of the decision tree (namely, C4.5). these two design steps cannot be disjointed. The growth of
Index Terms—Decision trees, depth-and-breadth tree expan- the tree relying on a choice of a single attribute can be also
sion, experimental studies, fuzzy clustering, node variability, tree treated as some conceptual drawback. While being quite simple
growing. and transparent, it could well be that considering two or more
attributes as a indivisible group of variables occurring as the
I. INTRODUCTION discrimination condition located at some node of the tree may
lead to the better tree.

D ECISION trees [12], [13] are the commonly used archi-


tectures of machine learning and classification systems.
They come with a comprehensive list of various training and
Having these shortcomings clearly identified, the objective
of this study is to develop a new class of decision trees that at-
tempts to alleviate these problems. The underlying conjecture is
pruning schemes, a diversity of discretization (quantization) that data can be perceived as a collection of information gran-
algorithms, and a series of detailed learning refinements [1], ules [11]. Thus, the tree becomes spanned over these granules,
[3]–[7], [10], [11], [15], [16]. In spite of such variety of the treated now as fundamental building blocks. In turn, informa-
underlying development activities, one can easily witness tion granules and information granulation are almost a synonym
several fundamental properties that cut quite evidently across of clusters and clustering [2], [8]. Subscribing to the notion
the entire spectrum of the decision trees. First, the trees operate of fuzzy clusters (and fuzzy clustering), the authors intend to
on discrete attributes that assume a finite (usually quite small) capture the continuous nature of the classes so that there is no
number of values. Second, in the design procedure, one attribute restriction of the use of these constructs to the discrete prob-
is chosen at a time. More specifically, one selects the most “dis- lems. Furthermore, fuzzy granulation helps link the discretiza-
criminative” attribute and expands (grows) the tree by adding tion problem with the formation of the tree in a direct and inti-
the node whose attribute’s values are located at the branches mate manner. As it becomes evident that fuzzy clusters are the
originating from this node. The discriminatory power of the central concept behind the generalized tree, they will be referred
attribute (which stands behind its selection out of the collection to as clustered-oriented decision trees or C-decision trees, for
short.
Manuscript received February 10, 2004; revised June 9, 2004 and September The material of this study is organized in the following
9, 2004. This work was supported in part by the Canada Research Chair (CRC) manner. Section II provides an introduction of the architec-
Program of the Natural Science and Engineering Research Council of Canada
(NSERC). The work of Z. A. Sosnowski was supported in part by the Technical ture of the tree by discussing the underlying processes of its
University of Bialystok under Grant W/WI/8/02. in-depth and in-breadth growth. Section III brings more details
W. Pedrycz is with the Department of Electrical and Computer Engineering, on the development of the C-trees where we concentrate on the
University of Alberta, Edmonton, AB T6G 2V4, Canada, and also with the Sys-
tems Research Institute, Polish Academy of Sciences, 01-447 Warsaw, Poland functional and algorithmic details (fuzzy clustering and various
(e-mail: pedrycz@ee.ualberta.ca). criteria of node splitting leading to the specific pattern of tree
Z. A. Sosnowski is with the Department of Computer Science, Technical growing). Next, Section IV elaborates on the use of the trees in
University of Bialystok, Bialystok 15-351, Poland (e-mail: ezenon@ii.pb.bia-
lystok.pl). the classification or prediction mode. A series of comparative
Digital Object Identifier 10.1109/TSMCC.2004.843205 numeric studies is presented in Section V.
1094-6977/$20.00 © 2005 IEEE
PEDRYCZ AND SOSNOWSKI: C – FUZZY DECISION TREES 499

The study adhered to the standard notation and notions used


in the literature of machine learning and fuzzy sets. The way
of evaluating the performance of the tree is standard as a five-
fold cross-validation is used. More specically, in each pass, an
80–20 split of data is generated into the training and testing set,
respectively, and the experiments are repeated for five different
splits for training and testing data (rotation method) that helps
us gain high confidence about the results. As to the format of
the data set, it comes as a family of input–output pairs
, , where and
. Note that when we restrict the range of values assumed by
to some finite set (say integers) then we encounter a stan-
dard classification problem while, in general, admitting contin-
uous values assumed by the output, we are concerned with a Fig. 1. Growing a decision tree by expanding nodes (which are viewed as
(continuous) prediction problem. clusters located at its nodes). Shadowed nodes are those with maximal values
of the diversity criterion and thus being subject to the split operation.

II. OVERALL ARCHITECTURE OF THE


CLUSTER-BASED DECISION TREE

The architecture of the cluster–based decision tree develops


around fuzzy clusters that are treated as generic building blocks
of the tree. The training data set is clustered into clusters so
that the data points (patterns) that are similar are put together.
These clusters are completely characterized by their prototypes
(centroids). We start with them positioned at top nodes of the
tree structure. The way of building the clusters implies a spe-
cific way in which we allocate elements of to each of them. In
other words, each cluster comes with a subset of , namely ,
, . The process of growing the tree is guided by a cer-
tain heterogeneity criterion that quantifies a diversity of the data
(with respect to the output variable ) falling under the given
cluster (node). Denote the values of the heterogeneity criterion
by , , , , respectively (see also Fig. 1). We then choose
the node with the highest value of the criterion and treat it as a
candidate for further refinement. Let be the one for which
assumes a maximal values, that is . The
th node is refined by splitting it into clusters as visualized in
Fig. 1.
Again the resulting nodes (children) of node come with
their own sets of data. The process is repeated by selecting the
most heterogeneous node out of all final nodes (see Fig. 2). The Fig. 2. Typical growth patterns of the cluster-based trees: (a) depth intensive
and (b) breadth intensive.
growth of the tree is carried out by expanding the nodes and
building their consecutive levels that capture more details of the
structure. It is noticeable that the node expansion leads to the For the completeness of the construct, each node is character-
increase in either the depth or width (breadth) of the tree. The ized by the following associated components: the heterogeneity
pattern of the growth is very much implied by the characteristics criterion , the number of patterns associated with it, and a list
of the data as well as influenced by the number of the clusters. of these patterns. Moreover, each pattern on this list comes with
a degree of belongingness (membership) to that node. We pro-
Some typical patterns of growth are illustrated in Fig. 2. Con-
vide a formal definition of the C-decision trees at a later stage
sidering a way in which the tree expands, it is easy to notice that
once we cover the pertinent mechanisms of their development.
each node of the tree has exactly zero or children.
By looking at the way of forming the nodes of the tree and
their successive splitting (refinement), we can easily observe III. DEVELOPMENT OF THE TREE
an interesting analogy between this approach and well-known In this section, we concentrate on the functional details and
hierarchical divisive algorithms. Conceptually, they share the ensuing algorithmic aspects. The crux of the overall design is
same principle; however, there are a number of technical aspects the clustering mechanism and the manipulations realized at the
of these two. level of information granules formed in this manner.
500 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005

A. Fuzzy Clustering
Fuzzy clustering is a core functional part of the overall tree.
It builds the clusters and provides its full description. We con-
fine ourselves to the standard fuzzy C-means (FCM), which is
an omnipresent technique of information granulation. The de-
scription of this algorithm is well documented in the literature.
We refer the reader to [2] and [11] and revisit it here in the set-
ting of the decision trees. The FCM algorithm is an example
of an objective-oriented fuzzy clustering where the clusters are
built through a minimization of some objective function. The
Fig. 3. Node splitting controlled by the variability criterion V .
standard objective function assumes the format

partition update
(1)
(3)

with , , being a parti-


tion matrix (here denotes a family of -by- matrices prototype update
that satisfies a series of conditions: 1) all elements of the parti-
tion matrix are confined to the unit interval, 2) the sum over
each column is equal to 1, and 3) the sum of the membership (4)
grades in each row is contained to the range of .
The number of clusters is denoted by . The data set to be
clustered consists of patterns. is a fuzzification factor (usu-
ally ), and is a distance function between the th data The series of iterations is started from a randomly initiated par-
point (pattern) and the th prototype. The prototype of the cluster tition matrix and involves the calculations of the prototypes and
can be treated as a typical vector or a representative of the data partition matrices.
forming the cluster. The feature space in which the clustering
is carried out requires a thorough description. Referring to the B. Node Splitting Criterion
format of the data we have started with, let us note that they The growth process of the tree is pursued by quantifying the
come as ordered pairs . For the purpose of clus- diversity of data located at the individual nodes of the tree and
tering, we concatenate the pairs and use the notation splitting the nodes that exhibit the highest diversity. This intu-
itively appealing criterion takes into account the variability of
(2) the data, finds the node with the highest value of the criterion,
and splits it into nodes that occur at the consecutive lower level
This implies that the clustering takes place in the ( ) di- of the tree (see Fig. 3). The essence of the diversity (variability)
mensional space and involves the data distributed in the input criterion is to quantify a dispersion of the data “allocated” to the
and output space. Likewise, the resulting prototype ( ) is po- given cluster so that higher dispersion of data results in higher
sitioned in . For future use, we distinguish between the values of the criterion. Recall that individual data points (pat-
coordinates of the prototype in the input and output space by terns) belong to the clusters with different membership grades;
splitting them into two sections (blocks of the variables) as fol- however, for each pattern, there is a dominant cluster to which
lows: they exhibit the highest degree of belongingness (membership).
More formally, let us represent the th node of the tree as
an ordered triple

and (5)

Here, denotes all elements of the data set that belong to this
node in virtue of the highest membership grade

It is worth emphasizing that located describes a prototype for all


located in the input space; this description will be of particular the index pertains to the nodes originating from
interest when utilizing the tree to carry out prediction tasks. the same parent
In essence, the FCM is an iterative optimization process in
which we iteratively update the partition matrix and prototypes The second set collects the output coordinates of the ele-
until some termination criterion has been satisfied. The updates ments that have already been assigned to , as follows:
of the values of and ’s are governed by the following well-
known expressions (cf. [2]): (6)
PEDRYCZ AND SOSNOWSKI: C – FUZZY DECISION TREES 501

Likewise, is a vector of the grades of membership of the


elements in , as follows:

(7)

We define the representative of this node positioned in the output


space as the weighted sum (note that in the construct hereafter
we include only those elements that contribute to the cluster so
the summation is taken over and ), as follows:

(8)

The variability of the data in the output space existing at this


node is taken as a spread around the representative ( ) Fig. 4. Structurability index (9) viewed as a function of c and plotted for
several selected values of .
where again we consider a partial involvement of the elements
in by weighting the distance by the associated membership
grade maximal value of the above expression). To describe the struc-
tural dependencies within the entire data set in a certain node,
(9) we carry out calculations over all patterns located at the node of
the tree

In the next step, we select the node of the tree (leaf) that has the
(11)
highest value of , say and expand the node by forming
its children by applying the clustering of the associated data
set into clusters. The process is then repeated: we examine the Again, with no structuralability present in the data, this expres-
leaves of the tree and expand the one with the highest value of sion returns a zero value.
the diversity criterion. To gain a better feel as to the lack of structure and the en-
We treat a C-decision tree as a regular tree structure of the suing values of (11), let us consider a case where all entries in
form with nodes, where nodes are a certain column of the partition matrix (pattern) are equal to
described by (5) and each nonterminal node has children. with some slight deviation being equal to . In 50% cases,
The growth of the tree is controlled by conditions under which we consider that these entries are higher than , and we put
the clusters can be further expanded (split). We envision two in- ; in the remaining 50%, we consider the decrease
tuitively appealing conditions that tackle the nature of the data over and have . Furthermore, let us treat as a
behind each node. The first one is self-evident: a given node can fraction of the original membership grade, that is, make it equal
be expanded if it contains enough data points. With clusters, to , where in the interval (0, 1/2). Then, (11) reads as
we require this number to be greater than the number of the clus-
ters; otherwise, the clusters cannot be formed. While this is the
lower bound on the cardinality of the data, practically we would
expect this number to be a multiplicity of , say , ,
etc. The second stopping condition pertains to the structure of (12)
data that we attempt to discover through clustering. It becomes
obvious that once we approach smaller subsets of data, the dom- The plot of this relationship treated as a function of is shown
inant structure (which is strongly visible at the level of the en- in Fig. 4. It shows how the departure from the situation where no
tire and far more numerous data set) may not manifest that pro- structure has been detected ( ) to the case where
foundly in the subset. It is likely that the smaller the data, the less quantifies in the values of the structurability expression. The
pronounced its structure. This becomes reflected in the entries plot shows several curves over the number of clusters ( ); higher
of the partition matrix that tend to be equal to each other and values of lead to the substantial drop in the values of the index
equal to . If no structure becomes present, this equal distri- (9).
bution of membership grades occurs across each column of the The two measures introduced previously can be used as a
partition matrix. This lack of visible structure can be quantified stopping criterion in the expansion (growth) of the tree. We can
by the expression (for the th pattern) leave the node intact once the number of patterns falls under the
assumed threshold and/or the structurability index is too low.
The first index is a sort of precondition: if not satisfied, it pre-
(10) vents us from expanding the node. The second index comes in
a form of some postcondition: to compute its value, we have
If all entries of the partition matrix are equal to , then the to cluster the data first and then determine its value. It is also
result is equal to zero. If we encounter a full membership to a stronger as one may encounter cases where there a significant
certain cluster, then the resulting value is equal to 1 (that is a number of the data points to warrant clustering in light of the
502 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005

TABLE I
EXPERIMENTAL SETTING OF THE FCM ALGORITHM

Fig. 5. Traversing a C-fuzzy tree: an implicit mode.

first criterion; however, the second one concludes that the un-
derlying structure is too “weak,” and this may advise us to back-
track and refuse to expand this particular node.
These two indexes support decision making that focuses on
the structural aspects of the data (namely, we decide whether to
split the data). It does not guarantee that the resulting C-tree will
be the best from the point of view of classification or prediction
of continuous output variable. The diversity criterion (sum of
at the leaves) can be also viewed as another termination cri-
Fig. 6. Classification boundaries (thick lines) for some configurations of
terion. While conceptually appealing, we may have difficulties the prototypes formed by (13) and hyperboxes: (a) v = [1:5 1:2] v =
in “translating” its values into a more tangible and appealing [2:5 2:1] v = [0:6 3:5] and (b) v = [1:5 1:5] v = [1:5 2:6] v = [0:6 3:5].
descriptor of the tree (obviously the lower, the better). Another Also shown are contour plots of the membership functions of the three clusters.
possible termination option (which may equally well apply to
each of these three indexes) is to monitor their changes along variable (denoted here by ). In the calculations, we rely on the
the nodes of the tree once being built; an evident saturation of membership grades computed for each cluster as follows:
the values of each of them could be treated as a signal to stop
growing the tree. (13)

IV. USE OF THE TREE IN THE CLASSIFICATION where is a distance computed between and (as a
(PREDICTION) MODE matter of fact, we have the same expression as used in the FCM
method, refer to (3)). The calculations pertain to the leaves of
Once the C-tree has been constructed, it can be used to clas- the C-tree, so for several levels of depth we have to traverse the
sify a new input ( ) or predict a value of the associated output tree first to reach the specific leaves. This is done by computing
PEDRYCZ AND SOSNOWSKI: C – FUZZY DECISION TREES 503

Fig. 7. Two-dimensional training data (239 patterns).

Fig. 9. Top level of the C-decision tree; note two clusters of different values
of the variability index; the cluster with its higher value is shaded.

Fig. 8. Two-dimensional testing data (61 patterns).

for each level of the tree, selecting the corresponding path


and moving down (Fig. 5). At some level, we determine the
path , where . Once at the th node, we
repeat the process—that is, determine ,
(here, we are dealing with the clusters at the successive level
of the tree). The process repeats for each level of the tree. The
predicted value occurring at the final leaf node is equal to
(refer to (8)).
It is of interest to show the boundaries of the classification
regions produced by the clusters (i.e., the implicit method) and Fig. 10. Decision tree after the second expansion (iteration).
contrast them with the geometry of classification regions gener-
ated by the decision trees. In the first case we use a straightfor- class if it falls into the hyperbox formed around prototype
ward classification rule .
Some examples of the classification boundaries are shown in
assign to class if exceeds the values of the Fig. 6.
membership in all remaining clusters, As Fig. 6 reveals, the hyperbox model of the classification
boundaries is far more “conservative” than the one based on
that is
the maximal membership rule. This is intuitively appealing as
in the process of forming the hyperboxes we allowed only for
cuts that are parallel to the coordinates. It becomes apparent
For the decision trees, the boundaries are guillotine cuts. As a
that the geometry of the decision tree “induced” in this way
result, we get hyperboxes whose faces are parallel to the coor-
varies substantially from the far more diversified geometry of
dinates. When dealing with the FCM, we can exercise the fol-
the FCM-based class boundaries.
lowing method. For the given prototypes, we can project them
on the individual coordinates (variables), take averages of the
successive projected prototypes, and build hyperboxes around V. EXPERIMENTAL STUDIES
the prototypes in the entire space. This approach is conceptu- The experiments conducted in the study involve both
ally close to the decision trees as leading to the same geometric prediction problems (in which the output variable is contin-
character of the classifier. The obvious rule holds: assign to uous) and those of classification nature (where we encounter
504 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005

Fig. 11. Complete C-tree; note that the values of the variability criterion have reached zero at all leaves, and this terminates further growth of the tree.

several discrete classes). Experiments 1 and 2 concern two-di- of splitting criterion ( ), the number of the data points from the
mensional (2-D) synthetic data sets. Data used in experi- training set allocated to the cluster ( ), and the predicted value
ments 3 and 4 come from the Machine Learning repository at the node (class) that is rounded to the nearest integer value of
(http://www.ics.uci.edu/~mlearn/MLRepository.html), which , which is evident as we are dealing with two discrete classes
makes the experiments fully reproducible and facilitates further (labels) of the patterns.
comparative analysis. The data sets we experimented with are In the next step, we select the first node of the tree, which
as follows: (experiment 3) auto-mpg [9] and (a) pima diabetes is characterized by the highest value of the variability index,
[9], (b) ionosphere [9], (c) hepatitis [9], (d) dermatology [9], and expand it by forming two children nodes by applying the
and (e) auto data [14] in experiment 4. The first one deals FCM algorithm to data associated with this original node. The
with a continuous problem, while the other ones concern dis- decision trees grown in this manner is visualized in Fig. 10.
crete-class data. In all experiments, we use the FCM algorithm At the next step, we select the second node of the tree (the
with the settings summarized in Table I. As far as learning and one with the highest variability) and expand it in the same way
prediction abilities of the tree are concerned, we proceed with a as before (see Fig. 11).
fivefold cross-validation that generates an 80–20 split by taking As expected, the classification error is equal to zero both for
80% of the data as a training set and testing the tree on the the training and testing set. This is not surprising considering
remaining 20% of the data set. Furthermore, the experiments that the classes are positioned quite apart from each other.
are repeated five times by taking splits of data into the training
and testing part, respectively.
B. Experiment 2
A. Experiment 1
Experiment 1 is a 2-D synthetic data set generated by uniform Two-dimensional synthetic patterns used in this experiment
distribution random generators. The training set comprises 239 are normally distributed with some overlap between the classes
data points (patterns), while the testing set consists of 61 pat- (see Figs. 12 and 13). The resulting C-decision tree is visualized
terns. These two data sets are visualized in Figs. 7 and 8, re- in Fig. 14. For this tree, an average classification error on the
spectively. training data is equal to 0.001 250 (with a standard deviation
The results of the FCM (with ) are visualized in Fig. 9. equal to 0.002 013). For the testing data, these numbers are equal
Here, we report the prototypes of each cluster ( ), the values to 0.188 333 and 0.043 780, respectively.
PEDRYCZ AND SOSNOWSKI: C – FUZZY DECISION TREES 505

Fig. 12. Two-dimensional training data (240 patterns).

Fig. 13. Two-dimensional testing data (60 patterns).

C. Experiment 3
This auto-mpg data set [9] involves a collection of vehicles
described by a number of features (such as weight of a vehicle,
number of cylinders, and fuel consumption).We complete a se-
ries of experiments in which we sweep through the number of
the clusters (c) varying it from 1 to 20 and carrying out 20 expan-
sions (iterations) of the tree (the tree is expanded step by step,
leading either to its in-depth or breadth expansion). The vari-
ability observed at all leaves of the tree
characterizes a process of the growth of the tree (refer to
Figs. 1 and 2).
The variability measure is reported for the training and testing
set as visualized in Figs. 15 and 16. The variability goes down
with the growth of the tree, and this becomes evident for the
training and testing data. It is also clear that most of the changes
in the reduction of the variability occur at the early stages of the
growth of the trees; afterwards, the changes are quite limited.
Likewise, we note that the drop in the variability values becomes
visible when moving from two to three or four clusters. Notice- Fig. 14. Complete C-tree; as the values of the variability criterion has reached
ably for the increased number of the clusters, the variability is zero at all leaves, this terminates further growth of the tree.
506 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005

Fig. 15. Variability of the tree reported during the growth of the tree (training data) for a selected number of clusters.

Fig. 16. Variability of the tree reported during the growth of the tree (testing data) for a selected number of clusters.

practically left unaffected (we see a series of barely distinguish- residing at each node ( ) as well as their variability ( ) and
able plots for greater than five). This effect is present for the the predicted value at the node (8).
training and testing data. While the variability criterion is an underlying measure in the
After taking a careful look at the variability of the tree, we design process, the predictive capabilities of the C-decision tree
conclude that the optimal configuration occurs at 5 clusters are quantified by the following performance index:
with the number of expansion equal to seven. In this case, the
resulting tree is portrayed in Fig. 17. The arrows shown there
along with the labels (numbers) visualize a growth of the tree,
namely a way in which the tree is grown (expanded in consecu- (14)
tive iterations). The numbers in circles denote the node number.
The last digit of the node number denotes the number of clus- In the above expression, denotes the predicted value occur-
ters while the beginning digits denote the parent node number. ring at the corresponding terminal node (refer to (8)). More
Detailed description of the nodes is given in Table II. Again, we specifically, the representative of this node in the output space is
report the details of the tree, including the number of patterns calculated as a weighted sum of those elements from the training
PEDRYCZ AND SOSNOWSKI: C – FUZZY DECISION TREES 507

Fig. 17. Detailed C-decision tree for the optimal number of clusters and iterations; see a detailed description in text.

set that contribute to this node, while is the output value en- is chosen as a potential candidate to expand (split into clus-
countered in the data set. For a discrete (classification) problem, ters). Prior to this node unfolding, we check if there are a suf-
this index is simply a classification error that is determined by ficient number of patterns located there (here, we consider that
counting the number of patterns that have been misclassified by the criterion is satisfied when this number is greater than the
the C-decision tree. number of the clusters). Once this holds, we proceed with the
The values of the error obtained on the training set for a clustering and then look at the structurability index (9) whose
number of different configurations of the tree (number of clus- values should be greater or equal to 0.05 (this threshold value
ters and iterations) are shown in Fig. 18. Again, most of the has been selected arbitrarily) to accept the unfolding of the node.
changes occur for low values of the clusters and are character- The number of iterations is set to 10. The plots of the variability
istic of the early phases of the growth of the tree. shown in Fig. 20 point out that there is not a substantial differ-
We see that low values of do not reduce the error even with a ence in the number of iterations on the value of ; the changes
substantial growth (number of iterations) of the tree. Similarly, occur mainly during the first few iterations (expansions of the
we observe that the same effect occurs for the testing set (see tree). Similarly, there are changes when we increase the number
Fig. 19) (obviously, these results are reported for the sake of of the clusters from two up to six, but beyond this the changes
completeness; in practice, we choose the “best” tree on a basis are not very significant.
of the training set and test it by means of the testing data). When using the C-decision tree in the predictive mode, its
performance is evaluated by means of (14). The collection of
pertinent plots is shown in Figs. 21 and 22 for the training data
D. Experiment 4 (the performance results are averaged over the series of exper-
iments). Similarly, the results of the tree on the testing set are
Again, we use a data set from the Machine Learning repos- visualized in Fig. 23. Evidently, with the increase in the number
itory [9], a two-class pima-diabetes consisting of 768 patterns of clusters, we note a drop in the error values; yet, values of that
distributed in an 8-dimensional feature space. In the design of are too high lead to some fluctuations of the error (so it is not evi-
the C-tree, we use the development procedure outlined in the dent that growing larger trees is still fully legitimate). Such fluc-
previous section: a node with the maximal value of diversity tuations are even more profound when studying the plots of error
508 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005

TABLE II
DESCRIPTION OF THE NODES OF THE C-DECISIONS TREE INCLUDED IN FIG. 17

Fig. 19. Average error of the C-decision tree reported for the testing set.
Fig. 18. Average error of the C-decision tree reported for the training set.
dard” decision trees, namely C4.5. In this analysis, we
reported for the testing set. In a search for the “optimal” con- have experimented with the software available on the Web
figuration, we have found that the number of clusters between (http://www.cse.unsw.edu.au/~quinlan/), which is C4.5 revi-
three and five and a few iterations led to the best results (see sion 8 run with the standard settings (i.e., selection of the
Fig. 24). We observe an evident tendency: while growing larger attribute that maximizes the information gain and no pruning
trees is definitely beneficial in case of the training data (gener- was used). The results are summarized in Table III. Following
ally, the error is reduced with a few exceptions), the error does the experimental scenario outlined at the beginning of the
not change quite visibly on the testing data (where the changes section, we report the mean values and the standard deviation
are in the range of 1%). of the error. For the C-decision trees, the number of nodes is
It is of interest to compare the results produced by the equal to the number of clusters multiplied by the number of
C-decision tree with those obtained when applying “stan- iterations.
PEDRYCZ AND SOSNOWSKI: C – FUZZY DECISION TREES 509

Fig. 20. Variability (V ) for the pima-diabetes data (training set).

Fig. 23. Error (e) pima-diabetes (testing set).

Fig. 21. Error (e) as a function of iterations (expansion of the tree) for the
training set for selected number of clusters.

Fig. 24. Classification error for the C-decision tree versus successive iterations
=
(expansions) of the tree; c 3 and 5; both training and testing sets are included.

of the tree has been reduced to one half of the larger tree (on the
pima data). The increase in the size of the tree does not dramati-
cally affect the classification results; the classification rates tend
to be better but do not differ significantly from structure to struc-
ture. In general, we note that the C-decision tree produces more
consistent results in terms of the classification for the training
and testing sets; these are closer when compared with the results
produced by the C4.5 tree. In some cases, the results of C-de-
cision tree are better than C4.5; this happens for the hepatatis
Fig. 22. Error (e) as a function of the number of clusters for selected iterations.
data.

Overall, we note that the C-tree is more compact (in terms of VI. CONCLUSION
the number of nodes). This is not surprising as its nodes are more The C-decision trees are classification constructs that are built
complex than those in the original decision tree. If our intent is on a basis of information granules—fuzzy clusters. The way
to have smaller and more compact structures, C-trees become in which these trees are constructed deals with successive re-
quite an appealing architecture. The results on the training sets finements of the clusters (granules) forming the nodes of the
are better for the C-trees at the level of 3%–6% improvement tree. When growing the tree, the nodes (clusters) are split into
(for the pima data set). The standard deviation of the error is granules of lower diversity (higher homogeneity). In contrast to
two times lower for these trees in comparison with the C4.5. C4.5-like trees, all features are used once at a time, and such a
For the testing set, we note that the larger out of the two C-trees development approach promotes more compact trees and a ver-
in Table III produces almost the same results as the C4.5. With satile geometry of the partition of the feature space. The exper-
the smaller C-tree, we note an increase in the classification rate imental studies illustrate the main features of the C-trees. One
by 1% in comparison with the larger structure however the size of them is quite profound and highly desirable for any practical
510 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 35, NO. 4, NOVEMBER 2005

TABLE III
C-DECISIONS TREE AND C4.5: A COMPARATIVE ANALYSIS FOR SEVERAL MACHINE LEARNING DATA SETS:
(a) PIMA-DIABETES, (b) IONOSPHERE, (c) HEPATITIS ( IN THIS DATA SET ALL MISSING VALUES
WERE REPLACED BY THE AVERAGES OF THE CORRESPONDING ATTRIBUTES),
(d) DERMATOLOGY, AND (e) AUTO DATA

usage: the difference of performance of C-trees on the training The C-tree is also sought as a certain blueprint of some de-
and testing sets is lower than the ones reported for the C4.5. tailed models that can be formed on a local basis by considering
PEDRYCZ AND SOSNOWSKI: C – FUZZY DECISION TREES 511

data allocated to the individual nodes. At this stage, the models [16] O. T. Yildiz and E. Alpaydin, “Omnivariate decision trees,” IEEE Trans.
are refined by choosing their topology (e.g., linear models and Neural Netw., vol. 12, no. 6, pp. 1539–1546, Nov. 2001.
neural networks) and making a decision about detailed learning.

REFERENCES Witold Pedrycz (M’88–SM’94–F’99) is a Professor


and Canada Research Chair (CRC) in the Department
[1] W. P. Alexander and S. Grimshaw, “Treed regression,” J. Computational of Electrical and Computer Engineering, University
Graphical Statistics, vol. 5, pp. 156–175, 1996. of Alberta, Edmonton, AB, Canada. He is actively
[2] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Functions, New
pursuing research in computational intelligence,
York: Plenum, 1981. fuzzy modeling, knowledge discovery and data
[3] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification
mining, fuzzy control, including fuzzy controllers,
and Regression Trees. Belmont, CA: Wadsworth, 1984. pattern recognition, knowledge-based neural net-
[4] E. Cantu-Paz and C. Kamath, “Using evolutionary algorithms to in-
works, relational computation, bioinformatics, and
duce oblique decision trees,” in Proc. Genetic Evolutionary Computa- software engineering. He has published numerous
tion Conf. 2000, D. Whitley, D. E. Goldberg, E. Cantu-Paz, L. Spector, papers in this area. He is also an author of eight
L. Partnee, and H.-G. Beyer, Eds., San Francisco, CA, pp. 1053–1060.
research monographs covering various aspects of computational intelligence
[5] A. Dobra and J. Gehrke, “SECRET: A scalable linear regression tree and software engineering.
algorithm,” in Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery
Dr. Pedrycz has been a member of numerous program committees of
Data Mining, Edmonton, AB, Canada, Jul. 2002. IEEE conferences in the area of fuzzy sets and neurocomputing. He currently
[6] A. Ittner and M. Schlosser, “Non-linear decision trees—NDT,” in Proc.
serves as an Associate Editor of IEEE TRANSACTIONS ON SYSTEMS, MAN
13th Int. Conf. Machine Learning (ICML’96), Bari, Italy, Jul. 3–6, 1996. AND CYBERNETICS, Parts A and B, and the IEEE TRANSACTIONS ON FUZZY
[7] A. Ittner, J. Zeidler, R. Rossius, W. Dilger, and M. Schlosser, “Feature
SYSTEMS. He is an Editor-in-Chief of Information Sciences, and he is Presi-
space partitioning by nonlinear fuzzy decision trees,” in Proc. Int. Fuzzy dent-Elect of International Fuzzy Systems Association (IFSA) and President of
Systems Assoc., pp. 394–398. North American Fuzzy Information Processing Society (NAFIPS).
[8] A. K. Jain et al., “Data clustering: A review,” ACM Comput. Surv., vol.
31, no. 3, pp. 264–323, Sep. 1999.
[9] C. J. Merz and P. M. Murphy. (1996) UCI Repository for Ma-
chine Learning Data-Bases. Dept. of Information and Computer
Science, University of California, , Irvine, CA. [Online]. Available: Zenon A. Sosnowski (M’99) received the M.Sc. de-
http://www.ics.uci.edu/~mlearn/MLRepository.html gree in mathematics from the University of Warsaw,
[10] S. K. Murthy, S. Kasif, and S. Salzberg, “A system for induction of Warsaw, Poland, in 1976 and the Ph.D. degree in
oblique decision trees,” J. Artificial Intelligence Res., vol. 2, pp. 1–32, computer science from the Warsaw University of
1994. Technology, Warsaw, Poland, in 1986.
[11] W. Pedrycz and Z. A. Sosnowski, “Designing decision trees with the He has been with the Technical University of Bi-
use of fuzzy granulation,” IEEE Trans. Syst., Man, Cybern. A, Syst., Hu- alystok, Bialystok, Poland, since 1976, where he is
mans, vol. 30, no. 2, pp. 151–159, Mar. 2000. an Assistant Professor at the Department of Com-
[12] J. R. Quinlann, “Induction of decision trees,” Mach. Learn. 1, pp. puter Science. In 1988–1989, he had been at the Delft
81–106, 1986. University of Technology in the Netherlands for five
[13] , C4.5: Programs for Machine Learning. San Francisco, CA: months. He spent two years (1990–1991) with the
Morgan Kaufmann, 1993. Knowledge Systems Laboratory of the National Research Council’s Institute for
[14] J. S. Siebert, “Vehicle Recognition using Rule-Based Methods, Research Information Technology, Ottawa, ON, Canada. His research interests include
Memo,” Turing Institute, TIRM-87-017, 1987. artificial intelligence, expert systems, approximate reasoning, fuzzy sets, and
[15] R. Weber, “Fuzzy ID3: A class of methods for automatic knowledge ac- knowledge engineering.
quisition,” in Proc. 2nd Int. Conf Fuzzy Logic Neural Networks, Iizuka, Dr. Sosnowski is a Member of the IEEE Systems, Man, and Cybernetics So-
Japan, Jul. 17–22, 1992, pp. 265–268. ciety.

You might also like