Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 4, NO.

3, AUGUST 1996 385

Correspondence
The Possibilistic C-Means Algorithm:
Insights and Recommendations
Raghu Krishnapuram and James M. Keller

Abstruct- Recently, the possibilistic C-means algorithm (PCM) was


proposed to address the drawbacks associated with the constrained
memberships used in algorithms such as the fuzzy C-means (FCM). In
this issue, Barni et al. report a difficulty they faced while applying the
PCM, and note that it exhibits an undesirable tendency to converge to
coincidenlal clusters. The purpose of this correspondence is not just to
address the issues raised by Barni et al., but to go further and analytically
examine the underlying principles of the PCM and the possibilistic
approach, in general. We analyze the data sets used by Barni et al. and
interpret the results reported by them in the light of our findings.

I. BACKGROUND
The main motivation behind the possibilistic approach to clustering
in [I] w,as to address the problems associated with the constraint
on the memberships used in the fuzzy clustering algorithms such Fig. 1. A plot of the PCM membership function for various values of the
as the fuzzy C-means (FCM). As pointed out in [l], the constraint fuzzifier parameter m.
causes the FCM to generate memberships that can be interpreted
as degrees of sharing but not as degrees of typicality. Thus, the
will discuss the issue of coincidental clusters raised by Barni et al.
memberships in a given cluster of two points that are equidistant
in Section IV. In Section V, we will explain and interpret the results
from the prototype of the cluster can be significantly different and
reported by Barni et al. In Section VI, we briefly discuss the role of
memberships of two points in a given cluster can be equal even
clustering in pattern recognition. Finally, in Section VII, we present
though the two points are arbitrarily far away from each other. This
the summary and conclusion.
gives rise to poor performance in the presence of noise and outliers.
In [l], a modification to the FCM objective function was proposed,
and a ne.w clustering algorithm, namely the possibilistic C-means 11. A FUNDAMENTAL DIFFERENCE BETWEENTHE FCM
(PCM) algorithm, was derived. The second term in the PCM objective AND THE PCM: PARTITIONING VERSUS MODE-SEEKING
function contains a parameter 17 whose value is to be estimated from The FCM is primarily a partitioning algorithm. It will find a fuzzy
the data. It is important to remember that the PCM is just a particular C-partition of a given data set, regardless of how many “clusters” are
implementation of the general idea of the possibilistic approach. The actually present in the data set. In other words, each component of the
possibi1is;ticapproach simply means that the membership value of a partition may or may not correspond to a “cluster.” In contrast, the
point in a cluster (or class) represents the typicality of the point in the PCM is a mode-seeking algorithm, i.e., each component generated by
class, or the possibility of the point belonging to the class. Typicality the PCM corresponds to a dense region in the data set. In the PCM,
is one of the most commonly used interpretations of memberships in the prototypes are automatically attracted to dense regions in feature
applications of fuzzy set theory. It was shown in [l] that by relaxing space as iterations proceed. This can be shown as follows.
the constraint on memberships, one can generate memberships that In the PCM algorithm, each cluster is independent of the other
represent typicality. It was also shown that since noise points or clusters. Hence, the objective function corresponding to cluster i can
outliers are less typical, typicality-based memberships automatically be formulated as
reduce the effect of noise points and outliers, and improve the results N
considerably.
In this correspondence, we revisit the PCM algorithm to provide
more insights and recommendations. This work is driven partially
In (l), p z represents the prototype associated with cluster i , U , (which
by the critique of Barni et al. [12] and partially by the results of
contains all the memberships associated with cluster i) represents the
our own investigations. In Section 11, we reformulate the objective
ith row of the membership matrix U , and 7. is the “bandwidth”
function of the PCM and show that the objective function consists of
or “resolution” or “scale” parameter. The membership matrix U
C-independent subfunctions, and the local minima of the subfunctions
generated by the PCM is, strictly speaking, not a “partition matrix,”
correspond to dense regions in feature space. In Section 111, we will
since its columns no longer satisfy the constraint
explain the significance of the parameter 17 as well as the fuzzifier
m in the PCM. Using the results derived in Sections I1 and 111, we
Manuscript received February 8, 1996.
The authors are with the Department of Computer Engineering and Com-
puter Science, University of Missouri, Columbia, MO 6521 1 USA. The parameter 7. needs to be prespecified, and can be estimated from
Publisher Item Identifier S 1063-6706(96)05623-S. the distance statistics of the data set X . The membership update
1063-6706/96$05.00 0 1996 IEEE
386 IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 4, NO. 3, AUGUST 1996

2e+4 I
2e+4

Oe+O Oe+O
0 40 80 120 160 200 0 40 80 120 160 200
Location of center Location of center
(a) (b)

3e+5 1
2e+5

2e+5
7 c
2e+3

2e+3
.I

Y
.-
0
I
U 3
c
5 2e+5 z le+3
.*>
I
>
.-
c
U
.j$le+5 e,
3
0 0 5e+2
5e+4

Oe+O Oe+O
0 40 80 120 160 200 0 40 80 120 160 200
Location of center Location of center
(c) (4

2e+4 I

5e+3

Oe+O Oe+O
0 40 80 120 160 200 0 40 80 120 160 200
Location of center Location of center
(e) (0
Fig. 2. Plots o f the PCM subobjective functions for the case of a one-dimensional data set consisting of two Gaussian clusters. The data points are shown
as circles on the horizontal axis. Plots are shown for the following choices of parameters. (a) fuzzifier m = 1.5. (b) Fuzzifier m = 2. (c) Scale parameters
'r/ overestimated by an order o f magnitude. (d) Scale parameters 71 underestimated by an order o f magnitude. (e) Fuzzifier m = 1.5 with the clusters moved
closer to each other. (t] Fuzzifer 7n = 2.0 with the clusters moved closer to each other.
IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 4, NO. 3, AUGUST 1996 387

rmo P O I

X
X

Fig. 3. The classification results of the FCM on a data set with two clusters in (a) noiseless and (b) noisy conditions. Poinls belonging to different
classes are shown using different symbols.

equation in the PCM is 111. THE IMPORTANCE AND MEANINGOF THE SCALE
PARAMETER 7). AND THE FUZZIFIER
PARAMETER m

It can be shown that the parameter qr is related to the resolution


parameter in the potential function [2] and the deterministic annealing
Solving for d2(x,,/3,)in terms of u L Jfrom (2), we obtain [3] approaches. It is also related to the idea of "scale" in robust
statistics [4]. More(over, the PCM is equivalent to a collection of
(3) C independent roblist fV1-estimators [5]-[7]. A detailed discussion
of these connections, as well as others, will be the subject of
We can now eliminate d Z ( z , , at) from the objective function in (1) another paper [8]. In any case, as explained in [l], the value of 7%
using (3). This gives us determines the distance at which the membership becomes 0.5. Thus,
1
V vz determines the "zone of influence" of a point. A point xJ will
J,(P<, U,;X ) = v <E(1 - have little influence on the estimates of the prototype parameters of
,=1 a cluster if d2(xJ,,!7c)is large when compared with q L . Therefore, a
ball-park value for qz is the variance (average intracluster distance)
For a giten value of q L , minimizing J , (Pz,U,; X ) is equivalent to
maximizing of cluster i [l].
The "fuzzifier" mi, determines the rate of decay of the membership

J l ( P Z > U Z ; x ) =qZ x[l


N

J=l
- (1 - u t J ) m - ' ] = 17%
1
V

,=1
~L:J (4)
value. Fig. 1 shows the membership function generated by the PCM
as a function of 'm. When m = 1, the memberships are crisp, i.e., all
points with d2(x,,[?t) greater than q A will have zero memberships.
where U : , = [l - (1 - u ~ , ) ~ can
- ~be ] interpreted as a modified When m + 00, the membership function does not decay to zero
memberslhip. It is to be noted that U : , is obtained from U,, via a at all. Also, in the FCM, m = 1 corresponds to the crisp case and
monotonic mapping since m + m corresponds to the maximally fuzzy case. However, the
interpretation of m is different in the FCM and the PCM. In the FCM,
d l
__ u t J = ( m - 1)(1- >O for m > 1. increasing values of m represent increased sharing of points among
du,,
all clusters, whereas in the PCM, increasing values of m represent
Hence, U : , varies the same way as u C Ji.e., , u L J= 0 + U:, = increased possibility of all points in the data set completely belonging
0; U,, = 1 + U : , = 1; both arc monotonically decreasing functions to a given cluster. Thus, the value of m that gives us satisfactory
of d 2 ( z J ,P L ) .Furthermore, for the special case of m = 2, (4) reduces performance is different in the two algorithms. A value of m N 2 is
to known to give good results with the FCM. However, as can be seen in
N Fig. 1, for this value of m the membership function decays too slowly
in the case of the PCM. For example, if the clusters are Gaussian,
then about 95% of the points will be within d 2 ( z J PL)= 4a2 = 4qZ.
From (4) and (3,we see that for a given value of q t , each Therefore, the membership function should almost completely decay
of the C-subobjective functions is maximized by choosing the to zero at d 2 ( x J ,P z ) / q z= 4. This is certainly not the case for 'm =
prototype location such that the sum of the (modified) memberships 2. A more appropriate choice seems to be m 1.5.
is maximized. This is achieved if the prototype is located in a dense Fig. 2 illustrates the ideas discussed in Sections I1 and I11 in the
region since the (modified) membership is a monotonically decreasing one-dimensional case. Fig. 2(a) shows the two subobjective functions
function of the distance to the prototype. If there are indeed C-dense J,(aL, U z ;X ) for a data set consisting of two Gaussian clusters with
regions in feature space (corresponding to C-distinct clusters), then, centers at 65 and 135. Each cluster has 50 poin;ts,and the variances of
with proper initialization, each prototype will converge to a dense the two clusters arc 144 and 100, respectively. The objective functions
region. In such a situation, even if all qz are equal (and, hence, all are plotted with respect to the center parameter. The variances have
subobjective functions become identical) each of them will still have been used as the estimates of q z ,and a value of 1.5 was used for
C-distinct minima corresponding to the C-dense regions. m. The data points are shown as small circles on the x-axis. Two
388 IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 4, NO. 3, AUGUST 1996

. . . .

(e) (0
Fig. 4. A comparison of the classification results of the PCM and the FCM or( a data set consisting of three Gaussian clusters (courtesy of Barni et
al.) (a), (c), and (e) show the points belonging to the three classes according to the PCM, and (b), (d), and (f) show the points belonging to the
three classes according to the FCM.

minima can be seen in each subobjective function. The upper curve the two subobjective functions when the q1 are underestimated by
corresponds to the first cluster, and the lower curve to the second. If an order of magnitude (Le, 711 = 14 and q 2 = 10). This results
the first center is initialized in the left half, and the second one in the in rather jagged subobjective functions with multiple local minima.
right half (as will be the case if the FCM is used for initialization) For example, there is a small local minimum corresponding to the
then the two individual centers do converge to the two distinct local three left-most points, which form a legitimate cluster at this scale.
minima. However, if both centers are initialized on the right half, Fig. 2(e) and (0shows the subobjective functions when m = 1.5
then they will both find the same local minimum, thus completely and when m = 2, respectively, for a data set obtained by moving
missing cluster 1. Fig. 2(b) shows the plots of the objective functions the two clusters closer to each other so that the centers are now at
for the same data set when m = 2. It can be seen that the valleys 84 and 119 respectively. It can be seen that with m = 2 it is no
become shallow, and the local minima slightly move toward each longer possible to find the two distinct local minima, and no matter
other. Fig. 2(c) shows the two subobjective functions when the q. what the initialization, both centers will converge to approximately
are overestimated by an order of magnitude (i.e., ql = 1440 and the same point.
q 2 = 1000). It can be seen that the two valleys tend to merge into The above examples illustrate that for the PCM to give meaningful
one shallow valley. This illustrates the fact that at this “resolution” solution, i) we need good estimates of the scale parameters q c ,ii) we
or “scale” the algorithm perceives only one cluster. Fig. 2(d) shows need to initialize each prototype so that we obtain C-distinct local
IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 4, NO. 3, AUGUST 1996 389

2r
0-

-2 - (7)

The prototype update equations will remain unchanged. Since the


0
0
0
exponential function decays much more rapidly for large values of
0 d 2 ( z J P, z ) , this formulation may be more appropriate when clusters
-4 - 0
are expected to be close lo one another. Note that u z Jlog ut3- uL,is
0 a monotonically decreasing function in [0,1], similar to (1 - u , , ) ~ .
8
If we use the Mahalanobis distance for d2(x,. /A), then we can even
eliminate 7%(i.e., set 7 , = 1). A robust estimate of the covariance
matrix will take itc; place.
-6 -1
-2 -1 0 1 2 IV. COINCIDENT
CLUSTERS:
A BLESSING
IN DISGUISE
Fig. 5. Ii scatter plot of the seven-dimensionaldata set from the multichannel As mentioned in Section 11, the FCM is primarily a partitioning
satellite image (courtesy of Bami et al.) mapped to two-dimensions using algorithm, whereas the PCM is primarily a mode-seeking algorithm.
Dzwinel’s algorithm. Only two distinct clusters can be seen.
If we were to smooth the feature histogram by a kernel function
that is shaped like the PCM membership function, then the modes
minima ((if in fact there are C-distinct clusters), and iii) the value corresponding to the peaks in the smoothed histogram would be the
of m needs to be chosen so that the decay rate in the membership locations of convergence of the PCM. A distinguishing characteristic
function is meaningful. We now briefly address these issues. of the PCM is that the membership ut3 of a point x, in cluster i is not
The scale parameter q7 of a cluster corresponds to the “zone of relative and depends only on the distance of . ~ from j p i . Therefore,
influence” or “size” of the cluster. Therefore, when the data set is the PCM objective function can be viewed as a collection of C-
relatively noise-free, the FCM algorithm can be used to initialize independent functions. This means that even if the true value of C
the partition and obtain a good estimate of the 7.. For example, we i s unknown, the outcome of the algorithm will be useful as long as
could use the desired scale qz is specified or estimated accurately and a good
N initialization for the prototypes or memberships is provided. In other
words, the algorithm can potentially find C good clusters from a data
11x7 > o! set that may have more than C clusters. (As shown in Section 11, by
or
c1 definition, “good” clusters in the PCM correspond to dense regions.)
On the other hand, if the data set has less than C clusters, then the
algorithm can still potentially find C good clusters, out of which
where d:3 = d2(x,,PL),and Q is an appropriate threshold. The some of them may be identical. It follows that the specification of C
FCM will also provide a good initialization for the prototypes, so is somewhat irrelevant in the PCM. In particular, if C is chosen to
that they will converge to C-distinct local minima, provided there be one, then the PCM will find one good cluster. Thus, it can be seen
indeed are C-distinct dense regions. However, when the data is noisy, that the PCM approach has the potential for solving one of the major
the initial partition produced by the FCM will be poor, resulting in problems with the FCM, namely, the need to know the number of
bad initializations for the prototypes and improper estimates for the clusters (see, also, [ll]).
scale parameters q 8 . Fig. 3 shows an example of such a situation. The behavior of the FCM is quite different. It will always find
Fig. 3(a) shows the “classification” generated by the FCM on a data the specified number of “clusters” by arbitrarily splitting or merging
set with two clusters. The two classes are indicated by square and clusters in the data set. In other words, the FCM makes no guarantees
cross symbols, respectively. Fig. 3(b) shows the classification when about the kind of “clusters” that it will find, and the partition
“noise” points are added. The FCM lumps the two good clusters into generated by the FCM may or may not correspond to C intuitively
one class and all the noise points into the other. In such a situation, correct clusters. To illustrate this point, let us consider the case when
both the initial partition and the scale estimates will be unacceptable. there is only one cluster, and let us assume that we apply the FCM
Fortunatlely, there are methods in robust statistics [4] which can be with C = 2. The FCM will find two clusters by artificially splitting
used to rescue us from this situation. A detailed discussion of these the single cluster in the middle. Although the FCM “fails to recognize
issues may be found in [8]. the structure underlying the data set” (to borrow a phrase that Barni
As seen in Fig. I , a good choice for the value of the fuzzifier m et al. attribute to the PCM), it does not mean that the FCM result is
for the I’CM seems to be around 1.5. As an aside, we would like “bad.” It simply means that the assumptions under which the FCM
to remaIk that a similar comment applies to the Gustafson-Kessel produces good clusters has been violated because the FCM assumes
algorithrn [9] (see, for example, [lo]. On the other hand, as pointed that the number of clusters is correctly specified. For the situation
out in Section I, the PCM is a particular implementation of the described above, if we apply the PCM, it will find two coincident
possibilistic approach. In fact, we could eliminate m altogether by clusters. In this case, the PCM does recognize the structure of the
choosing alternative formulations of the PCM. For instance, if we data better. Similarly, if for a particular initialization the PCM gives
390 IBEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 4, NO. 3, AUGUST 1996

Fig. 6. The segmentation result produced by the FCM when all seven channels are used.

Fig. 7. The segmentation result produced by the FCM when only channels five and six are used

coincidental clusters, it does not mean that the result is bad. By We first discuss the example with three Gaussian clusters (Fig. 4
merging coincident clusters after overspecifying C or by sequentially in [ 121). This data set is relatively clean and has no noise. Barni
removing clusters after repeatedly applying the PCM with C = 1, et al. reported that even when initialized with the FCM, the PCM
one can, in fact, determine the number of clusters, thus addressing finds three coincidental clusters, i.e., it just finds the larger cluster
the problem of cluster validity [ I l l . in the center and ignores the smaller clusters on either side, thus,
“failing to recognize the structure underlying the data set.” A closer
examination reveals that this behavior is very easily explained by
v. EXPERIMENTAL RESULTSREPORTEDBY an incorrect choice of the fuzzifier parameter. Since the clusters are
BARNIet al.: AN ALTERNATIVE
INTERPRETATION not well separated but actually touch one another, the membership
In [12] Barni et al., reported some difficulties they encountered functions need to be chosen in such a way that they decay rapidly
while running the PCM. They also graciously agreed to supply us enough outside the zone of influence. As explained in Section 111, a
with the data sets so we were able to run the same experiments value of two for m is simply not the best choice. Fig. 4 shows the
and reproduce their results. Here, we analyze and reinterpret the results of the PCM on the same data set when m = 1.5. The results
results and explain why they are consistent with the derivations and of the FCM are also shown for comparison. It can be seen that the
explanations given in Sections 11-V. two results are very similar. We can also completely do away with
IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 4, NO. 3, AUGUST 1996 391

250 -1
200 - 1 e

V I I ‘ I ’ I ’ I - I

0 50 100 150 200 250 0 50 100 150 200 250


channel 5 channel 5
(a) (b)

250

200 4
\D
150
C

8
100

50

0 50 100 150 200 250


channel 5
(C)

Fig. 8. (a) A scatter plot of the feature points from channels five and six. (b) Clusters found by the FCM. (c) Clusters found by the PCM.

the fuzzifier parameter by formulating the PCM differently, as in (6) alone, and one can hardly expect the seven-dimensional feature space
and (7). to contain four distinct “clusters.”
We now discuss the multichannel satellite image data set. It so To verify our suspicion, we used Dzwinel’s, algrorithm [13] based
happens that this data set is ideal for explaining and illustrating the on the molecular dynamics approach to find the global minimum of
differemes between the FCM and the PCM. The image corresponding Sammon’s criterion for mapping higher dimensional data to lower
to one of the seven channels is shown in Fig. 1 in [12]. Barni et al. dimeusions. The resulting two-dimensional rendition of the seven-
applied tlhe FCM with C = 4. Even though it may be true that there dimensional data set is shown in Fig. 5. Sime the number of data
are four main classes (water, wooded areas, agricultural land, and points is very large (512 x 699), to prevent clutter, a subsample
urban areas) in this image, when viewed in proper resolution, one can of the data set is shown. Only two distinct clusters are seen. The
easily find many regions that do not fit into any of the four categories. water region appears as the smaller and denser cluster, because
For example, what about beaches? Do we distinguish between roads, in this region, there is relatively less variation in the intensity
bridges, and buildings, or lump them all into the category of urban values in all channels. The highly reflective areas that appear white
areas? In the latter case, do the features allow us to do that? There in the image show up as outliers in this mapping. The larger
is much variation in the intensity values within the so-called “urban cluster includes samples from all the remaining regions, and it is
areas” no matter which channel is viewed. Similarly, some of the hard if not impossible to distinguish the three classes within this
water is so highly reflective that it appears white. Similar variations cluster.
occur in the wooded areas. In other words, unless more sophisticated Another interesting fact about this data set is that the channels are
features such as texture are taken into account, it is quite impossible to highly correlated. To illustrate this fact, we show the segmentation
use a clustering technique to “separate” these class by intensity values result of the FCM when all seven channels are used (Fig. 6) and
392 IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 4, NO. 3, AUGUST 1996

Fig. 9. The segmentation result produced by the FCM when only channels five and six are used after three coincident clusters have been merged

when only channels five and six are used (Fig. 7). It can be seen that VI. THE(MIS)USEOF CLUSTERING IN PATTERN RJ3COGNITION
the results are similar. Thus, the remaining channels do not contribute Clustering is a powerful technique with a wide variety of ap-
much to the final result. Moreover, from Fig. 6 it can be seen that the plications not only in pattern recognition, but also in areas such
FCM misclassification rate is unacceptably high. This is mainly due as probability density estimation (mixture density decomposition),
to the choice of the features. They are not sufficiently homogeneous membership function estimation, image coding, model identification
within each class and distinct between classes.
in control systems, and neural network training. Nevertheless, it is
Since the channels are highly correlated, and since a two-
to be remembered that clustering is an unsupervised technique, and
dimensional feature space can be easily plotted and viewed, here
it does not use label information. Thus, in the training phase of a
we use the two-dimensional data (of channels five and six) to explain
pattern-recognition system, when the labels for the pattern vectors
the results reported by Barni et al. Fig. 8(a) shows the scatter plot
are available, one is better off using a supervised technique than a
of the two-dimensional feature vectors. (Again, to prevent clutter a
clustering algorithm. Unfortunately, too often one sees comparisons
subsample of the data set is shown.) The similarity between this plot
made between a supervised technique such as the Bayes method
and the one in Fig. 5 can be readily seen. The bottom left comer of the
or a neural-network method and a clustering technique such as
plot contains the small elongated cluster corresponding to the water
the FCM. There is, however, an interesting and useful way to
region which gradually merges into the large cluster represented by
use clustering methods in a supervised mode. This is done by
the remaining regions. The 4-partition generated by the FCM on the
data set is shown in Fig. 8(b) after assigning each point to the cluster applying a clustering algorithm to data points belonging to each class
with the highest membership. The points in different “clusters” are separately. This enables us to estimate the parameters of the class
shown using different symbols. It can be seen that the FCM arbitrarily prototype(s). Note that depending on the features, the data points
divides the large cluster into three components which may or may corresponding to a particular class may or may not fall into one
not correspond to the ideal partition from a classification point of cluster. Therefore, a correct number of clusters needs to be chosen to
view. In this case, it apparently does not, judging from the high rate extract a representation of the data points corresponding to a class in
of misclassifications. Moreover, the components hardly represent terms of prototype parameters. Once the prototype parameters have
“clusters.” Again, this result does not mean that the FCM is not been estimated for each class, test data can be classified using a
useful. It simply means that it has been misapplied. Fig. 8(c) shows suitable rule. This procedure is roughly equivalent to estimating the
the result of the PCM when it is applied with C = 4.As reported by probability density functions of the classes and will yield similar
Barni et al., the PCM identifies the smaller cluster but produces three results (see, for example, [14]).
coincidental clusters rather than splitting the bigger cluster into three. The use of clustering for classification (rather than for parameter
(In Fig. 8(c), the three coincidental clusters have been merged.) The estimation) may be justified if it is known that the classes are well
resulting segmentation is shown in Fig. 9. separated in feature space and points belonging to different classes
Whether the result in Fig. 8(b) or the one in Fig. 8(c) is better do form clusters. It has been argued that fuzzy clustering techniques
depends on the application. When the data points do not fall into are more meaningful when clusters touch or overlap. Hence, it might
the desired number of clusters, the arbitrary partition produced by seem that they can be used for classification even when the classes
the FCM may happen to be reasonably close to the ideal partition are not well separated. However, the more meaningful result of fuzzy
from a classification point of view. However, this is by no means clustering is only from a clustering (or data representation) point
guaranteed. On the other hand, from a clustering point of view, one of view, and not from a classification point of view. For example,
can argue that the PCM produces a better result. In any case, one the center point in the famous “butterfly” data set in [15] gets a
can do much better by generating more meaningful features for this membership of 0.5 in both clusters when fuzzy clustering is used.
application. This does make sense because this point is exactly between the two
IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 4, NO. 3, AUGUST 7996 393

clusters. However, if we in fact knew that the center point had a [SI R. DavC and R. Krishnapuram, “Robust clustering methods: a unified
label that was the same as that of the cluster on the left, then a view,” IEEE Truns. Fuzzy Syst., to be published.
membership of 0.5 does not make sense from a classification point [9] D. E. Gustafson, and W. C. Kessel, “Fuzzy clustering with a fuzzy
covariance matrix,” in Proc. IEEE CDC, San Diego, CA, 1979, pp.
of view. The partition generated by a fuzzy clustering algorithm has 761-766.
more to do with the distance measure chosen than the class labels [lo] R. Krishnapuram and C.-P. Freg, “Fitting an unknown number of lines
associated with the data points. Thus, although in specific cases the and planes to image data through compatible cluster merging,” Pattern
partition may roughly correspond with the correct classification, this Recogn., vol. 25, Apr. 1992, pp. 385-400.
[ 1I] R. Krishnapuram, “Generation of membership functions via possibilistic
is simply fortuitous unless we, in fact, know the distributions of the clustering,” in 3rd IEEE Con$ Fuzzy Syst., Orllando, FL, July 1994, pp.
clusters in feature space and choose the distance measures and the 902-908.
clustering algorithm accordingly. [ 121 M. Barni, V. Cappellini, and A. Mecocci, “Comments on ‘A Possibilistic
Approach to Clustering,”’ IEEE Trans. Fuzzy Syst., vol. 4, pp. 393-396,
Aug. 1996
VII. CONCLUSION [ 131 W. Dzwinel, “In search of the global minimum in problems of feature
extraction and selection,” in Proc. Eur. Congress Intell. Tech. So@
Since the objective function of the PCM is a modification of that Computing, Aachen, Germany, Sept. 1995, pp. 1326-1 330.
of the FCM, the PCM may appear to be a close cousin of the FCM. 1141 S. Medasani, J. Kim, and R. Krishnapuram, “Ektimation of membership
However, as shown in this correspondence, there are fundamental functions for pattern recognition and computer vision,” in Fuzzy Logic
differences between the two algorithms. The FCM is primarily a and its Applications 10 Engineering, Information Sciences, and Intelli-
gent Systems, K . C. Min and Z. Bien, Eds. Norwell MA: Kluwer,
partitioning algorithm, whereas the PCM is primarily a mode-seeking 1995, pp. 45-54.
algorithm. The power of the PCM does not lie in creating partitions, [ 151 J . C. Bezdek, Puttern Recognition with Fuzzy Objective Function Algo-
but rather in tinding meaningful clusters as defined by dense regions. rithms. New York Plenum, 1981.
Its strengths are that it overcomes the need to specify the number of [16] P. J. Huber, Robust Statisrics. New York: Wiley, 1981.
clusters and it is highly robust in the presence of noise and outliers.
Its weakness is that it requires a good initialization and a reliable
scale estimate to function effectively. When the data is not severely
contaminated, the FCM can provide a reasonable initialization and a
scale estiimate. Thus, with the proper choice of the scale and fuzzifier Comments on “A Possibilistic Approach to Clustering”
parameters, the PCM can be used to improve the results of the FCM.
The situation is quite different when the data set is highly noisy. M. Barni, V. Cappellini, and A. Mecocci
The least squares (LS) algorithm is a general regression algorithm
in statistics that minimizes the sum of the squares of the residues,
and it has been used extensively in engineering applications. It Abstract-In this comment, we report a difficulty with the application
has been shown [6], [7] that the FCM is a generalization of the of the possibilisticapproach to fuzzy clustering (PCM) proposed by Keller
and Krishnapuram. In applying this algorithm we found that it has the
least squares technique that uses harmonic means of distances to undesirable tendency to produce coincident clusters. Results illustrating
prototypes as residues. It is well known that the LS analysis is this tendency are reported and a possible explanation for the PCM
severely compromised by a single outlier in the data set. Thus, the behavior is suggested.
FCM can completely break down in the presence of a single outlier. In
contrast, It has been shown [6]-[SI that the PCM is a robust parameter I. INTRODUCTION
estimation technique that is related to the M-estimator [16] which
In their paper,’ Krishnapuram and Keller presented a new approach
has been widely used in robust statistics with good results. Robust
to fuzzy clustering [possibilistic e-means (PCM)]. By relaxing the
techniques can tolerate up to 50% noise [4]. As shown in Fig. 3,
the FCM is quite inadequate as an initialization and scale estimation constraint that the memberships of a data point across classes sum to
one, each cluster is disentangled from the others and the membership
tool when noise is present. Fortunately, there are many techniques in
robust statistics that can help us in this regard. values are interpreted as the compatibilities of the point to the
class prototypes. Besides, the possibilistic aplproach leads to higher
noise immunity with respect to classical algorithms derived from
REFERENCES Bezdek’s fuzzy e-means (FCM) [l].Indeed the novel approach is very
R. Krishnapuram and J. Keller, “A possihilistic approach to clustering,” interesting since by recasting fuzzy clustering into the framework of
IEEI? Trans. Fuzzy Syst., vol. 1, pp. 98-110, May 1993. possibility theory, membership functions are directly related to the
J. T. Tau and R. C. Gonzales, Pattern Recognition Principles. Reading, typicality of data points with respect to the given classes. In this
MA: Addison-Wesley, 1974. way, classification tasks are made easier and the impact of spurious
G. ELeni and X. Liu, “A least biased fuzzy clustering method,” IEEE
points on the tinal partition is reduced.
Trans. Pattern Anal. Machine Intell., vol. 16, pp. 954-960, Sept. 1994.
P. J. Rousseeuw and A. M. Leroy, Robust Regression and Outlier The purpose here is to describe our experience in applying the
Detection. New York Wiley, 1987. PCM algorithm, whose performance, we found, is severely compro-
R. DavB and R. Krishnapuram, “Robust algorithms for clustering,” in
Proc. Inr. Fuzzy Syst. Assoc. Congress, SZo Paula, Brazil, July 1995, Manuscript received March 30, 1995; revised August 1, 1995.
vol. I, pp. 561-564. M. Bani and V. Cappellini are with the Department of Electronic Engi-
J. Kim, R. Krishnapuram, and R. Dav: “On robustifying the C-means neering, University of Florence Via S. Marta, 3 50139 Firenze, Italy.
algorithms,” in Proc. North Amer. Fuzzy Informat. Processing Soc. Coni, A. Mecocci is with the Department of Electronic Engineering, University
College Park, MD, Sept. 1995, pp. 630-635. of Pavia Via Abbiategrasso, 209-27100 Pavia, Italy
0. Nasraoui and R. Krishnapuram, “Crisp interpretations of fuzzy and Publisher Item Idenltifier !S 1063-6706(96)05624-X.
possibilistic clustering algorithms,” in Proc. Eur. Congress F u u y Intell. ‘R. Krishnapuram and J. M. Keller, IEEE Trans. Fuzzy Syst., vol. 1, pp.
Technol., Aachen, Germany, Sept. 1995, pp. 1312-1318. 98-1 10, May 1993.

1063-6706/96$05.00 0 1996 IEEE

You might also like