Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Outliers in Rough k-Means Clustering

Georg Peters

Munich University of Applied Sciences,


Department of Computer Science, 80335 Munich, Germany
georg.peters@cs.fhm.edu

Abstract. Recently rough cluster algorithm were introduced and suc-


cessfully applied to real life data. In this paper we analyze the rough
k-means introduced by Lingras’ et al. with respect to its compliance to
the classical k-means, the numerical stability and its performance in the
presence of outliers. We suggest a variation of the algorithm that shows
improved results in these circumstances.

1 Introduction
In the past years soft computing methodologies like fuzzy sets [14], neural nets [3]
or rough sets [9,10], have been proposed to handle challenges in data mining [8]
like clustering. Recently rough cluster algorithms were introduced by Lingras et
al. [5], do Prado et al. [2] and Voges et al. [11,12]. In rough clustering each cluster
has a lower and an upper approximation. The data in a lower approximation
exclusively belong to the cluster. Due to missing information the membership of
data in an upper approximation is uncertain. Therefore they must be assigned
to at least two upper approximations. The objective of this paper is to analyze
Lingras et al. [5] cluster algorithm with respect to its compliance to the classical
k-means, the numerical stability and outliers. In an example we show that a
variation of the algorithm delivers more intuitive results in these circumstances.
The structure of the paper is as follows. In Section 2 we introduce Lingras’ et
al. rough k-means cluster algorithm. In Section 3 we investigate the properties
mentioned above and introduce an improved version of the algorithm. The paper
concludes with a summary in Section 4.

2 Lingras’ Rough k-Means Clustering Algorithm


Lingras et al. [5] assume the following set of properties for rough sets: (1) A
data object can be a member of one lower approximation at most. (2) The lower
approximation of a cluster is a subset of its upper approximation. (3) A data
object that does not belong to any lower approximation is member of at least
two upper approximations.
So their cluster algorithm belongs to the branch of rough set theory with a
reduced interpretation of rough sets as lower and upper approximations of data
constellation [13]. The algorithm proceeds as follows.

S.K. Pal et al. (Eds.): PReMI 2005, LNCS 3776, pp. 702–707, 2005.

c Springer-Verlag Berlin Heidelberg 2005
Outliers in Rough k-Means Clustering 703

Let:
– Data set: Xn the nth data point and X = (X1 , ..., XN )T , n = 1, ..., N .
– Ck the kth cluster, Ck its lower and Ck its upper approximation. CkB =
Ck − Ck the boundary area.
– Mean mk of cluster Ck . M = (m1 , ..., mK )T with k = 1, ..., K.
– Distance between Xn and mk : d(Xn , mk ) = Xn − mk .

– Step 0: Initialization. Randomly assign each data object to one and only
one lower approximation and its corresponding upper approximation.
– Step 1: Calculation of the new means.
⎧  Xn  Xn

⎨wl X ∈C |Ck | + wb |CkB |
for CkB = ∅
n Xn ∈CkB

k
mk = (1)

⎩ l
w Xn
|Ck | otherwise
Xn ∈Ck

with the weights wl and wb . |Ck | indicates the numbers of data objects in
lower approximation, |CkB | = |Ck − Ck | in the boundary area of cluster k.
– Step 2: Assign the data objects to the approximations.
(i) For a given data object Xn determine its closest mean mh :

dmin
n,h = d(Xn , mh ) = min d(Xn , mk ). (2)
k=1,...,K

Assign Xn to the upper approximation of the cluster h: Xn ∈ Ch .


(ii) Determine the set T with  a given threshold:

T = {t : d(Xn , mk ) − d(Xn , mh ) ≤  ∧ h = k}. (3)

If (T = ∅) Then {Xn ∈ Ct , ∀t ∈ T } Else {Xn ∈ Ch }.


– Step 3: Check convergence of the algorithm.
If (Algorithm converged) Then{Goto Step 1} Else {STOP}.

Lingras et al. applied their algorithm to the analysis of student web access
logs at a university. Mitra [7] presented an evolutionary version and applied it
to vowel, forest cover and colon cancer data.

3 Properties of the Rough k-Means


3.1 Compliance with the Classical k-Means
Lingras et al. propose [4,5,6] that their algorithm equals the classical k-means in
the cases that the lower and upper approximations of the clusters are identical
(the boundary areas are empty). However, Equation (1) becomes:
 Xn
mk = wl (4)
|Ck |
Xn ∈Ck
704 G. Peters

Therefore we suggest to delete wl in the second case of Equation (1) so that


we obtain compliance with classical rough k-means:
⎧  Xn  Xn

⎪ for CkB = ∅


wl + wb
⎨ Xn ∈Ck |Ck | Xn ∈CkB
|CkB |
mk =  Xn (5)



⎪ otherwise
⎩ |C |
k
Xn ∈Ck

3.2 Numerical Stability


Lingas et al. algorithm is numerical instable in some circumstances. Consider
the following data set:
 
T 0.2 0.1 0.0 0.4 0.4 0.5 0.6 0.8 1.0 0.7
X =
0.0 0.2 0.2 0.5 0.6 0.5 0.5 0.8 0.8 1.0
For K = 2, wl = 0.7, wu = 0.3 and  = 0.2 a stable solution of the rough
k-means is given by the following meaningful result (Figure 1):
Means
 
T 0.21250 0.72583
M =
0.25083 0.76417
Approximations
   
1110000000 1111111000
CT = and CT =
0000000111 0001111111
However, the algorithm terminates abnormally if |Ck | = 0 ∀k which leads
to a division by Zero in Equation (1). So there are not only data constellations
that lead to empty boundary areas but it is also possible to obtain empty lower
approximations. Therefore Equation (1) must be modified to:
⎧  Xn  Xn

⎪ for CkB = ∅ ∧ Ck = ∅

⎪ wl + wb

⎪ |C k | |C B|

⎪ Xn ∈Ck Xn ∈CkB k

⎪  Xn

mk = for CkB = ∅ ∧ Ck = ∅ (6)
⎪ |C k |

⎪ Xn ∈Ck

⎪  Xn



⎪ for CkB = ∅ ∧ Ck = ∅

⎩ |C B|
Xn ∈Ck
B k

Now, besides the solution above we obtain the following stable result where
the lower approximations of both clusters are empty:
 
T 0.47 0.47
M =
0.51 0.51
An expert now has to decide which solution is considered as optimal. In the
present case it would properly be the first on.
Outliers in Rough k-Means Clustering 705

1.00

0.80

0.60

0.40

0.20

0.00
0.00 0.50 1.00

Data Lower Approximation


Means Upper Approximation

Fig. 1. Cluster results

3.3 Dealing with Outliers

To determine the set T Lingras et al. suggest to consider the absolute distance
d(Xn , mk ) − d(Xn , mh ). However consider the following data constellation in-
corporating an outlier A inline with Means 1 and 2 (Figure 2).

x a b
Data
Mean 1 Mean 2 Object A

Fig. 2. Data with an outlier

Let b = za (z ∈ R+ ) then A is an outlier for large z. For a ≤  the data


object A belongs to the upper approximations of both cluster:

d(Xn , mk ) −d(Xn , mh ) ≤
⇒ (a + b) −b ≤ (7)
⇒ a ≤ .

It is member of the lower approximation of cluster 2 if  < a respectively.


Obviously this is independent of the selection of z. However, for large z the outlier
should be assigned to the upper approximations of both clusters, indicating that
the actual membership to one cluster is uncertain.
We suggest to use a relative distance measure instead of Lingras’ et al. ab-
solute to determine the set T :
706 G. Peters

d(Xn , mk )
T = t: ≤ ζ ∧ h = k . (8)
d(Xn , mh )

The impact on the previous example is as follows. Let b = za (z ∈ R+ ); then


we get:
T = {t : z ≤ ζ ∧ h = k} . (9)
Note that the set T depends on z now. In contrast to Lingras solution the
data object A is assigned to the upper approximations of both sets for large z.
This indicates its uncertain status.

Absolute Distance Relative Distance


1.1 1.1

0.9 0.9
0.7 0.7
0.5 0.5
0.3 0.3
0.1 0.1

0.5 1 0.5 1
Boundary Area Lower Approximation Mean

Fig. 3. Two clusters with absolute and relative thresholds

To test this an experiment was set up. We generated two normally distributed
two-dimensional clusters each containing 100 data objects and added an data
point representing an outlier (Figure 3). Let wl = 0.7, wb = 0.3 and Lingras’
et al. threshold  = 0.1. The relative threshold was set to ζ = 1.86 so that the
boundary area has 130 data objects in both cases. While the outlier was assigned
to the lower approximation in Lingras et al. case, it became a member of the
boundary area when applying the relative threshold (Figure 3). Furthermore the
Davies-Bouldin index [1], based on the data objects in the lower approximations,
improved from DBabsolute = 1.4137 to DBrelative = 0.9635.

4 Conclusion
In the paper we investigated the performance of Lingras’ et al. rough cluster
algorithm with respect to its compliance to the classical k-means, the numerical
stability and in the presents of outliers. We suggested an improved version of
the algorithm that delivers more intuitive results in comparison to Lingras’ et
al. approach. In future research we will evaluate some other features of the
algorithm, compare it to classical cluster algorithms and conduct experiments
with real life data.
Outliers in Rough k-Means Clustering 707

References
1. D.L. Davies and D.W. Bouldin. A cluster separation measure. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 1:224–227, 1979.
2. H.A. do Prado, P.M. Engel, and H.C. Filho. Rough clustering: An alternative
to find meaningful clusters by using the reducts from a dataset. In Rough Sets
and Current Trends in Computing: Third International Conference, RSCTC 2002,
volume 2475 of LNCS, pages 234 – 238, Berlin, 2002. Springer.
3. S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper
Saddle River, N.J., 2. edition, 1998.
4. P. Lingras and C. West. Interval set clustering of web users with rough k-means.
Technical Report 2002-002, Department of Mathematics and Computer Science,
St. Mary’s University, Halifax, Canada, 2002.
5. P. Lingras and C. West. Interval set clustering of web users with rough k-means.
Journal of Intelligent Information Systems, 23:5–16, 2004.
6. P. Lingras, R. Yan, and C. West. Comparison of conventional and rough k-means
clustering. In International Conference on Rough Sets, Fuzzy Sets, Data Min-
ing and Granular Computing, volume 2639 of LNAI, pages 130–137, Berlin, 2003.
Springer.
7. S. Mitra. An evolutionary rough partitive clustering. Pattern Recognition Letters,
25:1439–1449, 2004.
8. S. Mitra and T. Acharya. Data Mining: Multimedia, Soft Computing, and Bioin-
formatics. John Wiley, New York, 2003.
9. Z. Pawlak. Rough sets. International Journal of Information and Computer Sci-
ences, 11:145–172, 1982.
10. Z. Pawlak. Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer
Academic Publishers, Boston, 1992.
11. K.E. Voges, N.K. Pope, and M.R. Brown. Heuristics and Optimization for Knowl-
edge Discovery, chapter Cluster Analysis of Marketing Data Examining On-Line
Shopping Orientation: A Comparision of k-Means and Rough Clustering Ap-
proaches, pages 207–224. Idea Group Publishing, Hershey PA, 2002.
12. K.E. Voges, N.K. Pope, and M.R. Brown. A rough cluster analysis of shopping
orientation data. In Proceedings Australian and New Zealand Marketing Academy
Conference, pages 1625–1631, Adelaide, 2003.
13. Y.Y. Yao, X. Li, T.Y. Lin, and Q. Liu. Representation and classification of rough
set models. In Proceedings Third International Workshop on Rough Sets and Soft
Computing, pages 630–637, San Jose, CA, 1994.
14. H.J. Zimmermann. Fuzzy Set Theory and its Applications. Kluwer Academic
Publishers, Boston, 4. edition, 2001.

You might also like