Professional Documents
Culture Documents
3 Outliers in Rough K-Means Clustering
3 Outliers in Rough K-Means Clustering
Georg Peters
1 Introduction
In the past years soft computing methodologies like fuzzy sets [14], neural nets [3]
or rough sets [9,10], have been proposed to handle challenges in data mining [8]
like clustering. Recently rough cluster algorithms were introduced by Lingras et
al. [5], do Prado et al. [2] and Voges et al. [11,12]. In rough clustering each cluster
has a lower and an upper approximation. The data in a lower approximation
exclusively belong to the cluster. Due to missing information the membership of
data in an upper approximation is uncertain. Therefore they must be assigned
to at least two upper approximations. The objective of this paper is to analyze
Lingras et al. [5] cluster algorithm with respect to its compliance to the classical
k-means, the numerical stability and outliers. In an example we show that a
variation of the algorithm delivers more intuitive results in these circumstances.
The structure of the paper is as follows. In Section 2 we introduce Lingras’ et
al. rough k-means cluster algorithm. In Section 3 we investigate the properties
mentioned above and introduce an improved version of the algorithm. The paper
concludes with a summary in Section 4.
S.K. Pal et al. (Eds.): PReMI 2005, LNCS 3776, pp. 702–707, 2005.
c Springer-Verlag Berlin Heidelberg 2005
Outliers in Rough k-Means Clustering 703
Let:
– Data set: Xn the nth data point and X = (X1 , ..., XN )T , n = 1, ..., N .
– Ck the kth cluster, Ck its lower and Ck its upper approximation. CkB =
Ck − Ck the boundary area.
– Mean mk of cluster Ck . M = (m1 , ..., mK )T with k = 1, ..., K.
– Distance between Xn and mk : d(Xn , mk ) = Xn − mk .
– Step 0: Initialization. Randomly assign each data object to one and only
one lower approximation and its corresponding upper approximation.
– Step 1: Calculation of the new means.
⎧ Xn Xn
⎪
⎨wl X ∈C |Ck | + wb |CkB |
for CkB = ∅
n Xn ∈CkB
k
mk = (1)
⎪
⎩ l
w Xn
|Ck | otherwise
Xn ∈Ck
with the weights wl and wb . |Ck | indicates the numbers of data objects in
lower approximation, |CkB | = |Ck − Ck | in the boundary area of cluster k.
– Step 2: Assign the data objects to the approximations.
(i) For a given data object Xn determine its closest mean mh :
dmin
n,h = d(Xn , mh ) = min d(Xn , mk ). (2)
k=1,...,K
Lingras et al. applied their algorithm to the analysis of student web access
logs at a university. Mitra [7] presented an evolutionary version and applied it
to vowel, forest cover and colon cancer data.
Now, besides the solution above we obtain the following stable result where
the lower approximations of both clusters are empty:
T 0.47 0.47
M =
0.51 0.51
An expert now has to decide which solution is considered as optimal. In the
present case it would properly be the first on.
Outliers in Rough k-Means Clustering 705
1.00
0.80
0.60
0.40
0.20
0.00
0.00 0.50 1.00
To determine the set T Lingras et al. suggest to consider the absolute distance
d(Xn , mk ) − d(Xn , mh ). However consider the following data constellation in-
corporating an outlier A inline with Means 1 and 2 (Figure 2).
x a b
Data
Mean 1 Mean 2 Object A
d(Xn , mk ) −d(Xn , mh ) ≤
⇒ (a + b) −b ≤ (7)
⇒ a ≤ .
d(Xn , mk )
T = t: ≤ ζ ∧ h = k . (8)
d(Xn , mh )
0.9 0.9
0.7 0.7
0.5 0.5
0.3 0.3
0.1 0.1
0.5 1 0.5 1
Boundary Area Lower Approximation Mean
To test this an experiment was set up. We generated two normally distributed
two-dimensional clusters each containing 100 data objects and added an data
point representing an outlier (Figure 3). Let wl = 0.7, wb = 0.3 and Lingras’
et al. threshold = 0.1. The relative threshold was set to ζ = 1.86 so that the
boundary area has 130 data objects in both cases. While the outlier was assigned
to the lower approximation in Lingras et al. case, it became a member of the
boundary area when applying the relative threshold (Figure 3). Furthermore the
Davies-Bouldin index [1], based on the data objects in the lower approximations,
improved from DBabsolute = 1.4137 to DBrelative = 0.9635.
4 Conclusion
In the paper we investigated the performance of Lingras’ et al. rough cluster
algorithm with respect to its compliance to the classical k-means, the numerical
stability and in the presents of outliers. We suggested an improved version of
the algorithm that delivers more intuitive results in comparison to Lingras’ et
al. approach. In future research we will evaluate some other features of the
algorithm, compare it to classical cluster algorithms and conduct experiments
with real life data.
Outliers in Rough k-Means Clustering 707
References
1. D.L. Davies and D.W. Bouldin. A cluster separation measure. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 1:224–227, 1979.
2. H.A. do Prado, P.M. Engel, and H.C. Filho. Rough clustering: An alternative
to find meaningful clusters by using the reducts from a dataset. In Rough Sets
and Current Trends in Computing: Third International Conference, RSCTC 2002,
volume 2475 of LNCS, pages 234 – 238, Berlin, 2002. Springer.
3. S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper
Saddle River, N.J., 2. edition, 1998.
4. P. Lingras and C. West. Interval set clustering of web users with rough k-means.
Technical Report 2002-002, Department of Mathematics and Computer Science,
St. Mary’s University, Halifax, Canada, 2002.
5. P. Lingras and C. West. Interval set clustering of web users with rough k-means.
Journal of Intelligent Information Systems, 23:5–16, 2004.
6. P. Lingras, R. Yan, and C. West. Comparison of conventional and rough k-means
clustering. In International Conference on Rough Sets, Fuzzy Sets, Data Min-
ing and Granular Computing, volume 2639 of LNAI, pages 130–137, Berlin, 2003.
Springer.
7. S. Mitra. An evolutionary rough partitive clustering. Pattern Recognition Letters,
25:1439–1449, 2004.
8. S. Mitra and T. Acharya. Data Mining: Multimedia, Soft Computing, and Bioin-
formatics. John Wiley, New York, 2003.
9. Z. Pawlak. Rough sets. International Journal of Information and Computer Sci-
ences, 11:145–172, 1982.
10. Z. Pawlak. Rough Sets: Theoretical Aspects of Reasoning about Data. Kluwer
Academic Publishers, Boston, 1992.
11. K.E. Voges, N.K. Pope, and M.R. Brown. Heuristics and Optimization for Knowl-
edge Discovery, chapter Cluster Analysis of Marketing Data Examining On-Line
Shopping Orientation: A Comparision of k-Means and Rough Clustering Ap-
proaches, pages 207–224. Idea Group Publishing, Hershey PA, 2002.
12. K.E. Voges, N.K. Pope, and M.R. Brown. A rough cluster analysis of shopping
orientation data. In Proceedings Australian and New Zealand Marketing Academy
Conference, pages 1625–1631, Adelaide, 2003.
13. Y.Y. Yao, X. Li, T.Y. Lin, and Q. Liu. Representation and classification of rough
set models. In Proceedings Third International Workshop on Rough Sets and Soft
Computing, pages 630–637, San Jose, CA, 1994.
14. H.J. Zimmermann. Fuzzy Set Theory and its Applications. Kluwer Academic
Publishers, Boston, 4. edition, 2001.