Professional Documents
Culture Documents
Advances in Knowledge Discovery and Data - Joshua Zhexue Huang - Longbing Cao - Jaide
Advances in Knowledge Discovery and Data - Joshua Zhexue Huang - Longbing Cao - Jaide
Advances in
Knowledge Discovery
and Data Mining
13
Series Editors
Volume Editors
Longbing Cao
University of Technology Sydney
Faculty of Engineering and Information Technology
Advanced Analytics Institute
Center for Quantum Computation and Intelligent Systems
Sydney, NSW 2007, Australia
E-mail: longbing.cao-1@uts.edu.au
Jaideep Srivastava
University of Minnesota
Department of Computer Science and Engineering
Minneapolis, MN 55455, USA
E-mail: Srivasta@cs.umn.edu
ISSN 0302-9743 e-ISSN 1611-3349
ISBN 978-3-642-20846-1 e-ISBN 978-3-642-20847-8
DOI 10.1007/978-3-642-20847-8
Springer Heidelberg Dordrecht London New York
Organizing Committee
Honorary Chair
Philip S. Yu University of Illinois at Chicago, USA
General Co-chairs
Jianping Fan Shenzhen Institutes of Advanced Technology,
CAS, China
David Cheung University of Hong Kong, China
Workshop Co-chairs
James Bailey The University of Melbourne, Australia
Yun Sing Koh The University of Auckland, New Zealand
Tutorial Co-chairs
Xiong Hui Rutgers, the State University of New Jersey,
USA
Sanjay Chawla The University of Sydney, Australia
Sponsorship Co-chairs
Yalei Bi Shenzhen Institutes of Advanced Technology,
CAS, China
Zhong Ming Shenzhen University, China
VIII Organization
Publicity Co-chairs
Jian Yang Beijing University of Technology, China
Ye Li Shenzhen Institutes of Advanced Technology,
CAS, China
Yuming Ou University of Technology Sydney, Australia
Publication Chair
Longbing Cao University of Technology Sydney, Australia
Steering Committee
Co-chairs
Rao Kotagiri University of Melbourne, Australia
Graham Williams Australian National University, Australia
Life Members
David Cheung University of Hong Kong, China
Masaru Kitsuregawa Tokyo University, Japan
Rao Kotagiri University of Melbourne, Australia
Hiroshi Motoda AFOSR/AOARD and Osaka University, Japan
Graham Williams (Treasurer) Australian National University, Australia
Ning Zhong Maebashi Institute of Technology, Japan
Members
Ming-Syan Chen National Taiwan University, Taiwan, ROC
Tu Bao Ho Japan Advanced Institute of Science and
Technology, Japan
Ee-Peng Lim Singapore Management University, Singapore
Huan Liu Arizona State University, USA
Jaideep Srivastava University of Minnesota, USA
Takashi Washio Institute of Scientific and Industrial Research,
Osaka University, Japan
Thanaruk Theeramunkong Thammasat University, Thailand
Kyu-Young Whang Korea Advanced Institute of Science and
Technology, Korea
Chengqi Zhang University of Technology Sydney, Australia
Zhi-Hua Zhou Nanjing University, China
Krishna Reddy IIIT, Hyderabad, India
Program Committee
Adrian Pearce The University of Melbourne, Australia
Aijun An York University, Canada
Aixin Sun Nanyang Technological University, Singapore
Akihiro Inokuchi Osaka University, Japan
Organization IX
External Reviewers
Ameeta Agrawal York University, Canada
Arnaud Soulet Université Francois Rabelais Tours, France
Axel Poigne Fraunhofer IAIS, Germany
Ben Tan Fudan University, China
Bian Wei University of Technology, Sydney, Australia
Bibudh Lahiri Iowa State University
Bin Yang Aalborg University, Denmark
Bin Zhao East China Normal University, China
Bing Bai Google Inc.
Bojian Xu Iowa State University, USA
Can Wang University of Technology, Sydney, Australia
Carlos Ferreira University of Porto, Portugal
Chao Li Shenzhen Institutes of Advanced Technology,
CAS, China
Cheqing Jin East China Normal University, China
Christian Beecks RWTH Aachen University, Germany
Chun-Wei Seah Nanyang Technological University, Singapore
De-Chuan Zhan Nanjing University, China
Elnaz Delpisheh York University, Canada
Erez Shmueli The Open University, Israel
Fausto Fleites Florida International University, USA
Fei Xie University of Vermont, USA
Gaoping Zhu University of New South Wales, Australia
Gongqing Wu University of Vermont, USA
Hardy Kremer RWTH Aachen University, Germany
Hideyuki Kawashima Nanzan University, Japan
Hsin-Yu Ha Florida International University, USA
Ji Zhou Fudan University, China
Jianbo Yang Nanyang Technological University, Singapore
Jinfei Shenzhen Institutes of Advanced Technology,
CAS, China
Jinfeng Zhuang Microsoft Research Asia, China
Jinjiu Li University of Technology, Sydney
Jun Wang Southwest University, China
Jun Zhang Charles Sturt University, Australia
Ke Zhu University of New South Wales, Australia
Keli Xiao Rutgers University, USA
Ken-ichi Fukui Osaka University, Japan
Organization XV
Graph Mining
Spectral Analysis of k-Balanced Signed Graphs . . . . . . . . . . . . . . . . . . . . . . 1
Leting Wu, Xiaowei Ying, Xintao Wu, Aidong Lu, and Zhi-Hua Zhou
Sequence Analysis
Real-Time Change-Point Detection Using Sequentially Discounting
Normalized Maximum Likelihood Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Yasuhiro Urabe, Kenji Yamanishi, Ryota Tomioka, and Hiroki Iwai
Outlier Detection
Multiple Distribution Data Description Learning Algorithm for Novelty
Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Trung Le, Dat Tran, Wanli Ma, and Dharmendra Sharma
Agent Mining
Multi-agent Based Classification Using Argumentation from
Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Maya Wardeh, Frans Coenen, Trevor Bench-Capon, and
Adam Wyner
Applications
Learning to Advertise: How Many Ads Are Enough? . . . . . . . . . . . . . . . . . 506
Bo Wang, Zhaonan Li, Jie Tang, Kuo Zhang, Songcan Chen, and
Liyun Ru
Local Feature Based Tensor Kernel for Image Manifold Learning . . . . . . . 544
Yi Guo and Junbin Gao
Feature Extraction
An Instance Selection Algorithm Based on Reverse Nearest Neighbor . . . 1
Bi-Ru Dai and Shu-Ming Hsu
Machine Learning
A Subpath Kernel for Rooted Unordered Trees . . . . . . . . . . . . . . . . . . . . . . 62
Daisuke Kimura, Tetsuji Kuboyama, Tetsuo Shibuya, and
Hisashi Kashima
Clustering
High-Order Co-clustering Text Data on Semantics-Based Representation
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Liping Jing, Jiali Yun, Jian Yu, and Joshua Huang
The Role of Hubness in Clustering High-Dimensional Data . . . . . . . . . . . . 183
Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, and
Mirjana Ivanović
Spatial Entropy-Based Clustering for Mining Data with Spatial
Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Baijie Wang and Xin Wang
Classification
Identifying Hidden Contexts in Classification . . . . . . . . . . . . . . . . . . . . . . . . 277
Indrė Žliobaitė
Cross-Lingual Sentiment Classification via Bi-view Non-negative
Matrix Tri-Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Junfeng Pan, Gui-Rong Xue, Yong Yu, and Yang Wang
A Sequential Dynamic Multi-class Model and Recursive Filtering by
Variational Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Xiangyun Qing and Xingyu Wang
Random Ensemble Decision Trees for Learning Concept-Drifting Data
Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Peipei Li, Xindong Wu, Qianhui Liang, Xuegang Hu, and
Yuhong Zhang
Collaborative Data Cleaning for Sentiment Classification with Noisy
Training Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Xiaojun Wan
Pattern Mining
Using Constraints to Generate and Explore Higher Order Discriminative
Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
Michael Steinbach, Haoyu Yu, Gang Fang, and Vipin Kumar
Mining Maximal Co-located Event Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Jin Soung Yoo and Mark Bow
Pattern Mining for a Two-Stage Information Filtering System . . . . . . . . . 363
Xujuan Zhou, Yuefeng Li, Peter Bruza, Yue Xu, and
Raymond Y.K. Lau
Efficiently Retrieving Longest Common Route Patterns of Moving
Objects By Summarizing Turning Regions . . . . . . . . . . . . . . . . . . . . . . . . . . 375
Guangyan Huang, Yanchun Zhang, Jing He, and Zhiming Ding
Automatic Assignment of Item Weights for Pattern Mining on Data
Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Yun Sing Koh, Russel Pears, and Gillian Dobbie
Prediction
Predicting Private Company Exits Using Qualitative Data . . . . . . . . . . . . 399
Harish S. Bhat and Daniel Zaelit
A Rule-Based Method for Customer Churn Prediction in
Telecommunication Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
Ying Huang, Bingquan Huang, and M.-T. Kechadi
XXIV Table of Contents – Part I
Text Mining
Adaptive and Effective Keyword Search for XML . . . . . . . . . . . . . . . . . . . . 423
Weidong Yang, Hao Zhu, Nan Li, and Guansheng Zhu
Leting Wu1 , Xiaowei Ying1 , Xintao Wu1 , Aidong Lu1 , and Zhi-Hua Zhou2
1
University of North Carolina at Charlotte, USA
{lwu8,xying,xwu,alu1}@uncc.edu
2
National Key Lab for Novel Software Technology, Nanjing University, China
zhouzh@nju.edu.cn
1 Introduction
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 1–12, 2011.
c Springer-Verlag Berlin Heidelberg 2011
2 L. Wu et al.
2 Notation
A signed graph G can be represented as the symmetric adjacency matrix An×n
with aij = 1 if there is a positive edge between node i and j, aij = −1 if
there is a negative edge between node i and j, and aij = 0 otherwise. A has
n real eigenvalues. Let λi be the i-th largest eigenvalue of A with eigenvector
xi , λ1 ≥ λ2 ≥ · · · ≥ λn .
Let xij denote the j-th entry of xi . The spectral
decomposition of A is A = i λi xi xTi .
x1 xi xk xn
↓
⎛ ⎞
x11 · · · xi1 · · · xk1 · · · xn1
⎜ .. .. .. .. ⎟
⎜ . . . . ⎟ (1)
⎜ ⎟
⎜
αu →⎜ x1u · · · xiu · · · xku · · · xnu ⎟⎟
⎜ . .. .. .. ⎟
⎝ .. . . . ⎠
x1n · · · xin · · · xkn · · · xnn
and E represents the negative edges across communities. More generally, euv =
1(−1) if a positive(negative) edge is added between node u and v, and euv = 0
otherwise.
where xiu > 0 is the only non-zero entries of αu . In other words, for a graph
with k disconnected comparable communities, spectral coordinates of all nodes
locate on k positive half-axes of ξ1 , · · · , ξk and nodes from the same community
locate on the same half axis.
Let Γui (i = 1, . . . , k) be the set of nodes in Ci that are newly connected to node
u by perturbation E: Γui = {v : v ∈ Ci , euv = ±1}. In [11], we derived several
theoretical results on general graph perturbation. We include the approximation
of spectral coordinates below as a basis for our spectral analysis of signed graphs.
Please refer to [11] for proof details.
4 L. Wu et al.
where scalar xiu is the only non-zero entry in its original spectral coordinate
shown in (4), and ri is the i-th row of matrix R in (6):
⎛ ⎞
1 λ2β−λ 12
· · · λkβ−λ
1k
⎜ β21 1 1
⎟
⎜ λ1 −λ2 1 · · · λkβ−λ 2k
⎟
R=⎜ .⎜ 2
⎟. (6)
⎝ ..
.
.. . .. .. ⎟
.
⎠
λ1 −λk λ2 −λk · · ·
βk1 βk2
1
u ≈ x1u r1 + (0,
α euv x2v ), (7)
λ2 2
v∈Γu
1
v ≈ x2v r2 + (
α euv x1u , 0), (8)
λ1 1
u∈Γv
2
0.05
2
0.05
2
e
0.05
e
e
0 0 0
(a) Disconnected (b) Add negative edges (c) Add positive edges
u , r2 = α
α u r2T = euv x2v .
λ2 2 v∈Γu
Figure 1 shows the scatter plot of the spectral coordinates for a synthetic graph,
Synth-2. Synth-2 is a 2-balanced graph with 600 and 400 nodes in each commu-
nity. We generate Synth-2 and modify its inter-community edges via the same
method as Synthetic data set Synth-3 in Section 5.1. As we can see in Figure
1(a), when the two communities are disconnected, the nodes from C1 and C2 lie
on the positive part of axis ξ1 and ξ2 respectively. We then add a small number
of edges connecting the two communities (p = 0.05). When the added edges are
all negative, as shown in Figure 1(b), the spectral coordinates of the nodes from
the two communities form two half-lines respectively. The two quasi-orthogonal
half-lines rotate counter-clockwise from the axes. Those nodes having negative
inter-community edges lie outside the two half-lines. On the contrary, if we add
positive inter-community edges, as shown in Figure 1(c), the nodes from two
communities display two half-lines with a clockwise rotation from the axes, and
nodes with inter-community edges lie between the two half-lines.
6 L. Wu et al.
n
u ≈ αu R +
α euv αv Λ−1 , (9)
v=1
where Λ = diag(λ1 , . . . , λk ).
One property implied by (9) is that, after adding negative inter-community
edges, nodes from different communities are still separable in the spectral space.
Note that R is close to an orthogonal matrix, and hence the first part of RHS
of (9) specifies an orthogonal transformation. The second part of RHS of (9)
specifies a deviation away from the position after the transformation. Note that
when the inter-community edges are all negative (euv = −1), the deviation of
αu is just towards the negative direction of αv (each dimension is weighted
with λ−1
i ). Therefore, after perturbation, node u and v are further separable
from each other in the spectral space. The consequence of this repellency caused
by adding negative edges is that nodes from different communities stay away
from each other in the spectral space. As the magnitude of the noise increases,
more nodes deviate from the half-lines ri , and the line pattern eventually dis-
appears.
0 0 0
(a) Negative edges (p = 0.1)(b) Negative edges (p = 0.3) (c) Negative edges (p = 1)
0 0 0
(d) Positive edges (p = 0.1)(e) Positive edges (p = 0.3) (f) Positive edges (p = 1)
Positive large perturbation. When the added edges are positive, we can sim-
ilarly conclude the opposite phenomenon: more nodes from the two communities
are “pulled” closer to each other by the positive inter-community edges and are
finally mixed together, indicating that the well separable communities merge
into one community.
Figure 2 shows the spectral coordinate of Synth-2 when we increase the mag-
nitude of inter-community edges (p = 0.1, 0.3 and 1). For the first row (Figure
2(a) to 2(c)), we add negative inter-community edges in Synth-2, and for the
second row (Figure 2(d) to 2(f)), we add positive inter-community edges. As
we add more and more inter-community edges, no matter positive or negative,
more and more nodes deviate from the two half-lines, and finally the line pattern
diminishes in Figure 2(c) or 2(f). When adding positive inter-community edges,
the nodes deviate from the lines and hence finally mix together as show in Fig-
ure 2(f), indicating that two communities merge into one community. Different
from adding positive edges, which mixes the two communities in the spectral
space, adding negative inter-community edges “pushes” the two communities
away from each other. This is because nodes with negative inter-community
edges lie outside the two half-lines as shown in Figure 2(a) and 2(b). Even when
p = 1, as shown in Figure 2(c), two communities are still clearly separable in the
spectral space.
2
0.05 0.05 0.05
e
e
0 0 0
1. Both node u and v move towards the negative part of axis ξi after pertur-
bation: yiu < xiu and yiv < xiu .
2. Node v moves farther than u after perturbation: |yiv − xiv | > |yiu − xiu |.
The two preceding properties imply that, for those nodes close to the origin,
adding negative edges would “push” them towards the negative part of axis ξi ,
and a small number of nodes can thus lie on the negative part of axis ξi , i.e.,
yiu < 0).
Add inter-community edges. The spectral perturbation caused by adding
Eout on to matrix A + Ein can be complicated. Notice that (A + Ein ) is still a
block-wise matrix, and we can still apply Thereom 1 and conclude that, when
Eout is moderate, the major nodes from k communities form k lines in the
spectral space and nodes with inter-community edges deviate from the lines.
It is difficult to give the explicit form of the lines and the deviations, because
xiu and the inter-community edges can be either positive and negative. However,
we expect that the effect of adding negative edges on positive nodes is still
dominant in determining the spectral pattern, because most nodes lie along the
positive part of the axes and the majority of inter-community edges are negative.
Communities are still distinguishable in the spectral space. The majority of nodes
in one community lie on the positive part of the line, while a small number
of nodes may lie on the negative part due to negative connections within the
community.
We make graph Synth-2 unbalanced by flipping the signs a small proportion
q of the edges. When the two communities are disconnected, as shown in Figure
3(a), after flipping q = 0.1 inner-community edges, a small number of nodes lie
on the negative parts of the two axes. Figure 3(b) shows the spectral coordinates
of the unbalanced graph generated from balanced graph Synth-2 (p = 0.1, q =
0.1). Since the magnitude of the inter-community edges is small, we can still
observe two orthogonal lines in the scatter plots. The negative edges within the
communities cause a small number of nodes lie on the negative parts of the two
lines. Nodes with inter-community edges deviate from the two lines. For Figure
3(c), we flip more edge signs (p = 0.1, q = 0.2). We can observe that more nodes
lie on the negative parts of the lines, since more inner-community edges are
changed to negative. The rotation angles of the two lines are smaller than that
Spectral Analysis of k-Balanced Signed Graphs 9
in Figure 3(b). This is because the positive inter-community edges “pull” the
rotation clockwise a little, and the rotation we observe depends on the effects
from both positive and negative inter-community edges.
5 Evaluation
5.1 Synthetic Balanced Graph
Data set Synth-3 is a synthetic 3-balanced graph generated from the power law
degree distribution with parameter 2.5. The 3 communities of Synth-3 contain
600, 500, 400 nodes, and 4131, 3179, 2037 edges respectively. All the 13027 inter-
community edges are set to be negative. We delete the inter-community edges
randomly until a proportion p of them remain in the graph. The parameter p
is the ratio of the magnitude of inter-community edges to that of the inner-
community edges. If p = 0 there are no inter-community edges. If p = 1, inner-
and inter-community edges have the same magnitude.
Figure 4 shows the change of spectral coordinates of Synth-3, as we increase
the magnitude of inter-community edges. When there are no any negative links
(p = 0), the scatter plot of the spectral coordinates is shown in Figure 4(a). The
disconnected communities display 3 orthogonal half-lines. Figure 4(b) shows the
spectral coordinates when the magnitude of inter-community edges is moderate
(p = 0.1). We can see the nodes form three half-lines that rotate a certain angle,
and some of the nodes deviate from the lines. Figures 4(c) and 4(d) show the
spectral coordinates when we increase the magnitude of inter-community edges
(p = 0.3, 1). We can observe that, as more inter-community edges are added,
more and more nodes deviate from the lines. However, nodes from different
communities are still separable from each other in the spectral space.
We also add positive inter-community edges on Synth-3 for comparison, and
the spectral coordinates are shown in Figures 4(e) and 4(f). We can observe
that, different from adding negative edges, as the magnitude of inter-community
edges (p) increases, nodes from the three communities get closer to each other,
and completely mix in one community in Figure 4(f).
C C C
1 1 1
C C C
2 2 2
0.2
C C C
3 3 3
0.15
0.1 0.2
0.2
0.05 0.1
3
e
3
0.1
e
0 −0.1
0
3
e
−0.1 0
−0.05
−0.1 0
−0.1 0
−0.1 −0.1
−0.1 0 −0.1
0.1
0 0.1 0
−0.1 0.1
0 0.1 0.1
0.1 e1
0.2 e1
0.2 0.2 0.2 0.2 0.2
e e
1 2 e
e 2
2
C C C
1 1 1
C C C
2 2 2
C 0.2 C 0.2 C
3 3 3
0.2
0.15 0.15
0.1
0.1
3
0.1
e
0
0.05
3
e
3
0.05
e
−0.1
−0.1 0
0
−0.05
0 −0.1
−0.05
−0.1 −0.1
0 −0.1
0.1 0 −0.1 0
0.1 −0.1 0.1 0 0.1
0.1
0.2 0.2 −0.1 −0.05 0 0.05 0.2 0.2 0.2
e e1 0.1 0.15 0.2
2 e e
1 1
e
e 2
Fig. 4. The spectral coordinates of the 3-balanced graph Synth-3. (b)-(d): add negative
inter-community edges; (e)-(f): add positive inter-community edges.
C C C
1 1 1
C C C
2 2 2
C C C
3 3 3
0.2 0.2
0.2
0.1 0.1
0.1
3
e
−0.1 0
e
3
−0.1 0 −0.1
e
0
−0.1
0 −0.1 −0.1
0 0
−0.1 −0.1
−0.1 0
0.1 0
0 0.1 0.1
0.1 0.1
0.1
e1 e1 e1
0.2 0.2 0.2 0.2 0.2 0.2
e e
e 2 2
2
Fig. 5. The spectral coordinates of a unbalanced synthetic graph generated via flipping
signs of inner- and inter-community edges of Synth-3 with p = 0.1 or 1
3 communities are separable in the spectral space, indicating that the unbalanced
edges do not greatly change the patterns in the spectral space.
C
1
C
2
C
3 0.05 0.03
0.02
0.05
0.01
3
e
0
3
e
0
3
e
−0.05 −0.01
−0.02
−0.05
−0.05 C C
0.05
−0.05 0 −0.05 1 1
−0.02
C −0.03 C
2 2
0 C 0 0 −0.02
C 0
3 3
0
0.02
0.05 e1 −0.05 0.02
0.05 0.05
e e1
2
e e
2 1 e
2
6 Related Work
There are several studies on community partition in social networks with neg-
ative (or negatively weighted) edges [1, 3]. In [1], Bansal et al. introduced cor-
relation clustering and showed that it is an NP problem to make a partition to
a complete signed graph. In [3], Demaine and Immorlica gave an approxima-
tion algorithm and showed that the problem is APX-hard. Kruegis et al. in [6]
presented a case study on the signed Slashdot Zoo corpus and analyzed various
measures (including signed clustering coefficient and signed centrality measures).
Leskovic et al. in [8] studied several signed online social networks and developed
a theory of status to explain the observed edge signs. Laplacian graph kernels
that apply to signed graphs were described in [7]. However, the authors only
focused on 2-balanced signed graphs and many results (such as signed graphs’
definiteness property) do not hold for general k-balanced graphs.
7 Conclusion
Acknowledgment
This work was supported in part by U.S. National Science Foundation (CCF-
1047621, CNS-0831204) for L. Wu, X. Wu, and A. Lu and by the Jiangsu Sci-
ence Foundation (BK2008018) and the National Science Foundation of China
(61073097) for Z.-H. Zhou.
References
1. Bansal, N., Chawla, S.: Correlation clustering. Machine Learning 56, 238–247
(2002)
2. Davis, J.A.: Clustering and structural balance in graphs. Human Relations 20,
181–187 (1967)
3. Demaine, E.D., Immorlica, N.: Correlation clustering with partial information. In:
Working Notes of the 6th International Workshop on Approximation Algorithms
for Combinatorial Optimization Problems, pp. 1–13 (2003)
4. Hage, P., Harary, F.: Structural models in anthropology, pp. 56–60. Cambridge
University Press, Cambridge (1983)
5. Inohara, T.: Characterization of clusterability of signed graph in terms of new-
comb’s balance of sentiments. Applied Mathematics and Computation 133, 93–104
(2002)
6. Kunegis, J., Lommatzsch, A., Bauckhage, C.: The slashdot zoo: mining a social
network with negative edges. In: WWW 2009, pp. 741–750 (2009)
7. Kunegis, J., Schmidt, S., Lommatzsch, A., Lerner, J., Luca, E.W.D., Albayrak, S.:
Spectral analysis of signed graphs for clustering, prediction and visualization. In:
SDM, pp. 559–570 (2010)
8. Leskovec, J., Huttenlocher, D., Kleinberg, J.: Signed networks in social media. In:
CHI, pp. 1361–1370 (2010)
9. Prakash, B.A., Sridharan, A., Seshadri, M., Machiraju, S., Faloutsos, C.: Eigen-
Spokes: Surprising patterns and scalable community chipping in large graphs. In:
Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119,
pp. 435–448. Springer, Heidelberg (2010)
10. Stewart, G.W., Sun, J.: Matrix perturbation theory. Academic Press, London
(1990)
11. Wu, L., Ying, X., Wu, X., Zhou, Z.-H.: Line orthogonality in adjacency eigenspace
and with application to community partition. Technical Report, UNC Charlotte
(2010)
12. Ying, X., Wu, X.: On randomness measures for social networks. In: SDM, pp.
709–720 (2009)
Spectral Analysis for Billion-Scale Graphs:
Discoveries and Implementation
Abstract. Given a graph with billions of nodes and edges, how can we find pat-
terns and anomalies? Are there nodes that participate in too many or too few
triangles? Are there close-knit near-cliques? These questions are expensive to an-
swer unless we have the first several eigenvalues and eigenvectors of the graph
adjacency matrix. However, eigensolvers suffer from subtle problems (e.g., con-
vergence) for large sparse matrices, let alone for billion-scale ones.
We address this problem with the proposed HE IGEN algorithm, which we
carefully design to be accurate, efficient, and able to run on the highly scalable
M AP R EDUCE (H ADOOP) environment. This enables HE IGEN to handle matrices
more than 1000× larger than those which can be analyzed by existing algorithms.
We implement HE IGEN and run it on the M45 cluster, one of the top 50 super-
computers in the world. We report important discoveries about near-cliques and
triangles on several real-world graphs, including a snapshot of the Twitter social
network (38Gb, 2 billion edges) and the “YahooWeb” dataset, one of the largest
publicly available graphs (120Gb, 1.4 billion nodes, 6.6 billion edges).
1 Introduction
Graphs with billions of edges, or billion-scale graphs, are becoming common; Facebook
boasts about 0.5 billion active users, who-calls-whom networks can reach similar sizes
in large countries, and web crawls can easily reach billions of nodes. Given a billion-
scale graph, how can we find near-cliques, the count of triangles, and related graph
properties? As we discuss later, triangle counting and related expensive operations can
be computed quickly, provided we have the first several eigenvalues and eigenvectors.
In general, spectral analysis is a fundamental tool not only for graph mining, but also
for other areas of data mining. Eigenvalues and eigenvectors are at the heart of numer-
ous algorithms such as triangle counting, singular value decomposition (SVD), spectral
clustering, and tensor analysis [10]. In spite of their importance, existing eigensolvers
do not scale well. As described in Section 6, the maximum order and size of input
matrices feasible for these solvers is million-scale.
In this paper, we discover patterns on near-cliques and triangles, on several real-
world graphs including a Twitter dataset (38Gb, over 2 billion edges) and the “Ya-
hooWeb” dataset, one of the largest publicly available graphs (120Gb, 1.4 billion nodes,
6.6 billion edges). To enable discoveries, we propose HE IGEN, an eigensolver for
billion-scale, sparse symmetric matrices built on the top of H ADOOP, an open-source
M AP R EDUCE framework. Our contributions are the following:
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 13–25, 2011.
c Springer-Verlag Berlin Heidelberg 2011
14 U Kang, B. Meeder, and C. Faloutsos
2 Discoveries
In a large, sparse network, how can we find tightly connected nodes, such as those
in near-cliques or bipartite cores? Surprisingly, eigenvectors can be used for this pur-
pose [14]. Given an adjacency matrix W and its SVD W = U ΣV T , an EE-plot is
defined to be the scatter plot of the vectors Ui and Uj for any i and j. EE-plots of some
real-world graphs contain clear separate lines (or ‘spokes’), and the nodes with the
largest values in each spoke are separated from the other nodes by forming near-cliques
or bipartite cores. Figures 1 shows several EE-plots and spyplots (i.e., adjacency matrix
of induced subgraph) of the top 100 nodes in top eigenvectors of YahooWeb graph.
In Figure 1 (a) - (d), we observe clear ‘spokes,’ or outstanding nodes, in the top
eigenvectors. Moreover, the top 100 nodes with largest values in U1 , U2 , and U4 form a
1
YahooWeb, LinkedIn: released under NDA.
Twitter: http://www.twitter.com/
Kronecker: http://www.cs.cmu.edu/∼ukang/dataset
Epinions: not public data.
Spectral Analysis for Billion-Scale Graphs 15
(e) U1 spoke (f) U2 spoke (g) U3 spoke (h) U4 spoke (i) Structure of
bi-clique
Fig. 1. EE-plots and spyplots from YahooWeb. (a)-(d): EE-plots showing the values of nodes in
the ith eigenvector vs. in the jth eigenvector. Notice the clear ‘spokes’ in top eigenvectors signify
the existence of a strongly related group of nodes in near-cliques or bi-cliques as depicted in (i).
(e)-(h): Spyplots of the top 100 largest nodes from each eigenvector. Notice that we see a near
clique in U3 , and bi-cliques in U1 , U2 , and U4 . (i): The structure of ‘bi-clique’ in (e), (f), and (h).
‘bi-clique’, shown in (e), (f), and (h), which is defined to be the combination of a clique
and a bipartite core as depicted in Figure 1 (i). Another observation is that the top seven
nodes shown in Figure 1 (g) belong to indymedia.org which is the site with the
maximum number of triangles in Figure 2.
Observation 1 (Eigenspokes). EE-plots of YahooWeb show clear spokes. Additionally,
the extreme nodes in the spokes belong to cliques or bi-cliques.
(a) LinkedIn (58M edges) (b) Twitter (2.8B edges) (c) YahooWeb (6.6B edges)
Fig. 2. The distribution of the number of participating triangles of real graphs. In general, they
obey the “triangle power-law.” Moreover, well-known U.S. politicians participate in many trian-
gles, demonstrating that their followers are well-connected. In the YahooWeb graph, we observe
several anomalous spikes which possibly come from cliques.
Using the top k eigenvalues computed with HE IGEN, we analyze the distribution
of triangle counts of real graphs including the Linkedin, Twitter social, and YahooWeb
graphs in Figure 2. We first observe that there exists several nodes with extremely large
triangle counts. In Figure 2 (b), Barack Obama is the person with the fifth largest num-
ber of participating triangles, and has many more than other U.S. politicians. In Figure 2
(c), the web page lists.indymedia.org contains the largest number of triangles;
this page is a list of mailing lists which apparently point to each other.
We also observe regularities in triangle distributions and note that the beginning part
of the distributions follows a power-law.
Observation 2 (Triangle power law). The beginning part of the triangle count distri-
bution of real graphs follows a power-law.
In the YahooWeb graph in Figure 2 (c), we observe many spikes. One possible reason
of the spikes is that they come from cliques: a k-clique generates k nodes with ( 2 )
k−1
triangles.
Observation 3 (Spikes in triangle distribution). In the Web graph, there exist several
spikes which possibly come from cliques.
The rightmost spike in Figure 2 (c) contains 125 web pages that each have about 1
million triangles in their neighborhoods. They all belong to the news site ucimc.org,
and are connected to a tightly coupled group of pages.
Triangle counts exhibit even more interesting patterns when combined with the de-
gree information as shown in the degree-triangle plot of Figure 3. We see that celebrities
have high degree and mildly connected followers, while accounts for adult sites have
many fewer, but extremely well connected, followers. Degree-triangle plots can be used
to spot and eliminate harmful accounts such as those of adult advertisers and spammers.
All of the above observations need a fast, scalable eigensolver. This is exactly what
HE IGEN does, and we describe our proposed design next.
Spectral Analysis for Billion-Scale Graphs 17
Fig. 3. The degree vs. participating triangles of some ‘celebrities’ (rest: omitted, for clarity) in
Twitter accounts. Also shown are accounts of adult sites which have smaller degree, but belong to
an abnormally large number of triangles (= many, well connected followers - probably, ‘robots’).
4 Proposed Method
In this section we describe HE IGEN, a parallel algorithm for computing the top k eigen-
values and eigenvectors of symmetric matrices in M AP R EDUCE.
The main idea of Lanczos-SO is as follows: We start with a random initial basis
vector b which comprises a rank-1 subspace. For each iteration, a new basis vector
Spectral Analysis for Billion-Scale Graphs 19
is computed by multiplying the input matrix with the previous basis vector. The new
basis vector is then orthogonalized against the last two basis vectors and is added to the
previous rank-(m − 1) subspace, forming a rank-m subspace. Let m be the number of
the current iteration, Qm be the n × m matrix whose ith column is the ith basis vector,
and A be the matrix for which we want to compute eigenvalues. We also define Tm =
Q∗m AQm to be a m × m matrix. Then, the eigenvalues of Tm are good approximations
of the eigenvalues of A . Furthermore, multiplying Qm by the eigenvectors of Tm gives
good approximations of the eigenvectors of A. We refer to [17] for further details.
If we used exact arithmetic, the newly computed basis vector would be orthogonal
to all previous basis vectors. However, rounding errors from floating-point calculations
compound and result in the loss of orthogonality. This is the cause of the spurious eigen-
values in Lanczos-NO. Orthogonality can be recovered once the new basis vector is
fully re-orthogonalized to all previous vectors. However, doing this becomes expensive
as it requires O(m2 ) re-orthogonalizations, where m is the number of iterations. A bet-
ter approach uses a quick test (line 10 of Algorithm 1) to selectively choose vectors that
need to be re-orthogonalized to the new basis [6]. This selective-reorthogonalization
idea is shown in Algorithm 1.
The Lanczos-SO has all the properties that we need: it finds the top k largest eigen-
values and eigenvectors, it produces no spurious eigenvalues, and its most expensive
operation, a matrix-vector multiplication, is tractable in M AP R EDUCE. Therefore, we
choose Lanczos-SO as our choice of the sequential algorithm for parallelization.
4.4 Blocking
Minimizing the volume of information sent between nodes is important to designing ef-
ficient distributed algorithms. In HE IGEN, we decrease the amount of network traffic by
using the block-based operations. Normally, one would put each edge ”(source, desti-
nation)” in one line; H ADOOP treats each line as a data element for its ’map()’ function.
Instead, we propose to divide the adjacency matrix into blocks (and, of course, the cor-
responding vectors also into blocks), and put the edges of each block on a single line,
and compress the source- and destination-ids. This makes the map() function a bit more
complicated to process blocks, but it saves significant transfer time of data over the
network. We use these edge-blocks and the vector-blocks for many parallel operations
in Table 2, including matrix-vector multiplication, vector update, vector dot product,
20 U Kang, B. Meeder, and C. Faloutsos
Table 2. Parallelization Choices. The last column (P) indicates whether the operation is paral-
lelized in HE IGEN. Some operations are better to be run in parallel since the input size is very
large, while others are better in a single machine since the input size is small and the overhead of
parallel execution overshadows its decreased running time.
vector scale, and vector L2 norm. Performing operations on blocks is faster than doing
so on individual elements since both the input size and the key space decrease. This
reduces the network traffic and sorting time in the M AP R EDUCE Shuffle stage. As we
will see in Section 5, the blocking decreases the running time by more than 4×.
The first matrix-vector operation multiplies a matrix with a large and dense vector,
and thus it requires a two-stage standard M AP R EDUCE algorithm by Kang et al. [9]. In
the first stage, matrix elements and vector elements are joined and multiplied to make
partial results which are added together to get the result vector in the second stage.
The other matrix-vector operation, however, multiplies with a small vector. HE IGEN
uses the fact that the small vector can fit in a machine’s main memory, and distributes
the small vector to all the mappers using the distributed cache functionality of H ADOOP.
The advantage of the small vector being available in mappers is that joining edge ele-
ments and vector elements can be done inside the mapper, and thus the first stage of the
standard two-stage matrix-vector multiplication can be omitted. In this one-stage algo-
rithm the mapper joins matrix elements and vector elements to make partial results, and
the reducer adds up the partial results. The pseudo code of this algorithm, which we call
CBMV(Cache-Based Matrix-Vector multiplication), is shown in Algorithm 2. We want
to emphasize that this operation cannot be performed when the vector is large, as is
the case in the first matrix-vector multiplication. The CBMV is faster than the standard
method by 57× as described in Section 5.
4.7 Analysis
We analyze the time and the space complexity of HE IGEN. In the lemmas below, m is
the number of iterations, |V | and |E| are the number of nodes and edges, and M is the
number of machines.
22 U Kang, B. Meeder, and C. Faloutsos
5 Performance
In this section, we present experimental results to answer the following questions:
– Scalability: How well does HE IGEN scale up?
– Optimizations: Which of our proposed methods give the best performance?
We perform experiments in the Yahoo! M45 H ADOOP cluster with total 480 hosts, 1.5
petabytes of storage, and 3.5 terabytes of memory. We use H ADOOP 0.20.1. The scala-
bility experiments are performed using synthetic Kronecker graphs [12] since realistic
graphs of any size can be easily generated.
5.1 Scalability
Figure 4(a,b) shows the scalability of HE IGEN-BLOCK, an implementation of HE IGEN
that uses blocking, and HE IGEN-PLAIN, an implementation which does not. Notice
that the running time is near-linear in the number of edges and machines. We also note
that HE IGEN-BLOCK performs up to 4× faster when compared to HE IGEN-PLAIN.
5.2 Optimizations
Figure 4(c) shows the comparison of running time of the skewed matrix-matrix mul-
tiplication and the matrix-vector multiplication algorithms. We used 100 machines for
YahooWeb data. For matrix-matrix multiplications, the best method is our proposed
CBMM which is 76× faster than repeated naive matrix-vector multiplications (IMV).
The slowest MM algorithm didn’t even finish, and failed due to heavy amounts of data.
For matrix-vector multiplications, our proposed CBMV is faster than the naive method
(IMV) by 48×.
6 Related Works
The related works form two groups: eigensolvers and M AP R EDUCE/H ADOOP.
Large-scale Eigensolvers: There are many parallel eigensolvers for large matrices: the
work by Zhao et al. [21], HPEC [7], PLANO [20], PREPACK [15], SCALABLE [4],
PLAYBACK [3] are several examples. All of them are based on MPI with message
passing, which has difficulty in dealing with billion-scale graphs. The maximum order
of matrices analyzed with these tools is less than 1 million [20] [16], which is far from
web-scale data. Very recently(March 2010), the Mahout project [2] provides SVD on
Spectral Analysis for Billion-Scale Graphs 23
Fig. 4. (a) Running time vs. number of edges in 1 iteration of HE IGEN with 50 machines. Notice
the near-linear running time proportional to the edges size. (b) Running time vs. number of ma-
chines in 1 iteration of HE IGEN . The running time decreases as number of machines increase. (c)
Comparison of running time between different skewed matrix-matrix and matrix-vector multipli-
cations. For matrix-matrix multiplication, our proposed CBMM outperforms naive methods by at
least 76×. The slowest matrix-matrix multiplication algorithm (MM) even didn’t finish and the
job failed due to excessive data. For matrix-vector multiplication, our proposed CBMV is faster
than the naive method by 57×.
top of H ADOOP . Due to insufficient documentation, we were not able to find the input
format and run a head-to-head comparison. But, reading the source code, we discov-
ered that Mahout suffers from two major issues: (a) it assumes that the vector (b, with
n=O(billion) entries) fits in the memory of a single machine, and (b) it implements the
full re-orthogonalization which is inefficient.
MapReduce and Hadoop: M AP R EDUCE is a parallel programming framework for
processing web-scale data. M AP R EDUCE has two major advantages: (a) it handles data
distribution, replication, and load balancing automatically, and furthermore (b) it uses
familiar concepts from functional programming. The programmer needs to provide only
the map and the reduce functions. The general framework is as follows [11]: The map
stage processes input and outputs (key, value) pairs. The shuffling stage sorts the map
output and distributes them to reducers. Finally, the reduce stage processes the values
with the same key and outputs the final result. H ADOOP [1] is the open source imple-
mentation of M AP R EDUCE. It also provides a distributed file system (HDFS) and data
processing tools such as PIG [13] and Hive . Due to its extreme scalability and ease of
use, H ADOOP is widely used for large scale data mining [9,8] .
7 Conclusion
In this paper we discovered patterns in real-world, billion-scale graphs. This was possi-
ble by using HE IGEN, our proposed eigensolver for the spectral analysis of very large-
scale graphs. The main contributions are the following:
– Effectiveness: We analyze spectral properties of real world graphs, including Twit-
ter and one of the largest public Web graphs. We report patterns that can be used
for anomaly detection and find tightly-knit communities.
– Careful Design: We carefully design HE IGEN to selectively parallelize operations
based on how they are most effectively performed.
– Scalability: We implement and evaluate a billion-scale eigensolver. Experimenta-
tion shows that HE IGEN is accurate and scales linearly with the number of edges.
24 U Kang, B. Meeder, and C. Faloutsos
Future research directions include extending the analysis and the algorithms for multi-
dimensional matrices, or tensors [10].
Acknowledgements
This material is based upon work supported by the National Science Foundation under
Grants No. IIS-0705359, IIS0808661, IIS-0910453, and CCF-1019104, by the Defense
Threat Reduction Agency under contract No. HDTRA1-10-1-0120, and by the Army
Research Laboratory under Cooperative Agreement Number W911NF-09-2-0053. This
work is also partially supported by an IBM Faculty Award, and the Gordon and Betty
Moore Foundation, in the eScience project. The views and conclusions contained in
this document are those of the authors and should not be interpreted as representing
the official policies, either expressed or implied, of the Army Research Laboratory or
the U.S. Government or other funding parties. The U.S. Government is authorized to
reproduce and distribute reprints for Government purposes notwithstanding any copy-
right notation here on. Brendan Meeder is also supported by a NSF Graduate Research
Fellowship and funding from the Fine Foundation, Sloan Foundation, and Microsoft.
References
[14] Prakash, B.A., Sridharan, A., Seshadri, M., Machiraju, S., Faloutsos, C.: EigenSpokes: Sur-
prising patterns and scalable community chipping in large graphs. In: Zaki, M.J., Yu, J.X.,
Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119, pp. 435–448. Springer, Hei-
delberg (2010)
[15] Lampe, J., Lehoucq, R.B., Sorensen, D.C., Yang, C.: Arpack user’s guide: Solution of large-
scale eigenvalue problems with implicitly restarted arnoldi methods. SIAM, Philadelphia
(1998)
[16] Song, Y., Chen, W., Bai, H., Lin, C., Chang, E.: Parallel spectral clustering. In: ECML
(2008)
[17] Trefethen, L.N., Bau III., D.: Numerical linear algebra. SIAM, Philadelphia (1997)
[18] Tsourakakis, C.: Fast counting of triangles in large real networks without counting: Algo-
rithms and laws. In: ICDM (2008)
[19] Tsourakakis, C.E., Kang, U, Miller, G.L., Faloutsos, C.: Doulion: Counting triangles in
massive graphs with a coin. In: KDD (2009)
[20] Wu, K., Simon, H.: A parallel lanczos method for symmetric generalized eigenvalue prob-
lems. Computing and Visualization in Science (1999)
[21] Zhao, Y., Chi, X., Cheng, Q.: An implementation of parallel eigenvalue computation using
dual-level hybrid parallelism. LNCS (2007)
LGM: Mining Frequent Subgraphs from Linear
Graphs
1 Introduction
Frequent subgraph mining is an active research area with successful applications
in, e.g., chemoinformatics [15], software science [4], and computer vision [13].
The task is to enumerate the complete set of frequently appearing subgraphs in
a graph database. Early algorithms include AGM [8], FSG [9] and gSpan [19].
Since then, researchers paid considerable efforts to improve the efficiency, for
example, by mining closed patterns only [20], or by early pruning that sacrifices
the completeness (e.g., leap search [18]). However, graph mining algorithms are
still too slow for large graph databases (see e.g.,[17]). The scalability of graph
mining algorithms is much worse than those for more restricted classes such as
trees [1] and sequences [14]. It is due to the fact that, for trees and sequences,
it is possible to design a pattern extension rule that does not create duplicate
patterns (e.g., rightmost extension) [1]. For general graphs, there are multiple
ways to generate the same subgraph pattern, and it is necessary to detect du-
plicate patterns and prune the search tree whenever duplication is detected. In
gSpan [19], a graph pattern is represented as a DFS code, and the duplication
check is implemented via minimality checking of the code. It is a very clever
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 26–37, 2011.
c Springer-Verlag Berlin Heidelberg 2011
LGM: Mining Frequent Subgraphs from Linear Graphs 27
c
b
a a
1 2 3 4 5 6
A B A B C A
mechanism, because one does not need to track back the patterns generated so
far. Nevertheless, the complexity of duplication checking is exponential to the
pattern size [19]. It harms efficiency substantially, especially when mining large
patterns.
A linear graph is a graph whose vertices are totally ordered [3,5] (Figure 1). For
example, protein contact maps, RNA secondary structures, alternative splicing
patterns in molecular biology and predicate-argument structures [11] in natural
languages can be represented as linear graphs. Amino acid residues of a protein
have natural ordering from N- to C-terminus, and English words in a sentence
are ordered as well. Davydov and Batzoglou [3] addressed the problem of aligning
several linear graphs for RNA sequences, assessed the computational complexity,
and proposed an approximate algorithm. Fertin et al. assessed the complexity
of finding a maximum common pattern in a set of linear graphs [5]. In this pa-
per, we develop a novel algorithm, linear graph miner (LGM), for enumerating
frequently appearing subgraphs in a large number of linear graphs. The advan-
tage of employing linear graphs is that we can derive a pattern extension rule
that does not cause duplication, which makes LGM much more efficient than
conventional graph mining algorithms.
We design the extension rule based on the reverse search principle [2]. Perhaps
confusingly, ’reverse search’ does not refer to a particular search method, but a
guideline for designing enumeration algorithms. A pattern extension rule specifies
how to generate children from a parent in the search space. In reverse search, one
specifies a rule that generates a parent uniquely from a child (i.e., reduction map).
The pattern extension rule is obtained by ’reversing’ the reduction map: When gen-
erating children from a parent, all possible candidates are prepared and those map-
ping back to the parent by the reduction map are selected. An advantage of reverse
search is that, given a reduction map, the completeness of the resulting pattern ex-
tension rule can easily be proved [2]. In data mining, LCM, one of the fastest closed
itemset miner, was designed using reverse search [16]. It is applied in the design
of a dense module enumeration algorithm [6] and a geometric graph mining algo-
rithm recently [12]. In computational geometry and related fields, there are many
successful applications1 . LGM’s reduction map is very simple: remove the largest
edge in terms of edge ordering. Fortunately, it is not necessary to take the “can-
didate preparation and selection” approach in LGM. We can directly reverse the
reduction map to an explicit extension rule here.
1
See a list of applications at
http://cgm.cs.mcgill.ca/~ avis/doc/rs/applications/index.html
28 Y. Tabei et al.
2 Preliminaries
Let us first define linear graphs and associated concepts.
Definition 1 (Linear graph). Denote by Σ V and Σ E the set of vertex and
edge labels, respectively. A labeled and undirected linear graph g = (V, E, LV , LE )
consists of an ordered vertex set V ⊂ N, an edge set E ⊆ V ×V , a vertex labeling
LV : V → Σ V and an edge labeling LE : E → Σ E . Let the size of the linear
graph |g| be the number of its edges. Let G denote the set of all possible linear
graphs and let θ ∈ G denote the empty graph.
The difference from ordinary graphs is that the vertices are defined as a subset
of natural numbers, introducing the total order. Notice that we do not impose
connectedness here. The order of edges is defined as follows:
Definition 2 (Total order among edges). ∀e1 = (i, j), e2 = (k, l) ∈ Eg ,
e1 <e e2 if and only if i) i < k or ii) i = k, j < l.
Namely, one first compares the indices of the left nodes. If they are identical, the
right nodes are compared. The subgraph relationship between two linear graphs
is defined as follows.
Definition 3 (Subgraph). Given two linear graphs g1 = (V1 , E1 , LV1 , LE1 ),
g2 = (V2 , E2 , LV2 , LE2 ), g1 is a subgraph of g2 , g1 ⊆ g2 , if and only if there exists
an injective mapping m : V1 → V2 such that
1. ∀i ∈ V1 : LV1 (i) = LV2 (m(i)), vertex labels are identical,
2. ∀(i, j) ∈ E1 : (m(i), m(j)) ∈ E2 , LE1 (i, j) = LE2 (m(i), m(j)), all edges of g1
exist in g2 , and
3. ∀(i, j) ∈ E1 : i < j → m(i) < m(j), the order of vertices is conserved.
The difference from the ordinary subgraph relation is that the vertex order is
conserved. Finally, frequent subgraph mining is defined as follows.
LGM: Mining Frequent Subgraphs from Linear Graphs 29
empty empty
Fig. 2. (Left) Graph-shaped search space. (Right) Search tree induced by the reduction
map.
i j i j
i j i j i j i j
i j i j i j
i j i j i j
Fig. 3. Example of children patterns. There are three types of extension with respect
to the number of nodes: (A) no-node-addition, (B) one-node-addition, (C) two-nodes-
addition.
(C) two-nodes-addition. Let us define the largest edge of g as (i, j), i < j. Then,
the enumeration of case A is done by adding an edge which is larger than (i, j).
For case B, a node is inserted to the position after i, and this node is connected to
every other node. If the new edge is smaller than (i, j), this extension is canceled.
For case C, two nodes are inserted to the position after i. In that case, the added
two nodes must be connected by a new edge. All patterns of valid extensions are
shown in Figure 3. This example does not include node labels, but for actual
applications, node labels need to be enumerated as well.
5 Complexity Analysis
The computational time of frequent pattern mining depends on the minimum
support and maximum pattern size thresholds [19]. Also, it depends on the “den-
sity” of the database: If all graphs are almost identical (i.e., a dense database),
the mining would take a prohibitive amount of time. So, conventional worst case
analysis is not amenable to mining algorithms. Instead, the delay, interval time
32 Y. Tabei et al.
between two consecutive solutions, is often used to describe the complexity. Gen-
eral graph mining algorithms including gSpan are exponential delay algorithms,
i.e., the delay is exponential to the size of patterns [19]. The delay of our algo-
rithm is only polynomial, because no duplication checks are necessary thanks to
the vertex order.
6 Experiments
We performed a motif extraction experiment from protein 3D structures.
Frequent and characteristic patterns are often called “motifs” in molecular biol-
ogy, and we adopt that terminology here. All experiments were performed on a
Linux machine with an AMD Opteron processor (2 GHz and 4GB RAM).
1 2 3 4 5 1 2 3 4 5
Fig. 4. Example of gap linear graph. 1-gap linear graph (left) and 2-gap linear
graph (right) are represented, respectively. Edges corresponding to gaps are represented
in bold line.
LGM: Mining Frequent Subgraphs from Linear Graphs 33
40000 LGM
gSpan+g1
30000
time (sec)
20000
10000
10
20
30
40
50
minimum support
Fig. 5. Execution time for the protein data. The line labeled by gSpan+g1 is execution
time for gSpan on the 1-gap linear graph dataset. gSpan does not work on the 2-gap
linear graph dataset even if the minimum support threshold is 50.
We adopted the Glyankina et al’s dataset [7] which consists of pairs of homol-
ogous proteins: one is derived from a thermophilic organism and the other is
from a mesophilic organism. This dataset was made for understanding struc-
tural properties of proteins which are responsible for the higher thermostability
of proteins from thermophilic organisms compared to those from mesophilic or-
ganisms. In constructing a linear graph from a 3D structure, each amino acid is
represented as a vertex. Vertex labels are chosen from {1, . . . , 6}, which repre-
sents the following six classes: aliphatic {AVLIMC}, aromatic {FWYH}, polar
{STNQ}, positive {KR}, negative {DE}, special (reflecting their special con-
formation properties) {GP} [10]. An edge is drawn between the pair of amino
acid residues whose distance is within 5 angstrom. No edge labels are assigned.
In total, 754 graphs were made. Average number of vertices and edges are 371
and 498, respectively, and the number of labels is 6. To detect the motifs char-
acterizing the difference between two organisms, we take the following two-step
approach. First, we employ LGM to find frequent patterns from all proteins of
both organisms. In this setting, we did not use (c-6) patterns in Figure 3. Fi-
nally, the patterns significantly associated with organism difference are selected
via statistical tests.
We assess the execution time of our algorithm in comparison with gSpan. The
linear graphs from 3D-structure proteins are not always connected graphs and
the gSpan can not be applied to such disconnected graphs. Hence, we made two
kinds of gaped linear graph: 1-gap linear graph and 2-gap linear graph. 1-gap
linear graph is a linear graph whose contiguous vertices in a protein sequence
are connected by an edge; 2-gap linear graph is a 1-gap linear graph whose two
vertices skipping one in a protein sequence are connected by an edge (Figure 4).
We run gSpan on two datasets: one consists of 1-gap linear graphs and the other
consists of 2-gap linear graphs. We run LGM on the original linear graphs. We
set the maximum execution time to 12 hours for both programs. Figure 5 shows
the execution time by changing minimum support thresholds. gSpan does not
34 Y. Tabei et al.
129 130 132 133 134 135 136 147 148 151 152 154 155
1 1 1 1 5 5 1 1 1 1 3 1 3
pvalue:3×10-2 pvalue:9×10-5
129 130 133 134 135 147 148 149 150 151 153
1 1 1 5 5 1 1 2 1 1 3
pvalue:3×10-2 pvalue:2×10-4
157 158 159 160 161 143 144 146 147 151 152 155
1 1 3 1 1 1 1 1 1 1 3 3
Fig. 6. Significant subgraphs detected by LGM. The p-value calculated by fisher exact
test is attached to each linear graph. The node labels 1, 2, 3, 4 and 5, represent aliphatic,
aromatic, polar, positive and negative proteins, respectively.
work on the 2-gap linear graph dataset even if the minimum support threshold
is 50. Our algorithm is faster than gSpan on the 1-gap linear graph dataset, and
its execution time is reasonable.
Then, we assess a motif extraction ability of our algorithm. To choose signif-
icant subgraphs from the enumerated subgraphs, we use Fisher’s exact test. In
this case, a significant subgraph should distinguish thermophilic proteins from
mesophilic proteins. Thus, for each frequent subgraph, we count the number of
proteins containing this subgraph in the thermophilic and mesophilic proteins;
and generate a 2×2 contingency table, which includes the number of thermophilic
organisms that contain subgraph g nT P , the number of thermophilic organisms
that does not contain a subgraph g nF P , the number of mesophilic organisms
LGM: Mining Frequent Subgraphs from Linear Graphs 35
that does not contain a subraph g nF N and the number of mesophilic organisms
that contain a subgraph g nT N . The probability representing the independence
in the contingency table is calculated as follows:
ng ng
nT P nF N ng !ng !nP !nN !
Pr = = ,
n n!nT P !nF P !nF N !nT N !
np
7 Conclusion
Acknowledgements
This work is partly supported by research fellowship from JSPS for young sci-
entists, MEXT Kakenhi 21680025 and the FIRST program. We would like to
thank M. Gromiha for providing the protein 3D-structure dataset, T. Uno and
H. Kashima for fruitful discussions.
References
1. Abe, K., Kawasoe, S., Asai, T., Arimura, H., Arikawa, S.: Optimized substructure
discovery for semi-structured data. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.)
PKDD 2002. LNCS (LNAI), vol. 2431, pp. 1–14. Springer, Heidelberg (2002)
2. Avis, D., Fukuda, K.: Reverse search for enumeration. Discrete Appl. Math. 65,
21–46 (1996)
3. Davydov, E., Batzoglou, S.: A computational model for RNA multiple sequence
alignment. Theoretical Computer Science 368, 205–216 (2006)
4. Eichinger, F., Böhm, K., Huber, M.: Mining edge-weighted call graphs to localise
software bugs. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD
2008, Part I. LNCS (LNAI), vol. 5211, pp. 333–348. Springer, Heidelberg (2008)
5. Fertin, G., Hermelin, D., Rizzi, R., Vialette, S.: Common structured patterns in
linear graphs: Approximation and combinatorics. In: Ma, B., Zhang, K. (eds.) CPM
2007. LNCS, vol. 4580, pp. 241–252. Springer, Heidelberg (2007)
6. Georgii, E., Dietmann, S., Uno, T., Pagel, P., Tsuda, K.: Enumeration of condition-
dependent dense modules in protein interaction networks. Bioinformatics 25(7),
933–940 (2009)
7. Glyakina, A.V., Garbuzynskiy, S.O., Lobanov, M.Y., Galzitskaya, O.V.: Different
packing of external residues can explain differences in the thermostability of pro-
teins from thermophilic and mosophilic organisms. Bioinformatics 23, 2231–2238
(2007)
8. Inokuchi, A., Washio, T., Motoda, H.: An apriori-based algorithm for mining fre-
quent substructures from graph data. In: Zighed, D.A., Komorowski, J., Żytkow,
J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 13–23. Springer, Heidelberg
(2000)
9. Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: Proceedings of the
2001 IEEE International Conference on Data Mining (ICDM 2001), pp. 313–320
(2001)
10. Mirny, L.A., Shakhnovich, E.I.: Universally Conserved Positions in Protein Folds:
Reading Evolutionary Signals about Stability, Folding Kinetics and Function. Jour-
nal of Molecular Biology 291, 177–196 (1999)
11. Miyao, Y., Sætre, R., Sagae, K., Matsuzaki, T., Tsujii, J.: Task-oriented evaluation
of syntactic parsers and their representations. In: 46th Annual Meeting of the
Association for Computational Linguistics (ACL), pp. 46–54 (2008)
12. Nowozin, S., Tsuda, K.: Frequent subgraph retrieval in geometric graph databases.
In: Perner, P. (ed.) ICDM 2008. LNCS (LNAI), vol. 5077, pp. 953–958. Springer,
Heidelberg (2008)
13. Nowozin, S., Tsuda, K., Uno, T., Kudo, T., Bakir, G.: Weighted substructure
mining for image analysis. In: IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos
(2007)
LGM: Mining Frequent Subgraphs from Linear Graphs 37
14. Pei, J., Han, J., Mortazavi-asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu,
M.: Mining sequential patterns by pattern-growth: The prefixspan approach. IEEE
Transactions on Knowledge and Data Engineering 16(11), 1424–1440 (2004)
15. Saigo, H., Nowozin, S., Kadowaki, T., Taku, K., Tsuda, K.: gBoost: a mathematical
programming approach to graph classification and regression. Machine Learning 75,
69–89 (2008)
16. Uno, T., Kiyomi, M., Arimura, H.: LCM ver.3: collaboration of array, bitmap and
prefix tree for frequent itemset mining. In: Proceedings of the 1st International
Workshop on Open Source Data Mining: Frequent Pattern Mining Implementa-
tions, pp. 77–86 (2005)
17. Wale, N., Karypis, G.: Comparison of descriptor spaces for chemical compound
retrieval and classification. In: Proceedings of the 2006 IEEE International Con-
ference on Data Mining, pp. 678–689 (2006)
18. Yan, X., Cheng, H., Han, J., Yu, P.S.: Mining significant graph patterns by leap
search. In: Proceedings of the ACM SIGMOD International Conference on Man-
agement of Data, pp. 433–444 (2008)
19. Yan, X., Han, J.: gSpan: Graph-based substructure pattern mining. In: Proceedings
of the 2002 IEEE International Conference on Data Mining (ICDM 2002), pp. 721–
724 (2002)
20. Yan, X., Han, J.: CloseGraph: mining closed frequent graph patterns. In: Proceed-
ings of 2003 International Conference on Knowledge Discovery and Data Mining
(SIGKDD 2003), pp. 286–295 (2003)
Efficient Centrality Monitoring for
Time-Evolving Graphs
Abstract. The goal of this work is to identify the nodes that have the
smallest sum of distances to other nodes (the lowest closeness central-
ity nodes) in graphs that evolve over time. Previous approaches to this
problem find the lowest centrality nodes efficiently at the expense of
exactness. The main motivation of this paper is to answer, in the affir-
mative, the question, ‘Is it possible to improve the search time without
sacrificing the exactness?’. Our solution is Sniper, a fast search method
for time-evolving graphs. Sniper is based on two ideas: (1) It computes
approximate centrality by reducing the original graph size while guaran-
teeing the answer exactness, and (2) It terminates unnecessary distance
computations early when pruning unlikely nodes. The experimental re-
sults show that Sniper can find the lowest centrality nodes significantly
faster than the previous approaches while it guarantees answer exactness.
1 Introduction
In graph theory, the facility location problem is quite important since it involves
finding good locations for one or more facilities in a given environment. Solving
this problem starts by finding the nodes whose distances to other nodes is the
shortest in the graph, since the cost it takes to reach all other nodes from these
nodes is expected to be low. In graph analysis, the centralities based on this
concept are closeness. In this paper, the closeness centrality of node u, Cu , is
defined as the sum of distances from the node to other nodes.
The naive approach, the exact computation of centrality, is impractical; it
needs distances of all node pairs. This led to the introduction of approximate
approaches, such as the annotation approach [13] and the embedding approach
[12,11], to estimate centralities. These approaches have the advantage of speed
at the expense of exactness. However, approximate algorithms are not adopted
by many practitioners. This is because the optimality of the solution is not
guaranteed; it is hard for approximate algorithms to identify the lowest centrality
node exactly. Furthermore, the focus of traditional graph theory has been limited
to just ‘static’ graphs; the implicit assumption is that nodes and edges never
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 38–50, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Efficient Centrality Monitoring for Time-Evolving Graphs 39
We propose a novel method called Sniper that can efficiently identify the lowest
centrality nodes in time-evolving graphs. To the best of our knowledge, our
approach is the first solution to achieve both exactness and efficiency at the
same time in identifying the lowest centrality nodes from time-evolving graphs.
the most influential researchers from 1986 to 1988, 1989 to 1992, and 1993 to
2002, respectively. All these three are very famous and important researchers in
the database community.
The remainder of this paper is organized as follows. Section 2 describes re-
lated work. Section 3 overviews some of the background of this work. Section 4
introduces the main ideas of Sniper. Section 5 discusses some of the topics re-
lated to Sniper. Section 6 gives theoretical analyses of Sniper. Section 7 reviews
the results of our experiments. Section 8 provides our brief conclusion.
2 Related Work
Many papers have been published on approximations of node-to-node distances.
The previous distance approximation schemes are distinguished into two types:
annotation types and embedding types. Rattigna et al. studied two annotation
schemes [13]. They randomly select nodes in a graph and divide the graph into
regions that are connected, mutually exclusive, and collectively exhaustive. They
give a set of annotations to every node from the regions. Distances are computed
by the annotations. They demonstrated their method can compute node dis-
tances more accurately than the embedding approaches. However, this method
can require O(n2 ) space and O(n3 ) time to estimate the lowest centrality nodes
as described in their paper.
The Landmark technique is an embedding approach [7,12], and estimates node-
to-node distance from selected nodes at O(n) time. The minimum distance via a
landmark node is utilized as node distance in this method. Another embedding
technique is Global Network Positioning, which was studied by Ng et al. [11]. Node
distances are estimated from the Lp norm between node pairs. These embedding
techniques require O(n2 ) space since all n nodes hold distances to O(n) selected
nodes. Moreover, they require O(n3 ) time to identify the lowest centrality node.
This is because they take O(n) time to estimate a node pair distance and need the
distances of n2 node pairs to compute centralities of all nodes.
3 Preliminary
In this section, we introduce the background to this paper. Social networks and
others can be described as graph G = (V, E), where V is the set of nodes, and
E is the set of edges. We use n and m to denote the number of nodes and edges,
respectively. That is n = |V | and m = |E|. A path from node u to v is the
sequence of nodes linked by edges, beginning with node u and ending at node
v. A path from node u to v is the shortest path if and only if the number of
nodes in the path is the smallest possible among all paths from node u to v. The
distance between node u and v, d(u, v), is the number of edges in the shortest
path connecting them in the graph. Therefore d(u, u) = 0 for every u ∈ V , and
d(u, v) = d(v, u) for u, v ∈ V . The closeness centrality of node u, C
u , is the sum
of the distances from the node to any other node, and computed as v∈V d(u, v).
Efficient Centrality Monitoring for Time-Evolving Graphs 41
4 Centrality Monitoring
In this section, we explain the two main ideas underlying Sniper. The main
advantage of Sniper is to exactly and efficiently identify the lowest closeness
centrality nodes in time-evolving graphs. While we focus on undirected and
unweighted graphs in this section, our approach can be applied to weighted
or directed graphs as described in Section 5.1. Moreover, we can handle range
queries (find the nodes whose centralities are less than a given threshold) and
K-best queries (find the K lowest centrality nodes) as described in Section 5.2.
For ease of explanation, we assume that no two nodes will have exactly the same
centrality value and that one node is added to a time-evolving graph at each
time tick. These assumptions can be eliminated easily. And all proofs in this
section are omitted due to the space limitations.
Our first idea involves aggregating nodes of the original graph, which enables us
to compute the lower centrality bound and thus realize reliable node pruning.
where d(u , v ) is node distance in the approximation graph (i.e. the number of
hops from node u to v ) and |v | is the number of original nodes aggregated
within node v .
We can provide the following lemma about the centrality approximation:
Lemma 2 (Approximate closeness centrality). For any node in the
approximate graph, Cu ≤ Cu holds.
Lemma 2 provides Sniper with the property of finding the exact answer as is
described in Section 6.
Notation. We first give some notations for the estimation. In the search process,
we construct the BFS-tree rooted at a selected node. As a result, the selected
node forms layer 0. The direct neighbors of the node form layer 1. All nodes
that are i hops apart from the selected node form layer i. We later describe our
approach to selecting the node.
Next, we check by BFS that the exact centralities of other nodes in the tree are
lower than the exact centrality of the candidate node. We define the set of nodes
explored by BFS as Vex , and the set of unexplored nodes as Vun (= V − Vex ).
dmax (u) is the maximum distance of the explored node from node u, that is
dmax (u) = max{d(u, v) : v ∈ Vex }. Moreover, we define the explored layers of
the tree as Lex if and only if there exists at least one explored node in the layer.
Similarly we define the unexplored layers as Lun if and only if there exists no
explored node in the layer. The layer number of node u is denoted as l(u).
The estimation is the same as exact centrality if all nodes are explored (i.e.
Vex = V ) in Equation (3). To show the property of the centrality estimate, we
introduce the following lemma:
This property enables Sniper to identify the lowest centrality node exactly.
The selection of the root node of the tree is important for efficient pruning.
We select the lowest centrality node of the previous time tick as the root node.
There are two reasons for this approach. The first is that this node and nearby
nodes are expected to have the lowest centrality value, and thus are likely to be
the answer node after node addition. In the case of time-evolving graphs, small
numbers of nodes are continually being added to the large number of already
existing nodes. Therefore, there is little difference between the graphs before and
after node addition. In addition, we can more accurately estimate the centrality
value of a node if the node is close to the root node; this is the second reason.
This is because our estimation scheme is based on distances from the root node.
Efficient Centrality Monitoring for Time-Evolving Graphs 45
Algorithm 1. Sniper
Input: G[t] = (V, E), a time-evolving graph at time t
uadd , the node added at time t
ulow [t − 1], the previous lowest centrality node
Output: ulow [t]: the lowest centrality node.
1: //Update the approximate graph
2: update the approximate graph by the update algorithm;
3: //Search for the lowest centrality node
4: Vexact ← empty set;
5: compute the BFS-tree of node ulow [t − 1];
6: compute θ, the exact centrality of node ulow [t − 1];
7: for each node v ∈ V do
8: compute Cv in the approximate graph by the estimation algorithm;
9: if Cv ≤ θ then
10: for each node v ∈ v do
11: append node v → Vexact ;
12: end for
13: end if
14: end for
15: for each node v ∈ Vexact do
16: compute Cv in the original graph by the estimation algorithm;
17: if Cv < θ then
18: θ ← Cv ;
19: ulow [t] ← v;
20: end if
21: end for
22: return ulow [t];
node. Otherwise, Sniper appends aggregated original nodes to Vexact (lines 9-13),
and then computes exact centralities to identify the lowest centrality node (lines
15-21).
5 Extension
In this section, we give a discussion of possible extensions to Sniper.
6 Theoretical Analysis
This section provides theoretical analyses that confirm the accuracy and com-
plexity of Sniper. Let n be the number of nodes and m the number of edges.
We prove that Sniper finds the lowest centrality node accurately (without fail)
as follows:
Efficient Centrality Monitoring for Time-Evolving Graphs 47
Theorem 1 (Find the lowest centrality node). Sniper guarantees the exact
answer when identifying the node whose centrality is the lowest.
Proof. Let ulow be the lowest centrality node in the original graph, and θlow
be the exact centrality of ulow (i.e., θlow is the lowest centrality). Also let θ be
the candidate centrality in the search process.
In the approximate graph, since θlow ≤ θ, the approximate centrality of node
ulow is never greater than θ (Lemma 2). Similarly, in the original graph, the
estimate centrality of node ulow is never grater than θ (Lemma 3). The algorithm
discards a node if (and only if) its approximate or estimated centrality is greater
than θ. Therefore, the lowest centrality node ulow can never be pruned during
the search process.
We now turn to the complexity of Sniper. Note that the previous approaches
need O(n2 ) space and O(n3 ) time to compute the lowest centrality node.
Proof. We first prove that Sniper requires O(n + m) space. Sniper keeps the
approximate graph and the original graph. In the approximate graph, since the
number of nodes and edges are at most n and m, respectively, Sniper needs
O(n+m) space for the approximate graph; O(n+m) space is required for keeping
the original graph. Therefore, the space complexity of Sniper is O(n + m).
Next, we prove that Sniper requires O(n2 + nm) time. To identify the lowest
centrality node, Sniper first updates the approximate graph and then computes
approximate and exact centralities. Sniper needs O(nm) time to update the
approximate graph, since it requires O(m) time to compute similarity for the
added node against each node in the original graph. It requires O(n2 + nm)
time to compute the approximate and exact centralities since the number of
nodes and edges are at most n and m, respectively. Therefore, Sniper requires
O(n2 + nm) time.
Theorem 2 shows, theoretically, that the space and time complexities of Sniper
are lower in order than those of the previous approximate approaches. In practice,
the search cost depends on the effectiveness of the approximation and estimation
techniques used by Sniper. In the next section, we show their effectiveness by
presenting the results of extensive experiments.
7 Experimental Evaluation
We performed experiments to demonstrate Sniper’s effectiveness in a comparison
to two annotation approaches: the Zone annotation scheme and the Distant to
zone annotation scheme (abbreviated to DTZ). These were selected since they
outperform the other embedding schemes on the contents of our dataset; the
same result is reported in [13]. Zone and DTZ annotation have two parameters:
48 Y. Fujiwara, M. Onizuka, and M. Kitsuregawa
5 5 2
10 10 1 10
Sniper Sniper ZoneZone(zones=2) Zone(zones=2)
Zone Zone Zone(zones=3) Zone(zones=3)
4 DTZ 4 DTZ Zone(zones=4) Zone(zones=4)
10 10 DTZ(zones=2) DTZ(zones=2)
0.75
DTZ(zones=3) DTZ(zones=3)
101 DTZ(zones=4)
Sniper Sniper
103
Error ratio
103
0.5
2
10
102
100
0.25
1
1
10
10
0
0
10 10-1
0 50000 200000 350000 500000 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8
10 9 10
P2P Social WWW Number of nodes Number of dimensions Number of dimensions
zones and dimensions. Zones are divided regions of the entire graph, and di-
mensions are sets of zones1 . Note that these approaches can compute centrality
quickly at the expense of exactness.
We used the following three public datasets in the experiments: P2P [1], Social
[2], and WWW [3]. They are a campus P2P network for file sharing, a free on-
line social network, and web pages within the ‘nd.edu’ domain, respectively. We
extracted the largest connected component from the real data, and we added
single nodes one by one in the experiments.
We evaluated the search performance through wall clock time. All experiments
were conducted on a Linux quad 3.33 GHz Intel Xeon server with 32GB of main
memory. We implemented our algorithms using GCC.
1
To compute the centralities of all nodes by the annotation approaches, we sampled
half pairs from all nodes, which is the same setting used in [13].
Efficient Centrality Monitoring for Time-Evolving Graphs 49
exact centralities for nodes that cannot be pruned through approximation. This
cost, however, has no effect on the experimental results. This is because a sig-
nificant number of nodes are pruned by approximation.
8 Conclusions
This paper addressed the problem of finding the lowest closeness centrality node
from time-evolving graphs efficiently and exactly. Our proposal, Sniper, is based
on two ideas: (1) It approximates the original graph by aggregating original nodes
to compute approximate centralities efficiently, and (2) It terminates unnecessary
distance computations early in finding the answer nodes, which greatly improves
overall efficiency. Our experiments show that Sniper works as expected; it can
find the lowest centrality node at high speed; specifically, it is significantly (more
than 110 times) faster than existing approximate methods.
50 Y. Fujiwara, M. Onizuka, and M. Kitsuregawa
References
1. http://kdl.cs.umass.edu/data/canosleep/canosleep-info.html
2. http://snap.stanford.edu/data/soc-LiveJournal1.html
3. http://vlado.fmf.uni-lj.si/pub/networks/data/ND/NDnets.htm
4. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of
the web. Computer Networks 29(8-13), 1157–1166 (1997)
5. Elmacioglu, E., Lee, D.: On six degrees of separation in dblp-db and more. SIG-
MOD Record 34(2), 33–40 (2005)
6. Garfield, E.: Citation Analysis as a Tool in Journal Evaluation. Science 178, 471–
479 (1972)
7. Goldberg, A.V., Harrelson, C.: Computing the shortest path: search meets graph
theory. In: SODA, pp. 156–165 (2005)
8. Leskovec, J., Kleinberg, J.M., Faloutsos, C.: Graph evolution: Densification and
shrinking diameters. TKDD 1(1) (2007)
9. Nascimento, M.A., Sander, J., Pound, J.: Analysis of sigmod’s co-authorship graph.
SIGMOD Record 32(3), 8–10 (2003)
10. Newman: The structure and function of complex networks. SIREV: SIAM Re-
view 45 (2003)
11. Ng, T.S.E., Zhang, H.: Predicting internet network distance with coordinates-based
approaches. In: INFOCOM (2002)
12. Potamias, M., Bonchi, F., Castillo, C., Gionis, A.: Fast shortest path distance
estimation in large networks. In: CIKM, pp. 867–876 (2009)
13. Rattigan, M.J., Maier, M., Jensen, D.: Using structure indices for efficient approx-
imation of network properties. In: KDD, pp. 357–366 (2006)
Graph-Based Clustering with Constraints
1 Introduction
One of the primary forms of adding background knowledge for clustering the data is
to provide constraints during the clustering process [1]. Recently, data clustering using
constraints has received a lot of attention. Several works in the literature have demon-
strated improved results by incorporating external knowledge into clustering in differ-
ent applications such as document clustering, text classification. The addition of some
background knowledge can sometimes significantly improve the quality of the final
results obtained. The final clusters that do not obey the initial constraints are often in-
adequate for the end-user. Hence, adding constraints and respecting these constraints
during the clustering process plays a vital role in obtaining desired results in many
practical domains. Several methods are proposed in the literature for adding instance-
level and cluster-level constraints. Constrained versions of partitional [19,1,7], hierar-
chical [5,13] and more recently, density-based [17,15] clustering algorithms have been
studied thoroughly. However, there has been little work in utilizing the constraints in
the graph-based clustering methods [14].
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 51–62, 2011.
c Springer-Verlag Berlin Heidelberg 2011
52 R. Anand and C.K. Reddy
– Investigate the appropriate way of embedding constraints into the graph-based clus-
tering algorithm for obtaining better results.
– Propose a novel distance limit criteria for must-links and cannot-links while em-
bedding constraints.
– Study the effects of adding different types of constraints to graph-based clustering.
The remainder of the paper is organized as follows: we briefly review the current ap-
proaches for using constraints in different methods in Section 2. In Section 3, we will
describe the various notations used throughout our paper and also give an overview of
a graph-based clustering method, namely, CHAMELEON. Next, we propose our algo-
rithm and discuss our approach regarding how and where to embed constraints in Sec-
tion 4. We present several empirical results on different UCI datasets and comparisons
to the state-of-the-art methods in Section 5. Finally, Section 6 concludes our discussion.
2 Relevant Background
Constraint-based clustering has received a lot of attention in the data mining community
in the recent years [3]. In particular, instance-based constraints have been successfully
used to guide the mining process. Instance-based constraints enforce constraints on
data points as opposed to and δ constraints which work on the complete clusters.
The -constraint says that for cluster X having more than two points, for each point
x ∈ X, there must be another point y ∈ X such that the distance betweeen x and
y is at most . The δ-constraint requires distance between any two points in different
clusters to be at least δ . This methodology has also been termed as semi-supervised
clustering [9] when the cluster memberships are available for some data. As pointed out
in the literature [19,5], even adding a small number of constraints can help in improving
the quality of results.
Embedding instance-level constraints into the clustering method can be done in sev-
eral ways. A popular method of incorporating constraints is to compute a new distance
metric and perform clustering. Other methods directly embed constraints into optimiza-
tion criteria of the clustering algorithm [19,1,5,17]. Hybrid methods of combining these
two basic approaches are also studied in the literature [2,10]. Adding instance-level con-
straints to the density-based clustering methods had recently received some attention as
well as [17,15]. Inspite of the popularity of graph-based clustering methods, not much
attention is given to the problem of adding constraints to these methods.
3 Preliminaries
Let us consider a dataset D, whose cardinality is represented as D. The total number
of classes in the dataset are K . Proximity graph is constructed from this dataset by
computing the pair-wise Euclidean distance between the instances. A user-defined pa-
rameter k is used to define the number of neigbors considered for each data point. The
hyper-graph partitioning algorithm generates intermediate subgraphs (or sub-clusters)
to be formed which are represented by κ.
Graph-Based Clustering with Constraints 53
3.1 Definitions
Given a dataset D with each point denoted as (x, y) where x represents the point and y
represent the corresponding label, we define constraints as follows:
Definition 1: Must-Link Constraints(ML): Two instances (x1 , y1 ) and (x2 , y2 ) are said
to be must-link constraints, if and only if, y1 = y2 where y1 , y2 ∈ K.
Definition 2: Cannot-Link Constraints(CL): Two instances (x1 , y1 ) and (x2 , y2 ) are
said to be cannot-link constraints, if and only if, y1 = y2 , where y1 , y2 ∈ K.
Definition 3: Transitivity of ML-constraints: Let X, Y be two components formed
must−link
using ML-constraints. Then, a new ML-constraint (x1 → x2 ) where x1 ∈ X
must−link
and x2 ∈ Y introduces the following new constraints: (xi → xj ) ∀xi , xj where
xi ∈ X and xj ∈ Y , i = j,X = Y .
Definition 4: Entailment of CL-constraints: Let X, Y be two components formed using
cannot−link
ML-constraints. Then, a new CL-constraint (x1 → x2 ), where x1 ∈ X and
cannot−link
x2 ∈ Y introduces the following new CL-constraints: (xi → xj ) ∀xi , xj
where xi ∈ X and xj ∈ Y , i = j,X = Y .
where EC(X, Y ) is the sum of edges that connects clusters X and Y in the k-nearest
neighbor graph. EC(X) is the minimum sum of the cut-edges if we bisect cluster X;
and EC(Y ) is the minimum sum of the cut-edges if we bisect cluster Y . Let lx and ly
represents size of the clusters X and Y respectively. Mathematically, Relative Closeness
(RC) is defined as follows:
S EC (X, Y )
RC = ly
(2)
lx
S (X) + lx +l
lx +ly EC y
S EC (Y )
Using distance (or dissimilarity) metric to enforce constraints [13] was claimed to be
effective in practice, despite having some drawbacks. The main problem is caused due
to setting the distance to zero between all the must-linked pair of constraints. i.e., Let
(pi , pj ) be two instances in a must-link constraint then,
distance(pi , pj ) = 0
At the first look, it may seem that this minute change will not affect the results signif-
icantly. However, after running all-pairs-shortest-path algorithm, the updated distance
matrix in this case, will respect the original distance measures better than setting the dis-
tance to zero. Similarly for cannot-link constraints, let (qi , qj ) be a pair of cannot-link
constraints, then the points qi and qj are taken apart as far as possible. i.e.,
Thus, by varying the values of η and λ, we can push and pull away points reasonably. It
seems that this might create a problem for finding optimal values of η and λ. However,
our preliminary experiments show that the basic limiting values for these parameters is
enough in most of the cases. This addition of constraints (and thus the manipulation of
the distance matrix) can be performed in the CHAMELEON algorithm primarily in any
of the two phases. We can add these constraints before (or after) the graph partitioning
step. After the graph partitioning, we can add constraints during agglomerative cluster-
ing. However, we prefer to add constraints before graph partitioning primarily due to
the following reasons:
– When the data points are already in sub-clusters, enforcing constraints through dis-
tance will not be beneficial unless we ensure that all such constraints are satis-
fied during the agglomerative clustering. However, constraint satisfaction might not
lead to convergence every time. Especially with CL constraints, even determining
whether satisfying assignments exist is NP-complete.
– Intermediate clusters formed are on the basis of original distance metric. Hence, RI
and RC on the original distance metric will get undermined by the new distance
updation through constraints.
Using Constraints. Our algorithm begins by using constraints to modify the distance
matrix. To utilize properties of the constraints (Section 3.1) and to restore metricity
of the distances, we propogate constraints. The must-links are propogated is done by
running the fast version of all-pairs-shortest-path algorithm. If u, v represents the source
and destination respectively, then the shortest path between u, v involves only points
u, v and x, where x must belong to any pair of ML constraints. Using this modification,
the algorithm runs in O(n2 m) (here m is the number of unique points in ML). The
complete-link clustering inherently propagates the cannot-link constraints. Thus, there
is no need to propagate CL constraints during Step 1.
56 R. Anand and C.K. Reddy
5 Experimental Results
We will now present our experimental results obtained using the proposed method on
benchmark datasets from UCI machine Learning Repository [8]. Our results on various
versions of Constrained CHAMELEON(CC) were obtained with same parameter set-
tings for each dataset. These parameters were not tuned particularly for CHAMELEON,
however we did follow some specific guidelines for each dataset to obtain these param-
eters. We used the same default settings for all the internal parameters of the METIS
Graph-Based Clustering with Constraints 57
Table 1. Average Rand Index values for 100 ML + CL constraints on UCI datasets
We used five UCI datasets in our experiments as shown in Table 1. Average Rand
Index values for 100 Must-link and Cannot-link constraints clearly outlines that on most
occasions, MPCK-means [4] is outperformed by both the variants of CC. CC(fixed)
performed marginally better than CC(p=1). Also, we only show results for CC(p=5)
and CC(p=15), since the results of CC(p=1) and CC(p=10) are similar to the other two.
58 R. Anand and C.K. Reddy
For each dataset, we randomly select constraints and run algorithm once per con-
straint set. This activity is done 10 times and we report the average Rand-Index value for
all the 10 runs. We used this experimentation for all the variants of CC and
MPCK-means. The results are depicted in Figs. 1-3. We state that the distance value
for must-links and cannot-links can be varied instead of fixed values like 0 and ∞ cor-
respondingly. The CC(fixed) uses (0,∞) for distance measures. Ideally, the values of
(η, λ) could be anything close to extreme values of 0 and ∞, yet they have to be quan-
tifiable. In order to quantify them in our experiments, we defined as follows:
where Dmax , Dmin represents maximum and minimum distance values in the data ma-
trix respectively. In order to study the effect of p, we varied it’s values: p = 1, 5, 10 and
15. Thus, we have CC(p = 1), CC(p = 5), CC(p = 10) and CC(p = 15) referring to differ-
ent values of (η, λ). It is interesting to note that, for different values of p, distance values
for constraints are different for each dataset due to different minimum and maximum
distance values. In this manner, we respect the metricity of original distances and vary
our constraint values accordingly.
We tried various parameter settings and found only few selected ones to be making
some significant difference in the quality of the final results. It is also important to note
that these settings were found by running the basic CHAMELEON algorithm rather
than CC. This is because, finding optimal parameters for CC using various constraints
will be constraints-specific and it will not truly represent the algorithmic aspect. We
then run CC using a few selected settings for all the variants of CC using all constraints
size and finally report the average values specific to one set of parameters only show-
ing better performance on average across all CC variants. The individual settings of
parameters (k, κ, α) for each dataset shown in results are as follows: Ion(19,10,1.2),
Iris(9,3,1), Liver(10,5,2), Sonar(6,3,1) and Wine(16,3,1). In summary, we selected the
best results obtained by the basic version of the CHAMELEON algorithm, and have
shown that these best results can be improved by adding constraints.
We observed across all the variants of CC and MPCK-means for all datasets con-
sistently that the performance decreases as the number of constraints increase, except
in some prominent cases (Figs. 1(d),2(a),2(b) and 3(d)). This observation is consistent
with the results outlined in the recent literature [6]. We stated earlier that we did not
attempt to satisfy constraints implicitly or explicitly. However, we observed that during
Step 3 of Algorithm 1, for fewer constraints, most of the times the constraint violation
is zero in the intermediate clusters, which is often reflected in the final partitions. As the
number of constraints increase, the number of constraint violations also increase. How-
ever, on an average, violations are roughly between 10%-15% for must-link constraints,
20%-25% for cannot-link constraints, and about 15%-20% for must-links and cannot-
links combined. We also observed that few times, the constraint violations are reduced
after Step 4, i.e., after the final agglomerative clustering. Thus, we can conclude that the
effect of constraints is significant in Step 3 and we re-iterate that embedding constraints
earlier is always better for CC.
Graph-Based Clustering with Constraints 59
0.61 1
CC(fixed) CC(fixed)
0.6 CC(p=5) CC(p=5)
CC(p=15) CC(p=15)
0.95 MPCK
0.59 MPCK
0.58
0.9
0.57
Rand Index
Rand Index
0.56 0.85
0.55
0.8
0.54
0.53 0.75
0.52
0.7
0.51 10 20 30 40 50 60 70 80 90 100
10 20 30 40 50 60 70 80 90 100 Number of ML constraints
Number of ML constraints
(a) (b)
0.514 0.54
CC(fixed)
CC(p=5) CC(fixed)
0.512 0.535
CC(p=15) CC(p=5)
MPCK CC(p=15)
0.53 MPCK
0.51
0.525
0.508
Rand Index
Rand Index
0.52
0.506
0.515
0.504
0.51
0.502
0.505
0.5 0.5
0.498 0.495
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of ML constraints Number of ML constraints
(c) (d)
0.7 0.94
0.68
CC(fixed)
0.92
CC(p=5)
0.66
CC(p=15)
0.9 MPCK
0.64
0.62
0.88
Rand Index
Rand Index
0.6
0.86
0.58
0.5 0.8
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of CL constraints Number of CL constraints
(a) (b)
0.507 0.505
CC(fixed) CC(fixed)
CC(p=5) CC(p=5)
0.506 CC(p=15) CC(p=15)
0.504
MPCK MPCK
0.505
0.503
0.504
0.502
Rand Index
Rand Index
0.503
0.501
0.502
0.5
0.501
0.499
0.5
0.499 0.498
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of CL constraints Number of CL constraints
(c) (d)
(Rand Index=1) during some of the runs for 190 constraints, with same settings as used
in all other experiments shown. However, it will be too early to conclude that finding
tuned values for (η, λ) will always increase performance, based on some initial results
and will need further experimental evidence.
Based on our findings, we observed that changing values for (η, λ) did sometimes
increase performance, but not consistently and can also sometimes lead to decrease in
performance. We were also surprised by this phenomenon demonstrated by both the
algorithms. In our case, carrying more experiments with additional constraints revealed
that this decrease in performance is true upto a particular number of constraints. Af-
ter that we again see rise in performance and with enough number of constraints (1%
to 5% of constraints in our case with these datasets), we are able to decipher original
clustering or close to it (Rand Index close to 1.0). CC(fixed) compared to other vari-
ants of CC were only slightly different on an average. CC(fixed) performed reasonably
well across all the datasets on nearly all settings with MPCK-means. Other variants of
CC were also better on an average compared to MPCK-means. Thus, our algorithm
performed better than MPCK-means in terms of handling the decrease in performance
when the number of constraints increase. Most importantly, our algorithm performed
well despite not trying to satisfy constraints implicitly or explicitly.
Graph-Based Clustering with Constraints 61
0.64 1
CC(fixed) CC(fixed)
CC(p=5) CC(p=5)
0.62 CC(p=15) 0.95 CC(p=15)
MPCK MPCK
0.6 0.9
0.58 0.85
Rand Index
Rand Index
0.56 0.8
0.54 0.75
0.52 0.7
0.5 0.65
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of ML,CL constraints Number of ML,CL constraints
(a) (b)
0.518 0.545
CC(fixed) CC(fixed)
0.516 CC(p=5) CC(p=5)
0.54
CC(p=15) CC(p=15)
MPCK MPCK
0.514 0.535
0.512
0.53
0.51
Rand Index
Rand Index
0.525
0.508
0.52
0.506
0.515
0.504
0.51
0.502
0.505
0.5
0.5
0.498 10 20 30 40 50 60 70 80 90 100
10 20 30 40 50 60 70 80 90 100 Number of ML,CL constraints
Number of ML,CL constraints
(c) (d)
6 Conclusion
In this work, we presented a novel constrained graph-based clustering method based
on the CHAMELEON algorithm. We proposed a new framework for embedding con-
straints into the graph-based clustering algorithm to obtain promising results. Specifi-
cally, we thoroughly investigated the “how and when to add constraints” aspect of the
problem. We also proposed a novel method for the distance limit criteria while em-
bedding constraints into the distance function. Our algorithm outperformed the popular
MPCK method on several real-world datasets under various constraint settings.
References
1. Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised clustering by seeding. In: Proceedings
of the Nineteenth International Conference on Machine Learning (ICML 2002), pp. 27–34
(2002)
2. Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised cluster-
ing. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 59–68 (2004)
62 R. Anand and C.K. Reddy
3. Basu, S., Davidson, I., Wagstaff, K.: Constrained Clustering: Advances in Algorithms, The-
ory, and Applications. Chapman & Hall/CRC (2008)
4. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semi-
supervised clustering. In: Proceedings of the Twenty-first International Conference on Ma-
chine Learning, ICML 2004 (2004)
5. Davidson, I., Ravi, S.S.: Agglomerative Hierarchical Clustering with Constraints: Theoreti-
cal and Empirical Results. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J.
(eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 59–70. Springer, Heidelberg (2005)
6. Davidson, I., Ravi, S.S., Shamis, L.: A sat-based framework for efficient constrained cluster-
ing. In: Jonker, W., Petković, M. (eds.) SDM 2010. LNCS, vol. 6358, pp. 94–105. Springer,
Heidelberg (2010)
7. Davidson, I., Wagstaff, K.L., Basu, S.: Measuring Constraint-Set Utility for Partitional Clus-
tering Algorithms. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS
(LNAI), vol. 4213, pp. 115–126. Springer, Heidelberg (2006)
8. Frank, A., Asuncion, A.: UCI machine learning repository (2010),
http://archive.ics.uci.edu/ml
9. Gunopulos, D., Vazirgiannis, M., Halkidi, M.: From unsupervised to semi-supervised learn-
ing: Algorithms and evaluation approaches. In: SIAM International Conference on Data Min-
ing: Tutorial (2006)
10. Halkidi, M., Gunopulos, D., Kumar, N., Vazirgiannis, M., Domeniconi, C.: A framework for
semi-supervised learning based on subjective and objective clustering criteria. In: Proceed-
ings of the 5th IEEE International Conference on Data Mining (ICDM 2005), pp. 637–640
(2005)
11. Karypis, G., Han, E.-H., Kumar, V.: Chameleon: Hierarchical clustering using dynamic mod-
eling. IEEE Computer 32(8), 68–75 (1999)
12. Karypis, G., Kumar, V.: Metis 4.0: Unstructured graph partitioning and sparse matrix order-
ing system. Tech. Report, Dept. of Computer Science, Univ. of Minnesota (1998)
13. Klein, D., Kamvar, S.D., Manning, C.D.: From instance-level constraints to space-level con-
straints: Making the most of prior knowledge in data clustering. In: Proceedings of the Nine-
teenth International Conference on Machine Learning (ICML 2002), pp. 307–314 (2002)
14. Kulis, B., Basu, S., Dhillon, I.S., Mooney, R.J.: Semi-supervised graph clustering: a ker-
nel approach. In: Proceedings of the Twenty-Second International Conference on Machine
Learning (ICML 2005), pp. 457–464 (2005)
15. Lelis, L., Sander, J.: Semi-supervised density-based clustering. In: Perner, P. (ed.) ICDM
2009. LNCS, vol. 5633, pp. 842–847. Springer, Heidelberg (2009)
16. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the Amer-
ican Statistical Association 66(336), 846–850 (1971)
17. Ruiz, C., Spiliopoulou, M., Menasalvas, E.: Density based semi-supervised clustering. Data
Mining and Knowledge Discovery 21(3), 345–370 (2009)
18. Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining, US edition. Addison
Wesley, Reading (2005)
19. Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering with back-
ground knowledge. In: Proceedings of the Eighteenth International Conference on Machine
Learning (ICML 2001), pp. 577–584 (2001)
A Partial Correlation-Based Bayesian Network
Structure Learning Algorithm under SEM
1 Introduction
Learning the structure of Bayesian network from dataset D is useful, unfortu-
nately, it is an NP-hard problem [2]. Consequently, many heuristic techniques
have been proposed. One of the most basic search algorithms is a local greedy
hill-climbing search over all DAG structures. The size of the search space of
greedy search is super exponential to the number of variables. One of the ap-
proaches uses constraints placed on the search to improve efficiency of the search,
such as the K2 algorithm [3], the SC algorithm [4], the MMHC algorithm [15],
the L1MB algorithm [7].
One drawback of the K2 algorithm is that it requires a total variable order-
ing. The SC algorithm first introduces local learning idea and proposes two-phase
framework including a Restrict step and a Search step. In the Restrict step, the
SC algorithm uses mutual information to find a set of potential neighbors for
each node and achieves fast learning by restricting the search space. One draw-
back of the SC algorithm is that it only allows a variable to have a maximum
Corresponding author.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 63–74, 2011.
c Springer-Verlag Berlin Heidelberg 2011
64 J. Yang and L. Li
up to k parents. However a common parameter k for all nodes will have to sac-
rifice either efficiency or quality of reconstruction [15]. The MMHC algorithm
uses the max-min parents-children (MMPC) algorithm to identify a set of po-
tential neighbors [15]. Experiments show that the MMHC algorithm has quite
high accuracy, one drawback of which is that it needs conditional independency
tests on exponentially large conditioning sets. The L1MB algorithm introduces
L1 techniques to learn DAG structure and uses the LARS algorithm to find a
set of potential neighbors [7]. The L1MB algorithm has good time performance.
However, the L1MB algorithm can describe the correlation between a set of
variables and a variable, not the correlation between two variables. Experiments
show that the L1MB algorithm has low accuracy.
In fact, many algorithms, such as the K2, SC, PC [13], TPDA [1], MMHC, can
be implemented efficiently with discrete variables, and are not applicable to the
continuous variables straightforwardly. The L1MB algorithm has been designed
for continuous variables. However, its accuracy is not very high.
Partial correlation method can reveal the true correlation between two vari-
ables by eliminating the influences of other correlative variables[16]. It has been
successfully applied to many fields such as medicine [8], economics [14], and geol-
ogy [16]. In causal discovery, it has been used (as transformed by Fisher’s z [12] )
as a continuous replacement for CI tests in PC algorithm. Pellet et al. introduced
partial-correlation-based CI test into causation discovery with the assumption
that data follows multivariate Gaussian distribution for continuous variables [9].
However, when the data doesn’t follow multivariate Gaussian distribution, can
partial correlation be CI test?
Our first contribution is that we give the proof that partial correlation can
be used as the criterion of CI test under linear simultaneous equation model
(SEM), which includes multivariate Gaussian distribution as a special case. Our
second contribution is that we propose an effective algorithm, called PCB (Par-
tial Correlation-Based), which combines ideas from local learning with partial
correlation techniques in an effective way. PCB algorithm works on the continu-
ous variable setting with the assumption data generated by SEM. Computational
complexity of PCB is O(3mn2 + n3 ) (n is the number of variables and m is the
number of cases). Advantages of PCB are that PCB has quite good time perfor-
mance and quite high accuracy. The time complexity of our PCB is polynomially
bounded by the number of variables. The third advantage of PCB algorithm is
that PCB algorithm uses a relevance threshold to evaluate the correlation to al-
leviate the drawback of SC algorithm (common parameter k for all nodes), and
we also find the best relevance threshold by a series of extensive experiments.
Empirical results show that PCB outperforms the above existing algorithms in
both accuracy and time performance.
The remainder of the paper is structured as follows. In section 2, we present
PCB algorithm and give computational complexity analysis. Some empirical
results are shown and discussed in section 3. Finally, we conclude our work and
address some issues about future work in section 4.
A PCB Bayesian Network Structure Learning Algorithm under SEM 65
2 PCB Algorithm
PCB(Partial Correlation-Based) includes two steps: the Restrict step and the
Search step.
The restrict step is analogous to the pruning step of the SC algorithm, the
MMHC algorithm, the L1MB algorithm. In this paper, partial correlation is
used to identify the candidate neighbors. To a certain extent, there is a corre-
lation between each two variables, but the correlation is affected by the other
correlative variables. Simple correlation method does not consider the influences,
so it cannot reveal the true correlation between two variables. Partial correla-
tion can eliminate the influences of other correlative variables and reveal the
true correlation between two variables. A larger magnitude of partial correlation
coefficient means a closer correlation [Xu et al., 2007]. So partial correlation is
used to select the potential neighbors. Before we give our algorithm, we give
some definitions and theorems.
Definition 2.1[9] (SEM). A SEM (structural equation model) is a set of equa-
tions describing the value of each variable Xi in X as a function fXi of its parents
Pa(Xi ) and a random disturbance term uXi :
all its descendants are not in Z. If X and Y are not d-separated by Z, they are
d-connected,denoted as Dcon(Xi , Xj |Z).
Theorem 2.1.[12] In a SEM with uncorrelated errors (that means for two ran-
dom variables Xi , Xj ∈ X, uXi and uXj are uncorrelated), Z ⊆ X \ {Xi , Xj },
a partial correlation ρ(Xi , Xj |Z) is entailed to be zero if and only if Xi and Xj
are d-separated given Z.
Definition 2.4[10] (Perfect map). If the Causal Markov and Faithfulness con-
ditions hold together, A DAG G is a directed perfect map of a joint probability
distribution P (X), and there is bijection between d-separation in G and condi-
tional independence in P :
∀Xi , Xj ∈ X, ∀Z ⊆ X \ {Xi , Xj } : Ind(Xi , Xj |Z) ⇔ Dsep(Xi , Xj |Z) (3)
Definition 2.5[5](Linear Correlation). In a variable set X, the linear corre-
lation coefficient γXi Xj between two random variables Xi , Xj ∈ X, provides the
most commonly used measure to assess the strength of the linear relationship
between Xi and Xj is defined by
γXi Xj = σXi Xj /σXi σXj (4)
where σXi Xj ,denotes the covariance between Xi and Xj , and σXi and σXj denote
the standard deviation of Xi and Xj respectively. γxi xj is estimated by
m
(xki − x̄i )(xkj − x¯j )
γ̂Xi Xj = k=1
(5)
m
m
(xki − x¯i )2 (xkj − x¯j )2
k=1 k=1
Here, m is the number of instances. xki means k-th realization (or case) of Xi ,
and x̄i is the mean of Xi . xkj means k-th case of Xj , and x̄j is the mean of Xj .
Definition 2.6[10] (Partial correlation). In a variable set X, the partial
correlation between two random variables Xi , Xj ∈ X, given Z ⊆ X \ {Xi , Xj },
noted ρ(Xi , Xj |Z), is the correlation of the residuals RXi and RXj resulting from
the least-squares linear regression of Xi on Z and of Xj on Z, respectively. Partial
correlation can be computed efficiently without having to solve the regression
problem by inverting the correlation matrix R of the X. With R−1 = (rij ), here
R−1 is the inverse matrix of R, we have:
√
ρ(Xi , Xj |X \ {Xi , Xj }) = −rij / rii rjj (6)
In this case, we can compute all partial correlations with a single matrix inver-
sion. This is an approach we use in our algorithm.
Theorem 2.2. In a SEM with uncorrelated errors, when data is generated by
the SEM no matter what distribution disturbances subject to, we can use partial
correlation as the criterion of CI test.
Prove: From Theorem 2.1, ∀Xi , Xj ∈ X, ∀Z ⊆ X \ {Xi , Xj }, the partial corre-
lation ρ(Xi , Xj |Z) is entailed to be zero if and only if Xi and Xj are d-separated
A PCB Bayesian Network Structure Learning Algorithm under SEM 67
The outline of the Restrict step is shown in Fig.1. Input of the step is threshold
k and a dataset D = {x1 , · · · , xm } of instances of X, where each xi is a complete
assignment to the variables X1 , · · · , Xn in V al(X1 , · · · , Xn ). Each column of the
dataset represents one variable. Output of the step is a set of potential neighbors
PN(Xj ) of each Xj and the matrix of potential neighbors PNM. If PNM(i, j)
is 1, that means Xi is Xj ’s potential neighbor. Otherwise, if PNM(i, j) is 0,
Xi isn’t Xj ’s potential neighbor. Initially, PN(Xj )(potential neighbors of each
variable Xj ) is empty, all elements of PNM are set to 0 (step 1 ). Then we
select a set of potential neighbors for each variable and obtain the final matrix
of potential neighbors (from step 2 to step 9). For each pair of variables Xi and
Xj (Xi , Xj ∈ X, j = 1 to n, i = 1, · · · , j, i = j), Z = X \ {Xi , Xj }, calculate
ρ(Xi , Xj |Z), if absolute value of ρ(Xi , Xj |Z) is greater than k, then choose Xi
as Xj ’s potential neighbor and set the value of PNM(i, j) to 1, otherwise set
the value of PNM(i, j) to 0. In fact, ρ(Xi , Xj |Z) (i < j) equals ρ(Xj , Xi |Z),
however, if there is strong correlation between them, we only set PNM(i, j) to
1. PNM is upper triangular matrix and on the diagonal elements are 0. Because,
Search step includes reverse-edge operator, by performing greedy hill-climbing
search, the step can orient the edges properly.
n
|θ̂imle |
M DL(G) = (N LL(i, Pa(Xi ), θ̂imle ) + log m) (7)
i=1
2
m
N LL(i, Pa(Xi ), θ) = − log(P (Xj,i |Xj,Pa(Xi ) , θ)) (8)
j=1
The method is used in [7]. Where m is the number of data cases, n is the
number of variables, Pa(Xi ) are the parents of node i in G, N LL(i, Pa(Xi ), θ)
is the negative loglikelihood of node i with parents Pa(Xi ) and parameters θ,
68 J. Yang and L. Li
1 m
Input: a dataset D={x ,…, x }, threshold: k
Output: a set of potential neighbors PN(Xj) of each variable Xj and potential neighbors matrix PNM
1. PN( Xj )= ( Xj X, j=1 to n ) , PNM( i, j )=0 ( i=1 to n, j=1 to n )
2. for Xj X, j=1 to n, do
3. for Xi X, i=1 to j , izj , Z = X \ { Xi, Xj }, do
4. Calculate partial correlation U (Xi , Xj |Z)
5. if abs( U(Xi, Xj |Z) )> =k then PN(Xj)= PN(Xj) Xi , PNM(i, j)=1,
6. else PNM(i, j)=0
7. end for
8. end for
9. return PN and PNM
that calculating the correlation coefficient of two variables needs 3 vector in-
ner products, and the correlation coefficient matrix has n2 elements, calculating
the correlation coefficient matrix requires 3n2 inner products, for m cases, com-
putational complexity of vector(m) inner product is O(m), thus computational
complexity of calculating the correlation coefficient matrix R is O(3mn2 ). We
know that the computation of the inverse matrix and matrix multiplication are
equal, so computational complexity of calculating the inverse matrix of R(n ∗ n)
is at most O(n3 ). We can get the conclusion that the total time complexity of
Restrict step is O(3mn2 + n3 ).
3 Experimental Results
We firstly evaluate the performance of PCB algorithm under the above two
cases, with different sample sizes, thresholds and networks. Fig.2 shows the re-
sults under SEM (2). X axis denotes networks. Y axis denotes the number of
structural errors. The results of SEM (1) are omitted because of space. From
Fig.2, we can see that threshold has a great effect on the performance of PCB
algorithm. The results of different SEMs are similar. When the dataset size is
1
http://www.cs.huji.ac.il/labs/compbio/Repository
70 J. Yang and L. Li
structural errors with different sample sizes, thresholds and networks of PCB algorithm under SEM(2)
Fig. 2. Structural errors of PCB algorithm. X axis denotes networks: 1.alarm 2.barley
3.carpo 4.factors 5.hailfinder 6.insurance 7.mildew 8.water. Y axis denotes the number
of structural errors. With different sample sizes(1000, 5000, 10000, 20000, 100000),
thresholds(0, 0.1, 0.3, m (m is the mean) ) and networks(the above 8 networks), PCB
algorithm has been tested.
small (1000, 5000), PCB (0.1) has the fewest structural errors on average, with
the dataset size gets larger, PCB(m) and PCB(0.1) have similar performances.
So when the threshold is 0.1, PCB algorithm achieves the best performance and
has the fewest structural errors on average almost on all the networks. Zero par-
tial correlation is not the best choice for CI test. For zero partial correlation
means independent, however, relevance has different extent, such as strong rel-
evance and weak relevance. The threshold is hard to select, maybe it depends
on the adopted networks. We have done a series of extensive experiments, and
found the best threshold on average.
The second experiment is to compare existing structure learning methods
with PCB algorithm. We adopt DAG, SC (5), SC (10), L1MB, PC (0.05), TPDA
(0.05), PCB (0.1) etc. PCB (0.1) means running DAG-Search after PCB pruning,
and DAG means running DAG-Search without pruning, SC (5) and SC (10)
means running DAG-Search after SC pruning (where we set the fan in bound to
5 and 10), L1MB means running DAG-Search after L1MB pruning. For DAG, SC
(5), SC (10), L1MB, we use Murphy’s DAGsearch implementation of DAGLearn
software2. For PC (0.05), TPDA (0.05), we used ”causal explorer”3.
Fig.3 shows the structural errors and time performance on the above networks
under SEM (2) by the seven algorithms. The results of SEM (1) are omitted
because of space. We give analyses in details as follows.
(1) PCB (0.1) algorithm vs DAG algorithm. DAG has worse performance on all
the networks. PCB algorithm achieves higher accuracy on all the networks under all
2
http://people.cs.ubc.ca/~ murphyk/
3
http://www.dsl-lab.org/
A PCB Bayesian Network Structure Learning Algorithm under SEM 71
(a) structural errors with different sample sizes and networks of seven algorithms under SEM(2)
(b) run time with different sample sizes and networks of seven algorithms under SEM(2)
Fig. 3. structural errors and run times under SEM(2). Under SEM(2), with differ-
ent sample sizes(1000,5000,10000,20000,100000) and networks(1.alarm 2.barley 3.carpo
4.factors 5.hailfinder 6.insurance 7.mildew 8.water), the 7 algorthms(DAG, SC(5),
SC(10), L1MB, PCB(0.1), PC(0.05), TPDA(0.05) ) have been tested. (a) are results
of structural errors. (b) are results of run time.
72 J. Yang and L. Li
the SEMs. For time performance, PCB (0.1) wins 5, ties 2, and loses 1 under SEM
(1), wins 5, ties 3, and loses 0 under SEM (2). Under the two SEMs, the results are
similar. For DAG, potential neighbors of each variable are all the other variables. In
the Search step, because we set the maximum number of iteration to 2500, maybe
2500 is too small, the Search step may terminate before finding the best DAG, so
the structural errors are more. Without the pruning step, time performance of DAG
algorithm is worse than PCB (0.1), the reason is as follows: Search step examines
the change in the score for each possible move. For without pruning, the number
of potential neighbors for each variable is large, the number within consideration
is also large, so the cost of search step is higher.
(2) PCB (0.1) algorithm vs SC(5) and SC(10). PCB algorithm achieves both
better time performance and higher accuracy almost on all the networks un-
der all the SEMs. SC algorithm needs specify the maximum fan in advance,
however some nodes in the true structure may have much higher connectivity
than others, so a common parameter for all nodes is not reasonable. In addition,
based on correlation coefficient SC algorithm of DAGLearn software selects the
top k (maximum fan) candidate neighbors and doesn’t consider symmetry of
correlation coefficient, and this will lead to redundant information of potential
neighbors, and will sacrifice either efficiency or performance. PCB algorithm
has not the above problems. From the section 2.3 we know that computational
complexity of calculating the correlation coefficient matrix is O(3mn2 ); in order
to select the top k candidate neighbors, we must sort each row of correlation
coefficient matrix, computational complexity is n3 , so the total complexity is
O(3mn2 + n3 ), which is equal to PCB restrict step (O(3mn2 + n3 )). However,
total time performance of SC algorithm is worse than that of PCB (0.1). Due
to the unreasonable selection of potential neighbors and redundant information
of potential neighbors, the cost of the search step will be increased. So SC algo-
rithm has worse time performance and accuracy.
(3) PCB (0.1) algorithm vs L1MB. PCB algorithm achieves both better time
performance and higher accuracy on all the networks under all the SEMs.
L1MB algorithm adopts LARS algorithm to select potential neighbors. For a
variable, L1MB selects the set of variables that have the best predictive accuracy
as a whole, and L1MB evaluates the effects of a set of variables, not a single vari-
able. Using the method to select potential neighbors has some shortcomings. The
method can describe the correlation between a set of variables and a variable,
not the correlation between two variables. There maybe exist some variables,
which do not belong to the selected set of potential neighbors, but have strong
relevance with the target variable. However, Partial correlation method can re-
veal the true correlation between two variables by eliminating the influences of
other correlative variables. PCB algorithm is based on partial correlation to se-
lect potential neighbors and evaluates the effect of a single variable. So PCB
algorithm is more reasonable, and experimental results also indicate that PCB
algorithm has fewer structural errors.
PCB (0.1) has better time performance than L1MB. From section 3.4, we know
that time complexity of PCB is O(3mn2 + n3 ) (n is the number of variables and
A PCB Bayesian Network Structure Learning Algorithm under SEM 73
m is the number of cases ). For L1MB, time complexity of computing the L1-
regularization path is O(mn2 ) in the Gaussian case (SEM (1) and SEM (2))[7].
In addition, L1MB also includes computing the Maximum Likelihood parameters
for all non-zero sets of variables encountered along this path and selecting the
set of variables that achieved the highest MDL score. So L1MB has worse time
performance than PCB(0.1) under all the SEMs.
(4) PCB (0.1) algorithm vs PC(0.05) algorithm. PCB (0.1) algorithm achieves
both better time performance and higher accuracy on all the networks under
all the SEMs. PC (0.05) algorithm has been designed for discrete variables, or
imposes restrictions on which variables may be continuous. PC first identifies
the skeleton of a Bayesian network and then orients the edges. However, PC
algorithm may fail to orient some edges, and in our experiments, we take the
edges as wrong. So PC algorithm has more structural errors. PC algorithm needs
O(nk+2 ) CI tests, k is the maximum degree of any node in the true structure[13].
Time complexity of CI test at least is O(m). So time complexity of PC algorithm
is O(mnk+2 ). PC algorithm has an exponential time complexity in the worst case.
Time complexity of PCB (0.1) is O(3mn2 + n3 ). Obviously, PC algorithm has
worse time performance.
(5) PCB(0.1) algorithm vs TPDA(0.05) algorithm. PCB algorithm achieves
both better time performance and higher accuracy on all the networks. TPDA
has been designed for discrete variables, or imposes restrictions on which vari-
ables may be continuous. So TPDA algorithm has more structural errors. TPDA
requires at most O(n4 ) CI tests to discover the edges. In some special case, TPDA
requires only O(n2 ) CI tests[1]. Time complexity of CI test at least is O(m). So
time complexity of TPDA is O(mn4 ) or O(mn2 ). Compared with PC algorithm,
TPDA algorithm has better time performance, however compared with PCB
algorithm O(mn2 ), time complexity of TPDA algorithm is still high. So PCB
(0.1) has better time performance than TPDA (0.05).
Acknowledgement
The research has been supported by 973 Program of China under award 2009CB
326203, the National Natural Science Foundation of China under award 61073193
74 J. Yang and L. Li
and 61070131. The authors are very grateful to the anonymous reviewers for their
constructive comments and suggestions that have led to an improved version of
this paper.
References
1. Cheng, J., Greiner, R., Kelly, J., Bell, D.A., Liu, W.: Learning Bayesian networks
from data: An information-theory based approach. Doctoral Dissertation. Depart-
ment of Computing Science, University of Alberta and Faculty of Informatics,
University of Ulster, November 1 (2001)
2. Chickering, D.: Learning Bayesian networks is NP-Complete. In: AI/Stats V (1996)
3. Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic
networks from data. Machine Learning 9(4), 309–347 (1992)
4. Friedman, N., Nachman, I., Peer, D.: Learning Bayesian network structure from
massive datasets: The ”sparse candidate” algorithm. In: UAI (1999)
5. Kleijnena, J.P.C., Heltonb, J.C.: Statistical analyses of scatterplots to identify im-
portant factors in largescale simulations, 1: Review and comparison of techniques.
Reliability Engineering and System Safety 65, 147–185 (1999)
6. Lam, W., Bacchus, F.: Learning Bayesian belief networks: An approach based on
the MDL principle. Comp. Int. 10, 269–293 (1994)
7. Schmidt, M., Niculescu-Mizil, A., Murphy, K.: Learning Graphical Model Struc-
ture Using L1-Regularization Paths. In: Proceedings of Association for the Ad-
vancement of Artificial Intelligence (AAAI), pp. 1278–1283 (2007)
8. Ogawa, T., Shimada, M., Ishida, H.: Relation of stiffness parameter b to carotid
arteriosclerosis and silent cerebral infarction in patients on chronic hemodialysis.
Int. Urol. Nephrol. 41, 739–745 (2009)
9. Pellet, J.P., Elisseeff, A.: Partial Correlation and Regression-Based Approaches to
Causal Structure Learning, IBM Research Technical Report (2007)
10. Pellet, J.P., Elisseeff, A.: Using Markov Blankets for Causal Structure Learning.
Journal of Machine Learning Research 9, 1295–1342 (2008)
11. Rissanen, J.: Stochastic complexity. Journal of the Royal Statistical Society, Series
B 49, 223–239 (1987)
12. Scheines, R., Spirtes, P., Glymour, C., Meek, C., Richardson, T.: The tetrad
project: Constraint based aids to causal model specification. Technical report,
Carnegie Mellon University, Dpt. of Philosophy (1995)
13. Spirtes, P., Glymour, C., Scheines, R.: Causation, prediction, and search, 2nd edn.
The MIT Press, Cambridge (2000)
14. Sun, Y., Negishi, M.: Measuring the relationships among university, industry and
other sectors in Japan’s national innovation system: a comparison of new ap-
proaches with mutual information indicators. Scientometrics 82, 677–685 (2010)
15. Tsamardinos, I., Brown, L., Aliferis, C.: The max-min hill-climbing bayesian net-
work structure learning algorithm. Machine Learning 65, 31–78 (2006)
16. Xu, G.R., Wan, W.X., Ning, B.Q.: Applying partial correlation method to analyz-
ing the correlation between ionospheric NmF2 and height of isobaric level in the
lower atmosphere. Chinese Science Bulletin 52(17), 2413–2419 (2007)
Predicting Friendship Links in Social Networks
Using a Topic Modeling Approach
Abstract. In the recent years, the number of social network users has
increased dramatically. The resulting amount of data associated with
users of social networks has created great opportunities for data mining
problems. One data mining problem of interest for social networks is the
friendship link prediction problem. Intuitively, a friendship link between
two users can be predicted based on their common friends and interests.
However, using user interests directly can be challenging, given the large
number of possible interests. In the past, approaches that make use of
an explicit user interest ontology have been proposed to tackle this prob-
lem, but the construction of the ontology proved to be computationally
expensive and the resulting ontology was not very useful. As an alterna-
tive, we propose a topic modeling approach to the problem of predicting
new friendships based on interests and existing friendships. Specifically,
we use Latent Dirichlet Allocation (LDA) to model user interests and,
thus, we create an implicit interest ontology. We construct features for
the link prediction problem based on the resulting topic distributions.
Experimental results on several LiveJournal data sets of varying sizes
show the usefulness of the LDA features for predicting friendships.
1 Introduction
Social network such as MySpace, Facebook, Orkut, LiveJournal and Bebo have
attracted millions of users [1], some of these networks growing at a rate of more
than 50 percent during the past year [2]. Recent statistics have suggested that
social networks have overtaken search engines in terms of usage [3]. This shows
how Internet users have integrated social networks into their daily practices.
Many social networks, including LiveJournal online services [4] are focused
on user interactions. Users in LiveJournal can tag other users as their friends.
In addition to tagging friends, users can also specify their demographics and
interests in this social network. We can see LiveJournal as a graph structure with
users (along with their specific information, e.g. user interests) corresponding to
nodes in the graph and edges corresponding to friendship links between the users.
In general, the graph corresponding to a social network is undirected. However,
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 75–86, 2011.
c Springer-Verlag Berlin Heidelberg 2011
76 R. Parimi and D. Caragea
in LiveJournal, the edges are directed i.e., if a user ‘A’ specifies another user ‘B’
as its friend, then it is not necessary for user ‘A’ to be the friend of user ‘B’. One
desirable feature of an online social network is to be able to suggest potential
friends to its users [8]. This task is known as the link prediction problem, where
the goal is to predict the existence of a friendship link from user ‘A’ to user ‘B’.
The large amounts of social network data accumulated in the recent years have
made the link prediction problem possible, although very challenging.
In this work, we aim at using the ability of machine learning algorithms to take
advantage of the content (data from user profiles) and graph structure of social
network sites, e.g., LiveJournal, to predict friendship links. User profiles in such
social networks consist of data that can be processed into useful information.
For example, interests specified by users of LiveJournal act as good indicators
to whether two users can be friends or not. Thus, if two users ‘A’ and ‘B’ have
similar interests, then there is a good chance that they can be friends. However,
the number of interests specified by users can be very large and similar interests
need to be grouped semantically. To achieve this, we use a topic modeling ap-
proach. Topic models provide an easy and efficient way of capturing semantics of
user interests by grouping them into categories, also known as topics, and thus
reducing the dimensionality of the problem. In addition to using user interests,
we also take advantage of the graph structure of the LiveJournal network and
extract graph information (e.g., mutual friends of two users) that is helpful for
predicting friendship links [9]. The contributions of this paper are as follows: (i)
an approach for applying topic modeling techniques, specifically LDA, on user
profile data in a social network; and (ii) experimental results on LiveJournal
datasets showing that a) the best performance results are obtained when in-
formation from interest topic modeling is combined with information from the
network graph of the social network b) the performance of the proposed approach
improves as the number of users in the social network increases.
The rest of the paper is organized as follows: We discuss related work in
Section 2. In Section 3, we review topic modeling techniques and Latent Dirichlet
Allocation (LDA). We provide a detailed description of our system’s architecture
in Section 4 and present the experimental design and results in Section 5. We
conclude the paper with a summary and discussion in Section 6.
2 Related Work
Over the past decade, social network sites have attracted many researches as
sources of interesting data mining problems. Among such problems, the link
prediction problem has received a lot of attention in the social network domain
and also in other graph structured domains.
Hsu et al. [9] have considered the problems of predicting, classifying, and an-
notating friendship relations in a social network, based on the network struc-
ture and user profile data. Their experimental results suggest that features
constructed from the network graph and user profiles of LiveJournal can be
effectively used for predicting friendships. However, the interest features pro-
posed in [9] (specifically, counts of individual interests and the common interests
Friendship Link Prediction in Social Networks 77
of two users) do not capture the semantics of the interests. As opposed to that,
in this work, we create an implicit interest ontology to identify the similarity be-
tween interests specified by users and use this information to predict unknown
links.
A framework for modeling link distributions, taking into account object fea-
tures and link features is also proposed in [5]. Link distributions describe the
neighborhood of links around an object and can capture correlations among links.
In this context, the authors have proposed an Iterative Classification Algorithm
(ICA) for link-based classification. This algorithm uses logistic regression models
over both links and content to capture the joint distributions of the links. The
authors have applied this approach on web and citation collections and reported
that using link distribution improved accuracy in both cases.
Taskar et al. [8] have studied the use of a relational Markov network (RMN)
framework for the task of link prediction. The RMN framework is used to define a
joint probabilistic model over the entire link graph, which includes the attributes
of the entities in the network as well as the links. This method is applied to
two relational datasets, one involving university web pages, and the other a
social network. The authors have reported that the RMN approach significantly
improves the accuracy of the classification task as compared to a flat model.
Castillo et al. [7] have also shown the importance of combining features
computed using the content of web documents and features extracted from the
corresponding hyperlink graph, for web spam detection. In their approach, sev-
eral link-based features (such as degree related measures) and various ranking
schemes are used together with content-based features such as corpus precision
and recall, query precision, etc. Experimental results on large public datasets of
web pages have shown that the system was accurate in detecting spam pages.
Caragea et al. [10], [11] have studied the usefulness of a user interest ontology
for predicting friendships, under the assumption that ontologies can provide a
crisp semantic organization of the user information available in social networks.
The authors have proposed several approaches to construct interest ontologies
over interests of LiveJournal users. They have reported that organizing user in-
terests in a hierarchy is indeed helpful for predicting links, but computationally
expensive in terms of both time and memory. Furthermore, the resulting ontolo-
gies are large, making it difficult to use concepts directly to construct features.
With the growth of data on the web, as new articles, web documents, social
networking sites and users are added daily, there is an increased need to ac-
curately process this data for extracting hidden patterns. Topic modeling tech-
niques are generative probabilistic models that have been successfully used to
identify inherent topics in collections of data. They have shown good perfor-
mance when used to predict word associations, or the effects of semantic as-
sociations on a variety of language-processing tasks [12], [13]. Latent Dirichlet
Allocation (LDA) [15] is one such generative probabilistic model used over dis-
crete data such as text corpora. LDA has been applied to many tasks such as
word sense disambiguation [16], named entity recognition [17], tag recommen-
dation [18], community recommendation [19], etc. In this work, we apply LDA
78 R. Parimi and D. Caragea
on user profile data with the goal of producing a reduced set of features that
capture user interests and improve the accuracy of the link prediction task in
social networks. To the best of our knowledge, LDA had not been used for this
problem before.
Topic models [12], [13] provide a simple way to analyze and organize large vol-
umes of unlabeled text. They express semantic properties of words and docu-
ments in terms of probabilistic topics, which can be seen as latent structures
that capture semantic associations among words/documents in a corpus. Topic
models treat each document in a corpus as a distribution over topics and each
topic as a distribution over words. A topic model, in general, is a generative
model, i.e. it specifies a probabilistic way in which documents can be generated.
One such generative model is Latent Dirichlet Allocation, introduced by Blei
et al. [15]. LDA models a collection of discrete data such as text corpora. Fig-
ure 1 (adapted from [15]) illustrates a simplified graphical model representing
LDA. We assume that the corpus consists of M documents denoted by D =
{d1 , d2 · · · dM }. Each document di in the corpus is defined as a sequence of Ni
words denoted by di = (wi1 , wi2 · · · wiNi ), where each word wij belongs to a vo-
cabulary V . A word in a document di is generated by first choosing a topic zij
according to a multinomial distribution and then choosing a word wij according
to another multinomial distribution, conditioned on the topic zij . Formally, the
generative process of the LDA model can be described as follows [15]:
From Figure 1, we can see that the LDA model has a three level representation.
The parameters α and β are corpus level parameters, in the sense that they
are assumed to be sampled once in the process of generating a corpus. The
variables θi are document-level variables sampled once per document and the
variables zij and wij are at the word level. These variables will be sampled once
for each word in each document. For the work in this paper, we have used the
LDA implementation available in MALLET, A Machine Learning for Language
Toolkit [20]. MALLET uses Gibbs sampling for parameter estimation.
4 System Architecture
As can be seen in Figure 2, the architecture of the system that we have designed
is divided into two modules. The first module of the system is focused on iden-
tifying and extracting features from the interests expressed by each user of the
LiveJournal. These features are referred to as interest based features. The second
module uses the graph network (formed as a result of users tagging other users
in the network as ‘friends’) to calculate certain features which have been shown
to be helpful at the task of predicting friendship links in LiveJournal [9]. We
call these features, graph based features. We use both types of features as input
to learning algorithms (as shown in Section 5). Sections 4.1 and 4.2 describe in
detail the construction of interest based and graph based features, respectively.
Previous work by Hsu et al. [9] and Caragea et al. [10], [11], among others, have
shown that the graph structure of the LiveJournal social network acts as a good
source of information for predicting friendship links. In this work, we follow the
method described in [9] to construct graph-based features. For each user pair
(A, B) in the network graph, we calculate in-degree of ‘A’, in-degree of ‘B’, out-
degree of ‘A’, out-degree of ‘B’, mutual friends of ‘A’ and ‘B’, backward deleted
distance from ‘B’ to ‘A’ (see [9] for detailed descriptions of these features).
This section describes the dataset used in this work and the experiments de-
signed to evaluate our approach of using LDA for the link prediction task. We
have conducted various experiments with several classifiers to investigate their
performance at predicting friendship links between the users of LiveJournal.
Friendship Link Prediction in Social Networks 81
5.2 Experiments
The following experiments have been performed in this work.
1. Experiment 1: In the first experiment, we test the performance of several
predictive models trained on interest features constructed from topic distri-
butions. The number of topics to be modeled is varied from 20 to 200. The
1000 user dataset described above is used in this experiment.
2. Experiment 2: In the second experiment, we test several predictive models
that are trained on graph features, for the 1000 user dataset. To be able to
construct the graph features for test data, we assume that a certain per-
centage of links is known [8] (note that this is a realistic assumption, as it
is expected that some friends are already known for each user). Specifically,
we explore scenarios where 10%, 25% and 50% links are known, respectively.
Thus, we construct features for the unknown links using the known links.
3. Experiment 3: In the third experiment, graph based features are used
in combination with interest-based features to see if they can improve the
performance of the models trained with graph features only on the 1000 user
dataset. For the test set graph features constructed by assuming 10%, 25%
and 50% known links, respectively, are combined with interest features.
82 R. Parimi and D. Caragea
We repeat the above mentioned experiments for the 5000 user dataset. The
corresponding experiments are referred to as Experiment 4, Experiment 5
and Experiment 6, respectively. For the 10,000 user dataset, we build predictive
models using just interest based features (construction of graph features for the
10,000 user dataset was computationally infeasible, given our resources). This
experiment is referred to as Experiment 7. We use results from Experiments
1, 4 and 7 to study the performance and the scalability of the LDA approach
to link prediction based on interests, as the number of users increases. For all
the experiments, we used WEKA implementations of the Logistic Regression,
Random Forest and Support Vector Machine (SVM) algorithms.
5.3 Results
Importance of the Interest Features for Predicting Friendship Links.
As mentioned above, several experiments have been conducted to test the use-
fulness of the topic modeling approach on user interests for the link prediction
problem in LiveJournal. As expected, interest features (i.e., topic distributions
obtained by modeling user interests) combined with graph features produced the
most accurate models for the prediction task. This can be seen from Tables 1
and 2. In both tables, we can see that interest+graph features with 50% known
links outperform interest or graph features alone in terms of AUC values1 , for
all three classifiers used. Interesting results can be seen in Table 2, where inter-
est features alone are better than graph features alone when only 10% links are
known, and sometimes better also than interest+graph features with 10% links
known, thus, showing the importance of the user profile data, captured by LDA,
for link prediction in social networks. Furthermore, a comparison between our
results and the results presented in [21], which uses an ontology-based approach
to construct interest features, shows that the LDA features are better than the
ontology features on the 1,000 user dataset. As another drawback, the ontology
based approach is not scalable (no more than 4,000 users could be used) [21].
Figure 3 depicts the AUC values obtained using interest, graph and inter-
est+graph features with Logistic Regression and SVM classifiers across all num-
bers of topics modeled for the 1,000 and 5,000 user datasets, respectively. We can
see that the AUC value obtained using interest+graph features is better than
the corresponding value obtained using graph features alone across all numbers
of topics, for all scenarios of known links, in the case of the 5000 user dataset.
This shows that the contribution of interest features increases with the number
of users. Also based on Figure 3, it is worth noting that the graphs do not show
significant variation with the number of topics used.
Performance of the Proposed Approach with the Number of Users.
In addition to studying the importance of the LDA interest features for the link
prediction task, we also study the performance and scalability of the approaches
considered in this work (i.e., graph-based versus LDA interest based, and com-
binations) as the number of users increases. We are interested in both a) the
1
All AUC values reported are averaged over five different train and test datasets.
Friendship Link Prediction in Social Networks 83
Table 1. AUC values for Logistic Regression (LR), Random Forests (RF) and Sup-
port Vector Machines (SVM) classifiers with interest, graph and interest+graph based
features for the 1,000 user dataset. k% links are known in the test set, where k is 10,
25 and 50, respectively. The known links are used to construct graph features.
Table 2. AUC values similar to those in Table 1, for the 5,000 user dataset.
quality of the predictions that we get for the LiveJournal data as the number of
users increases; and b) the time and memory requirements for each approach.
From Figure 4, we can see that the prediction performance (expressed in terms
of AUC values) is improved in the 5,000 user dataset as compared to the 1,000
user dataset, across all numbers of topics modeled. Similarly, the prediction
performance for the 10,000 user dataset is better than the performance for the
5,000 user dataset, for all topics from 20 to 200. One reason for better predictions
with more users in the dataset is that, when we add more users, we also add the
interests specified by the newly added users to the interest set on which topics
are modeled using LDA. Thus, we get better LDA probability estimates for the
topics associated with each user in the dataset, as compared to the estimates that
we had for a smaller set of data, and hence better prediction results. However,
as expected, both the amount of time it takes to compute features for the larger
dataset, as well as the memory required increase with the number of users in the
data set. The amount of time it took to construct features for the 10,000 user
dataset for all numbers of topics modeled in the experiments is around 14 hours
on a system with Intel core 2 duo processor running at 3.16GHz and 20GB of
RAM. This time requirement is due to our complete graph assumption (which
results in feature construction for 10,000*10,000 user pairs in the case of a 10,000
user dataset) and can be relaxed if we relax the completeness assumption. Still
the LDA feature construction is more efficient than the construction of graph
features, which was not possible for the 10,000 user dataset used in our study.
84 R. Parimi and D. Caragea
Fig. 3. Graph of reported AUC values versus number of topics used for modeling, using
Logistic Regression and SVM classifiers, for the 1,000 user dataset (top-left and top-
right, respectively) and 5,000 user dataset (bottom-left and bottom-right, respectively)
Fig. 4. AUC values versus number of topics for LR (left) and SVM (right) classifiers
for the 1,000, 5,000 and 10,000 user datasets using interest-based features
LDA. Experimental results suggest that the usefulness of the interest features
constructed using the LDA approach increases with an increase in the number
of users. Furthermore, the results suggest that the LDA based interest features
can help improve the prediction performance when used in combination with
graph features, in the case of the LiveJournal dataset. Although in some cases
the improvement in performance due to interest features is not very significant
compared with the performance when graph features alone are used, the fact that
computation of graph features becomes intractable for 10,000 users or beyond
emphasizes the importance of the LDA based approach.
However, while the proposed approach is effective and shows improvement in
performance as the number of users increases, it also suffers from some limita-
tions. First, adding more users to the dataset increases the memory and time
requirements. Thus, as part of the future work, we plan to take advantage of
the MapReduce framework to support distributed computing for large datasets.
Secondly, our approach takes into account, the static image of the LiveJournal
social network. Obviously, this assumption does not hold in the real world. Based
on user interactions in the social network, the graph might change rapidly due
to the addition of more users as well as friendship links. Also, users may change
their demographics and interests regularly. Our approach does not take into ac-
count such changes. Hence, the architecture of the proposed approach has to be
changed to accommodate the dynamic nature of a social network. We also spec-
ulate that the approach of modeling user profile data using LDA will be effective
for tasks such as citation recommendation in scientific document networks, iden-
tifying groups in online scientific communities based on their research/tasks and
recommending partners in internet dating, ideas that are left as future work.
References
1. Boyd, M.D., Ellison, B.N.: Social Network Sites: Definition, History, and Scholar-
ship. Journal of Computer-Mediated Communication 13 (2007)
2. comScore Press Release, http://www.comscore.com/Press Events/Press
Releases/2007/07/Social Networking Goes Globa
3. TechCrunch Report, http://eu.techcrunch.com/2010/06/08/report-social-
networks-overtake-search-engines-in-uk-should-google-be-worried
4. Fitzpatrick, B.: LiveJournal: Online Service, http://www.livejournal.com
5. Geetor, L., Lu, Q.: Link-based Classification. In: Twelth International Conference
on Machine Learning (ICML 2003), Washington DC (2003)
6. Na, J.C., Thet, T.T.: Effectiveness of web search results for genre and sentiment
classification. Journal of Information Science 35(6), 709–726 (2009)
7. Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F.: Know your Neigh-
bors: Web Spam Detection using the web Topology. In: Proceedings of SIGIR 2007,
Amsterdam, Netherlands (2007)
8. Taskar, B., Wong, M., Abbeel, P., Koller, D.: Link Prediction in Relational Data.
In: Proc. of 17th Neural Information Processing Systems, NIPS (2003)
9. Hsu, H.W., Weninger, T., Paradesi, R.S.M., Lancaster, J.: Structural link analy-
sis from user profiles and friends networks: a feature construction approach. In:
Proceedings of International Conference on Weblogs and Social Media (ICWSM),
Boulder, CO, USA (2007)
86 R. Parimi and D. Caragea
10. Caragea, D., Bahirwani, V., Aljandal, W., Hsu, H.W.: Link Mining: Ontology-
Based Link Prediction in the LiveJournal Social Network. In: Proceedings of As-
sociation of the Advancement of Artificial Intelligence, pp. 192–196 (2009)
11. Haridas, M., Caragea, D.: Link Mining: Exploring Wikipedia and DMoz as Knowl-
edge Bases for Engineering a User Interests Hierarchy for Social Network Appli-
cations. In: Proceedings of the Confederated International Conferences on On the
Move to Meaningful Internet Systems: Part II, Portugal, pp. 1238–1245 (2009)
12. Steyvers, M., Griffiths, T.: Probabilistic Topic Models. In: Landauer, T., Mcna-
mara, D., Dennis, S., Kintsch, W. (eds.) Handbook of Latent Semantic Analysis.
Lawrence Erlbaum Associates, Mahwah (2007)
13. Steyvers, M., Griffiths, T., Tenenbaum, J.B.: Topics in Semantic Representation.
American Psychological Association 114(2), 211–244 (2007)
14. Steyvers, M., Griffiths, T.: Finding Scientific Topics. Proceedings of National
Academy of Sciences, U.S.A, 5228–5235 (2004)
15. Blei, D., Ng, Y.A., Jordan, I.M.: Latent Dirichlet Allocation. Journal of Machine
Learning Research 3, 993–1022 (2003)
16. Blei, D., Boyd-Graber, J., Zhu, X.: A Topic Model for Word Sense Disambigua-
tion. In: Proc. of the 2007 Joint Conf. on Empirical Methods in Natural Language
Processing and Comp. Natural Language Learning, pp. 1024–1033 (2007)
17. Guo, J., Xu, G., Cheng, X., Li, H.: Named Entity Recognition in Query. In: Pro-
ceedings of SIGIR 2009, Boston, USA (2009)
18. Krestel, R., Fankhauser, P., Nejdl, W.: Latent Dirichlet Allocation for Tag Recom-
mendation. In: Proceedings of RecSys 2009, New York, USA (2009)
19. Chen, W., Chu, J., Luan, J., Bai, H., Wang, Y., Chang, Y.E.: Collaborative Fil-
tering for Orkut Communities: Discovery of User Latent Behavior. In: Proceedings
of International World Wide Web Conference (2009)
20. McCallam, K.A.: Mallet: A Machine Learning for Language Toolkit (2002),
http://mallet.cs.umass.edu
21. Phanse, S.: Study on the Performance of Ontology Based Approaches to Link
Prediction in Social Networks as the Number of Users Increases. M.S. Thesis (2010)
Info-Cluster Based Regional Influence Analysis
in Social Networks
1 Introduction
Web-based social networks have attracted more and more research efforts in re-
cent years. In particular, community detection is one of the major directions in
social network analysis where a community can be simply defined as a group
of objects sharing some common properties. Nowadays, with the rapid devel-
opment of positioning techniques (eg., GPS), one can easily collect and share
his/her positions. Furthermore, with a large amount of shared positions or tra-
jectories, individuals expect to form their social network based on positions.
On the other hand, a social network, the graph of relationships and interac-
tions within a group of individuals, plays a fundamental role as a medium for
disseminating information, ideas, and influence among its members. Most peo-
ple consider the problem of how to maximize influence propagation in social
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 87–98, 2011.
c Springer-Verlag Berlin Heidelberg 2011
88 C. Li et al.
2 Related Work
The success of large-scale online social network sites, such as Facebook and Twit-
ter, has attracted a large number of researchers. many of them are focusing on
modeling the information diffusion patterns within social networks. Domingos
and Richardson [4] are the first ones to study the information influence in social
networks. They used probabilistic theory to maximize the influence in social net-
work. However, Kempe, Kleinberg and Tardos [8] are the first group to formulate
the problem as a discrete optimization problem. They proposed namely the in-
dependent cascade model, the weight cascade model, and the linear threshold
model. Chen et al. [2] collected the blog dataset to identify five features (namely
the number of friends, popularity of participants, number of participants, time
elapsed since the genesis of the cascade, and citing factor of the blog) that may
play important role in predicting blog cascade affinity, so as to identify most
easily influenced bloggers. However, since the influence cascade models are dif-
ferent, they do not directly address the efficiency issue of the greedy algorithms
for the cascade models studied in [1].
With the growth of the web and social network, communities mining (com-
munity detection) are of great importance recently. In social network graph, the
community is with high concentrations of edges within special groups of vertices,
and low concentrations between these groups [6]. Therefore, Galstyan and Mu-
soyan [5] show that simple strategies that work well for homogenous networks
can be overly sub-optimal, and suggest simple modification for improving the
performance, by taking into account the community structure.
Spatial clustering is the process of grouping a set of objects into classes or
clusters so that objects within a cluster are close to each other, but are far away
Info-Cluster Based Regional Influence Analysis in Social Networks 89
3 Frameworks
A large volume of work has been done on community discovery as discussed
above. Most of them, however, ignored the location information of individuals.
The location information often plays a very important role in the community
formulation and evolution. Therefore, it should be paid attention to. In this
paper, we take the location of individuals into consideration to guide us detecting
the Info-Clusters of communities. In this section, we first give our model for the
social networks with location information. And then we present the problem
formulation. The frameworks of our solutions are described in section 3.3.
B D
A
10
3 9
5
4 1
11
2
8 12
7
6
C E
Fig. 1. The model of social networks with location information. In this example, there
are 12 individuals that belong to 5 locations. The individuals are connected with each
other through 17 edges.
luence ∪ Sknow
i i
Sinf
According to the above definitions, we present the main works about the regional
influence analysis as follows:
4 Algorithms
In this section, we describe two main algorithms which aim to solve clustering
and Info-Cluster detection based on influence propagation. Firstly, the K-Means
clustering algorithm is used to cluster locations, and community detection algo-
rithm based on modularity maximization is used to cluster individuals. Secondly,
influence propagation based Info-Cluster detection (IPBICD) algorithm is pre-
sented in section 4.2.
4.1 Clustering
K-Means[10] is one of the simplest unsupervised learning algorithms. It is often
employed to solve the clustering problem. In this paper, we adopt K-Means
method to cluster the locations of the social networks. What should be paid
attention to is that, other clustering methods can also be used for our location
clustering. We adopt K-Means method here, only to show the feasibility of our
92 C. Li et al.
where
– v, w are vertices within V ;
– i represent the ith community;
– cv is the community to which vertex v is assigned;
– Avw is an element of the adjacency matrix corresponding to the G = (V, Ev );
– m = 12 Avw ;
vw
– kv = Avu , where u is a vertex;
u
1
– eij = 2m Avw δ(cv , i)δ(cw , j);
1
vw
– ai = 2m kv δ(cv , i);
v
1 x=y
– δ(x, y) =
0 otherwise
We start off with each vertex being a community which contains only one mem-
ber. Then the process includes finding the changes of Q, choosing the largest of
them, and performing the merge of communities.
v1 v2 v3 v4 ... vn
L(vi ) l1 l2 l3 l4 ... ln
LC(vi ) 1 2 1 2 ... k
Com(vi ) 1 1 2 2 ... m
α1 α2 β1 β2 θ
(0 < α1 < 1) (0 < α2 < α1 ) (0 < β1 < α1 ) (0 < β2 < β1 ) (0 < θ < 1)
0.9 0.8 0.7 0.6 0.7670
0.8 0.6 0.6 0.4 0.6560
0.3 0.2 0.2 0.1 0.3544
94 C. Li et al.
Active Node
D1 D1 D1 D1 E1
Co
mm
D1
un
Inactive Node
D1 D1 D1
yit
D1
Boundary
between
Scapital D Scapital E Scapital F capital nodes
and non-
capital nodes
D1 D1 D1 D1
co
D1 D1
com
mm
un
mu
Dividing line
ity
nity
E2 between
D2
different
communities
where:
– N (z): The z-th individual’s neighborhood individuals set.
– N um(N (z)): The number of the neighbors’ individuals of z-th individuals.
– Activenum (N (z)): The number of active individuals in neighbors’ individuals
of z-th.
Finally, according to Y (z) and θ, we can easily generate Info-Cluster in our social
network graph. Specifically, for each location cluster, we first set all individu-
als from one location cluster to be active, and add those individuals into group
i
Capital(Scapital ). And then, for each inactive individual calculate its Y (z) and
i
compare it with θ. If Y (z) > θ, then add it into Inf luence group Sinf luence . If
0 < Y (z) ≤ θ, then add z-th node into Know group Sknow . However, if Y (z) = 0,
i
i
then add it into N othing group . At last, merge Capital group (Scapital ) indi-
i i
viduals, Inf luence group (Sinf luence ) individuals and Know group(Sknow ) in-
dividuals into one Info-Cluster and repeat the next location cluster. The process
of the Info-Cluster detection is shown in Algorithm 1.
5 Experiments
In order to test the performance of our algorithm, we conduct an experiment on
real social networks. We first obtain the information of 5000 individuals from the
Info-Cluster Based Regional Influence Analysis in Social Networks 95
Renren friend network by crawling the Renren online web site (www.renren.com,
which is similar as the web site of Facebook). After preprocessing, there are 2314
circle vertices, 1400 square vertices and 56456 edges in final Renren data sets.
Each circle vertex denotes an individual registered in Renren web site, while the
square vertex represents the location of the corresponding individual. And each
edge between circle vertices denotes the friendship of two individuals. Then we
do experiments on Renren data sets. In the experiment, we set three kinds of
influence probabilities which are shown in Table 2.
Fig. 3 illustrates the changing of Average Covering Rate (ACR) with differ-
ent K. From this figure, we can see that the Average Covering Rate decreases
with the increasing of K macroscopically. One main reason may lie in that the
higher K value often leads to less people in each location cluster. That is, the
information propagates from less sources.
In order to study the relations between covering rate and the number of
sources, we randomly select some K values and then analyze the experiments
results microscopically. Suppose that all the locations are grouped into 50 clus-
i
ters (K = 50). Then we can get 50 capitals denoted by Scapital , i = 1, 2, . . . , K
as described in Definition 3. After the experiment, we get 50 Info-Clusters, each
of which is composed of capital individuals, influence individuals and know indi-
viduals. Fig.4 shows the change of Covering Rate with the number of individuals
96 C. Li et al.
0.9
D1=0.9 D2=0.8 E1=0.7 E2=0.6
D =0.8 D =0.6 E =0.6 E =0.4
1 2 1 2
0.8
D =0.3 D =0.2 E =0.2 E =0.1
1 2 1 2
0.7
0.5
0.4
0.3
0.2
0.1
0
0 200 400 600 800 1000 1200 1400
Number of cluster:K
of 50 capitals. According to the Fig. 4, we find that more people as sources may
result in higher covering rate as a whole. But it is not absolutely true. With
the third parameter settings, the covering rate of 30 individuals is higher than
that of 100 individuals. For other two kinds of parameter settings, the covering
rate of 50 individuals is nearly equal to that of 150 individuals. That implies
the former individuals have stronger influential power than the latter ones. Even
to the same number of individuals, the covering rates are often different. One
such example is when the number of individuals is 20. From Fig. 4, we can see
that there are two capitals composed of 20 individuals. And each cluster reaches
different covering rates even with the same parameter settings.
0.9
0.8
0.7
Covering Rate(CR)
0.6
0.5
0.4
0.1
0
0 50 100 150 200 250
Number of individuals
Fig. 4. The change of Covering Rate with the number of individuals of 50 capitals
(a) (0.9, 0.8, 0.7, 0.6) (b) (0.8, 0.6, 0.6, 0.4) (c) (0.3, 0.2, 0.2, 0.1)
70
50
Influential Power(IP)
40
30
20
10
0
0 5 10 15 20 25 30 35 40 45 50
clusterid
value of the covering rate. According to the Fig. 5, we find that the information
of eastern Info-Cluster spreads more widely than that of west. Particularly, the
region of Beijing has the highest covering rate. This may attribute to the higher
density of its population.
The influence power for each capital set is shown in Fig. 6. From this figure, we
find that the last cluster (clusterid=50) achieves the highest influential power at
parameter settings (0.9, 0.8, 0.7, 0.6) and (0.8, 0.6, 0.6, 0.4). Therefore, the region
which contains those individuals, is an influential region. On the contrary, the
15th cluster (clusterid=15) represents the lowest influential power. That means
the region which contains those individuals, is weak influential region.
6 Conclusion
In this paper, we propose an innovative concept Info-Cluster. And then, based
on the information propagation, we present the framework of identifying Info-
Cluster, which uses both community and location information. With the social
network data set, we first adopt the K-Means algorithm to find location clusters.
Next, we identify the communities for the whole network. Given the location
clusters and communities, we present the information propagation based Info-
Cluster detection algorithm (IPBICD). Experiments on Renren data sets show
that the Info-Clusters have many characteristics. The Info-Clusters identified
have many potential applications, such as analyzing and predicting the influential
range of the information or advertisement from a certain location.
98 C. Li et al.
References
1. Cha, M., Mislove, A., Gummadi, K.P.: A measurement-driven analysis of informa-
tion propagation in the flickr social network. In: WWW 2009: Proceedings of the
18th International Conference on World Wide Web, pp. 721–730. ACM, New York
(2009)
2. Chen, W., Wang, Y., Yang, S.: Efficient influence maximization in social networks.
In: KDD 2009: Proceedings of the 15th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pp. 199–208. ACM, New York (2009)
3. Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very
large networks. Physical Review E 70, 066111 (2004)
4. Domingos, P., Richardson, M.: Mining the network value of customers. In: KDD
2001: Proceedings of the Seventh ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 57–66. ACM, New York (2001)
5. Galstyan, A., Musoyan, V., Cohen, P.: Maximizing influence propagation in net-
works with community structure. Physical Review E 79(5), 56102 (2009)
6. Girvan, M., Newman, M.E.J.: Community structure in social and biological net-
works. Proceedings of the National Academy of Sciences of the United States of
America 99, 7821 (2002)
7. Han, J., Kamber, M., Tung, A.K.H.: Spatial clustering methods in data mining:
A survey. In: Geographic Data Mining and Knowledge Discovery, Research Mono-
graphs in GIS. Taylor & Francis, Abington (2001)
8. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence through
a social network. In: KDD 2003: Proceedings of the Ninth ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, pp. 137–146. ACM,
New York (2003)
9. Leskovec, J., Adamic, L.A., Huberman, B.A.: The dynamics of viral marketing.
ACM Trans. Web 1(1), 5 (2007)
10. MacQueen, J.: Some methods for classification and analysis of multivariate obser-
vations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics
and Probability, pp. 281–297. University of California Press, Berkeley (1967)
11. Watts, D.: A simple model of global cascades on random networks. Proceedings
of the National Academy of Sciences of the United States of America 99(9), 5766
(2002)
Utilizing Past Relations and User Similarities in
a Social Matching System
Richi Nayak
Abstract. Due to the higher expectation more and more online match-
ing companies adopt recommender systems with content-based, collab-
orative filtering or hybrid techniques. However, these techniques focus
on users explicit contact behaviors but ignore the implicit relationship
among users in the network. This paper proposes a personalized social
matching system for generating potential partners’ recommendations
that not only exploits users’ explicit information but also utilizes im-
plicit relationships among users. The proposed system is evaluated on
the dataset collected from an online dating network. Empirical analysis
shows the recommendation success rate has increased to 31% as com-
pared to the baseline success rate of 19%.
1 Introduction
With the improved Web technology and increased Web popularity, users are
commonly using online social networks to allow them to contact new friends
or ’alike’ users. Similarly, people from various different demographics have also
increased the customer base of online dating networks [9]. It is reported [1] that
there are around 8 million singles in Australia and 54.32% of them are using
online dating services. Users of online dating services are overwhelmed by the
number of choices returned by these services. The process of selecting the right
partner among a vast amount of candidates becomes tedious and nearly inef-
fective if the automatic selection process is not available. Therefore, a matching
system, utilizing data mining to predict behaviors and attributes that could lead
to successful matches, becomes a necessity.
Recommendation systems have existed for a long time to suggest users a prod-
uct according to their web visit histories or based on the product selections of
other similar users [2],[7]. In most cases, the recommendation is an item recom-
mendation which is inanimate. On the contrary, the recommendation in dating
networks is made about people who are animate. Different from item recommen-
dation, people recommendation is a form of two way matching where a person
can refuse an invitation but products cannot refuse to be sold. In other words, a
product does not choose the buyer but dating service users can choose the dating
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 99–110, 2011.
c Springer-Verlag Berlin Heidelberg 2011
100 R. Nayak
sending emails or chat invitations; and (4) Measure of relationships with other
users such as willingness to initialize relationships and responding to invita-
tions, and frequency and intensity with which all relationships are maintained.
A relationship is called successful for the purpose of match making when a user
initiates a pre-typed message as a token of interest and the target user sends
back a positive reply.
Let U be the set of m users in the network, U = u1 , . . . , um . Let X be a user
personal profile that includes a list of personal profile attributes, X = x1 , . . . , xn
where each attribute xi is an item such as body type, dietary preferences, po-
litical persuasion and so on. Consider the list of user’s ideal partner profile
attributes as a set Y = y1 , . . . , yn where each attribute yi is an item such as
body type, dietary preferences, political persuasion and so on. For a user uj ,
value of xi is unary, however, the values of yi can be multiple. Let P = X + Y
denote a user profile containing both the personal profile attributes and partner
preference attributes. The profile vector of a user is shown as P (uj ). There can
be many types of user activities in a network that can be used in the matching
process. Some of the main activities are “viewing profiles”, “initiating and/or
responding pre-defined messages (or kisses1 , “sending and/or receiving emails”
and “buying stamps”. The profile viewing is a one-sided interaction from the
viewer perspective; therefore it is hard to define the viewers’ interests. The “kiss
interactions” are more promising to be considered as an effective way to show the
distinct interests between two potential matches. A user is able to show his/her
interest by sending a “kiss”. The receiver is able to ignore the ”kiss” received or
return a positive or negative reply. When a receiver replies a kiss with positive
predefined message it is considered as a “successful” kiss or a “positive” kiss.
Otherwise, it is judged as an “unsuccessful” kiss or a “negative” kiss.
the lack of relationship based users for a seed pair. The similarity between ub
and users in GrB, and the similarity between ua and users in GrA are calculated
to find “closer” members in terms of profile attributes. This step determines the
users whose profiles match since they are the same gender users. Each pair in
(GrA, GrB) is also checked for their compatibility using a two-way matching.
These three similarity scores are combined by using weighted linear strategy.
Finally, a ranked list of potential partner matches from GrB for each of GrA is
formed.
match fits the preferences of user ub based on profile attribute xi . That is, does
a potential match’s stated value for an attribute fit the user’s preference yi for
that attribute? If the user has explicitly stated a preference for the attribute
then the measure becomes trivial. If the user has not explicitly stated a prefer-
ence, then a preference can be inferred from the preferences of other members
in the same age and gender group. Though a user may not explicitly state their
preference, an assumption can be made that their preference is similar to others
in the same age and gender group. The score then becomes the likelihood of a
potential match up meets the preferences of a user ub for the attribute xi .
⎧
⎪
⎪1 , xi (ua ) ∈ yi (ub )
⎪
⎪ N (xi (ub )=x,xi (ua )∈yi (ub )|xiGender (ub ),xi A ge(ub ))
⎨ , y (u ) = φ
CSxi (ub , ua ) = (N (xi (ub ) = x|xiGender (ub ), xiA ge(ub ))) i b
⎪
⎪
⎪
⎪ −N (xi (ub ) = x, yi (ub ) = “N ot Specif ied” |xiGender (ub ), xiA ge(ub ))
⎩
0 , Otherwise
(3)
where xi (ub ) is the user ub ’s profile value for attribute xi and yi (ub ) is user
ub ’s preferred match value for attribute xi . By the definition in above equation,
scores range from 0 to 1. The attribute cross match score is moderated by a
comparative measure of how important the attribute xi is to users within the
same age and gender demographics of user ub . This measure, called the impor-
tance, is estimated from the frequency of the users within the same age band
and gender of the user that specify a preference for the attribute. This is done to
ensure that all attributes are not equally important when selecting a potential
match. Attributes such as height, body type, and have children are specified
more frequently than attributes such as nationality, industry and have pets. If
a user explicitly specifies a preference then it is assumed the attribute is highly
important to them (e.g. when a user makes an explicit religious preference).
When it is not specified a good proxy for the importance of the attribute is the
complement of the proportion of users in the same age and gender group who
did specify a preference for the attribute. Mathematically, this is defined by
⎧
⎪
⎪ 1 , xi (ua ) ∈ yi (ub )
⎪
⎪
⎨1 − N (xi (ub )=x,yi (ub )“N ot Specif ied”|xiGender (ub ),xi Age (ub ))
yi (ub ) = φ
Ixi (ub ) = (N (xi (ub )=x|xiGender (ub ),xiAge (ub ))) (4)
⎪
⎪
⎪
⎪
⎩
0 , Otherwise
By the definition in this equation, scores range from 0 to 1. The attribute score
for xi between potential partners is calculated as follows:
Axi (ub , ua ) = Ixi (ub ) × CSxi (ub , ua ) (5)
Including the importance information upfront may simplify the task of training
an optimisation model to map the attribute scores to a target variable. By
reducing complexity of the model, accuracy may be improved. An alternative
to the importance measure would be to leave the weightings for an optimisation
model to estimate. It is assumed including the importance measure as part of
the score calculations will assist training of the optimisation model.
Utilizing Past Relations and User Similarities in a Social Matching System 105
Both the attribute match score and importance are also calculated from the
perspective of the potential match’s up preference towards user ub . Finally, a
single attribute match score Mxi between two users for attribute xi is obtained
as follows:
Mxi (ub , ua ) = Mxi (ua , ub ) = 1/2((Axi (ub , ua ) + Axi (ua , ub ))) (6)
By combining the four measures per attribute into one cross-match score per
attribute the search space is reduced by three quarters.
Finding user compatibility requires a measurement that allows different po-
tential matches to be compared. The measure should allow a user’s list of po-
tential matches to be ranked in order of ”closest” matches. This is achieved by
combining all attribute cross-matches scores into a singular match score.
M(ub , ua ) = [Mx1 (ub , ua ), . . . , Mxn (ub , ua )] (7)
The goal then becomes to intelligently summarise the vector M (ub , ua ) in a way
that increases the score for matches that are likely to lead to a relationship.
Technically, this becomes a search for an optimal mapping from a user ub to a
potential match up based on their shared attribute cross-match vector M (ub , ua )
to a target variable that represents a good match. We will call this target variable
the compatibility score Comp(ub , ua) such that:
Comp(ub , ua ) = f (M(ub , ua )) (8)
Putting it all together. Once the three similarity scores, SimScore(ub , GrBj )
identifying profile similarity between the seed user and a potential match,
SimScore(ua , GrAi ) identifying profile similarity between the seed partner and
a recommendation object and Comp(GrAi , GrBj ) compatibility score between
a potential match pair (GrAi , GrBj ) are obtained, these scores can be combined
using weighted linear strategy.
To determine these weights setting, a decision tree model was built using 300
unique seed users, 20 profile attributes and about 300,711 recommendations
generated from the developed social matching system along with an indicator
of their success. The resulting decision tree showed that higher percentage of
positive kisses are produced when w1 ≥ 0.5 and w2 ≥ 0.3 and w3 ≥ 0.2. Therefore
w1 , w2 and w3 are set as 0.5, 0.3 and 0.2 respectively. It is interesting to note the
lower value of w3 . It means that when two members are interested in each other,
there exists high probability that both of them are similar to their ex-partners
respectively.
For each recommendation object GrAi , matching partners are ranked accord-
ing to their M atch(GrAi , GrBj ) score and top-n partners from GrB become the
potential match of GrAi .
3 Empirical Analysis
The proposed method is tested with the dataset collected from a real life online
dating network. There were about 2 million users in the network. We used the
three months of data to generate and test networks of relationship-based users
and recommendations. The activity and measure of relationship between two
users in this research is “kiss”. The number of positive kisses is used in testing
the proposed social matching system. Figure 2 lists the details of the users and
kisses in the network. A user who has logged on in the website during the chosen
three months period is called as “active” user. The seed users and relationship
based users come from this set of users. A kiss sender is called “successful” when
the target user sends back a positive kiss reply. There are about 50 predefined
messages (short-text up to 150 characters) used in the dating network. These
kiss messages are manually defined as positive or negative showing the user
interest towards another member. There are a large number of kisses exist in the
network that have never been replied by the target users and they are called as
“null kiss”.
Fig. 2. User and Kiss Statistics for the three months chosen period
Utilizing Past Relations and User Similarities in a Social Matching System 107
It can be noted that for each kiss sender, there is about 4 kiss replies (including
successful and negative both) on an average. It can also be seen that about
75% kiss senders have received at least one positive kiss reply. The amount of
successful kisses is less than one fourth of the sum of negative and null kisses. A
further kiss analysis shows a strong indication of Male members in the network
for initiating the first activities such as sending kisses (78.9% vs 21.1%). They
are defined as proactive behavior users in the paper. While female members who
are reactive behaviour users usually wait for receiving kisses.
Success Rate(SR)
Success Rate Improvement (SRI) = (12)
Baseline Success Rate(BSR)
N umber of (Kissed P artners Recommended P artners)
Recall = (13)
(N umber of (Kissed P artners))
positive and 20 negative kiss responses per user were chosen. This created a
sample training set of about 144,430 records which were used to train SVM
models. The test dataset was created with 24498 records by randomly choosing
users. A 10-fold cross validation experiments were performed and the average
performance is shown in Figure 3. The best performing SVM model was used in
the proposed matching system.
Figure 4 shows that the Success Rate (SR) decreases as the number of po-
tential matches (GrB) is increased for a user in GrA. This result confirms that
higher the total score generated by the proposed matching system, M atch(GrAi
, GrBj ), the more relevant and accurate matches are made. For example, users
with higher total score in top-5 recommendation list received highest percentage
of positive kiss reply. There are a number of null kiss replies in the dataset. A
null kiss reply can be transformed to positive kiss reply and negative kiss reply.
If all the null kiss reply is able to transform to positive kiss reply, the success
rate (SR) can be obtained as 66% for top-20 users. The BSR of the underly-
ing online dating network is 19%. Figure 2 shows that the proposed system is
always better. This result describes that the potential matches offered by the
system interest the user (as shown by figure 5) and also the receivers show high
interests towards these users by sending the positive kiss message back as shown
by SR in figure 4 and with the increased recall (figure 6). However, it can be
seen that with the increased number of recommendations the value of SRI de-
creases as shown in figure 5. It concludes that more matching recommendations
will attract user attention and trigger more kisses to be sent. However, more
recommendation will also lead to low quality recommendations. When recom-
mending potential matches, the user is more interested in examining a small set
of recommendations rather than a long list of candidates. Based on all results,
high quality top-20 recommendation maximize SRI without letting recall drop
unsatisfactorily.
Experiments have also been performed to determine which kind of users are
more important for generating high quality matches for the dating network, the
similar users from clusters or relationship based users? Two sets of experiments
are performed.
– In the first setting, the size of GrA and GrB is fixed as 200. The usual
size of GrA is about 30 to 50, populated with ex-partners. More similar
users obtained from respective clusters are added into these two groups in
comparison to relationship-based users.
– In the second setting, the variance between the two groups Dif f (#GrA,
#GrB) is covered by adding members from the respective cluster. In addi-
tion, only the 10% size of GrA, GrB is increased by clustering to add new
members and to increase the user coverage.
Results show that when more similar users rather than relationship-based users
are added the success rate improvement (SRI) is lower than adding more relation
ship-based users against the current pairs. The SR and SRI obtained from the
first setting are 0.19 and 1.0 respectively, whereas in the second setting SR
and SRI are 0.29 and 1.4 respectively considering all suggested matching pairs.
Utilizing Past Relations and User Similarities in a Social Matching System 109
2
0.3
1.5 0.2
1 SR 0.1
SRI 0
0.5
top-5 top-10 top-20 top-50 top-100 All
0
top-5 top-10 top-20 top-50 top-100 All Reaction of S users
Fig. 4. Top-n success rate and suc- Fig. 5. Sender’s Interests Prediction
cess rate improvement Accuracy
10
8
6
4 Recall
2
0
top-5 top-10 top-20 top-50 top-100 All
the matching pair quality by measuring similarity level against seed pairs and
relationship-based users, and compatibility between the matching pair. The de-
cision tree model is used for producing the weights for these similarity scores.
4 Conclusion
The proposed system gathers relationship-based users, forms relationship-based
users networks, explores the similarity level between relationship-based users and
seed users, explores the compatibility between potential partners and then make
partner recommendations in order to increase the likelihood of successful reply.
This innovative system combines the following three algorithms to generate the
potential partners: (1) An instance-based similarity algorithm for predicting sim-
ilarity between the seed users and relationship-based user that forms potential
high quality recommendation and reduces the number of users that the matching
system needs to be considered; (2) A K-means similar user checking algorithm
that helps to overcome the problems that the standard recommender techniques
usually suffer, including the absence of knowledge, the cold-start problem and
the sparse user data; and (3) A user compatibility algorithm that conducts the
two-way matching between users by utilising the SVM predictive data mining
algorithm. Empirical analysis show that the success rate has been improved from
the baseline results of 19% to 31% by using the proposed system.
References
1. 2006 census quickstats. Number March 2010 (2006)
2. Anand, S.S., Mobasher, B.: Intelligent techniques for web personalization. Online
Information Review (2005)
3. Bennett, K.P., Campbel, C.: Support vector machines: Hype or hallelujah? SIGKDD
Explorations 2, 1–13 (2000)
4. Brozovsky, L., Petricek, V.: Recommender system for online dating service (2005)
5. Fiore, A., Shaw Taylor, L., Zhong, X., Mendelsohn, G., Cheshire, C.: Who’s right
and who writes: People, profiles, contacts, and replies in online dating. In: Hawai’i
International Conference on System Sciences 43, Persistent Conversation Minitrack
(2010)
6. Kazienko, P., Musial, K.: Recommendation framework for online social networks. In:
4th Atlantic Web Intelligence Conference (AWIC 2006). IEEE Internet Computing
(2006)
7. Linden, G., Smith, B., York, J.: Amazon.com recommendations: Item-to-item col-
laborative filtering. IEEE Internet Computing 7 (2003)
8. Markey, P.M., Markey, C.N.: Romantic ideals, romantic obtainment, and relation-
ship experiences: The complementarity of interpersonal traits among romantic part-
ners. Journal of Social and Personal Relationships 24, 517–534 (2007)
9. Smith, A.: Exploring online dating and customer relationship management. Online
Information Review 29, 18–33 (2005)
On Sampling Type Distribution from Heterogeneous
Social Networks
Abstract. Social network analysis has drawn the attention of many researchers
recently. As the advance of communication technologies, the scale of social net-
works grows rapidly. To capture the characteristics of very large social networks,
graph sampling is an important approach that does not require visiting the en-
tire network. Prior studies on graph sampling focused on preserving the prop-
erties such as degree distribution and clustering coefficient of a homogeneous
graph, where each node and edge is treated equally. However, a node in a social
network usually has its own attribute indicating a specific group membership or
type. For example, people are of different races or nationalities. The link between
individuals from the same or different types can thus be classified to intra- and
inter-connections. Therefore, it is important whether a sampling method can pre-
serve the node and link type distribution of the heterogeneous social networks.
In this paper, we formally address this issue. Moreover, we apply five algorithms
to the real Twitter data sets to evaluate their performance. The results show that
respondent-driven sampling works well even if the sample sizes are small while
random node sampling works best only under large sample sizes.
1 Introduction
Social network analysis has drawn more and more attention of the data mining com-
munity in recent years. By modeling the social network as a graph structure, where a
node is an individual and an edge represents the relationship between individuals, many
studies addressed the graph mining techniques to discover the interesting knowledge on
the social networks. As the advance of communication technologies and the explosion
of social web applications such as Facebook and Twitter, the scale of the generated net-
work data is usually very large. Apparently, it is less possible to explore and store the
entire large network data before extracting the characteristics of these social networks.
Therefore, it is critical to develop an efficient and systematic approach to gathering data
in an appropriate size while keeping the properties of the original network.
To scale down the network data to be processed, there are two possible strategies:
graph summarization and graph sampling. Graph summarization [1,2,3,4,5,6] aims to
condense the original graph in a more compact form. There are lossless methods, where
the original graph can be recovered from the summary graph, and loss-tolerant methods,
where some information may be lost during the summarization. To obtain the summary
graph, these methods usually need to examine the entire network first. On the other
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 111–122, 2011.
c Springer-Verlag Berlin Heidelberg 2011
112 J.-Y. Li and M.-Y. Yeh
hand, sampling is a way of data collection by selecting a subset of the original data. By
following some rules of sampling nodes and edges, a subgraph can be constructed with
the characteristics of the original graph preserved. In contrast to graph summarization,
a big advantage of sampling is that only a controlled number of nodes, instead of the
entire network, are visited. In this work, as a result, we want to focus on sampling from
large social networks.
Prior studies on graph sampling [7,8], however, focused only on preserving statis-
tics such as degree distribution, hop-plot, and clustering coefficient on homogeneous
graphs, where each node and link is treated equally. In reality, the social network is
heterogeneous, where each individual has its own attribute indicating a specific group
membership or type. For example, people are of different races or nationalities. The
link between individuals of the same or different types can thus be classified to intra-
connection and inter-connection. The type distribution of nodes and the proportion of
intra/inter-connection links is also key information that should be preserved to under-
stand the heterogeneous social network, which, to the best of our knowledge, has not yet
been addressed in the previous graph sampling works in the data mining community.
To this end, we propose two goals on the heterogeneous social network. First is the
type distribution preserving goal. Given a desired number of nodes of the sample size,
a subgraph Gs is generated by some sampling method. The type distribution of Gs ,
Dist(Gs ), is expected to be the same as that of the original graph G. The second goal is
the intra-relationship preserving goal. We expect that the ratio of the intra-connection
numbers to the total edges of Gs should be preserved.
In search of a better solution, we adopt five possible methods: Random Node Sam-
pling (RNS), Random Edge Sampling (RES), EgoCenteric Exploration Sampling(ECE)
[9], Multiple-Ego-Centric Sampling, (MES) and Respondent-Driven Sampling (RDS)
[10], to see their effects on sampling type distribution in the heterogeneous social net-
works. RNS and RES are two methods of selecting nodes and edges randomly until some
criteria are met. ECE is a chain-referral-based sampling proposed in [9]. Chain-referral
sampling usually starts from a node called ego and selects neighbor nodes uniformly at
random wave by wave [9]. MES is a variation of ECE we designed that the sampling
starts with multiple initial egos. Finally, we adopt RDS, which is a sampling method
used in social science for studying the hidden populations [10]. Many works on the
social network analysis focus on the majority, i.e., the greatest or the second greatest
connected components, of the network. However, sometimes the small or hidden group
of a network hints more interesting messages. For example, the population of drug users
or patients with rare diseases is usually hidden and relatively small. Essentially, RDS
is a method combining snowball sampling, of which the recruiting of future samples is
from acquaintances of the current subject, with a Markov Chain model to generate un-
biased samples. In our implementation, we adopt RDS for the simulation of the human
recruiting process and indicate how the Markov Chain is computed from the collected
samples.
To evaluate the sampling quality of the above five methods, we conduct experi-
ments on the Twitter data sets provided in [11]. We measure the difference of the type
distribution between the sampling results and the original network by two indexes:
error ratio and D-statistic of Kolmogorov-Smirnov Test. In addition, we measure the
On Sampling Type Distribution from Heterogeneous Social Networks 113
difference of the intra-connection percentage between the samples and the original net-
work. The results show that RDS works best in terms of preserving the type distribu-
tion and the intra-connection percentage when the sample size is small. MES and ECE
perform next best, while MES has small improvement over ECE in the node type distri-
bution. Finally, the sampling quality of RNS and RES are less stable. RNS outperforms
other methods only when the sample size is large.
The remainder of the paper is organized as follows. The related work is discussed
in Section 2. The problem statement is formally defined in Section 3. The detailed
implementation of the five sampling algorithms is described in Section 4. In Section 5,
we show the experimental results. Finally, the paper is concluded in Section 6.
2 Related Work
As the scale of social network data is getting very large, graph sampling is a useful
technique to collect a smaller subgraph, without visiting the entire original network,
but with some properties of the original network preserved. Krishnamurthy et al. [12]
found that a simple random node selection to a 30% sampling size is already able to
preserve some properties of an undirected graph. Leskovec and Faloutsos [7] provided
a survey on three major kinds of sampling methods on graphs: sampling by random
node selection, sampling by random edge selection and sampling by exploration. The
sampling quality of preserving many graph properties such as degree distribution, hop-
plot, and clustering coefficient are examined. Moreover, they proposed a Forest Fire
sampling algorithm, which preserves the power law property of a network very well
during the sampling process. They concluded that there is no perfect sampling method
for preserving every property under any conditions. The sampling performance depends
on different criteria and graph structures. Hübler et al. [8] further proposed Metropolis
algorithms to obtain representative subgraphs. However, all these sampling works did
not concern the heterogeneous network where each node may have its own attribute
indicating a specific group membership or type. The link (edge) may be a connection
between two nodes of the same or different types. The type distribution of nodes and
the proportion of intra/inter-connection links is also key information to understand the
heterogeneous social network, which, to the best of our knowledge, has not yet been
addressed in the previous graph sampling works in the data mining community.
To sample the type distribution in a heterogeneous network, we further introduce
the Respondent-Driven Sampling (RDS) proposed by Heckathorn [10]. RDS is a well-
known sampling approach for studying the hidden population, which combines snow-
ball sampling and the Markov Chain process to produce unbiased sampling results for
the hidden population. Furthermore, a newer estimator for the sampling results is de-
signed [13] based on the reciprocity model assumption, i.e., the number of edges from
group A to group B is equal to that from group B to group A in a directed graph with two
groups (an undirected graph naturally complies with this assumption). In the real case
of directed heterogeneous network, however, this assumption does not usually hold.
For example, the Twitter data sets we use in the experiments do not have the reciprocity
property between the tweets among users. In our study, as a result, we apply and sim-
ulate the original RDS [10] to sample the type distribution in a large heterogeneous
social network.
114 J.-Y. Li and M.-Y. Yeh
3 Problem Statement
Given a graph G =< V, E >, V denotes a set of n vertexes (nodes, individuals) vi and
E is a set of m directed edges (link, relationships) ei . First, we define the heterogeneous
graph which models the heterogeneous social network we are interested in.
Definition 1. A heterogeneous graph G with k types is a graph where each node be-
longs to only one specific type out of k types. More specifically, given a finite set
L = {L1 , ...Lk } denoting k types, the type of each node vi is T (vi ) = Li , where Li ∈ L.
Suppose the number of vertex of G is n, and the number of nodes belonging to type Li is
Ni , then the condition ∑ki=1 Ni = n must be true. In other words, (nodes ∈ Li ) ∩ (nodes
∈ L j ) =0, where i = j.
With the above two definitions, our problem statements are presented as follows.
Problem 1. Type distribution preserving goal Given a desired number of nodes, i.e.,
the sample size, a subgraph Gs is generated by some sampling method. The type distri-
bution of Gs , Dist(Gs ), is expected to be the same as that of the original graph G. That
is, d(Dist(Gs ), Dist(G)) = 0, where d() denotes the difference between two distribu-
tions. In other words, the percentage of each Ni in Gs is expected to be the same as that
of G.
d(IR(Gs ), IR(G)) = 0.
On the other hand, the inter-relationship is equal to 1 − IR(Gs) which is also preserved.
An example is given to illustrate these two problems. Given a social network which
including 180 nodes (n = 180) and 320 edges (m = 320). Suppose there are totally
3 groups (k = 3) containing 20, 100, and 60 people respectively. Thus, the type dis-
tribution of the network Dist(G) is 0.11, 0.56, 0.33. Also suppose there are 200 intra-
connection edges, the intra-connection ratio is thus 0.625. Our goal is to find out a
sampling method that preserves the type distribution and the intra-connection ratio best.
Suppose that a subgraph Gs is sampled under the given 10% sampling rate, which is 18
nodes. If the number of nodes of group 1, 2, and 3 is 5, 8, and 5, then the type distri-
bution is 0.28,0.44,0.28. In addition, suppose there are 30 intra-connection edges out of
50 sampled edges, then the intra-connection ratio is 0.6. In the experiment section, we
will provide several indexes to compute the difference between these distributions and
ratios.
On Sampling Type Distribution from Heterogeneous Social Networks 115
4 Sampling Algorithms
The five algorithms for sampling large heterogeneous networks to be described are di-
vided into three categories: random-based sampling, chain-referral sampling and indi-
rect inference sampling. For random-based sampling, we conduct Random Node
Sampling and Random Edge Sampling. The chain-referral sampling method includes
Ego-Centric Exploration Sampling and Multiple Ego-Centric-Exploration Sampling.
Finally, we adopt Respondent-Driven Sampling, an indirect sampling method, that orig-
inated from the social science.
Random Node Sampling (RNS) is an intuitive procedure for selecting desired number
of nodes uniformly at random from the given graph. First of all, RNS picks up a set of
nodes into a list. Then it constructs the vertex-induced subgraph by checking if there
are edges between the selected nodes in the original graph.
The logic of Random Edge Sampling (RES) is also intuitive and similar to RNS
by selecting edges uniformly at random. Once an edge is selected during the sampling
process, the two nodes, head and tail, that it connected are also be included. Note that
once the node number exceeds the desired one when a latest edge is selected, one of the
node, say, the head node, will be excluded.
Although the chain-referral sampling algorithms can both produce a reasonable con-
nected subgraph and preserve community structure, the rich get richer flavor inherently
exists in this family of sampling techniques.
E1 + E2 + ... + Ek = 1
S1,1 E1 + S2,1 E2 + ...Sk,1 Ek = E1
S1,2 E1 + S2,2 E2 + ...Sk,2 Ek = E2
..
.
S1,k−1 E1 + S2,k−1 E2 + ...Sk,k−1 1Ek = Ek−1 .
For instance, if there are only two groups in a social network, Male (m) and Female ( f ),
Sm f
the solution is Em = 1−Smm+S f m and E f = 1 − Em , thus provide the information about
On Sampling Type Distribution from Heterogeneous Social Networks 117
5 Evaluation
In this section, we introduce the data sets we used and present our experimental results.
Then, we show the results of all five sampling methods for both of the type reserv-
ing goal and the intra-relationship preserving goal. We also discuss the effects on the
statistics with the number of types and the sampling size varied. The sampling proba-
bility p was set to 0.8 for ECE and MES based on the suggestion in [12]. The coupon
limit for RDS was set to 5. We implemented all algorithms using VC++ and ran on a PC
equipped with 2.66GHz dual CPUs and 2G memory. Moreover, we ran each experiment
200 times for each setting and computed the average to get a stable and valid result.
group count characteristics group 1 group 2 group 3 group 4 group 5 group 6 group 7
group ratio 0.24 0.246 0.149 0.196 0.142 0.023 0.004
node count 97053 99177 60318 79290 57357 9206 1473
7 intra-connection ratio 0.324 0.332 0.335 0.265 0.258 0.209 0.02
intra-edge count 55943 53170 33132 38360 21558 2798 70
edge count 185242 160094 98819 144920 83531 13378 3530
group ratio 0.486 0.149 0.338 0.023 0.004 — —
node count 196230 60318 136647 9206 1473 — —
5 intra-connection ratio 0.574 0.335 0.381 0.209 0.02 — —
intra-edge count 198306 33132 86943 2798 70 — —
edge count 345336 98819 228451 13378 3530 — —
group ratio 0.509 0.153 0.381 — — — —
node count 205436 61791 136647 — — — —
3 intra-connection ratio 0.598 0.334 0.381 — — — —
intra-edge count 214351 34149 86943 — — — —
edge count 358714 102349 228451 — — — —
the original graph G. First, the Error Ratio (ER) summed up the proportion difference
∑k |O(i)−E(i)|
of all types. It is defined as i=1 2∗SN , where O(i) is the number of nodes in the ith
group on Gs , E(i) is the theoretical number of nodes it should be in the sampled graph
according to the type i’s real proportion in G, and SN is the sample size. Another evalu-
ation statistic is the D-statistic for the Kolmogorov-Smirnov Test. We simply used it as
an index rather than conducting a hypothesis test. The D-statistic, which can measure
the agreement between two distributions, is defined as D = supx |F (x) − F(x)|, where
F (x) is the type distribution of Gs and F(x) is that of G. ER provided a percentage-
like form of the total errors between type distributions of Gs and G whereas D-statistic
provided the information about the cumulative errors within the structure of Gs and G.
For the intra-relationship preserving goal, we used the Intra-Relation Error (IRE) to
measure the difference of the intra-relationship ratio among Gs and G. It was defined
| mI − mI |, where I and I denoted the number of intra-connection edges in Gs and G
respectively, and m and m were the total number of edges in Gs and G respectively.
D Statistic
Error Ratio
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
10 100 1000 10000 100000 1e+006 10 100 1000 10000 100000 1e+006
Sample Size Sample Size
D Statistic
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
10 100 1000 10000 100000 1e+006 10 100 1000 10000 100000 1e+006
Sample Size Sample Size
D Statistic
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
10 100 1000 10000 100000 1e+006 10 100 1000 10000 100000 1e+006
Sample Size Sample Size
only at the small sample size. It was because MES could avoid getting stuck in some
particular group members and thus provided more accurate results. When the sample
size increased, ECE had higher chances to travel from groups to groups thus provided
the similar result with MES. Finally, RNS and RES behaved very unstable and sensi-
tive to the sample size. When the sample size was very small, both methods produced
terrible results. However, they improved significantly and obtained the best sampling
results when the sample size was large enough. In Fig. 1(b), we found similar behav-
ior patterns for all sampling methods except for RDS at the small sample size. This
indicated that RDS heavily relied on the information provided by the recruitment ma-
trix. When the sampling size was very small, the recruitment matrix could not provide
enough information for the Markov Chain process thus produced a worse result.
120 J.-Y. Li and M.-Y. Yeh
For the 5-groups Twitter data set, all patterns from five methods were similar to those
of 7-group Twitter data set as shown in Fig. 1(c) and (d). This is also true for the setting
of 3-groups as shown in Fig. 1(e) and (f). Only at small sample sizes, the results showed
that the error decreased as the number of groups was getting smaller. We will further
discuss the results in Section 5.5.
It is noted that, the results were similar for both ER and D-statistic. This was because
the property of the Twitter data sets. Since ER is an index to measure the total error,
it was sensitive to the performance on the largest or relatively great groups in terms of
size. On the other hand, D-statistic measured the cumulative error that encountered on
the greater groups as well in most cases. According to those reasons, we observed some
similar patters between ER and D-statistic.
Our second goal is to preserve the relationship among different groups in a network.
Fig. 2 presented experimental results for this goal. We found that RDS produced the
best result even at small sample sizes. It indicated that the sampling phase of RDS not
only provided the network information to the Markov Chain process, but also somewhat
preserved the relationship information (different tie types) as well. Still, its improve-
ment slowed down when the sample size became very large. On the other hand, MES
had a little higher errors compared to that of the ECE at a small sample size. Since the
original concept of MES is to avoid sampling bias from the chain-referral procedure in
the type distribution, it did not consider the issues about the relationship among indi-
viduals (edges on the graph). However, we can observe the advantage of MES when
the sample size increased. RES outperformed RNS since it is an edge-based random
selection. Thus it had more advantages than the node-based random selection did. Fi-
nally, RNS failed to describe the relationships among individuals with a small sample
size since RNS tended to produce a set of nonconnective nodes, which was especially
true when the network was sparse, that mislead the intra-connection ratio to 0. How-
ever, the situation changed while the sample size increased. Because RNS performed a
vertex-induced procedure after sampling enough nodes into the sample pool, this pro-
cess included both in-edge and out-edges between two nodes. Therefore, more selected
edges resulted in the better performance of the intra-relationship preserving goal. We
omitted the results of the 5-group data set due to the space limit. Its IRE values were
between those of the 7-group and 3-group settings.
5.5 Analysis on the Effects of the Number of Groups and the Sample Size
Here we provide some remarks on the performance at different numbers of groups. The
sampling size chosen here was 100. We only presented ER in Fig. 3(a) and omitted
the results of D-statistic since they had similar patterns. We found that both ER and
D-statistic affected by the number of groups (k) positively. This is reasonable since the
more k existed in a social network the more errors we observed lead to a lower accuracy.
On the other hand, in Fig. 3(b), the number of groups k were almost independent of the
intra-relationship error. It is noted that since RNS cannot sample any edge in the small
On Sampling Type Distribution from Heterogeneous Social Networks 121
MES
Intra-Relation Error
MES
RDS RDS
0.3 0.3
0.2 0.2
0.1 0.1
0 0
10 100 1000 10000 100000 1e+006 10 100 1000 10000 100000 1e+006
Sample Size Sample Size
0.400 0.600
0.350
0.500
IntraͲRelationError
0.300
0.400
ErrorRatio
0.250
0.200 0.300
0.150
0.200
0.100
0.100
0.050
0.000 0.000
3 5 7 3 5 7
sample setting, the IRE equals to the IR of the original graph. As a result, the IRE of
RNS in Fig. 3(b) had significant differences of different group settings.
One of the most important issues in the sampling problem is that how big the sample
size should be chosen to get good enough results in terms of the sampling accuracy on
preserving our two goals. According to all of our experiment results, Fig. 1 and Fig. 2,
we concluded that 15% is the best point for this concern. We found that when the sample
size grew to more than 15% (around 60000 nodes) of the population, all statistics were
below 0.05 no matter which sampling method was used. In other words, the sampling
quality improved limitedly when the sample size was even larger (up to 50% in this
study). Although the purposes and research target are different, our finding is similar to
that in [7], which also concluded with this argument.
6 Conclusion
In this study, we proposed a novel and meaningful sampling problem, the type distri-
bution preserving and the intra-relationship preserving problems, on the heterogeneous
social network and conducted five algorithms to deal with this issue. In preserving the
type distribution, we found that RDS was a good method especially at small sample
122 J.-Y. Li and M.-Y. Yeh
sizes. MES helped ECE a little at a small sample size. In addition, the random-based
methods were sample size sensitive and failed to provide reasonable results at small
sample sizes. In preserving the link relationship goal, we had a similar conclusion while
some differences were discussed. Furthermore, we discussed the results under different
group sizes. Finally, we provided a rule of thumb that a 15% sample size should be
large enough on the type distribution preserving and the intra-relationship preserving
sampling problems in our findings.
References
1. Navlakha, S., Rastogi, R., Shrivastava, N.: Graph summarization with bounded error. In:
Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 419–432 (2008)
2. Gibson, D., Kumar, R., Tomkins, A.: Discovering large dense subgraphs in massive graphs.
In: Proc. of Int. Conf. on Very Large Data Bases, p. 732 (2005)
3. Raghavan, S., Garcia-Molina, H.: Representing web graphs. In: Proc. of IEEE Int. Conf. on
Data Engineering, pp. 405–416 (2003)
4. Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Extracting large-scale knowledge
bases from the web. In: Proc. of Int. Conf. on Very Large Data Bases, pp. 639–650 (1999)
5. Li, C.T., Lin, S.D.: Egocentric Information Abstraction for Heterogeneous Social Networks.
In: Proc. of Int. Conf. on Advances in Social Network Analysis and Mining, pp. 255–260
(2009)
6. Tian, Y., Hankins, R., Patel, J.: Efficient aggregation for graph summarization. In: Proc. of
ACM SIGMOD Int. Conf. on Management of Data, pp. 567–580 (2008)
7. Leskovec, J., Faloutsos, C.: Sampling from large graphs. In: Proc. of ACM SIGKDD Int.
Conf. on Knowledge Discovery and Data Mining, p. 636 (2006)
8. Hübler, C., Kriegel, H., Borgwardt, K., Ghahramani, Z.: Metropolis algorithms for represen-
tative subgraph sampling. In: Proc. of IEEE Int. Conf. on Data Mining, pp. 283–292 (2008)
9. Ma, H., Gustafson, S., Moitra, A., Bracewell, D.: Ego-centric Network Sampling in Viral
Marketing Applications. In: Int. Conf. on Computational Science and Engineering, pp. 777–
781 (2009)
10. Heckathorn, D.: Respondent-driven sampling: a new approach to the study of hidden popu-
lations. Social problems 44, 174–199 (1997)
11. Choudhury, M.D.: Social datasets by munmun de choudhury (2010),
http://www.public.asu.edu/~mdechoud/datasets.html
12. Krishnamurthy, V., Faloutsos, M., Chrobak, M., Lao, L., Cui, J.-H., Percus, A.G.: Re-
ducing large internet topologies for faster simulations. In: Boutaba, R., Almeroth, K.C.,
Puigjaner, R., Shen, S., Black, J.P. (eds.) NETWORKING 2005. LNCS, vol. 3462, pp. 328–
341. Springer, Heidelberg (2005)
13. Heckathorn, D.: Respondent-driven sampling II: deriving valid population estimates from
chain-referral samples of hidden populations. Social Problems 49, 11–34 (2002)
14. Lovász, L.: Random walks on graphs: A survey. Combinatorics, Paul Erdos is Eighty 2, 1–46
(1993)
15. Kemeny, J.G., Snell, J.L.: Finite Markov Chains, pp. 69–72. Springer, Heidelberg (1960)
Ant Colony Optimization with Markov Random
Walk for Community Detection in Graphs
1 Introduction
Many complex systems in the real world exist in the form of networks, such as
social networks, biological networks, Web networks, etc., which are also often
classified as complex networks. Complex network analysis has been one of the
most popular research areas in recent years due to its applicability to a wide
range of disciplines [1,2,3]. While a considerable body of work addressed basic
statistical properties of complex networks such as the existence of “small world
effects” and the presence of “power laws” in the link distribution, another prop-
erty that has attracted particular attention is that of “community structure”:
nodes in a network are often found to cluster into tightly-knit groups with a
high density of within-group edges and a lower density of between-group edges
[3]. Thus, the goal of network clustering algorithms is to uncover the underlying
community structure in given complex networks.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 123–134, 2011.
c Springer-Verlag Berlin Heidelberg 2011
124 D. Jin et al.
2 Algorithm
Let N = (V, E) denote a network, where V is the set of nodes (or vertices)
and E is the set of edges (or links). Let a k-way partition ofthe network be
definedas π = N1 , N2 , . . . , Nk , where N1 , N2 , . . . , Nk satisfy 1≤i≤k Ni = N
and 1≤i≤k Ni = ∅. If partition π has the property that within-community
edges are dense and between-community edges are sparse, it’s called a well-
defined community structure of this network.
In a network, let pij be the probability that an agent freely walks from any
node i to its neighbor node j within one step, this is also called the transition
probability of a random walk. In terms of the adjacency matrix of N , A =
(aij )n×n , pij is defined by
aij
pij = . (1)
r air
ACO with Markov Random Walk for Community Detection in Graphs 125
Let D = diag(d1 , . . . , dn ) where di = j aij denotes the degree of node i. Let
P be the transition probability matrix of random walk, we have
P = D−1 ∗ A. (2)
From the view of a Markov random walk, when a complex network has commu-
nity structure, a random walk agent should be found it difficult to move outside
its own community boundary, whereas it should be easy for it to reach other
nodes within its community, as link density within a community should be high,
by definition. In other words, the probability of remaining in the same commu-
nity, that is, an agent starts from any node and stays in its own community after
freely walking by a number of steps, should be greater than that of going out to
a different community.
It’s for the reason above, that in this ant colony optimization (ACO) strategy,
each ant (only different from agent in the sense that it can consult and update
a “pheromone” variable in each link) takes the transition probability of random
walk as heuristic rule and is directed by pheromone to find its solution. At each
iteration, the solution found by each ant only expresses its local view, but one
can derive a global solution when all of the ants’ solutions are aggregated to one
through a clustering ensemble, which will be used to update the pheromone ma-
trix. As the process evolves, the cluster characteristic of the pheromone matrix
will gradually become sharper and algorithm ACOMRW converges to a solu-
tion where the community structure can be accurately detected. In short, the
pheromone matrix can be regarded as the final clustering result which aggregates
the information of all the ants at all iterations in this algorithm.
In order to further clarify this above idea, an intuitive description is presented
as follows. Given a network N which has community structure. One lets some
ants freely crawl along the links in the network. The ants have a given life-cycle,
and the new ant colony will be generated immediately when all of the former
ants die. At the beginning of this algorithm, there is yet no impact from the
pheromone on network N . Only because of the restriction by community struc-
ture, the ant’s probability for remaining in its own community should be greater
than that for going out to other communities, but there is no difference between
these ants and the random walk agents at the moment since pheromone distribu-
tion is still homogeneous. As ants move, with the accumulation and volatilization
of pheromone left by the former ants, the pheromone on within-community links
will become thicker and thicker, and the pheromone on between-community links
will become thinner and thinner. In fact, pheromone is simply a mechanism that
can register in the network past walks and that leads to more informed decisions
for subsequent walks. This strengthens the trend that any ant will more often re-
main in its own community. At last, when the pheromone matrix converges, the
clustering result of network N will be got naturally. In one word, the idea behind
of ACOMRW is that, by strengthening within-community links and weakening
between-community links, an underlying community structure of the network
will gradually become visible.
126 D. Jin et al.
bij pij
mij = . (3)
r bir pir
Consider the Markov dynamics of each ant above. Let the start position of
one ant be node s, the step number limitation be l, and Vst denote the t-step
( t ≤ l ) transition probability distribution of the ant, in which Vst (j) denotes the
probability that this ant walks from node s to node j within t steps. There should
be Vs0 = (0, . . . 0, 1, 0, . . . 0), where Vs0 (s) = 1. If we also consider the influence
of power-law degree distribution in complex network, directed by matrix M , Vst
can be given by
Vst = Vst−1 ∗ M T . (4)
In this algorithm, all the ants take the transition probability of random walk as
heuristic rule, and are directed by pheromone at the same time. Thus, as the
link density within a community is, in general, much higher than that between
communities, an ant that starts from the source node s should have more paths
to choose from to reach the nodes in its own community within l steps, where
the value of l can’t be too large. On the contrary, the ant should have much
lower probability to arrive the nodes outside its community. In other words, it
will be hard for an ant that falls on a community to pass those “bottleneck” links
and leave the existing community. Furthermore, with the evolution of algorithm
ACOMRW, the pheromone on within-community links will become thicker and
thicker, and the pheromone on between-community links will become thinner
ACO with Markov Random Walk for Community Detection in Graphs 127
and thinner. This makes the trend that any ant remains in its own community
more and more obvious.
Here we define Eq. (5), where Cs denotes the community where node s is
situated. More formally speaking, Eq. (5) should be met better and better with
the evolution of pheromone matrix. When algorithm ACOMRW converges at
last, Eq. (5) will be completely satisfied, and underlying community structure
will become visible. Later we will give detailed analysis on parameter l.
∀ ∀ Vsl (i) > Vsl (j) . (5)
i∈Cs j ∈C
/ s
The algorithm that each ant adopts to compute its l-step transition probability
distribution Vsl is given bellow. It is described by using Matlab pseudocode.
Procedure Produce V
/ ∗ Consider that each ant has already visited t + 1 nodes upon any t steps,
thus the max t + 1 elements in Vst should be set at 1 af ter each step. ∗ /
input: s / ∗ start position of this ant ∗ /
B / ∗ current pheromone matrix ∗ /
P / ∗ transition probability matrix of random walk ∗ /
l / ∗ limitation of step number ∗ /
output: V / ∗ l − step transition probability distribution of this ant ∗ /
begin
1 V ← zeros(1, n);
2 V (s) ← 1;
3 M ← P. ∗ B;
4 D ← sum(M, 2);
5 D ← diag(D);
6 M ← inv(D) ∗ M ;
7 M ← M ;
8 f or i = 1 : l
9 V ← V ∗ M;
10 if i = l
11 [sorted V, ix] ← sort(V, descend );
12 V (ix(1 : i + 1)) ← 1;
13 end
14 end
end
After attaining Vsl , the current problem is how to find the ant’s solution which
should be also a clustering solution of the network. However, each ant only has
the ability of denoting that it can visit the nodes in its own community with
high probability, but the nodes with low visited probability are not necessarily in
one community, which may respectively belong to several different communities.
Therefore one ant can only find its own community from its local view.
This algorithm sorts Vsl in descending order, and then calculates differences
between adjacent elements of the sorted Vsl , finding the point corresponding
128 D. Jin et al.
to the max difference. It’s obvious that the point corresponding to the biggest
“valley” of the sorted Vsl should be the most suitable one as the cutoff point to
identify the community of this ant. At last, we believe that the points whose
visited probability value is greater than that of the cutoff point should be in a
same community, but we don’t consider which communities the nodes left out
belong to. It’s obvious, a solution produced by one ant is its own community.
Given Vsl , the algorithm that divides Vsl and finds this ant’s solution is as
follows.
Procedure Divide V
/ ∗ As each ant has visited at least l + 1 nodes,
the index of cutoff point shouldn t be less than l + 1. ∗ /
input: V / ∗ l − step transition probability distribution of this ant ∗ /
output: solution / ∗ solution of this ant ∗ /
begin
1 [sorted V, ix] ← sort(V, descend );
2 diff V ← −diff (sorted V );
3 diff V ← dif f V (l + 1 : length(diff V ));
4 [max diff , cut pos] ← max(diff V );
5 cut pos ← cut pos + l;
6 cluster ← ix(1 : cut pos);
7 solution ← zeros(n, n);
8 solution(cluster, cluster) ← 1;
9 I ← eye(cut pos, cut pos);
10 solution(cluster, cluster) ← solution(cluster, cluster) − I;
end
In the network, let the number of total nodes be n, and the number of total
edges be m. If the network is denoted by its adjacency matrix, the time com-
plexity of Produce V is O(l ∗ n2 ). Divide V needs to sort all nodes according
to their probability values. Because there exist linear-time sort algorithms (such
as bin sort and counting sort) which can be adopted, the time complexity of
Divide V is O(n). Thus, the overall complexity that one ant induces to produce
its solution is O(l ∗ n2 ). However, if this network is denoted by linked list, its
time complexity can be decreased to O(l(m + n)). As most complex networks
are sparse graphs, this can be very efficient.
As we can see, at each iteration, this algorithm aggregates the local solutions
of all ants to a global one, and then updates the pheromone matrix B by using
it. With the increase of iterations, the pheromone matrix is gradually evolving,
which makes the ants more and more directed, and the trend that any ant
stays in its own community more and more obvious. When the algorithm finally
converges, the pheromone matrix B can be regarded as the final clustering result
which aggregates the information of all the ants at all iterations.
The next step is how to analyze the produced pheromone matrix B in or-
der to attain the clustering solution of the network. Because of the convergence
property of ACO, the cluster characteristic of matrix B is very obvious. The
description of a simple partition phase algorithm is as follows.
3 Experiments
In order to quantitatively analyze the performance of algorithm ACOMRW,
we tested it by using both computer-generated and real-world networks. We
conclude by analyzing the parameter l defined in this algorithm.
In this experiment our algorithm is compared with GN algorithm [3], Fast
Newman (FN) algorithm [5] and FEC algorithm [10]. They are all known and
competitive network clustering algorithms. In order to more fairly compare the
clustering accuracy of different algorithms, we adopt two widely used accuracy
measures, which are Fraction of Vertices Classified Correctly (FVCC) [5] and
Normalized Mutual Information (NMI) [16] respectively.
1.05
1
tlyc 1
onti 0.9 rreo 0.95
a c
rm de 0.9
fon 0.8 ifis
il sa 0.85
uat cl
u 0.7 se 0.8
m ticr
edz ev 0.75 GN
lia 0.6 GN fo
m FN FN
orn FEC onit 0.7 FEC
0.5 ACOMRW acfr 0.65 ACOMRW
0.40 0.6
1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
number of inter-community edges per vertex zout number of inter-community edges per vertex zout
(a) (b)
Fig. 1. Compare ACOMRW with GN, FN and FEC against benchmark random net-
works. (a) take NMI as accuracy measure; (b) take FVCC as accuracy measure.
Table 1. Compare ACOMRW with GN, FN and FEC on three real-world networks
6 0.54
ACOMRW Q-value of ACOMRW
0.52 Q-value of real community structure
5 0.5
re 0.48
b4 eu0.46
m
un la
re v- 0.44
stu3 Q
lc
0.42
2 0.4
0.38
16 8 10 12 14 16 18 20 0.366 8 10 12 14 16 18 20
L-steps L-steps
(a) (b)
(c)
References
1. Watts, D.J., Strogatz, S.H.: Collective Dynamics of Small-World Networks. Na-
ture 393(6638), 440–442 (1998)
2. Barabsi, A.L., Albert, R., Jeong, H., Bianconi, G.: Power-law distribution of the
World Wide Web. Science 287(5461), 2115a (2000)
134 D. Jin et al.
3. Girvan, M., Newman, M.E.J.: Community Structure in Social and Biological Net-
works. Proceedings of National Academy of Science 9(12), 7821–7826 (2002)
4. Santo, F.: Community Detection in Graphs. Physics Reports 486(3-5), 75–174
(2010)
5. Newman, M.E.J.: Fast Algorithm for Detecting Community Structure in Networks.
Physical Review E 69(6), 066133 (2004)
6. Guimera, R., Amaral, L.A.N.: Functional cartography of complex metabolic net-
works. Nature 433(7028), 895–900 (2005)
7. Barber, M.J., Clark, J.W.: Detecting Network Communities by Propagating Labels
under Constraints. Phys. Rev. E 80(2), 026129 (2009)
8. Jin, D., He, D., Liu, D., Baquero, C.: Genetic algorithm with local search for
community mining in complex networks. In: Proc. of the 22th IEEE International
Conference on Tools with Artificial Intelligence (ICTAI 2010), pp. 105–112. IEEE
Press, Arras (2010)
9. Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community
structures of complex networks in nature and society. Nature 435(7043), 814–818
(2005)
10. Yang, B., Cheung, W.K., Liu, J.: Community Mining from Signed Social Networks.
IEEE Trans. on Knowledge and Data Engineering 19(10), 1333–1348 (2007)
11. Zhang, Y., Wang, J., Wang, Y., Zhou, L.: Parallel Community Detection on Large
Networks with Propinquity Dynamics. In: Proc. the 15th ACM SIGKDD Int. Conf.
on Knowledge Discovery and Data Mining, pp. 997–1005. ACM Press, Paris (2009)
12. Morarescu, C.I., Girard, A.: Opinion Dynamics with Decaying Confidence: Appli-
cation to Community Detection in Graphs. arXiv:0911.5239v1 (2010)
13. Strehl, A., Ghosh, J.: Cluster ensembles-a knowledge reuse framework for combin-
ing partitionings. Journal of Machine Learning Research 3, 583–617 (2002)
14. Milgram, S.: The Small World Problem. Psychology Today 1(1), 60–67 (1967)
15. Albert, R., Jeong, H., Barabasi, A.L.: Diameter of the World Wide Web. Na-
ture 401, 130–131 (1999)
16. Danon, L., Duch, J., Diaz-Guilera, A., Arenas, A.: Comparing community structure
identification. J. Stat. Mech., P09008 (2005)
17. Zachary, W.W.: An Information Flow Model for conflict and Fission in Small
Groups. J. Anthropological Research 33, 452–473 (1977)
18. Lusseau, D.: The Emergent Properties of a Dolphin Social Network. Proc. Biol.
Sci. 270, S186–S188 (2003)
19. Newman, M.E.J., Girvan, M.: Finding and Evaluating Community Structure in
Networks. Phys. Rev. E 69(2), 026113 (2004)
Faster and Parameter-Free Discord Search in
Quasi-Periodic Time Series
Abstract. Time series discord has proven to be a useful concept for time-
series anomaly identification. To search for discords, various algorithms
have been developed. Most of these algorithms rely on pre-building an
index (such as a trie) for subsequences. Users of these algorithms are typ-
ically required to choose optimal values for word-length and/or alphabet-
size parameters of the index, which are not intuitive. In this paper, we
propose an algorithm to directly search for the top-K discords, without the
requirement of building an index or tuning external parameters. The al-
gorithm exploits quasi-periodicity present in many time series. For quasi-
periodic time series, the algorithm gains significant speedup by reducing
the number of calls to the distance function.
1 Introduction
Periodic and quasi-periodic time series appear in many data mining applications,
often due to internal closed-loop regulation or external phase-locking forces on
the data sources. A time series’ temporary deviation from a periodic or quasi-
periodic pattern constitutes a major type of anomalies in many applications. For
example, an electrocardiography (ECG) recording is nearly periodic, as one’s
heartbeat. Figure 1 shows an ECG signal where a disruption of periodicity is
highlighted. This disruption of periodicity actually indicates a Premature Ven-
tricular Contraction (PVC) arrhythmia [3]. As another example, Figure 4 shows
the number of beds occupied in a tertiary hospital. The time series suggests a
weekly pattern—busy weekdays followed by quieter weekends. If the weekly pat-
tern is disrupted, then chaos often follows with elective surgeries being canceled
and the emergency department being over-crowded, greatly impacting patient
satisfaction and health care quality.
Time Series Discord captures the idea of anomalous subsequences in time
series and has proven to be useful in a diverse range of applications (see for
example [5,1,11]). Intuitively, a discord of a time series is a subsequence with
the largest distance from all other non-overlapping subsequences in the time se-
ries. Similarly, the 2nd discord is a subsequence with the second largest distance
from all other non-overlapping subsequences. And more generally one can search
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 135–148, 2011.
c Springer-Verlag Berlin Heidelberg 2011
136 W. Luo and M. Gallagher
1st discord
d
^
d
15
distance
10
5
0 1000 3000 5000 0 1000 3000 5000
index p
Fig. 1. An ECG time series that demon- Fig. 2. Illustration of Proposition 1. The
strates periodicity, baseline shift, and a blue solid line represents the true d for
discord. The time series is the second-lead the time series xmitdb_x108_0 (with sub-
signal from dataset xmitdb_x108_0 of [6]. sequence length 360). The red dashed line
According to [3], the ECG was taken un- represents an estimate d̂ for d. Although
der the frequency of 360 Hz. The unit for at many locations d̂ is very different from
measurement is unknown to the author. d, the maximum of d̂ coincides with the
maximum of d.
for the top-K discords [1]. Finding the discord for a time series in general requires
comparisons among O(m2 ) pair-wise distances, where m is the length of the time
series. Despite past efforts in building heuristics (e.g., [5,1]), searching for the
discord still requires expensive computation, making real-time interaction with
domain experts difficult. In addition, most of existing algorithms are based on
the idea of indexing subsequences with a data structure such as a trie. Such data
structures often have unintuitive parameters (e.g., word length and alphabet
size) to tune. This means time consuming trial-and-error that compromises the
efficiency of the algorithms.
Keogh, Lin, and Fu first defined time series discords and proposed a search
algorithm named HOT SAX in [5]. A memory efficient search algorithm was
also proposed later [11]. HOT SAX builds on the idea of discretizing and index-
ing time series subsequences. To select the lengths for index keys, wavelet decom-
position can be used ([2,1]). Most recently, adaptive discretization has been pro-
posed to improve the index for efficient discord search ([8]). In this paper, we pro-
pose a fast algorithm to find the top-K discords in a time series without prebuild-
ing an index or tuning parameters. For periodic or quasi-periodic time series, the
algorithm finds the discord with much less computation, compared to results pre-
viously reported in the literature (e.g., [5]). After finding the 1st discord, our algo-
rithm finds subsequent discords with even less computation—often 50% less. We
tested our algorithm with a collection of datasets from [6] and [4]. The diversity
of the collection shows the definition of “quasi-periodicity” can be very relaxed
for our algorithm to achieve search efficiency. Periodicity of a time series can be
easily assessed through visual inspection. The experiments with artificially gen-
erated non-periodic random walk time series showed increased running time, but
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series 137
the algorithm is still hundreds of times faster than the brute-force search, without
tuning any parameter.
The paper is organized as follows. Section 2 reviews the definition of time-
series discord and existing algorithms for discord search. Section 3 introduces our
direct search algorithm and explains ideas behind it. Section 4 presents empirical
evaluation for the new algorithm and a comparison with the results of HOT SAX
from [5]. Section 5 concludes the paper.
950
850
750
Date
Note that the values for both n and k should be determined by the application;
they are independent of the search algorithm. If a user was looking for three
most unusual weeks in the bed occupancy example (Figure 4), k would be 3
and n would be 7 ∗ 24, assuming the time series is sampled hourly. Strictly
speaking, the discord is not well defined as there may be more than one location
p that maximizes dp (i.e., dp1 = dp2 = max p {dp : 1 ≤ p ≤ m − n + 1}). But
the ambiguity rarely matters in most applications, especially when the top-K
discords are searched in a batch. In this paper, we shall follow the existing
literature [5] and assume that all dp ’s have distinct values.
The discord has a formulation similar to the minimax problem in game theory.
Note that
According to Sion’s minimax theorem [9], the equality holds if distT,n (p, ·) is
quasi-concave on q for every p and distT,n (·, q) is quasi-convex on p for every q.
Figure 3 indicates, however, that in general neither distT,n (p, ·) is quasi-concave
nor distT,n (·, q) is quasi-convex, and no global saddle point exists. That suggests
searching for discords requires a strategy different from those used in game the-
ory. In the worst case, searching for the discord has the complexity O(m2 ), essen-
tially requiring brute-force computation of the pair-wise distances of all length-n
subsequences of the time series. When m = 104 , that means 100 million calls
to the distance function. Nevertheless, the following sufficient condition for the
discord suggests a search strategy better than the brute-force computation.
Observation 1. Let T be a time series. A subsequence T [p∗ ; n] is the discord
of length n if there exists d∗ such that
In general, there are infinitely many d∗ that satisfies Clause (2) and Clause (3).
Suppose we have a good guess d∗ . Clause (3) implies that a false candidate
of the discord can be refuted, potentially in fewer than m steps. Clause (2)
implies that, given all false candidates have been refuted, the true candidate
for the discord can be verified in m − n + 1 steps. Hence in the best case,
(m − n + 1) + (m − 1) = 2m − n calls to the distance function are sufficient to
verify the discord. To estimate d∗ , we can start with the value of dp where p is
a promising candidate for the discord, and later increase the guess to a larger
value dp if p is not refuted (i.e., distT,n (p , q) > dp for every non-overlapping q)
and becomes the next candidate. This hill-climbing process goes on until all but
one of the subsequences are refuted with the updated value of d∗ .
This idea forms the basis of most existing discord search algorithms (e.g., HOT
SAX in [5] and WAT in [1]); the common structure of these algorithms is shown in
Figure 5. With this base algorithm, the efficiency of a search then depends on the
order of subsequences in the Outer and Inner loops (see lines 2 and 3). Intuitively,
the Outer loop should rank p according to the singularity of subsequence T [p; n];
the Inner loop should rank q according the proximity between subsequences
T [p; n] and T [q; n]. Both HOT SAX and WAT adopt the following strategy.
Firstly all subsequences of length n are discretized and compressed into shorter
strings. Then the strings are indexed with a suffix trie—in the ideal situation,
subsequences close in distance also share an index key or occupy neighboring
index keys in the trie. This is not so different from the idea of hashing to achieve
O(1) search time. In the end, all subsequences will be indexed into a number of
buckets on the terminal nodes. The hope is that, with careful selection of string
length and alphabet size, the discord will fall into a bucket containing very few
subsequences while a non-discord subsequence will fall into a bucket shared with
similar subsequences. Then the uneven distribution of subsequences among the
buckets can be exploited to devise efficient ordering for the Outer and Inner
loops.
This ingenious approach however has two drawbacks. Firstly, one needs to
select optimal parameters that balance the index size and the bucket size, which
are critical to the search efficiency. For example, to use HOT SAX, one needs
140 W. Luo and M. Gallagher
to set the alphabet size and the word size for the discretized subsequences [5,
Section 4.2]; WAT automates the selection of word size, but still requires setting
the alphabet size [1, Section 3.2]. Such parameters are not always intuitive to a
user, as the difficulty of building a useable trie has been discussed in [11, Section
2]. Secondly, the above approach uses fixed/random order in the outer loop
to search for all top-K discords. A dynamic ordering for the outer loop could
potentially make better use of the information gained in the previous search
steps. Also it is not clear how knowledge gained in finding the k-th discord can
help finding the (k + 1)-th discord. In [1, Section 3.6], partial information about
d̂ is cached so that the inner loop may break quickly. But as caching works at
the “easy” part of the search space—where dp is small, it is not clear how much
computation is saved.
In the following section, we address the above issues by proposing a direct
way to search for multiple discords. In particular, our algorithm requires no
ancillary index (and hence no parameters to tune), and the algorithm reuses the
knowledge gained in searching for the first k discords to speed up the search for
the (k + 1)-th discord.
1: For each p, estimate dˆp minq∈Qp dist(p, q), where Qp is a subset of {q : |p − q| > n}.
2: while the discord has not been found, do
3: p∗ ← argmax{dˆp }.
p
4: Compute dp∗ minq dist(p∗ , q).
5: if dp∗ > dˆp for all p = p∗ then
6: return p∗ as the discord starting location.
7: else
8: Decrease d̂ by enlarging Qp s.
9: end if
10: end while
30
25
10 15 20
dist(p,, q)
5
0
Fig. 7. Distance profiles dist(p, ·) of time Fig. 8. Locations of qp ’s for time series
series xmitdb_x018_0. Each line plots xmitdb_x108_0. Each location (p, qp ) is
the sequence sp = (distT,n (p, 1), . . . , colored according the value dp . Dashed
distT,n (p, 1000)) for some p, where n = 360. lines are a period (360) apart. Hence if
The 10 lines in the plot correspond to p a location (p, qp ) falls on a dashed line,
being 10, 20, . . . , 100 respectively. then qp − p is a multiple of the period
360.
1.0
0.6
estimated period
0.010
ACF
Density
0.2
−0.2
0.000
●
●●
● ●●● ●● ● ●
Fig. 9. Autocorrelation function of time Fig. 10. The density plot for gaps be-
series xmitdb_x108_0. The plot shows tween local minima and the estimated pe-
multiple peaks corresponding to multiples riod for time series xmitdb_x108_0
of the period.
Fig. 11. Estimating period with the median gap between two neighboring local minima
With heuristics for both Traversing and Sampling, Figure 12 implements Line 1
in Figure 6. The procedure uses a sequential covering strategy to estimate d̂p for
each p. In each iteration (the while loop from Line 2 to Line 10), a Sampling
operation is done to find a “sweet spot”. Then a Traversing operation exploits
that location to cover as many neighboring locations as possible.
The verification stage of our algorithm (Lines 2-10 in Figure 6) consists of
a while loop which resembles the outer loop in HOT SAX and WAT. But here
the order of location is dynamic, determined by the ever-improving estimate d̂.
Line 8 in Figure 6 further improves d̂ when the initial guess for the discord
turns out to be incorrect. The improvement can be achieved by traversing with
a better starting location qp∗ produced in Line 4 of Figure 6 (see Figure 13). As
suggested by Figure 8, the “best” locations tend to cluster along the 45 degree
lines. Moreover the large value of the initial estimate dˆp∗ suggests the neighbor-
hood of p∗ is a high-payoff region for further refinement of d̂. As the traversing
is done locally, the improvement step is relatively fast compared to the initial
estimation step for d.
To sum up, we have described a new algorithm for discord search that con-
sists of an estimation stage followed by a verification stage. The estimation stage
144 W. Luo and M. Gallagher
4 Empirical Evaluation
In this section, we first compare the performance of our direct-discord-search
algorithm with the results reported for HOT SAX in [5]. We then report the
performance of our algorithm on a collection of time series which are publicly
available. Following the tradition established in [5] and [1], the efficiency of our
algorithm was measured by the number of calls to the distance function, as
apposed to wall clock or CPU time. Since our algorithm entails no overhead of
constructing an index (in contrast to the algorithms in [5] and [1]), the number
of calls to the distance function is roughly proportional to the total computation
time involved. As shown in [2] and [1], the performance of HOT SAX depends
on the parameters selected. Here we assume that the metrics reported in [5] were
based on optimal parameter values.
To compare to HOT SAX, we use the dataset qtdbsel102 from [6]. Although
several datasets were used in [5] to evaluate the performance of HOT SAX, this is
the only one readily available to us. The dataset qtdbsel102 contains two time
series of length 45, 000; we use the first one as the two are highly correlated.
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series 145
●
●●
●●
●●
●● ●
●
●● ●●
●
●●
●
●●
●
●●●
●
●
● ●●● ● ●●
●
● ●
● ●● ● ●●● ● ●● ●●
●
●● ● ● ● ● ● ●● ● ●●
●
●●● ●●●●
●●●● ● ●
●
3rd discord
● ● ● ●●
● ●●
●● ● ● ●● ●
●
● ●
●●
●● ●
● ● ● ● ● ● ● ●● ● ●● ●● ●● ●●●● ●● ●●● ●
●● ● ● ●● ●
●
●
●●●●
● ● ● ●● ●●
●
● ●● ● ● ● ●● ● ●●●●
●
●● ● ● ● ●●
● ● ● ●●●●● ● ● ● ●
●● ●● ● ●●●
● ● ● ●
●
●●●●
●
●●●●●●
●
● ●● ●●● ● ●● ● ●● ● ● ● ●●
● ● ●●●● ● ● ● ● ● ●● ●
● ● ● ●●
●●● ●● ● ●● ●●
● ●● ● ●
● ● ● ●●● ●●● ● ● ●● ●● ●
●● ● ● ●
●
● ● ● ●●● ● ● ●●
●●●●● ●●
●● ●
● ●● ●
● ● ●● ● ●●● ● ●●●● ● ●
●●
● ●
● ●● ● ●●
● ● ●● ●
●● ● ● ●●● ●● ● ●● ●●● ●●● ● ●●●
● ● ●●
● ● ●● ● ●
● ●● ● ● ● ● ●● ●●● ● ●● ●●● ●
● ●
●●● ●●●●● ● ●●
● ●● ●
● ●●●●●
●
●
●● ●● ● ●●● ●
●● ● ● ● ● ●●● ●● ● ● ●
●● ●● ● ● ●
● ● ● ●●
●●●●● ● ● ●
●
● ●●●●● ● ● ● ● ● ●●
● ●● ● ●●
● ●● ●●●● ● ●● ●● ● ●● ● ● ●
● ● ●
● ● ● ●●● ● ● ● ● ●●
●● ●
●● ● ● ● ● ● ●● ●●●●● ●● ● ● ● ● ●●● ●●
● ● ●● ●●●● ● ● ● ●● ●
● ● ●● ●● ● ●● ● ● ●●●● ● ● ● ● ● ●● ●
●● ●● ● ● ●●● ● ● ●
● ● ●
● ● ●●● ●● ● ●●
●●● ● ●
●●●●
●● ● ●● ● ● ● ● ●●●● ● ● ●●● ●
● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ●
●
● ● ●●●● ● ● ●● ● ●● ● ● ●● ● ●●● ●
●
●●●
● ●● ●
●
●●● ●
10000
●●●● ● ● ●● ●● ● ●● ●● ● ●
● ●
●
● ●
10 100
●
●
●
● ●
● ● ●
●
● ●
●
●
10 14
● ●
d
^
1
1k 2k 4k 8k 16k 32k
6
0 5000 15000 25000
Length of time series
Index
Fig. 14. Search costs for the direct search Fig. 15. Time series nprs44 and its d̂
algorithm and HOT SAX. For HOT SAX, vector
the mean numbers of distance calls were
visually estimated from [5]; interval esti-
mates were used to account for potential
estimation error.
Table 1. Numbers of calls to the distance function with random excerpts from ptdb-
sel102, for the direct-discord-search algorithm and HOT SAX
g
Time Series Direct Search Cost (Standard Error) Aver. Cost for HOT SAX
Length 1st discord 2nd discord 3rd discord (visual estimates)
1,000 4,020 (1,441) 1,072 (705) 998 ( 690) 16, 000 to 40, 000
2,000 11,159 (4,641) 4,120 (2,532) 3,493 (2,780) 40, 000 to 100, 000
4,000 30,938 (12,473) 13,963 (10,633) 13,399 (12,473) 60, 000 to 160, 000
8,000 77,381 (33,064) 29,711 (32,651) 38,632 (40,974) 100, 000 to 160, 000
16,000 168,277 (70,071) 94,855 (107,128) 141,038 (143,553) 250, 000 to 400, 000
32,000 365,900 (184,540) 198,797 (95,960) 105,911 (107,992) 400, 000 to 1 × 106
In the second set of experiments, we search for the top 3 discords for a col-
lection of time series from [6]2 and [4], using the proposed algorithm. For time
series from [6], the discord lengths are chosen to be consistent with configura-
tions used in [5]. The results are shown in Table 2. Many of these datasets, in
particular 2h_radioactivity, demonstrate little periodicity. The results show
that our algorithm has reasonable performance even for such time series.
Table 2. Numbers of calls to the distance function for top-3 discord search
In Table 2, the results for the time series nprs44 are particularly interesting.
For nprs44, no significant reduction in computation is observed for computing
the 2nd and the 3rd discords. To find out why, we plot the time series and the
estimated d vector in Figure 15. The figure shows that the 2nd and the 3rd
discords are not noticeably different from other subsequences.
Fig. 16. Random walk time series used in the experiments for completely nonperiodic
data
Table 3. Number of calls to the distance function for top-3 discord search (random
walk time series)
(see Figure 16). Random walk time series is interesting in two aspects: firstly a
random walk time series is completely nonperiodic; secondly every subsequence
of a random walk can be regarded as equally anomalous.
We applied the algorithm to find the top-3 discords in the two random-walk
time series. The results are shown in Table 3. Without tuning any parameter,
the algorithm is still hundreds of times faster than the brute-force computation
of all pair-wise distances.
To sum up, our experiments show clear performance improvement on quasi-
periodic time series by the proposed direct discord-search algorithm. Our algo-
rithm also demonstrates consistent performance across a broad range of time
series, with varying degree of periodicity.
One limitation of the proposed algorithm is that the time series need to be fit
into the main memory. Hence the algorithm requires O(m) memory. One future
direction is to explore disk-aware approximations to the direct-discord-search
algorithm. When the time series is too large to be fitted into the main memory,
one needs to minimize the number of disk scans as well the number of calls to
the distance function (see [11]).
Another direction is to explore alternative ways of estimating the d vector so
that the number of iterations for refining d̂ is minimized. We also are looking for
ways to extend the algorithm so that the periodicity assumption can be removed.
Acknowledgment
Support for this work was provided by an Australian Research Council Linkage
Grant (LP 0776417). We would like to thank anonymous reviewers for their
helpful comments.
References
1. Bu, Y., Leung, T.W., Fu, A.W.C., Keogh, E., Pei, J., Meshkin, S.: WAT: Finding
top-k discords in time series database. In: Proceedings of 7th SIAM International
Conference on Data Mining (2007)
2. Fu, A.W.-c., Leung, O.T.-W., Keogh, E.J., Lin, J.: Finding time series discords
based on haar transform. In: Li, X., Zaı̈ane, O.R., Li, Z.-h. (eds.) ADMA 2006.
LNCS (LNAI), vol. 4093, pp. 31–41. Springer, Heidelberg (2006)
3. Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark,
R.G., Mietus, J.E., Moody, G.B., Peng, C.-K., Stanley, H.E.: PhysioBank, Phys-
ioToolkit, and PhysioNet: Components of a new research resource for complex
physiologic signals. Circulation 101(23), e215–e220 (2000), Circulation Electronic
Pages: http://circ.ahajournals.org/cgi/content/full/101/23/e215
4. Hyndman, R.J.: Time Series Data Library, http://www.robjhyndman.com/TSDL
(accessed on April 15, 2010)
5. Keogh, E., Lin, J., Fu, A.: HOT SAX: Efficiently finding the most unusual time
series subsequence. In: Proc. of the 5th IEEE International Conference on Data
Mining, pp. 226–233 (2005)
6. Keogh, E., Lin, J., Fu, A.: The UCR Time Series Discords Homepage,
http://www.cs.ucr.edu/~eamonn/discords/
7. Lindström, J., Kokko, H., Ranta, E.: Detecting periodicity in short and noisy time
series data. Oikos 78(2), 406–410 (1997)
8. Pham, N.D., Le, Q.L., Dang, T.K.: HOT aSAX: A novel adaptive symbolic repre-
sentation for time series discords discovery. In: Nguyen, N.T., Le, M.T., Światek,
J. (eds.) ACIIDS 2010. LNCS, vol. 5990, pp. 113–121. Springer, Heidelberg (2010)
9. Sion, M.: On general minimax theorems. Pacific J. Math. 8(1), 171–176 (1958)
10. Sprott, J.C.: Chaos and time-series analysis. Oxford Univ. Pr., Oxford (2003)
11. Yankov, D., Keogh, E., Rebbapragada, U.: Disk aware discord discovery: Finding
unusual time series in terabyte sized datasets. Knowledge and Information Sys-
tems 17(2), 241–262 (2008)
INSIGHT: Efficient and Effective Instance Selection for
Time-Series Classification
Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
{buza,nanopoulos,schmidt-thieme}@ismll.de
1 Introduction
Time-series classification is a widely examined data mining task with applications in
various domains, including finance, networking, medicine, astronomy, robotics, bio-
metrics, chemistry and industry [11]. Recent research in this domain has shown that the
simple nearest-neighbor (1-NN) classifier using Dynamic Time Warping (DTW) [18]
as distance measure is “exceptionally hard to beat” [6]. Furthermore, 1-NN classifier is
easy to implement and delivers a simple model together with a human-understandable
explanation in form of an intuitive justification by the most similar train instances.
The efficiency of nearest-neighbor classification can be improved with several meth-
ods, such as indexing [6]. However, for very large time-series data sets, the execution
time for classifying new (unlabeled) time-series can still be affected by the significant
computational requirements posed by the need to calculate DTW distance between the
new time-series and several time-series in the training data set (O(n) in worst case,
where n is the size of the training set). Instance selection is a commonly applied ap-
proach for speeding-up nearest-neighbor classification. This approach reduces the size
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 149–160, 2011.
c Springer-Verlag Berlin Heidelberg 2011
150 K. Buza, A. Nanopoulos, and L. Schmidt-Thieme
of the training set by selecting the best representative instances and use only them dur-
ing classification of new instances. Due to its advantages, instance selection has been
explored for time-series classification [20].
In this paper, we propose a novel instance-selection method that exploits the re-
cently explored concept of hubness [16], which states that some few instances tend to
be much more frequently nearest neighbors than the remaining ones. Based on hub-
ness, we propose a framework for score-based instance selection, which is combined
with a principled approach of selecting instances that optimize the coverage of training
data, in the sense that a time series x covers an other time series y, if y can be classi-
fied correctly using x. The proposed framework not only allows better understanding
of the instance selection problem, but helps to analyze the properties of the proposed
approach from the point of view of coverage maximization. For the above reasons, the
proposed approach is denoted as Instance Selection based on Graph-coverage and Hub-
ness for Time-series (INSIGHT). INSIGHT is evaluated experimentally with a collec-
tion of 37 publicly available time series classification data sets and is compared against
FastAWARD [20], a state-of-the-art instance selection method for time series classifi-
cation. We show that INSIGHT substantially outperforms FastAWARD both in terms
of classification accuracy and execution time for performing the selection of instances.
The paper is organized as follows. We begin with reviewing related work in section
2. Section 3 introduces score-based instance selection and the implications of hubness
to score-based instance selection. In section 4, we discuss the complexity of the in-
stance selection problem, and the properties of our approach. Section 5 presents our
experiments followed by our concluding remarks in section 6.
2 Related Work
Attempts to speed up DTW-based nearest neighbor (NN) classification [3] fall into 4
major categories: i) speed-up the calculation of the distance of two time series, ii) reduce
the length of time series, iii) indexing, and iv) instance selection.
Regarding the calculation of the DTW-distance, the major issue is that implement-
ing it in the classic way [18], the comparison of two time series of length l requires the
calculation of the entries of an l × l matrix using dynamic programming, and therefore
each comparison has a complexity of O(l 2 ). A simple idea is to limit the warping win-
dow size, which eliminates the calculation of most of the entries of the DTW-matrix:
only a small fraction around the diagonal remains. Ratanamahatana and Keogh [17]
showed that such reduction does not negatively influence classification accuracy, in-
stead, it leads to more accurate classification. More advanced scaling techniques include
lower-bounding, like LB Keogh [10].
Another way to speed-up time series classification is to reduce the length of time
series by aggregating consecutive values into a single number [13], which reduces the
overall length of time series and thus makes their processing faster.
Indexing [4], [7] aims at fast finding the most similar training time series to a given
time series. Due to the “filtering” step that is performed by indexing, the execution
time for classifying new time series can be considerable for large time-series data sets,
since it can be affected by the significant computational requirements posed by the need
to calculate DTW distance between the new time-series and several time-series in the
INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification 151
training data set (O(n) in worst case, where n is the size of the training set). For this
reason, indexing can be considered complementary to instance selection, since both
these techniques can be applied to improve execution time.
Instance selection (also known as numerosity reduction or prototype selection) aims
at discarding most of the training time series while keeping only the most informative
ones, which are then used to classify unlabeled instances. While instance selection is
well explored for general nearest-neighbor classification, see e.g. [1], [2], [8], [9], [14],
there are just a few works for the case of time series. Xi et al. [20] present the Fast-
AWARD approach and show that it outperforms state-of-the-art, general-purpose in-
stance selection techniques applied for time series.
FastAWARD follows an iterative procedure for discarding time series: in each it-
eration, the rank of all the time series is calculated and the one with lowest rank is
discarded. Thus, each iteration corresponds to a particular number of kept time time
series. Xi et al. argue that the optimal warping window size depends on the number of
kept time series. Therefore, FastAWARD calculates the optimal warping window size
for each number of kept time series.
FastAWARD follows some decisions whose nature can be considered as ad-hoc (such
as the application of an iterative procedure or the use of tie-breaking criteria [20]). Con-
versely, INSIGHT follows a more principled approach. In particular, INSIGHT gener-
alizes FastAWARD by being able to use several formulae for scoring instances. We
will explain that the suitability of such formulae is based on the hubness property that
holds in most time-series data sets. Moreover, we provide insights into the fact that the
iterative procedure of FastAWARD is not a well-formed decision, since its large com-
putation time can be saved by ranking instances only once. Furthermore, we observed
the warping window size to be less crucial, and therefore we simply use a fixed window
size for INSIGHT (that outperforms FastAWARD using adaptive window size).
INSIGHT performs instance selection by assigning a score to each instance and select-
ing instances with the highest scores (see Alg. 1). In this section, we examine how to
develop appropriate score functions by exploiting the property of hubness.
In order to develop a score function that selects representative instance for nearest-
neighbor time-series classification, we have to take into account the recently explored
property of hubness [15]. This property states that for data with high (intrinsic) dimen-
sionality, as most of the time-series data1 , some objects tend to become nearest neigh-
bors much more frequently than others. In order to express hubness in a more precise
way, for a data set D we define the k-occurrence of an instance x ∈ D, denoted fNk (x),
that is the number of instances of D having x among their k nearest neighbors. With
the term hubness we refer to the phenomenon that the distribution of fNk (x) becomes
1 In case of time series, consecutive values are strongly interdependent, thus instead of the length
of time series, we have to consider the intrinsic dimensionality [16].
152 K. Buza, A. Nanopoulos, and L. Schmidt-Thieme
Fig. 1. Distribution of fG1 (x) for some time series datasets. The horizontal axis correspond to the
values of f G1 (x), while on the vertical axis we see how many instance have that value.
significantly skewed to the right. We can measure this skewness, denoted by S f k (x) ,
N
with the standardized third moment of fNk (x):
where μ f k (x) and σ f k (x) are the mean and standard deviation of fNk (x). When S f k (x)
N N N
is higher than zero, the corresponding distribution is skewed to the right and starts
presenting a long tail.
In the presence of labeled data, we distinguish between good hubness and bad hub-
ness: we say that the instance y is a good (bad) k-nearest neighbor of the instance x
if (i) y is one of the k-nearest neighbors of x, and (ii) both have the same (different)
class labels. This allows us to define good (bad) k-occurrence of a time series x, fGk (x)
(and fBk (x) respectively), which is the number of other time series that have x as one
of their good (bad) k-nearest neighbors. For time series, both distributions fGk (x) and
fBk (x) are usually skewed, as is exemplified in Figure 1, which depicts the distribution
of fG1 (x) for some time series data sets (from the collection used in Table 1). As shown,
the distributions have long tails, in which the good hubs occur.
We say that a time series x is a good (bad) hub, if fGk (x) (and fBk (x) respectively)
is exceptionally large for x. For the nearest neighbor classification of time series, the
skewness of good occurrence is of major importance, because a few time series (i.e.,
the good hubs) are able to correctly classify most of the other time series. Therefore, it
is evident that instance selection should pay special attention to good hubs.
Good 1-occurrence score — In the light of the previous discussion, INSIGHT can use
scores that take the good 1-occurrence of an instance x into account. Thus, a simple
score function that follows directly is the good 1-occurrence score fG (x):
While x is being a good hub, at the same time it may appear as bad neighbor of several
other instances. Thus, INSIGHT can also consider scores that take bad occurrences into
account. This leads to scores that relate the good occurrence of an instance x to either
its total occurrence or to its bad occurrence. For simplicity, we focus on the following
relative score, however other variations can be used too:
Relative score fR (x) of a time series x is the fraction of good 1-occurrences and total
occurrences plus one (plus one in the denominator avoids division by zero):
fG1 (x)
fR (x) = (3)
fN1 (x) + 1
Xi’s score — Interestingly, fGk (x) and fBk (x) allows us to interpret the ranking criterion
of Xi et al. [20], by expressing it as another form of score for relative hubness:
Algorithm 1. INSIGHT
Require: Time-series dataset D, Score Function f , Number of selected instances N
Ensure: Set of selected instances (time series) D
1: Calculate score function f (x) for all x ∈ D
2: Sort all the time series in D according to their scores f (x)
3: Select the top-ranked N time series and return the set containing them
154 K. Buza, A. Nanopoulos, and L. Schmidt-Thieme
We first examine coverage graphs for the general case of instance-based learning
methods, which include the k-NN (k ≥ 1) classifier and its generalizations, such as
adaptive k-NN classification where the number of nearest neighbors k is chosen adap-
tively for each object to be classified [12], [19].2 In this context, the contribution of
an instance x to the correct classification of an instance y refers to the case when x is
among the nearest neighbors of y and they have the same label.
Based on the definition of the coverage graph, we can next define the coverage of a
specific vertex and of set of vertices:
Following the common assumption that the distribution of the test (unlabeled) data is
similar to the distribution of the training (labeled) data, the more vertices are covered,
the better prediction for new (unlabeled) data is expected. Therefore, the objective of an
instance-selection algorithm is to have the selected vertex-set S (i.e., selected instances)
covering the entire set of vertices (i.e., the entire training set), i.e., C(S) = VGc . This,
however, may not be always possible, such as when there exist vertices that are not
covered by any other vertex. If a vertex v is not covered by any other vertex, this means
that the out-degree of v is zero (there are no edges going from v to other vertices).
Denote the set of such vertices with by VG0c . Then, an ideal instance selection algorithm
should cover all coverable vertices, i.e., for the selected vertices S an ideal instance
selection algorithm should fulfill:
C(v) = VGc \ VG0c (5)
∀v∈S
In order to achieve the aforementioned objective, the trivial solution is to select all
the instances of the training set, i.e., chose S = VGc . This, however is not an effective
instance selection algorithm, as the major aim of discarding less important instances
is not achieved at all. Therefore, the natural requirement regarding the ideal instance
selection algorithm is that it selects the minimal amount of those instances that together
cover all coverable vertices. This way we can cast the instance selection task as a cov-
erage problem:
Instance selection problem (ISP) — We are given a coverage graph Gc = (V, E). We
aim at finding a set of vertices S ⊆ VGc so that: i) all the coverable vertices are covered
(see Eq. 5), and ii) the size of S is minimal among all those sets that cover all coverable
vertices.
Next we will show that this problem is NP-complete, because it is equivalent to the
set-covering problem (SCP), which is NP-complete [5]. We proceed with recalling the
set-covering problem.
2 Please notice that in the general case the resulting coverage graph has no regularity regarding
both the in- and out-degrees of the vertices (e.g., in the case of k-NN classifier with adaptive k).
INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification 155
Set-covering problem (SCP) — ”An instance (X, F ) of the set-covering problem con-
sists of a finite set X and a familiy F of subsets of X, such that every element of X
belongs to at least one subset in F . (...) We say that a subset F ∈ F covers its ele-
ments. The problem is to find a minimum-size subset C ⊆ F whose members cover all
of X”[5]. Formally: the task is to find C ⊆ F , so that |C | is minimal and X = F.
∀F∈C
Theorem 1. ISP and SCP are equivalent. (See Appendix for the proof.)
Fig. 2. Accuracy as function of the number of selected instances (in % of the entire training data)
for some datasets for FastAWARD and INSIGHT
5 Experiments
We experimentally examine the performance of INSIGHT with respect to effectiveness,
i.e., classification accuracy, and efficiency, i.e., execution time required by instance se-
lection. As baseline we use FastAWARD [20].
We used 37 publicly available time series datasets3 [6]. We performed 10-fold-cross
validation. INSIGHT uses fG (x) (Eq. 2) as the default score function, however fR (x)
(Eq. 3) and fXi (x) (Eq. 4) are also being examined. The resulting combinations are
denoted as INS- fG (x), INS- fR (x) and INS- fXi (x), respectively.
The distance function for the 1-NN classifier is DTW that uses warping windows [17].
In contrast to FastAWARD, which determines the optimal warping window size ropt ,
INSIGHT sets the warping-window size to a constant of 5%. (This selection is justified
by the results presented in [17], which show that relatively small window sizes lead to
higher accuracy.) In order to speed-up the calculations, we used the LB Keogh lower
bounding technique [10] for both INSIGHT and FastAWARD.
Table 1. Accuracy ± standard deviation for INSIGHT and FastAWARD (bold font: winner)
resulting from INSIGHT (INS- fG (x)) is worse by less than 0.05 compared to using the
entire training data. For FastAward this number is 4, which clearly shows that INSIGHT
select more representative instances of the training set than FastAward.
Next, we investigate the reasons for the presented difference between INSIGHT and
FastAward. In Section 3.1, we identified the skewness of good k-occurrence, fGk (x), as
a crucial property for instance selection to work properly, since skewness renders good
hubs to become representative instances. In our examination, we found that using the
iterative procedure applied by FastAWARD, this skewness has a decreasing trend from
iteration to iteration. Figure 3 exemplifies this by illustrating the skewness of fG1 (x) for
two data sets as a function of iterations performed in FastAWARD. (In order to quanti-
tatively measure skewness we use the standardized third moment, see Equation 1.) The
reduction in the skewness of fG1 (x) means that FastAWARD is not able to identify in the
end representative instances, since there are no pronounced good hubs remaining.
To further understand that the reduced effectiveness of FastAWARD stems from its
iterative procedure and not from its score function, fXi (x) (Eq. 4), we compare the
accuracy of all variations of INSIGHT including INS- fXi (x), see Tab. 2. Remarkably,
INS- fXi (x) clearly outperforms FastAWARD for the majority of cases, which verifies
our previous statement. Moreover, the differences between the three variations are not
large, indicating the robustness of INSIGHT with respect to the scoring function.
158 K. Buza, A. Nanopoulos, and L. Schmidt-Thieme
Fig. 3. Skewness of the distribution of fG1 (x) as function of the number of iterations performed in
FastAWARD. On the trend, the skewness decreases from iteration to iteration.
Table 2. Number of datasets where different versions of INSIGHT win/lose against FastAWARD
Table 3. Execution times (in seconds, averaged over 10 folds) of instance selection using IN-
SIGHT and FastAWARD for some datasets
first step (leave-one-out nearest neighbor classification of the train instances) already
requires O(n2 ) execution time. However, FastAWARD performs additional computa-
tionally expensive steps, such as determining the best warping-window size and the
iterative procedure for excluding instances. For this reason, INSIGHT is expected to
require reduced execution time compared to FastAWARD. This is verified by the re-
sults presented in Table 3, which show the execution time needed to perform instance
selection with INSIGHT and FastAWARD. As expected, INSIGHT outperforms Fast-
AWARD drastically. (Regarding the time for classifying new instances, please notice
that both methods perform 1-NN using the same number of selected instances, there-
fore the classification times are equal.)
References
1. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learn-
ing 6(1), 37–66 (1991)
2. Brighton, H., Mellish, C.: Advances in Instance Selection for Instance-Based Learning Al-
gorithms. Data Mining and Knowledge Discovery 6, 153–172 (2002)
3. Buza, K., Nanopoulos, A., Schmidt-Thieme, L.: Time-Series Classification based on Indi-
vidualised Error Prediction. In: IEEE CSE 2010 (2010)
4. Chakrabarti, K., Keogh, E., Sharad, M., Pazzani, M.: Locally adaptive dimensionality reduc-
tion for indexing large time series databases. ACM Transactions on Database Systems 27,
188–228 (2002)
5. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT
Press, Cambridge (2001)
6. Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.: Querying and Mining of
Time Series Data: Experimental Comparison of Representations and Distance Measures. In:
VLDB 2008 (2008)
7. Gunopulos, D., Das, G.: Time series similarity measures and time series indexing. ACM
SIGMOD Record 30, 624 (2001)
8. Jankowski, N., Grochowski, M.: Comparison of instances seletion algorithms I. Algorithms
survey. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC
2004. LNCS (LNAI), vol. 3070, pp. 598–603. Springer, Heidelberg (2004)
160 K. Buza, A. Nanopoulos, and L. Schmidt-Thieme
9. Jankowski, N., Grochowski, M.: Comparison of instance selection algorithms II. Results and
Comments. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC
2004. LNCS (LNAI), vol. 3070, pp. 580–585. Springer, Heidelberg (2004)
10. Keogh, E.: Exact indexing of dynamic time warping. In: VLDB 2002 (2002)
11. Keogh, E., Kasetty, S.: On the Need for Time Series Data Mining Benchmarks: A Survey
and Empirical Demonstration. In: SIGKDD (2002)
12. Ougiaroglou, S., Nanopoulos, A., Papadopoulos, A.N., Manolopoulos, Y., Welzer-Druzovec,
T.: Adaptive k-Nearest-Neighbor Classification Using a Dynamic Number of Nearest Neigh-
bors. In: Ioannidis, Y., Novikov, B., Rachev, B. (eds.) ADBIS 2007. LNCS, vol. 4690, pp.
66–82. Springer, Heidelberg (2007)
13. Lin, J., Keogh, E., Lonardi, S., Chiu, B.: A Symbolic Representation of Time Series, with
Implications for Streaming Algorithms. In: Proceedings of the 8th ACM SIGMOD Workshop
on Research Issues in Data Mining and Knowledge Discovery (2003)
14. Liu, H., Motoda, H.: On Issues of Instance Selection. Data Mining and Knowledge Discov-
ery 6, 115–130 (2002)
15. Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Nearest Neighbors in High-Dimensional
Data: The Emergence and Influence of Hubs. In: ICML 2009 (2009)
16. Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Time-Series Classification in Many Intrin-
sic Dimensions. In: 10th SIAM International Conference on Data Mining (2010)
17. Ratanamahatana, C.A., Keogh, E.: Three myths about Dynamic Time Warping. In: SDM
(2005)
18. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recog-
nition. IEEE Trans. Acoustics, Speech and Signal Proc. 26, 43–49 (1978)
19. Wettschereck, D., Dietterich, T.: Locally Adaptive Nearest Neighbor Algorithms. Advances
in Neural Information Processing Systems 6 (1994)
20. Xi, X., Keogh, E., Shelton, C., Wei, L., Ratanamahatana, C.A.: Fast Time Series Classifica-
tion Using Numerosity Reduction. In: Airoldi, E.M., Blei, D.M., Fienberg, S.E., Goldenberg,
A., Xing, E.P., Zheng, A.X. (eds.) ICML 2006. LNCS, vol. 4503. Springer, Heidelberg (2007)
1 Introduction
Previous studies have found that in multiple time-series data relating to real
world phenomena in the Biological and Economic domains, dynamic relation-
ships between series exist, and being governed by them these series move together
through time. For instance, it is well known that movement of a stock market
index in a specific country is affected by the movements of other stock market
indexes across the world or in that particular region [1],[2],[3]. Likewise, in a
Gene Regulatory Network (GRN) the expression level of a Gene is determined
by its time varying interactions with other Genes [4],[5].
However, even though time-series prediction has been extensively researched,
and some prominent methods from the machine learning and data mining arenas
Corresponding author.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 161–172, 2011.
c Springer-Verlag Berlin Heidelberg 2011
162 H. Widiputra, R. Pears, and N. Kasabov
such as the Multi-Layer Perceptron and Support Vector Machines have been
developed, there has been no research so far into developing a method that
can predict multiple time-series simultaneously based on interactions between
the series. The closest researches that take into account multiple time-series
variables are that of ([6],[7],[8],[9]) which generally used historical values of some
independent variables as inputs to a model that estimates future values of a
dependent variable. Consequently, these methods do not have the capability to
capture and model the dynamics of relationships in multiple time-series dataset
and to predict their future values simultaneously.
This research proposes a new method for modeling the dynamics of relation-
ships in multiple time-series and to simultaneously predict their future values
without the need for generating multiple models. The work thus focuses on the
discovery of profiles of relationships in multiple time-series dataset and the recur-
ring trends of movement that occur in a specific relationship’s form to construct
a knowledge repository of the system under evaluation. The identification and
exploitation of these profiles and recurring trends is expected to provide knowl-
edge to perform simultaneous multiple time-series prediction, and that it would
also significantly improve the accuracy of time-series prediction.
The rest of the paper is organized as follows; in the next section we briefly re-
view the issues involved in time-series modeling and cover the use of both Global
and Localized models. Section 3 describes and explains the method proposed in
this paper to discover profiles of relationships in multiple time-series dataset
and their recurring trends of movement. In section 4 we present our experimen-
tal findings, and to end with, in section 5 we conclude the paper summarizing
the main achievements of the research and briefly outline some directions for
future research.
developed for each cluster (i.e. local regressions) that will yield better accuracy
over the local problem space covered by the model in contrast to a global model.
Having a set of local models also offers greater flexibility as predictions can be
made either on the basis of a single model or, if needed, on a global level by
combining the predictions made by the individual local models [11].
Most of the work in clustering time-series data has concentrated on sample clus-
tering rather than variable clustering [13]. However, one of the key tasks in
our work is to group together series that are highly correlated and have similar
shapes of movement (variable clustering), as we believe that these local models
representing clusters of similar profiles will provide a better basis than a single
global model for predicting future movement of the multiple time-series.
The first step in extracting profiles of relationships between multiple time-
series is the computation of cross-correlation coefficients between the observed
time-series using the Pearson’s correlation analysis. Additionally, only those sta-
tistically significant correlation coefficients identified by the t-test with confi-
dence level of 95% are taken into account. The following step of the algorithm
is to measure dissimilarity between time-series from the Pearson’s correlation
164 H. Widiputra, R. Pears, and N. Kasabov
Fig. 1. The Pearson’s correlation coefficient matrix is calculated from a given multiple
time-series data (TS-1,TS-2,TS-3,TS-4), and then converted to normalized correlation
[Equation 1] before the profiles are finally extracted
and extraction of profiles, extracted profiles of relationships are stored and being
updated dynamically instead of being computed on the fly.
To detect and cluster recurring trends of movement from localized sets of time-
series data, an algorithm that extracts patterns of movement in the form of
polynomial regression function and groups them on the basis of similarity in
the regression coefficients has been proposed in a previous study [15]. However,
in this proposed algorithm to eliminate the assumption that the data is drawn
from a Gaussian data distribution when estimating the regression function, a
non-parametric regression analysis is used in place of the polynomial regression.
166 H. Widiputra, R. Pears, and N. Kasabov
(i)
Here xj = (xij1 , ..., xijk ) is the extended smaller value of the original data
n
X(i) at domain j and certain small step dx where j = 1, 2, ..., ( dx + 1).
xj = (xj1 , ..., xjk ) is calculated using the Gaussian MF equation as follows,
(xj − k)2
xjk = K(xj , k) = exp − , (3)
2α2
where xj = dx×(j−1), k = 1, 2, ..., n and α is a pre-defined kernel bandwidth.
The kernel weight, wi is estimated using common OLS, ordinary least squares
such that the following objective functions is minimized,
n
(i) (i) (i)
SSR = (Xk − X̂j ), ∀ X̂j where xj = Xk . (4)
k=1
(u)
where xj = dx × (j (u) − 1); j (u) = 1, 2, ..., ( n+1
dx + 1); k = 1, 2, . . . , n + 1 and
(u) (u) (u) (u)
the kernel weights wi = (wi1 , wi2 , ..., wi(n+1) ).
Multiple Time-Series Prediction through Time-Series Relationships Profiling 167
– Step 3, if there is no more data snapshot, then the process stops (go to
Step 7); else next snapshot, X(i) , is taken. Extract trend of movement from
X(i) as in Step 2, and calculate distances between current trend and all m
already created cluster centres defined as,
where l = 1, 2, ..., m. If found cluster centre Ccl where Di,l ≤ Rul , then
current trend joins cluster Cl and the step is repeated; else continue to next
step.
– Step 4, find a cluster Ca (with centre Cca and cluster radius Rua ) from all
m existing cluster centres by calculating the values of Si,a given by,
Clusters of trends of movement are then stored in each extracted profile of rela-
tionship. This information about profiles of and the trends of movements inside
them will then be exploited as the knowledge repository to perform simultaneous
multiple time-series prediction.
After the repository has been built, there are two further steps that need
to be performed before prediction can take place. The first step is to extract
current profiles of relationships between the multiple series. Thereafter, matches
are found between the current trajectory and previously stored profiles from the
past. Predictions are then made by implementing a weighting scheme that gives
more importance to pairs of series that belong to the same profile and retain
comparable trends of movement. The weight wij for given pair (i, j) of series, is
given by the distance of similarity between them.
Air pressure data collected from various locations in New Zealand by the Na-
tional Institute of Weather and Atmosphere (NIWA, http://www.niwa.co.nz)
constitutes the multiple time-series in this research. The data covers a period of
three years, ranging from 1st January 2007 to 31st December 2009.
Findings from previous study about global weather system [16] which argue
that a small change to one part of the system can lead to a complete change
in the weather system as a whole is the key reason that drives us to use such
dataset. Consequently, being able to reveal profiles of relationship between air
pressure in different locations at various time-points would help us to understand
more about the behavior of our weather system and would also facilitate in
constructing a reliable prediction model.
Fig. 4. Results of 100 days (1st October 2009 to 31st December 2009) air pressure level
prediction at four observation locations (Paeroa, Auckland, Hamilton and Reefton) in
New Zealand
results confirm proposals in previous studies which states that by being able to
reveal and understand characteristics of relationships between variables in mul-
tiple time-series data, one can predict their future states or behavior accurately
[2],[3],[4].
In addition, as it is expected, the proposed algorithm is not only able to
provide excellent accuracy in predicting future values, but it is also capable of
extracting knowledge about profiles of relationship between different locations in
New Zealand (in terms of movement of air pressure level) and clustering recurring
trends which exist in the series as illustrated in Figure 5 and 6. Consequently,
our study is also able to reveal that the air pressure level in the four locations
are highly correlated and tend to move in a similar fashion through time. This
is showed by the circle in the lower left corner where number of occurrences of
such profile is 601 in Figure 5.
Multiple Time-Series Prediction through Time-Series Relationships Profiling 171
Fig. 5. Extracted profiles of relationship from air pressure data. The radius represents
average normalized correlation coefficient, while N indicates number of occurrences of
a distinct profile.
Fig. 6. Created clusters of recurring trends when Paeroa, Auckland, Hamilton and
Reefton are detected to be progressing in a highly correlated similar manner
References
1. Collins, D., Biekpe, N.: Contagion and Interdependence in African Stock Markets.
The South African Journal of Economics 71(1), 181–194 (2003)
2. Masih, A., Masih, R.: Dynamic Modeling of Stock Market Interdependencies: An
Empirical Investigation of Australia and the Asian NICs. Working Papers 98-18,
pp. 1323–9244. University of Western Australia (1998)
3. Antoniou, A., Pescetto, G., Violaris, A.: Modelling International Price Relation-
ships and Interdependencies between the Stock Index and Stock Index Future Mar-
kets of Three EU Countries: A Multivariate Analysis. Journal of Business, Finance
and Accounting 30, 645–667 (2003)
4. Kasabov, N., Chan, Z., Jain, V., Sidorov, I., Dimitrov, D.: Gene Regulatory Net-
work Discovery from Time-series Gene Expression Data: A Computational Intelli-
gence Approach. In: Pal, N.R., Kasabov, N., Mudi, R.K., Pal, S., Parui, S.K. (eds.)
ICONIP 2004. LNCS, vol. 3316, pp. 1344–1353. Springer, Heidelberg (2004)
5. Friedman, L., Nachman, P.: Using Bayesian Networks to Analyze Expression Data.
Journal of Computational Biology 7, 601–620 (2000)
6. Liu, B., Liu, J.: Multivariate Time Series Prediction via Temporal Classification.
In: Proc. IEEE ICDE 2002, pp. 268–275. IEEE, Los Alamitos (2002)
7. Kim, T., Adali, T.: Approximation by Fully Complex Multilayer Perceptrons. Neu-
ral Computation 15, 1641–1666 (2003)
8. Yang, H., Chan, L., King, I.: Support Vector Machine Regression for Volatile Stock
Market Prediction. In: Yellin, D.M. (ed.) Attribute Grammar Inversion and Source-
to-source Translation. LNCS, vol. 302, pp. 143–152. Springer, Heidelberg (1988)
9. Zanghui, Z., Yau, H., Fu, A.M.N.: A new stock price prediction method based on
pattern classification. In: Proc. IJCNN 1999, pp. 3866–3870. IEEE, Los Alamitos
(1999)
10. Holland, J.H., Holyoak, K.J., Nisbett, R.E., Thagard, P.R.: Induction: Processes
of Inference, Learning and Discovery, Cambridge, MA, USA (1989)
11. Kasabov, N.: Global, Local and Personalised Modelling and Pattern Discovery in
Bioinformatics: An Integrated Approach. Pattern Recognition Letters 28, 673–685
(2007)
12. Song, Q., Kasabov, N.: ECM - A Novel On-line Evolving Clustering Method and Its
Applications. In: Posner, M.I. (ed.) Foundations of Cognitive Science, pp. 631–682.
MIT Press, Cambridge (2001)
13. Rodrigues, P., Gama, J., Pedroso, P.: Hierarchical Clustering of Time-Series Data
Streams. IEEE TKDE 20(5), 615–627 (2008)
14. Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering Gene Expression Patterns. Journal
of Computational Biology 6(3/4), 281–297 (1999)
15. Widiputra, H., Kho, H., Lukas, Pears, R., Kasabov, N.: A Novel Evolving Clus-
tering Algorithm with Polynomial Regression for Chaotic Time-Series Prediction.
In: Leung, C.S., Lee, M., Chan, J.H. (eds.) ICONIP 2009. LNCS, vol. 5864, pp.
114–121. Springer, Heidelberg (2009)
16. Vitousek, P.M.: Beyond Global Warming: Ecology and Global Change. Ecol-
ogy 75(7), 1861–1876 (1994)
Probabilistic Feature Extraction from
Multivariate Time Series Using Spatio-Temporal
Constraints
1 Introduction
A multivariate time series (MTS) is a sequential collection of high dimensional
observations generated by a dynamical system. The high dimensionality of MTS
creates challenges for machine learning and data mining algorithms. To tackle
this, feature extraction techniques are required to obtain computationally effi-
cient and compact representations.
Gaussian Process Latent Variable Model [5] (GPLVM) is one of the most
powerful nonlinear feature extraction algorithms. GPLVM emerged in 2004 and
instantly made a breakthrough in dimensionality reduction research. The novelty
of this approach is that, in addition to the optimisation of low dimensional
coordinates during the dimensionality reduction process as other methods did,
it marginalises parameters of a smooth and nonlinear mapping function from low
to high dimensional space. As a consequence, GPLVM defines a continuous low
dimensional representation of high dimensional data, which is called latent space.
Since GPLVM is a very flexible approach, it has been successfully applied in a
range of application domains including pose recovery [2], human tracking [20],
computer animation [21], data visualization [5] and classification [19].
However, extensive study of the GPLVM framework has revealed some essen-
tial limitations of the basic algorithm [5, 6, 8, 12, 19, 21, 22]. First, since GPLVM
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 173–184, 2011.
c Springer-Verlag Berlin Heidelberg 2011
174 M. Lewandowski, D. Makris, and J.-C. Nebel
aims at retaining data global structure in the latent space, there is no guaran-
tee that local features are preserved. As a result, the natural topology of the
data manifold may not be maintained. This is particularly problematic when
data, such as MTS, have a strong and meaningful intrinsic structure. In addi-
tion, when data are captured from different sources, even after normalisation,
GPLVM tends to produce latent spaces which fail to represent common local fea-
tures [8, 21]. This prevents successful utilisation of the GPLVM framework for
feature extraction of MTS. In particular, GPLVM cannot be applied in many
classification applications such as speech and action recognition, where latent
spaces should be inferred from time series generated by different subjects and
used to classify data produced by unknown individuals. Another drawback of
GPLVM is its computationally expensive learning process [5, 6, 21] which may
converge towards a local minima if the initialization of the model is poor [21].
Although recent extensions of GPLVM, i.e. back constrained GPLVM [12]
(BC-GPLVM) and Gaussian Process Dynamical Model [22] (GPDM), allow sat-
isfactorily representation of time series, the creation of generalised latent spaces
from data issued from several sources is still a unsolved problem which has never
been addressed by the research community. In this paper, we define ’style’ as
the data variations between two or more datasets representing a similar phe-
nomenon. They can be produced by different sources and/or repetitions from
a single source. Here, we propose an extension of the GPLVM framework, i.e.
Spatio-Temporal GPLVM (ST-GPLVM), which produces generalised and prob-
abilistic representation of MTS in the presence of stylistic variations. Our main
contribution is the integration of a spatio-temporal ’constraining’ prior distribu-
tion over the latent space within the optimisation process.
After a brief review of the state of art, we introduce the proposed methodology.
Then, we validate qualitatively our method on a real dataset of human behavioral
time series. Afterwards, we apply our method to a challenging view independent
action recognition task. Finally, conclusions are presented.
2 Related Work
Feature extraction methods can be divided into two general categories, i.e., deter-
ministic and probabilistic frameworks. The deterministic methods can be further
classified into two main classes: linear and non linear methods. Linear methods
like PCA cannot model the curvature and nonlinear structures embedded in ob-
served spaces. As a consequence, nonlinear methods, such as Isomap [17], locally
linear embedding [13] (LLE), Laplacian Eigenmaps [1] (LE) and kernel PCA [14],
were proposed to address this issue. Isomap, LLE and LE aim at preserving a
specific geometrical property of the underlying manifold by constructing graphs
which encapsulate nonlinear relationships between points. However, they do not
provide any mapping function between spaces. In contrast, kernel PCA obtains
embedded space through nonlinear kernel based mapping from a high to low
dimensional space. In order to deal with MTS, extensions of Isomap [3], LE [8]
and the kernel based approach [15] were proposed.
Probabilistic Feature Extraction from Multivariate Time Series 175
3 Methodology
Let a set of multivariate time series Y consists of multiple repetitions (or cycles)
of the same phenomenon from the same or different sources and all data points
{yi }(i=1..N ) in this set are distributed on a manifold in a high dimensional space
176 M. Lewandowski, D. Makris, and J.-C. Nebel
where priors of the unknowns are: p(X) = N (0, I), p(Φi ) ∝ i Φ−1
i . The max-
imisation of the above posterior is equivalent to minimising the negative log
posterior of the model:
− ln p(X, Φ|Y ) = 0.5((DN +1) ln 2π+D ln |K|+tr(K −1 Y Y T )+ xi 2 )+ Φi
i i
(4)
This optimization process can be achieved numerically using the scaled conju-
gate gradient [11] (SCG) method with respect to Φ and X. However, the learning
process is computationally very expensive, since O(N 3 ) operations are required
in each gradient step to inverse the kernel matrix K [5]. Therefore, in practice,
a sparse approximation to the full Gaussian process, such as ’fully independent
training conditional’ (FITC) approximation [6] or active set selection [5], is ex-
ploited to reduce the computational complexity to a more manageable O(k 2 N )
where k is the number of points involved in the lower rank sparse approximation
of the covariance [6].
Fig. 2. Temporal (a) and spatial (b) neighbours (green dots) of a given data point, Pi ,
(red dots)
1 tr(XLX T )
p(X|L) = √ exp(− ) (7)
2πσ 2 2σ 2
where σ represents a global scaling of the prior and controls the ’strength’ of the
constraining prior. Note that although distance between neighbours (especially
spatial ones) may be large in L, it is infinite between unconnected points.
The maximisation of the new objective function (5) is equivalent to minimising
the negative log posterior of the model:
− ln p(X, Φ|Y, L) = 0.5(D ln |K| + tr(K −1 Y Y T ) + σ −2 tr(XLX T ) + C) + Φi
i
(8)
where C is a constant: (DN + 1) ln 2π + ln σ 2 . Following the standard GPLVM
approach, the learning process involves minimising Eq. 8 with respect to Φ and
X iteratively using SCG method [11] until convergence.
ST-GPLVM is initialised using a nonlinear feature extraction method, i.e.
temporal LE [8] which is able to preserve the constraints L in the produced em-
bedded space. Consequently, compared to the standard usage of linear PPCA,
initialisation is more likely to be closer to the global optimum. In addition, the
enhancement of the objective function (3) with the prior (7) constrains the op-
timisation process and therefore further mitigates the problem of local minima.
The topological structure in terms of spatio-temporal dependencies is implic-
itly preserved in the latent space without enforcing any domain specific prior
knowledge.
The proposed methodology can be applied to other GPLVM based approaches,
such as BC-GPLVM [12] and GPDM [22] by integrating the prior (7) in their cost
function. The extension of BC-GPLVM results in a spatio-temporal model which
provides bidirectional mapping between latent and high dimensional spaces. Al-
ternatively, ST-GPDM produces a spatio-temporal model with an associated
nonlinear dynamical process in the latent space. Finally, the proposed extension
is compatible with a sparse approximation of the full Gaussian process [5, 6]
which allows reducing further processing complexity.
Fig. 3. 3D models learned from walking sequences of 3 different subjects with corre-
sponding first 2 dimensions and processing times: a) GPLVM, b)ST-GPLVM, c) BC-
GPLVM, d) ST-BC-GPLVM, e) GPDM and f) ST-GPDM. Warm-coloured regions
correspond to high reconstruction certainty.
Our evaluation is conducted using time series of MoCap data, i.e. repeated ac-
tions provided by the HumanEva dataset [16]. The MoCap time series are firstly
converted into normalized sequences of poses, i.e. invariant to subject’s rotation
and translation. Then each pose is represented as a set of quaternions, i.e. a 52-
dimension feature vector. In this experiment, we consider three different subjects
performing a walking action comprising of 500 frames each. The dimensionality
of walking action space is reduced to 3 dimensions [20, 22]. During the learn-
ing process, the computational complexity is reduced using FITC [6] where the
number of inducing variables is set to 10% of the data. The global scaling of the
constraining prior, σ, and the width of the back constrained kernel [12] were set
empirically to 104 and 0.1 respectively. Values of all the other parameters of the
models were estimated automatically using maximum likelihood optimisation.
The back constrained models used a RBF kernel [12].
The learned latent spaces for the walking sequences with the corresponding
first two dimensions and processing times are presented in figure 3. Qualitative
analysis confirms the generalisation property of the proposed extension. Stan-
dard GPLVM based approaches discriminate between subjects in the spatially
distinct latent space regions. Moreover, action repetitions by a given subject
are represented separately. In contrast, the introduction of our spatio-temporal
Probabilistic Feature Extraction from Multivariate Time Series 181
Fig. 4. Pipeline for generation of probabilistic view and style invariant action
descriptor
Table 1. Left, average recognition accuracy over all cameras using either single or
multiple views for testing. Right, class-confusion matrix using multiple views.
Average accuracy
Subjects
% Single All
/Actions
view views
Weinland [23] 10 / 11 63.9 81.3
Yan [24] 12 / 11 64.0 78.0
Junejo [4] 10 / 11 74.1 -
Liu [9] 12 / 13 71.7 78.5
Liu [10] 12 / 13 73.7 82.8
Lewandowski [7] 12 / 12 73.2 83.1
Our 12 / 12 76.1 85.4
Action recognition results are compared with the state of the art in table 1
(top view excluded). Examples of learned view and style invariant action descrip-
tors using ST-GPLVM are shown in figure 5. Although different approaches may
use slightly different experimental settings, table 1 shows that our framework
produces the best performances. In particular, it improves the accuracy of the
standard framework [7]. The confusion matrix of recognition for the ’all-view’
experiment reveals that our framework performed better when dealing with mo-
tions involving the whole body, i.e. ”walk”, ”sit down”, ”get up”, ”turn around”
Probabilistic Feature Extraction from Multivariate Time Series 183
Fig. 5. Probabilistic view and style invariant action descriptors obtained using ST-
GPLVM for a) sit down, b) cross arms, c) turn around and d) kick
and ”pick up”. As expected, the best recognition rates 78.7%, 80.7% are ob-
tained for camera 2 and 4 respectively, since those views are similar to those
used for training, i.e. side views. Moreover, when dealing with either different,
i.e. camera 1, or even significantly different views, i.e. camera 3, our framework
still achieves good recognition rate, i.e. 75.2% and 69.9% respectively.
6 Conclusion
This paper introduces a novel probabilistic approach for nonlinear feature extrac-
tion called Spatio-Temporal GPLVM. Its main contribution is the inclusion of
spatio-temporal constraints in the form of a conditioned prior into the standard
GPLVM framework in order to discover generalised latent spaces of MTS. All
conducted experiments confirm the generalisation power of the proposed concept
in the context of classification applications where marginalising style variabil-
ity is crucial. We applied the proposed extension on different GPLVM variants
and demonstrated that their Spatio-Temporal versions produce smoother, co-
herent and visually more convincing descriptors at a lower computational cost.
In addition, the methodology has been validated in a view independent action
recognition framework and produced state of the art accuracy. Consequently,
the concept of consistent representation of time series should benefit to many
other applications beyond action recognition such as gesture, sign-language and
speech recognition.
References
1. Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding
and clustering. In: Proc. NIPS, vol. 14, pp. 585–591 (2001)
2. Ek, C., Torr, P., Lawrence, N.D.: Gaussian process latent variable models for
human pose estimation. Machine Learning for Multimodal Interaction, 132–143
(2007)
3. Jenkins, O., Matarić, M.: A spatio-temporal extension to isomap nonlinear dimen-
sion reduction. In: Proc. ICML, pp. 441–448 (2004)
4. Junejo, I., Dexter, E., Laptev, I., Pérez, P.: Cross-view action recognition from
temporal self-similarities. In: Proc. ECCV, vol. 12 (2008)
184 M. Lewandowski, D. Makris, and J.-C. Nebel
5. Lawrence, N.: Gaussian process latent variable models for visualisation of high
dimensional data. In: Proc. NIPS, vol. 16 (2004)
6. Lawrence, N.: Learning for larger datasets with the Gaussian process latent variable
model. In: Proc. AISTATS (2007)
7. Lewandowski, J., Makris, D., Nebel, J.C.: View and style-independent action man-
ifolds for human activity recognition. In: Daniilidis, K., Maragos, P., Paragios, N.
(eds.) ECCV 2010. LNCS, vol. 6316, pp. 547–560. Springer, Heidelberg (2010)
8. Lewandowski, M., Martinez-del-Rincon, J., Makris, D., Nebel, J.-C.: Temporal
extension of laplacian eigenmaps for unsupervised dimensionality reduction of time
series. In: Proc. ICPR (2010)
9. Liu, J., Ali, S., Shah, M.: Recognizing human actions using multiple features. In:
Proc. CVPR (2008)
10. Liu, J., Shah, M.: Learning human actions via information maximization. In: Proc.
CVPR (2008)
11. Möller, M.: A scaled conjugate gradient algorithm for fast supervised learning.
Neural Networks 6(4), 525–533 (1993)
12. Lawrence, N.D., Quinonero-Candela, J.: Local Distance Preservation in the
GP-LVM Through Back Constraints. In: Proc. ICML, pp. 513–520 (2006)
13. Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embed-
ding. Science 290(5500), 2323–2326 (2000)
14. Schölkopf, B., Smola, A., Müller, K.: Kernel principal component analysis. In:
ICANN, pp. 583–588 (1997)
15. Shyr, A., Urtasun, R., Jordan, M.: Sufficient dimension reduction for visual se-
quence classification. In: Proc. CVPR (2010)
16. Sigal, L., Black, M.: HumanEva: Synchronized Video and Motion Capture Dataset
for Evaluation of Articulated Human Motion. Brown Univertsity (2006)
17. Tenenbaum, J., Silva, V., Langford, J.: A global geometric framework for nonlinear
dimensionality reduction. Science 290(5500), 2319–2323 (2000)
18. Tipping, M., Bishop, C.: Probabilistic principal component analysis. Journal of the
Royal Statistical Society, Series B 61, 611–622 (1999)
19. Urtasun, R., Darrell, T.: Discriminative Gaussian process latent variable model for
classification. In: Proc. ICML, pp. 927–934 (2007)
20. Urtasun, R., Fleet, D.J., Fua, P.: 3D people tracking with gaussian process dynam-
ical models. In: Proc. CVPR, vol. 1, pp. 238–245 (2006)
21. Urtasun, R., Fleet, D., Geiger, A., Popović, J., Darrell, T., Lawrence, N.:
Topologically-constrained latent variable models. In: Proc. ICML (2008)
22. Wang, J., Fleet, D., Hertzmann, A.: Gaussian process dynamical models. In: Proc.
NIPS, vol. 18, pp. 1441–1448 (2006)
23. Weinland, D., Boyer, E., Ronfard, R.: Action recognition from arbitrary views
using 3D exemplars. In: Proc. ICCV, vol. 5(7), p. 8 (2007)
24. Yan, P., Khan, S., Shah, M.: Learning 4D action feature models for arbitrary view
action recognition. In: Proc. CVPR, vol. 12 (2008)
Real-Time Change-Point Detection Using
Sequentially Discounting Normalized Maximum
Likelihood Coding
1 Introduction
1.1 Motivation
We are concerned with the issue of detecting change points in time series. Here
a change-point is the time point at which the statistical nature of time series
suddenly changes. Hence the detection of that point may lead to the discovery
of a novel event. The issue of change-point detection has recently received vast
attentions in the area of data mining ([1],[9],[2],etc.).This is because it can be
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 185–197, 2011.
c Springer-Verlag Berlin Heidelberg 2011
186 Y. Urabe et al.
applied to a wide variety of important data mining problems such as the de-
tection of failures of computer devices from computer performance data such as
CPU loads, the detection of malicious executables from computer access logs.
We require that the change-point detection be conducted in real-time. This
requirement is crucial in real environments as in security monitoring, system
monitoring, etc. Hence we wish to design a real-time change-point detection al-
gorithm s.t. every time a datum is input, it gives a score measuring to what
extent it is likely to be a change-point. Further it is desired that such an algo-
rithm detects change-points as early as possible with least false alarms.
We attempt to design a change-point detection algorithm on the basis of data
compression. The basic idea is that a change point may be considered as a time
point when the data is no longer compressed using the same nature as the one
which have ever been observed. An important notion of sequentially normal-
ized maximum likelihood (SNML) coding has been developed in the scenario of
sequential source coding [4],[6],[5]. It has turned out to attain the shortest code-
length among possible coding methods. Hence, from the information-theoretic
view point, it is intuitively reasonable that the time point when the SNML code-
length suddenly changes can be thought of as a change point. However, SNML
coding has never been applied to the issue of change-point detection. Further in
the case where data sources are non-stationary, SNML should be extended so
the data compression is adaptive to the time-varying nature of the sources.
for artificial data demonstration, we evaluate the performance of our method for
two types of change-points; continuous change points and discontinuous ones. As
for real data demonstration, we apply our method into real security issues called
malware detection. We empirically demonstrate that our method is able to detect
unseen security incidents or their symptoms at significantly early stages. Through
this demonstration we develop a new method for discovering unseen malware by
means of change-point detection from web server logs.
This section introduces the SDNML coding. Suppose that we observe a discrete-
time series, which we denote as, {xt : t = 1, 2, · · · }. We denote xt = x1 · · · xt .
Consider the parametric class F = {p(xt |xt−1 : θ) : t = 1, 2, · · · } of conditional
188 Y. Urabe et al.
We call the quantity − log pSNML (xt |xt−1 ) the SNML code-length. It is known
from [5],[6] that the cumulative SNML code-length, which is the sum of SNML
code-length over the sequence, is optimal in the sense that it asymptotically
achieves the shortest code-length. According to Rissanen’s MDL principle [4],
the SNML leads to the best statistical model for explaining data.
We employ here the AR model as a probabilistic model and introduce SD-
NML(sequentially discounting normalized maximum likelihood) coding for this
model by extending SNML so that the effect of past data can gradually be dis-
counted as time goes on. The function of ”discounting” is important in real
situations where the data source is non-stationary and the coding should be
adaptive to it.
Let X ⊂ R be 1-dimensional and let xt ∈ X for each t. We define the kth
order auto-regression (AR) model as follows:
1 1
p(xt |xt−k : θ) =
t−1
exp − 2 (xt − w) ,
2
(2)
(2πσ 2 )1/2 2σ
k
where w = i=1 A(i) xt−i and θ = (A(1) , · · · , A(k) , σ 2 ).
Let r (0 < r < 1) be the discounting coefficient. Let m be the the least sample
(1) (k)
size such that Eq.(3) is uniquely solved. Let Ât = (Ât , · · · Ât )T be the dis-
(1) (k)
counting maximum likelihood estimate of the parameter At = (At , · · · , At )T
t
from x i.e.,
t
Ât = arg min r(1 − r)t−j (xj − AT x̄j )2 , (3)
A
j=m+1
where x̄j = (xj−1 , xj−2 , . . . , xj−k )T . Here the discounting maximum likelihood
estimate can be thought of as a modified variant of maximum likelihood estimate
so that the weighted likelihood is maximum where the weight of the jth past
Real-Time Change-Point Detection 189
data is given r(1 − r)t−j . Hence the larger the discounting coefficient r is, the
exponentially smaller the effect of past data becomes.
def
We further let êt = xt − ÂTt x̄t . Then let us define the discounting maximum
likelihood estimate of the variance from xt by
def
t
τ̂t = argmax p(xt |xt−1 2
t−k : Ât , σ )
σ2 j=m+1
1 t
= ê2 .
t − m j=m+1 j
Below we give a method of sequential computation of Ât and τ̂t so that they
def
can be computed every time a datum xt is input. Let Xt = (x̄k+1 , x̄k+2, . . . , x̄t ).
Let us recursively define Ṽt and Mt as follows:
Ṽt−1 = (1 − r)Ṽt−1
−1
def def
+ rx̄t x̄Tt , Mt = (1 − r)Mt−1 + rx̄t xt .
Then we obtain the following iterative relation for the parameter estimation:
1 r Ṽt−1 x̄t x̄Tt Ṽt−1
Ṽt = Ṽt−1 − , Ât = Ṽt Mt ,
1−r 1 − r 1 − r + c̃t
êt = xt − ÂTt x̄t , c̃t = rx̄Tt Ṽt−1 x̄t ,
c̃t
d˜t = . (4)
1 − r + c̃t
Setting r = 1/(t − m) yields the iteration developed by Rissanen et.al. [5] and
Roos et.al. [6]. We employ (4) for parameter estimation. Define st by
def
t
st = ê2j = (t − m)τ̂t . (5)
j=m+1
3 Proposed Method
The main features of our proposed method are summarized as follows:
1)Two-stage learning framework with SDNML code-length: We basically employ
the two-stage learning framework proposed in [9] to realize real-time change-
point detection. The key idea is that probabilistic models are learned at the
two stages; in the first stage a probabilistic model is learned from the original
time series and a score is given for each time on the basis of the model, and in
the second stage another probabilistic model is learned from a score sequence
obtained by smoothing scores calculated at the first stage and a change-point
score is calculated on the basis of the learned model. We use the SDNML
code-length for the scoring in each stage.
2)Efficiently computing the estimates of parameters: Although the Yule-Walker
equation must be solved for the parameter estimation in ChangeFinder [9],
we can use an iterative relation to more efficiently estimate parameters than
ChangeFinder.
Below we give details of our version of the two-stage learning framework.
1
t
yt = Score(xi ). (10)
T
i=t−T +1
1
t
This indicates how drastically the nature of the original sequence xt has
changed at time point t.
In [9],[11], the score is calculated as the negative logarithm of the plug-in density
defined as: − log p(xt |θ̂(t−1) ), where θ̂(t−1) is the estimate of θ obtained by using
the discounting learning algorithm from xt−1 . In our method, it is replaced by
the SDNML code-length.
In updating each parameter estimate, there is only one iteration every time a
datum is input. The computation time for our method is O(k 2 n) while that for
the original two-stage learning-based method: ChangeFinder is O(k 3 n).
def
t
I(xtu ) = − log p(xi |xi−1 i−1
i−k : θ(xu )),
i=u
First we consider the case where change points are discontinuous in the sense
that the value of Δ(t) discontinuously changes at a change point t.
As for the data-generation model we employed the following AR model:
xt = A1 xt−1 + A2 xt−2 + ε,
where A1 = 0.6, A2 = −0.5, and ε ∼ N (μ, 1). We generated 1,000 records and
set change-points so that the jump of the mean value μ occurred at x × 100 (x =
1, 2, · · · ). Let the amount of jump of mean at the kth change-point be Dk . We
set Dk = 10 − k. The dissimilarity at the ith change point is given by
60
50
40
30
20
10
1 101 201 301 401 501 601 701 801 901 SDNML
-10
Log10
SDNML
The benefit measures how early the true change-point is detected. It takes the
maximum value 1 when the true change-point is detected at that point, and is
zero when |t − t∗ | exceeds 20. The false discovery rate (FDR) is the ratio of the
number of false positive alarms over the number of total alarms. Considering the
trade-off between benefit and FDR, we used the benefit-FDR curve as proposed
in [1] for the performance comparison. It is a concept similar to ROC curve.
Figure 1(b) shows the results of the benefit-FDR curves for our method and
existing methods. The horizontal axis shows FDR while the vertical axis shows
the average benefit where the average was taken over all of the change points.
SDNML is our method, CF is ChangeFinder, the conventional two-stage learning
based method. HT is the hypothesis-testing based method in which the fourth-
degree AR model is used for model fitting and the score is measured in terms of
the logarithmic loss. We observe from Figure 1(b) that SDNML performs better
than HT and CF. The AUC (Area Under Curve) for SDNML was about 12 %
larger than that for CF.
Figure 2 shows the computation time (sec) of SDNML in comparison with CF
for this data set. We see that SDNML is significantly more efficient than CF.
194 Y. Urabe et al.
600
500
400
300
200
100
Next we consider the case where change points are continuous in the sense that
the value of dissimilarity Δ continuously changes at each of the points. We
consider the following data generation model xt = v(t) + ε where ε ∼ N (0, 1)
and v(t) = 0 for 0 ≤ t ≤ 100, and v(t) = c(t − 100)(t − 99)/2 for t > 100.
Letting the dissimilarity at time t be Δ(t), then it is calculated as: For a given
c > 0, Δ(t) = 0 for 0 ≤ t ≤ 100, and Δ(t) = c2 (t − 100)2 /2 for t > 100.
This shows that the dissimilarity of change points is continuous with respect
to t. We call such change points continuous change points. They are more difficult
to be identified than discontinuous ones.
We generated 6 times 200 records according to the model as above.
Figure 3(a) shows an example of such data sets. We evaluated the detection
accuracies for CF,HT, and SDNML for this data set. Parameter values for all
of the methods are systematically chosen as with the discontinuous case. Figure
3(b) shows the results of the benefit-FDR curves for CF and SDNML where
the average-benefit was computed as the average of the benefits taken over the
6-times randomized data generation. Note that HT was much worse than CF,
and was omitted from the Figure3(b). We observe from Figure 3(b) that SD-
NML performs significantly better than CF. The AUC for SDNML is about 46
% larger than that for CF.
Through the empirical evaluation using artificial data sets including contin-
uous and discontinuous change-points, we observe that our method performs
significantly better than the existing methods both in detection accuracy and
computational efficiency.
The superiority of SDNML over CF may be justified from the view of the
minimum description length (MDL) principle. Indeed, SDNML is designed as the
optimal strategy that sequentially attains the least code-length while CF using
the predictive code produces longer code-lengths than SDNML. It is theoretically
guaranteed from the theory of the MDL principle that the shorter the code-length
for data sequence is, the better model is learned from data. Hence the better
strategy in the light of the MDL principle yields a better strategy for statistical
Real-Time Change-Point Detection 195
SDNML SDNML
Table 1 summarizes the performance of SDNML and CF for the two time
series (IP counting data and URL counting data) in terms of alert time and
the total number of alarms. In the row of ServerError Time, for each burst of
messages 500ServerError, the starting time point and ending time point of the
burst are shown. In the table ”-” indicates the fact that the burst associated
with the message: 500ServerError was not detected.
We observe from Table 1 that our method was able to detect all of the bursts
associated with the message: 500ServerError, while CF overlooked some of
them. It was confirmed by security analysts that all of the detected bursts were
related to backdoor, and were considered as symptoms of backdoor. Further
there were no logs related to backdoor other than the bursts of the message:
500ServerError. It implies that our method was able to detect backdoor at
early stages when its symptoms appeared. This demonstrates the validity of our
method in the scenario of malware detection.
6 Conclusion
We have proposed a new method of real-time change point detection, in which we
employ the sequentially discounting normalized maximum likelihood (SDNML)
coding as a scoring function within the two-stage learning framework. The intu-
ition behind the design of this method is that SDNML coding, which sequentially
attains the shortest code-length, would improve the accuracy of change-point
Real-Time Change-Point Detection 197
Acknowledgments
This research was supported by Microsoft Corporation (Microsoft Research
CORE Project) and NTT Corporation.
References
1. Fawcett, T., Provost, F.: Activity monitoring: noticing interesting changes in be-
havior. In: Proc. of ACM-SIGKDD Int’l Conf. Knowledge Discovery and Data
Mining, pp. 53–62 (1999)
2. Guralnik, V., Srivastava, J.: Event detection from time series data. In: Proc. ACM-
SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 33–42 (1999)
3. Hawkins, D.M.: Point estimation of parameters of piecewise regression models. J.
Royal Statistical Soc. Series C 25(1), 51–57 (1976)
4. Rissanen, J.: Information and Complexity in Statistical Modeling. Springer, Hei-
delberg (2007)
5. Rissanen, J., Roos, T., Myllymäki, P.: Model selection by sequentially normalized
least squares. Jr. Multivariate Analysis 101(4), 839–849 (2010)
6. Roos, T., Rissanen, J.: On sequentially normalized maximum likelihood models.
In: Proc. of 1st Workshop on Information Theoretic Methods in Science and En-
gineering, WITSME 2008 (2009)
7. Shtarkov, Y.M.: Universal sequential coding of single messages. Problems of Infor-
mation Transmission 23(3), 175–186 (1987)
8. Song, X., Wu, M., Jermaine, C., Ranka, S.: Statistical change detection for multi-
dimensional data. In: Proc. Fifteenth ACM-SIGKDD Int’l Conf. Knowledge Dis-
covery and Data Mining, pp. 667–675 (2009)
9. Takeuchi, J., Yamanishi, K.: A unifying framework for detecting outliers and
change-points from time series. IEEE Transactions on Knowledge and Data Engi-
neering 18(44), 482–492 (2006)
10. Wang, J., Deng, P., Fan, Y., Jaw, L., Liu, Y.: Virus detection using data mining
techniques. In: Proc. of ICDM 2003 (2003)
11. Yamanishi, K., Takeuchi, J.: A unifying approach to detecting outliers and change-
points from nonstationary data. In: Proc. of the Eighth ACM SIGKDD Int’l Conf.
Knowledge Discovery and Data Mining (2002)
12. Ye, Y., Li, T., Jiang, Q., Han, Z., Wan, L.: Intelligent file scoring system for malware
detection from the gray list. In: Proc. of the Fifteenth ACM SIGKDD Int’l Conf.
Knowledge Discovery and Data Mining (2009)
Compression for Anti-Adversarial Learning
1 Introduction
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 198–209, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Anti-Adversarial Learning 199
that provide good estimate of the true entropy of data by using symbol-level dy-
namic Markov modeling. PPM predicts the symbol probability conditioned on
its k immediately prior symbols, forming a k th order Markov model. For exam-
ple, the context cki of the ith symbol xi in a given sequence is {xi−k , . . . , xi−1 }.
The total number of contexts of an order-k model is O(|Σ|k+1 ), where Σ is
the alphabet of input symbols. As the order of the model increases the number
of contexts increases exponentially. High-order models are more likely to cap-
ture longer-range correlations among adjacent symbols, if they exist; however,
an unnecessarily high order can result in context dilution leading to inaccurate
probability estimate. PPM solves the dilemma by using dynamic context match
between the current sequence and the ones that occurred previously. It uses
high-order predictions if they exist, otherwise “drops gracefully back to lower
order predictions” [10]. More specifically, the algorithm first looks for a match
of an order-k context. If such a match does not exist, it looks for a match of
an order k − 1 context, until it reaches order-0. Whenever a match is not found
in the current context, the model falls back to a lower-order context and the
total probability is adjusted by what is called an escape probability. The escape
probability models the probability that a symbol will be found in a lower-order
context. When an input symbol xi is found in context cki where k ≤ k, the
conditional probability of xi given its k th order context cki is:
⎛ ⎞
k
p(xi |cki ) = ⎝ p(Esc|cji )⎠ · p(xi |cki )
j=k +1
where p(Esc|cji ) is the escape probability conditioned on context cji . If the sym-
bol is not predicted by the order-0 model, a probability defined by a uniform
distribution is predicted.
PPMC [11] and PPMD [12] are two well known variants of the PPM algorithm.
Their difference lies in the estimate of the escape probabilities. In both PPMC
and PPMD, an escape event is counted every time a symbol occurs for the first
time in the current context. In PPMC, the escape count and the new symbol
count are each incremented by 1 while in PPMD both counts are incremented
by 1/2. Therefore, in PPMC, the total symbol count increases by 2 every time
a new symbol is encountered, while in PPMD the total count only increases by
1. When implemented on a binary computer, PPMD sets the escape probability
|d|
to 2|t| , where |d| is the number of distinct symbols in the current context and
|t| is the total number of symbols in the current context.
Now, given an input X = x1 x2 . . . xd of length d, where x1 , x2 , . . . , xi is a
sequence of symbols, its probability given a compression model M can be esti-
mated as
d
p(X|M ) = p(xi |xi−1
i−k , M )
i=1
where |X| is the length of the instance, xi−k , . . . , xi is a subsequence in the in-
stance, k is the length of the context, and Mc is the compression model associated
with class c.
When classifying an unlabeled instance, a common practice is to compress it
with both compression models and check to see which one compresses it more
efficiently. However, PPM is an incremental algorithm, which means once an
unlabeled instance is compressed, the model that compresses it will be updated
as well. This requires the changes made to both models be reverted every time
after an instance is classified. Although, the PPM algorithm has a linear time
complexity, the constants in its complexity are by no means trivial. It is desirable
to eliminate the redundancy of updating and then reverting changes made to
the models. We propose an approximation algorithm (See Algorithm 1) that
we found works quite well in practice. Given context C = xi−k . . . xi−1 in an
unlabeled instance, if any suffix c of C has not occurred in the context trees
1
of the compression models, we set p(Esc|c) = |A| , thus the probability of xi is
discounted by |A| , where |A| |Σ|, the size of the alphabet. More aggressive
1
discount factors set the prediction further away from the decision boundary.
Empirical results will be discussed in Section 5.
Compression-based algorithms have demonstrated superior classification per-
formance in learning tasks where the input consists of strings [3,4]. However, it
is not clear whether this type of learning algorithm is susceptible to adversarial
attacks. We investigate several ways to attack the compression-based classifier
on real data in the domain of e-mail spam filtering. We choose this domain in
our study for the following reasons: 1.) previous work has demonstrated great
success of compression-based algorithms in this domain [3]; 2.) it is conceptu-
ally simple to design various adversarial attacks and establish a ground truth;
3.) there have been studies of several variations of adversarial attacks against
standard learning algorithms in this domain [14,5].
Good word attacks are designed specifically to target the integrity of statisti-
cal spam filters. Good words are those that appear frequently in normal e-mail
202 Y. Zhou, M. Inge, and M. Kantarcioglu
but rarely in spam e-mail. Existing studies show that good word attacks are
very effective against the standard learning algorithms that are considered the
state of the art in text classification [14]. The attacks against multinomial naı̈ve
Bayes and support vector machines with 500 good words caused more than 45%
decrease in recall while the precision was fixed at 90% in a previous study [5].
We repeated the test on the 2006 TREC Public Spam Corpus [15] using our
implementation of PPMD compressors. It turns out that the PPMD classifier is
surprisingly robust against good word attacks. With 500 good words added to
50% of the spam in e-mail, we observed no significant change in both precision
and recall (See Figure 1)2 . This remains true even when 100% of spam is ap-
pended with 500 highly ranked good words. Its surprising resilience to good word
attacks led us to more aggressive attacks against the PPMD classifier. We ran-
domly chose a legitimate e-mail message from the training corpus and appended
it to a spam e-mail during testing. 50% of the spam was altered this way. This
time we were able to bring the average recall value down to 57.9%. However, the
precision value remains above 96%. Figure 1 shows the accuracy, precision and
recall values when there are no attacks, 500-goodword attacks, and in the worst
case, attacks with legitimate training e-mail. Details on experimental setup will
be given in Section 5.
100%
Accuracy
Precision
Accuracy/Precision/Recall Values
80% Recall
60%
40%
20%
0%
PPMD−No−Attack PPMD−500GW−Attack PPMD−WorstCase−Attack
filtering, the adversary’s attempts are mostly focused on altering positive in-
stances to make them less distinguishable among ordinary data in that domain.
Now that we know the adversary can alter the “bad” data to make it appear
to be good, our goal is to find a way to separate the target from its innocent
looking. Suppose we have two compression models M+ and M− . Intuitively,
a positive instance would compress better with the positive model M+ than it
would with M− , the negative model. When a positive instance is altered with fea-
tures that would ordinarily appear in negative data, we expect the subsequences
in the altered data that are truly positive to retain relatively higher compression
rates when compressed against the positive model. We apply a sliding window
approach to scan through each instance and extract subsequences in the sliding
window that require a smaller number of bits when compressed against M+ than
M− . Ideally, more subsequences would be identified in a positive instance than
in a negative instance. In practice, there are surely exceptional cases where the
number of subsequences in a negative instance would exceed its normal average.
For this reason, we decide to compute the difference between the total number of
bits required to compress the extracted subsequences S using M− and M+ , re-
spectively. If an instance is truly positive, we expect BitsM− (S) BitsM+ (S),
where BitsM− (S) is the number of bits needed to compress S using the neg-
ative compression model, and BitsM+ (S) is the bits needed using the positive
model. Now for a positive instance, not only we expect a longer list of subse-
quences extracted, but also a greater discrepancy between the bits after they are
compressed using the two different models.
For the adversary, any attempt to attack this counter-attack strategy will
always boil down to finding a set of “misleading” features and seamlessly blend
them into the target (positive instance). To break down the first step of our
counter-attack strategy, that is, extracting subsequences that compress better
against the positive compression model, the adversary would need to select a
set of good words {wi |BitsM+ (wi ) < BitsM− (wi )} so that the good words can
pass, together with the “bad” ones, our first round screening. To attack the
second step, the adversary must find good words that compress better against
the negative compression model, that is, {wi |BitsM+ (wi ) > BitsM− (wi )}, to
offset the impact of the “bad” words in the extracted subsequences. These two
204 Y. Zhou, M. Inge, and M. Kantarcioglu
goals inevitably contradict each other, thus making strategically attacking the
system much more difficult.
We now formally present our algorithm. Given a set of training data T , where
T = T+ ∪ T− , we first build two compression models M+ and M− from T+ and
T− , respectively. For each training instance t in T , we scan t using a sliding
window W of size n, and extract the subsequence si in the current sliding window
W if BitsM+ (si ) < BitsM− (Si ). This completes the first step of our algorithm—
subsequence
extraction. Next, for each instance t in the training set, we compute
dt = si (BitsM− (si ) − BitsM+ (si )), where si is a subsequence in t that has
been extracted in the first step. We then compute the classification threshold by
maximizing the information gain:
r= argmax InfoGain(T ).
r∈{d1 ,...,d|T | }
For a more accurate estimate, r should be computed using k-fold cross validation.
To classify an unlabeled instance u, we first extract the set of subsequences S
from u in the same manner, then compute du = BitsM− (S) − BitsM+ (S). If
du ≤ r, u is classified as a negative instance, otherwise, u is positive. Detailed
description of the algorithm is given in Algorithm 2.
5 Experimental Results
We evaluate our counter-attack algorithm on e-mail data from the 2006 TREC
Public Spam Corpus [15]. The data consists of 36,674 spam and legitimate email
messages, sorted chronologically by receiving date and evenly divided into 11
subsets D1 , · · · , D11 . Experiments were run in an on-line fashion by training on
the ith subset and testing on the subset i + 1. The percentage of spam messages
in each subset varies from approximately 60% to a little bit over 70%. The
good word list consists of the top 1,000 unique words from the entire corpus
ranked according to the frequency ratio. In order to allow a true apples-to-
apples comparison among compression-based algorithms and standard learning
algorithms, we preprocessed the entire corpus by removing HTML and non-
textual parts. We also applied stemming and stop-list to all terms. The to,
from, cc, subject, and received headers were retained, while the rest of the
headers were removed. Messages that had an empty body after preprocessing
were discarded. In all of our experiments, we used 6th -order context and a sliding
window of 5 where applicable.
We first test our counter-attack algorithm under the circumstances where
there are no attacks and there are only good word attacks. In the case of good
words attacks, the adversary has a general knowledge about word distributions in
the entire corpus, but lacks a full knowledge about the training data. As discussed
in Section 3, the PPMD-based classifier demonstrated superior performance in
these two cases. We need to ensure that our anti-adversarial classifier would
perform the same. Figure 2 shows the comparison between the PPMD-based
classifier and our anti-adversarial classifier when there are no attacks and 500-
goodword attacks on 50% of spam in each test set.
Anti-Adversarial Learning 205
Accuracy
Precision
Recall
100%
Accuracy/Precision/Recall Values
80%
60%
40%
20%
0%
PPMD−No−Attack AADL−NO−Attack PPMD−500GW−Attack AADL−500GW−Attack
Fig. 2. The accuracy, precision and recall values of the PPMD classifier and the anti-
adversarial classifier (AADL) with no attacks and 500 good words attacks
Accuracy
Precision
Recall
100%
Accuracy/Precision/Recall Values
80%
60%
40%
20%
0%
AADL−No−Attack AADL−500GW−Attack AADL−WorstCase(1)−Attack AADL−WorstCase(5)−Attack
Fig. 3. The accuracy, precision and recall values of the anti-adversarial classifier
(AADL) with no attacks, 500 good word attacks, 1 legit mail attacks, and 5 legit
mail attacks
validation. We also experimented when the number of good words varied from
50 to 500, and the percentage of spam altered from 50% to 100% in each test
set, the results remain similar.
We conducted more aggressive attacks with exact copies of legitimate mes-
sages randomly chosen from the training set. We tried two attacks of increasing
strength by attaching one legitimate message and five legitimate messages, re-
spectively, to spam in the test set. In total 50% of the spam in the test set
was altered in each attack. Figure 3 illustrates the results. As can be observed,
our anti-adversarial classifier is very robust to any of the attacks, while the
PPMD-based classifier, for which the results are shown in Figure 4, is obviously
vulnerable to the more aggressive attacks. Furthermore, similar results were ob-
tained when: 1.) the percentage of altered spam increased to 100%; 2.) legitimate
messages used to alter spam were randomly selected from the test set, and 3.) in-
jected legtimate messages were randomly selected from data sets that are neither
training nor test sets.
To make the matter more complicated, we also tested our algorithm when
legitimate messages were randomly scattered into spam. This was done in two
different fashions. In the first case, we first randomly picked a position in spam;
then we took a random length (no greater than 10% of the total length) of a
legitimate message and inserted it to the selected position in spam. The two steps
Anti-Adversarial Learning 207
100%
Accuracy
Precision
Accuracy/Precision/Recall Values
80% Recall
60%
40%
20%
0%
PPMD−No−Attack PPMD−500GW−Attack PPMD−WorstCase(1)−Attack PPMD−WorstCase(5)−Attack
Fig. 4. The accuracy, precision and recall values of the PPMD-based classifier with no
attacks, 500 good word attacks, 1 legit mail attacks, and 5 legit mail attacks
100%
Accuracy
Precision
Accuracy/Precision/Recall Values
80% Recall
60%
40%
20%
0%
PPMD−500GW−AttackBoth PPMD−WorstCase(1)−AttackBoth AADL−500GW−AttackBoth AADL−WorstCase(1)−AttackBoth
Fig. 5. The accuracy, precision and recall values of the PPMD-based classifier and the
anti-adversarial classifier (AADL) with 500-goodword attacks and 1 legit mail attacks
to 50% spam in both training and test data
were repeated until the entire legitimate message has been inserted. Empirical
results show that random insertion does not appear to affect the classification
performance. In the second case, we inserted terms from a legitimate message
in a random order after every terms in spam. The process is as follows: 1.)
tokenize the legitimate message and randomly shuffle the tokens; 2.) insert a
random number of tokens (less than 10% of the total number of tokens) after
terms in spam. Repeat 1) and 2) until all tokens are inserted to spam. We
observed little performance change when ≥ 3. When ≤ 2, nearly all local
context is completely lost in the altered spam, the average recall values dropped
to below 70%. Note that in the latter case ( ≤ 2), the attack is most likely
useless to the adversary in practice since the scrambled instance would also fail
to deliver the malicious attacks the adversary has set out to accomplish.
Previous studies [14,17,5] show that retraining on altered data as a result
of adversarial attacks may improve the performance of classifiers against the
attacks. This observation is further verified in our experiments, in which we ran-
domly select 50% of spam in the training set and appended good words and legit-
imate e-mail to it, separately. Figure 5 shows the results of the PPMD classifier
and the anti-adversarial classifier with 500-goodword attacks and 1-legitimate-
mail attacks in both training and test data. As can be observed, retraining
improved, to an extent, the classification results in all cases.
6 Concluding Remarks
We demonstrate that compression-based classifiers are much more resilient to
good word attacks compared to standard learning algorithms. On the other
208 Y. Zhou, M. Inge, and M. Kantarcioglu
hand, this type of classifier is vulnerable to attacks when the adversary has a
full knowledge of the training set. We propose a counter-attack technique that
extracts and analyzes subsequences that are more informative in a given instance.
We demonstrate that the proposed technique is robust against any attacks, even
in the worst case where the adversary can alter positive instances with exact
copies of negative instances taken directly from the training set.
A fundamental theory needs to be developed to explain the strength of the
compression-based algorithm and the anti-adversarial learning algorithm. It re-
mains less clear, in theory, why the compression-based algorithms are remarkably
resilient to strategically designed attacks that would normally defeat classifiers
trained using standard learning algorithms. It is certainly to our great interest
to find out how well the proposed counter-attack strategy performs in other
domains, and under what circumstances this seemingly bullet-proof algorithm
would break down.
Acknowledgement
The authors would like to thank Zach Jorgensen for his valuable input. This work
was partially supported by Air Force Office of Scientific Research MURI Grant
FA9550-08-1-0265, National Institutes of Health Grant 1R01LM009989, National
Science Foundation (NSF) Grant Career-0845803, and NSF Grant CNS-0964350,
CNS-1016343.
References
1. Barreno, M., Nelson, B.A., Joseph, A.D., Tygar, D.: The security of machine learn-
ing. Technical Report UCB/EECS-2008-43, EECS Department, University of Cal-
ifornia, Berkeley (April 2008)
2. Sculley, D., Brodley, C.E.: Compression and machine learning: A new perspective
on feature space vectors. In: DCC 2006: Proceedings of the Data Compression
Conference, pp. 332–332. IEEE Computer Society, Washington, DC (2006)
3. Bratko, A., Filipič, B., Cormack, G.V., Lynam, T.R., Zupan, B.: Spam filtering
using statistical data compression models. J. Mach. Learn. Res. 7, 2673–2698 (2006)
4. Zhou, Y., Inge, W.: Malware detection using adaptive data compression. In: AISec
2008: Proceedings of the 1st ACM Workshop on Artificial Intelligence and Security,
Alexandria, Virginia, USA, pp. 53–60 (2008)
5. Jorgensen, Z., Zhou, Y., Inge, M.: A multiple instance learning strategy for com-
bating good word attacks on spam filters. Journal of Machine Learning Research 9,
1115–1146 (2008)
6. Witten, I., Neal, R., Cleary, J.: Arithmetic coding for data compression. Commu-
nications of the ACM, 520–540 (June 1987)
7. Cleary, J., Witten, I.: Data compression using adaptive coding and partial string
matching. IEEE Transactions on Communications COM-32(4), 396–402 (1984)
8. Cormack, G., Horspool, R.: Data compression using dynamic markov modeling.
The Computer Journal 30(6), 541–550 (1987)
9. Cleary, J., Witten, I.: Unbounded length contexts of ppm. The computer Jour-
nal 40(2/3), 67–75 (1997)
Anti-Adversarial Learning 209
10. Moffat, A., Turpin, A.: Compression and Coding Algorithms. Kluwer Academic
Publishers, Boston (2002)
11. Moffat, A.: Implementing the ppm data compression scheme. IEEE Trans. Comm.
38, 1917–1921 (1990)
12. Howard, P.: The design and analysis of efficient lossless data compression systems.
Technical report, Brown University (1993)
13. Teahan, W.J.: Text classification and segmentation using minimum cross-entropy.
In: RIAO 2000, 6th International Conference Recherche d’Informaiton Assistee par
ordinateur (2000)
14. Lowd, D., Meek, C.: Good word attacks on statistical spam filters. In: Proceedings
of the 2nd Conference on Email and Anti-Spam (2005)
15. Cormack, G.V., Lynam, T.R.: Spam track guidelines – TREC 2005-2007 (2006),
http://plg.uwaterloo.ca/~ gvcormac/treccorpus06/
16. Bratko, A.: Probabilistic sequence modeling shared library (2008),
http://ai.ijs.si/andrej/psmslib.html
17. Webb, S., Chitti, S., Pu, C.: An experimental evaluation of spam filter performance
and robustness against attack. In: The 1st International Conference on Collabora-
tive Computing: Networking, Applications and Worksharing, pp. 19–21 (2005)
Mining Sequential Patterns from Probabilistic
Databases
1 Introduction
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 210–221, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Mining Sequential Patterns from Probabilistic Databases 211
2 Problem Statement
Classical SPM [18,3]. Let I = {i1 , i2 , . . . , iq } be a set of items and S =
{1, . . . , m} be a set of sources. An event e ⊆ I is a collection of items. A database
D = r1 , r2 , . . . , rn is an ordered list of records such that each ri ∈ D is of the
form (eid i , ei , σi ), where eid i is a unique event-id, including a time-stamp (events
are ordered by this time-stamp), ei is an event and σi is a source.
A sequence s = s1 , s2 , . . . , sa is an ordered list of events. The events si in the
sequence are called its elements. The length of a sequence s is the total number of
items in it, i.e. aj=1 |sj |; for any integer k, a k-sequence is a sequence of length k.
Let s = s1 , s2 , . . . , sq and t = t1 , t2 , . . . , tr be two sequences. We say that s is a
subsequence of t, denoted s t, if there exist integers 1 ≤ i1 < i2 < · · · < iq ≤ r
such that sk ⊆ tij , for k = 1, . . . , q. The source sequence corresponding to a
source i is just the multiset {e|(eid, e, i) ∈ D}, ordered by eid. For a sequence
s and source i, let Xi (s, D) be an indicator variable, whose value is 1 if s is
Mining Sequential Patterns from Probabilistic Databases 213
a subsequence of the source sequence for source i, and 0 otherwise. For any
m
sequence s, define its support in D, denoted Sup(s, D) = i=1 Xi (s, D). The
objective is to find all sequences s such that Sup(s, D) ≥ θm for some user-
defined threshold 0 ≤ θ ≤ 1.
eid event W
p-sequence
e1 (a, d) (X : 0.6)(Y : 0.4) p
DX (a, d : 0.6)† (a, b : 0.3)(b, c : 0.7)
e2 (a) (Z : 1.0)
DYp (a, d : 0.4)† (a, b : 0.2)
e3 (a, b) (X : 0.3)(Y : 0.2)(Z : 0.5) p
DZ (a : 1.0)(a, b : 0.5)(b, c : 0.3)
e4 (b, c) (X : 0.7)(Z : 0.3)
The interpretation of Eq. 3 is that c∗k is the probability that e allows the element
sk to be matched in source i; this is 0 if sk
⊆ e , and is otherwise equal to the
probability that e is associated with source i. Now we use the equation:
p
Table 2. Computing Pr[s DX ] for s = (a)(b) using DP in the database of Table 1
4 Optimizations
Pr[t Dip ] for source i, we would like to exploit the similarity between s and t
to compute Pr[t Dip ] more rapidly.
Let i be a source, Dip = (e1 , c1 ), . . . , (er , cr ), and s = s1 , . . . , sq be any
sequence. Now let Ai,s be the (q + 1) × (r + 1) DP matrix used to compute
Pr[s Dip ], and let Bi,s denote the last row of Ai,s , that is, Bi,s [] = Ai,s [q, ]
for = 0, . . . , r. We now show that if t is an extension of s, then we can quickly
compute Bi,t from Bi,s , and thereby obtain Pr[t Dip ] = Bi,t [r]:
Lemma 2. Let s and t be sequences such that t is an extension of s, and let i
be a source whose p-sequence has r elements in it. Then, given Bi,s and Dip , we
can compute Bi,t in O(r) time.
Proof. We only discuss the case where t is an I-extension of s, i.e. t = s1 , . . . , sq ∪
{x} for some x
∈ sq . Firstly, observe that since the first q − 1 elements of s
and t are pairwise equal, the first q − 1 rows of Ai,s and Ai,t are also equal. The
(q − 1)-st row of Ai,s is enough to compute the q-th row of Ai,t , but we only have
Bi,s , the q-th row of Ai,s . If tq = sq ∪ {x}
⊆ e , then Ai,t [q, ] = Ai,t [q, − 1],
and we can move on to the next value of . If tq ⊆ e , then sq ⊆ e and so:
Ai,s [q, ] = (1 − c ) ∗ Ai,s [q, − 1] + c ∗ Ai,s [q − 1, − 1]
Since we know Bi,s [] = Ai,s [q, ], Bi,s [ − 1] = Ai,s [q, − 1] and c , we can
compute Ai,s [q − 1, − 1]. But this value is equal to Ai,t [q − 1, − 1], which
is the value from the (q − 1)-st row of Ai,t that we need to compute Ai,t [q, ].
Specifically, we compute:
Bi,t [] = (1 − c ) ∗ Bi,t [ − 1] + (Bi,s [] − Bi,s [ − 1] ∗ (1 − c ))
if tq ⊆ e (otherwise Bi,t [] = Bi,t [ − 1]). The (easier) case of S-extensions and
an example illustrating incremental computation can be found in [15].
5 Candidate Generation
We now describe two candidate generation methods for enumerating all frequent
sequences, one each based on breadth-first and depth-first exploration of the
sequence lattice, which are similar to GSP [18,3] and SPAM [4] respectively. We
first note that an “Apriori” property holds in our setting:
Lemma 4. Given two sequences s and t, and a probabilistic database Dp , if s
is a subsequence of t, then ERS(s, D p ) ≥ ERS(t, Dp ).
Proof. In Eq. 1 note that for all D ∗ ∈ P W (Dp ), Sup(s, D∗ ) ≥ Sup(t, D∗ ).
1: L1 ← ComputeFrequent-1(Dp )
2: for all sequences x ∈ L1 do
1: j ← 1 3: Call TraverseDFS(x)
2: L1 ← ComputeFrequent-1(Dp ) 4: Output all frequent sequences
3: while Lj = ∅ do 5: function TraverseDFS(s)
4: Cj+1 ← Join Lj with itself 6: for all x ∈ L1 do
5: Prune Cj+1 7: t ← s · {x}
{S-extension}
6: for all s ∈ Cj+1 do 8: Compute ES(t, Dp )
p
7: Compute ES(s, D ) 9: if ES(t, D p ) ≥ θm then
8: Lj+1 ← all sequences s ∈ Cj+1 {s.t. 10: TraverseDFS(t)
ES(s, Dp ) ≥ θm}. 11: t ← s1 , . . . , sq ∪ {x}
{I-extension}
9: j ←j+1 12: Compute ES(t, Dp )
10: Stop and output L1 ∪ . . . ∪ Lj 13: if ES(t, Dp ) ≥ θm then
14: TraverseDFS(t)
15: end function
Fig. 1. BFS (L) and DFS (R) Algorithms. Dp is the input database and θ the threshold.
6 Experimental Evaluation
We report on an experimental evaluation of our algorithms. Our implementations
are in C# (Visual Studio .Net), executed on a machine with a 3.2GHz Intel CPU
and 3GB RAM running XP (SP3). We begin by describing the datasets used for
experiments. Then, we demonstrate the scalability of our algorithms (reported
running times are averages from multiple runs), and also evaluate probabilis-
tic pruning. In our experiments, we use both real (Gazelle from Blue Martini
[14]) and synthetic (IBM Quest [3]) datasets. We transform these deterministic
datasets to probabilistic form in a way similar to [2,5,24,7]; we assign probabili-
ties to each event in a source sequence using a uniform distribution over (0, 1],
thus obtaining a collection of p-sequences. Note that we in fact generate ELU
data rather than SLU data: a key benefit of this approach is that it tends to
preserve the distribution of frequent sequences in the deterministic data.
We follow the naming convention of [23]: a dataset named CiDjK means that
the average number of events per source is i and the number of sources is j (in
thousands). Alphabet size is 2K and all other parameters are set to default.
We study three parameters in our experiments: the number of sources D,
the average number of events per source C, and the threshold θ. We test our
Mining Sequential Patterns from Probabilistic Databases 219
algorithms for one of the three parameters by keeping the other two fixed. Ev-
idently, all other parameters being fixed, increasing D and C, or decreasing θ,
all make an instance harder. We choose our algorithm variants according to two
“axes”:
We thus report on four variants in all, for example “BFS+P” represents the
variant with BFS lattice traversal and with probabilistic pruning ON.
Phase Joining Apriori Prob. prun. Frequent Phase Joining Apriori Prob. prun. Frequent
2 15555 15555 246 39 2 15555 15555 246 39
3 237 223 208 91 3 334 234 175 91
C10D10K Gazelle
400
250
BFS BFS
BFS+P 350 BFS+P
DFS DFS
200 DFS+P 300 DFS+P
Running time (in sec)
250
150
200
100 150
100
50
50
0 0
0 2 4 6 8 0 0.01 0.02 0.03 0.04 0.05
θ values (in %age) θ values (in %age)
1400 1000
1200
Running time (in sec)
800
10
600
Fig. 3. Scalability of our algorithms for increasing number of sources D (L), and for
increasing number of events per sources C (R)
References
1. Aggarwal, C.C. (ed.): Managing and Mining Uncertain Data. Springer, Heidelberg
(2009)
2. Aggarwal, C.C., Li, Y., Wang, J., Wang, J.: Frequent pattern mining with uncertain
data. In: Elder et al. [9], pp. 29–38
3. Agrawal, R., Srikant, R.: Mining sequential patterns. In: Yu, P.S., Chen, A.L.P.
(eds.) ICDE, pp. 3–14. IEEE Computer Society, Los Alamitos (1995)
4. Ayres, J., Flannick, J., Gehrke, J., Yiu, T.: Sequential pattern mining using a
bitmap representation. In: KDD, pp. 429–435 (2002)
Mining Sequential Patterns from Probabilistic Databases 221
5. Bernecker, T., Kriegel, H.P., Renz, M., Verhein, F., Züfle, A.: Probabilistic frequent
itemset mining in uncertain databases. In: Elder et al. [9], pp. 119–128
6. Chui, C.K., Kao, B.: A decremental approach for mining frequent itemsets from
uncertain data. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD
2008. LNCS (LNAI), vol. 5012, pp. 64–75. Springer, Heidelberg (2008)
7. Chui, C.K., Kao, B., Hung, E.: Mining frequent itemsets from uncertain data. In:
Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp.
47–58. Springer, Heidelberg (2007)
8. Cormode, G., Li, F., Yi, K.: Semantics of ranking queries for probabilistic data
and expected ranks. In: ICDE, pp. 305–316. IEEE, Los Alamitos (2009)
9. Elder, J.F., Fogelman-Soulié, F., Flach, P.A., Zaki, M.J. (eds.): Proceedings of the
15th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, Paris, France, June 28-July 1. ACM, New York (2009)
10. Gunopulos, D., Khardon, R., Mannila, H., Saluja, S., Toivonen, H., Sharm, R.S.:
Discovering all most specific sentences. ACM Trans. DB Syst. 28(2), 140–174 (2003)
11. Hassanzadeh, O., Miller, R.J.: Creating probabilistic databases from duplicated
data. The VLDB Journal 18(5), 1141–1166 (2009)
12. Hua, M., Pei, J., Zhang, W., Lin, X.: Ranking queries on uncertain data: a proba-
bilistic threshold approach. In: Wang [21], pp. 673–686
13. Khoussainova, N., Balazinska, M., Suciu, D.: Probabilistic event extraction from
RFID data. In: ICDE, pp. 1480–1482. IEEE, Los Alamitos (2008)
14. Kohavi, R., Brodley, C., Frasca, B., Mason, L., Zheng, Z.: KDD-Cup 2000 orga-
nizers’ report: Peeling the onion. SIGKDD Explorations 2(2), 86–98 (2000)
15. Muzammal, M., Raman, R.: Mining sequential patterns from probabilistic
databases. Tech. Rep. CS-10-002, Dept. of Comp. Sci. Univ. of Leicester, UK
(2010), http://www.cs.le.ac.uk/people/mm386/pSPM.pdf
16. Muzammal, M., Raman, R.: On probabilistic models for uncertain sequential pat-
tern mining. In: Cao, L., Feng, Y., Zhong, J. (eds.) ADMA 2010, Part I. LNCS,
vol. 6440, pp. 60–72. Springer, Heidelberg (2010)
17. Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu,
M.: Mining sequential patterns by pattern-growth: The PrefixSpan approach. IEEE
Trans. Knowl. Data Eng. 16(11), 1424–1440 (2004)
18. Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and perfor-
mance improvements. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.)
EDBT 1996. LNCS, vol. 1057, pp. 3–17. Springer, Heidelberg (1996)
19. Suciu, D., Dalvi, N.N.: Foundations of probabilistic answers to queries. In: Özcan,
F. (ed.) SIGMOD Conference, p. 963. ACM, New York (2005)
20. Sun, X., Orlowska, M.E., Li, X.: Introducing uncertainty into pattern discovery in
temporal event sequences. In: ICDM, pp. 299–306. IEEE Computer Society, Los
Alamitos (2003)
21. Wang, J.T.L. (ed.): Proceedings of the ACM SIGMOD International Conference
on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12.
ACM, New York (2008)
22. Yang, J., Wang, W., Yu, P.S., Han, J.: Mining long sequential patterns in a noisy
environment. In: Franklin, M.J., Moon, B., Ailamaki, A. (eds.) SIGMOD Confer-
ence, pp. 406–417. ACM, New York (2002)
23. Zaki, M.J.: SPADE: An efficient algorithm for mining frequent sequences. Machine
Learning 42(1/2), 31–60 (2001)
24. Zhang, Q., Li, F., Yi, K.: Finding frequent items in probabilistic data. In: Wang
[21], pp. 819–832
Large Scale Real-Life Action Recognition Using
Conditional Random Fields with Stochastic
Training
1 Introduction
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 222–233, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Large Scale Real-Life Action Recognition Using Conditional Random Fields 223
2.5
X-acceler.
2 Y-acceler.
Z-acceler.
act-0 act-4
1.5 act-3 act-5
act-5
Signal strength (g)
0.5
-0.5
-1
-1.5
-2
1000 1500 2000 2500
Time (seconds)
Fig. 1. An example of real-life continuous actions in our data, in which the correspond-
ing 3D acceleration signals are collected from the attached sensors. See Section 5 for
the meaning of the ‘g’ and action types, act-0 to act-5.
action signals [1,2,3]. However, this is not the case for real-life action sequences
of human beings, in which different types of actions are performed one by one
without an explicit segmentation on the boundaries. For example, people may
first walk, and then take a taxi, and then take an elevator, in which the bound-
aries of the actions are unknown to the target action recognition system. An
example of real-life actions with continuous sensor signals is shown in Figure 1.
For this concern, it is necessary and important to develop a more powerful sys-
tem not only to predict the types of the actions, but also to disambiguate the
boundaries of those actions.
With this motivation, we collected a large-scale real-life action data (continu-
ous sensor-based three-dimension acceleration signals) from about one hundred
people for continuous real-life action recognition. We adopt a popular structured
classification model, conditional random fields (CRFs), for recognizing the ac-
tion types and at the same time disambiguate the action boundaries. Moreover,
good online training methods are necessary for training CRFs on a large-scale
data in our task. We will compare different online training methods for training
CRFs on this action recognition data.
sequence. Ravi et al. [3] used decision trees, support vector machines (SVMs)
and K-nearest neighbors (KNN) models for classification. Bao and Intille [1] and
Pärkkä et al. [2] used decision trees for classification. A few other works treated
the task as a structured classification problem. Huynh et al. [4] tried to discover
latent activity patterns by using a Bayesian latent topic model.
Most of the prior work of action recognition used a relatively small data set.
For example, in Ravi et al. [3], the data was collected from two persons. In Huynh
et al. [4], the data was collected from only one person. In Pärkkä et al. [2], the
data was collected from 16 persons.
There are two major approaches for training conditional random fields: batch
training and online training. Standard gradient descent methods are normally
batch training methods, in which the gradient computed by using all training in-
stances is used to update the parameters of the model. The batch training meth-
ods include, for example, steepest gradient descent, conjugate gradient descent
(CG), and quasi-Newton methods like Limited-memory BFGS (LBFGS) [5]. The
true gradient is usually the sum of the gradients from each individual training
instance. Therefore, batch gradient descent requires the training method to go
through the entire training set before updating parameters. Hence, the batch
training methods are slow on training CRFs.
A promising fast online training method is the stochastic gradient method,
for example, the stochastic gradient descent (SGD) [6,7]. The parameters of the
model are updated much more frequently, and much fewer iterations are needed
before the convergence. For large-scale data sets, the SGD can be much faster
than batch gradient based training methods. However, there are problems on
the current SGD literature: (1) The SGD is sensitive to noise. The accuracy
of the SGD training is limited when the data is noisy (for example, the data
inconsistency problem that we will discuss in the experiment section). (2) The
SGD is not robust. It contains many hyper-parameters (not only regularization,
but also learning rate) and it is quite sensitive to them. Tuning the hyper-
parameters for SGD is not a easy task.
To deal with the problems of the traditional training methods, we use a new
online gradient-based learning method, the averaged SGD with feedback (ASF)
[8], for training conditional random fields. According to the experiments, the
ASF training method is quite robust for training CRFs for the action recognition
task.
Notes m is the number of periods when the ASF reaches the convergence;
b is the current number of period;
c is the current number of iteration;
n is the number of training samples;
The learning rate, γ ←− 1+b/Zγ0
, is only for theoretical analysis. In practice we
can simply set γ ← 1, i.e., remove the learning rate.
Procedure ASF-train
Initialize Θ with random values
c ←− 0
for b ←− 1 to m
. γ ←− 1+b/Zγ0
with Z n, or simply γ ← 1
. for 1 to b
. Θ ←− SGD-update(Θ)
. c ←− c + b
iter(c)
. Θ ←− Θ in Eq. 4
Return Θ
Procedure SGD-update(Θ)
for 1 to n
. select a sample j randomly
. Θ ←− Θ + γ ∂Θ ∂
Ls (j, Θ)
Return Θ
the feedback, i.e., the length of each period should be adjusted reasonably as the
training goes on. For example, at the early stage of the training, the Θ is highly
noisy, so that the feedback operation to Θ should be performed more frequently.
As the training goes on, less frequent feedback operation would be better in order
to adequately optimize the parameters. In practice, the ASF adopts a schedule
of linearly slowing-down feedback : the number of iterations increases linearly in
each period, as the training goes on.
Figure 2 shows the steps of the ASF. We denote Θb,c,d as the model parameters
after the d’th sample is processed in the c’th iteration of the b’th period. Without
making any difference, we denote Θb,c,d more simply as Θb,cn+d where n is
the number of samples in a training data. Similarly, we use g b,cn+d to denote
∂
∂Θ s
L (d, Θ) in the c’th iteration of the b’th period. Let γ (b) be the learning rate
(b)
in the b’th period. Let Θ be the averaged parameters produced by the b’th
(1)
period. We can induce the explicit form of Θ :
(1) n−d+1
Θ = Θ1,0 + γ (1) g 1,d . (5)
n
d=1...n
When the 2nd period ends, the parameters are again averaged over all previous
model parameters, Θ1,0 , . . . , Θ1,n , Θ2,0 , . . . , Θ2,2n , and it can be expressed as:
(2) n−d+1
Θ = Θ1,0 + γ (1) g 1,d
n
d=1...n
(6)
(2) 2n − d + 1 2,d
+γ g .
3n
d=1...2n
Similarly, the averaged parameters produced by the b’th period can be expressed
as follows:
(b) in − d + 1 i,d
Θ = Θ1,0 + (γ (i) g ). (7)
ni(i + 1)/2
i=1...b d=1...in
The best possible convergence result for stochastic learning is the “almost
sure convergence”: to prove that the stochastic algorithm converges towards
the solution with probability 1 [6]. The ASF guarantees to achieve almost sure
convergence [8]. The averaged parameters produced at the end of each period of
the optimization procedure of the ASF training are “almost surely convergent”
towards the optimum Θ∗ [8]. On the implementation side, there is no need to
keep all the gradients in the past for computing the averaged gradient Θ: we
can compute Θ on the fly, just like the averaged perceptron case.
Table 1. Features used in the action recognition task. For simplicity, we only describe
the features on x-axis, because the features on y-axis and z-axis are in the same setting
like the x-axis. A × B means a Cartesian product between the set A and the set B.
The time interval feature do not record the absolute time from the beginning to the
current window. This feature only records the time difference between two neighboring
windows: sometimes there is a jump of time between two neighboring windows.
(in a temporal sequence). The data was collected by iPod accelerometers with the
sampling frequency of 20HZ. A sample contains 4 values: {time (the seconds past
from the beginning of a session), x-axis-acceleration, y-axis-acceleration, z-axis-
acceleration}, for example, {539.266(s), 0.091(g), -0.145(g), -1.051(g)}1. There
are six kinds of action labels: act-0 means “walking or running”, act-1 means
“on an elevator or escalator”, act-2 means “taking car or bus”, act-3 means
“taking train”, act-4 means “up or down stairs”, and act-5 means “standing
or sitting”.
We split the data into a training data (85%), a development data for hyper-
parameters (5%), and the final evaluation data (10%). The evaluation metric
are sample-accuracy (%) (equals to recall in this task: the number of correctly
predicted samples divided by the number of all the samples). Following previous
work on action recognition [1,2,3,4], we use acceleration features, mean features,
standard deviation, energy, and correlation (covariance between different axis)
features. Features are extracted from the iPod accelerometer data by using a
window size of 256. Each window is about 13 seconds long. For two consecutive
windows (each one contains 256 samples), they have 128 samples overlapping to
each other. Feature extraction on windows with 50% of the window overlapping
was shown to be effective in previous work [1]. The features are listed in Table 1.
All features are used without pruning. We use exactly the same feature set for
all systems.
1
In the example, ‘g’ is the acceleration of gravity.
Large Scale Real-Life Action Recognition Using Conditional Random Fields 229
One of the tough problems in this action recognition task is the rotation of the x-
axis, y-axis, and z-axis in the collected data. Since different people attached the
iPod accelerometer with a different rotation of iPod accelerometer, the x-axis,
y-axis, and z-axis faced the risk of inconsistency in the collected data. Take an
extreme case for example, while the x-axis may represent a horizontal direction
for an instance, the same x-axis may represent a vertical direction for another
instance. As a result, the acceleration signals of the same axis may face the prob-
lem of inconsistency. We suppose this is an important reason that prevented the
experimental results reaching a higher level of accuracy. A candidate solution
to keep the consistency is to tell the people to adopt a standard rotation when
3
Here, the empirical convergence state means an empirical evaluation of the
convergence.
232 X. Sun et al.
65
ASF
Averaged SGD
SGD
60
Accuracy (%)
55
50
45
0 10 20 30 40 50 60
Number of Iterations
Fig. 3. Curves of accuracies of the different stochastic training methods by varying the
number of iterations
collecting the data. However, this method will make the collected data not “nat-
ural” or “representative”, because usually people put the accelerometer sensor
(e.g., in iPod or iPhone) randomly in their pocket in daily life.
Acknowledgments
X.S., H.K., and N.U. were supported by the FIRST Program of JSPS. We thank
Hirotaka Hachiya for helpful discussion.
References
1. Bao, L., Intille, S.S.: Activity recognition from user-annotated acceleration data.
In: Ferscha, A., Mattern, F. (eds.) PERVASIVE 2004. LNCS, vol. 3001, pp. 1–17.
Springer, Heidelberg (2004)
2. Präkkä, J., Ermes, M., Korpipää, P., Mäntyjärvi, J., Peltola, J., Korhonen, I.: Ac-
tivity classification using realistic data from wearable sensors. IEEE Transactions
on Information Technology in Biomedicine 10(1), 119–128 (2006)
3. Ravi, N., Dandekar, N., Mysore, P., Littman, M.L.: Activity recognition from ac-
celerometer data. In: AAAI, pp. 1541–1546 (2005)
4. Huynh, T., Fritz, M., Schiele, B.: Discovery of activity patterns using topic models.
In: Proceedings of the 10th International Conference on Ubiquitous Computing,
pp. 10–19. ACM, New York (2008)
5. Nocedal, J., Wright, S.J.: Numerical optimization. Springer, Heidelberg (1999)
6. Bottou, L.: Online algorithms and stochastic approximations. In: Saad, D. (ed.)
Online Learning and Neural Networks. Cambridge University Press, Cambridge
(1998)
7. Spall, J.C.: Introduction to stochastic search and optimization. Wiley-IEEE (2005)
8. Sun, X., Kashima, H., Matsuzaki, T., Ueda, N.: Averaged stochastic gradient de-
scent with feedback: An accurate, robust and fast training method. In: Proceedings
of the 10th International Conference on Data Mining (ICDM 2010), pp. 1067–1072
(2010)
9. Bottou, L.: Une Approche théorique de l’Apprentissage Connexionniste: Applica-
tions à la Reconnaissance de la Parole. PhD thesis, Université de Paris XI, Orsay,
France (1991)
10. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic
models for segmenting and labeling sequence data. In: Proceedings of the 18th
International Conference on Machine Learning (ICML 2001), pp. 282–289 (2001)
11. Daumé III, H.: Practical Structured Learning Techniques for Natural Language
Processing. PhD thesis, University of Southern California (2006)
12. Sun, X.: Efficient Inference and Training for Conditional Latent Variable Models.
PhD thesis, The University of Tokyo (2010)
13. Collins, M., Globerson, A., Koo, T., Carreras, X., Bartlett, P.L.: Exponentiated
gradient algorithms for conditional random fields and max-margin markov net-
works. J. Mach. Learn. Res. (JMLR) 9, 1775–1822 (2008)
14. Collins, M.: Discriminative training methods for hidden markov models: Theory
and experiments with perceptron algorithms. In: Proceedings of EMNLP 2002, pp.
1–8 (2002)
15. Hattori, Y., Takemori, M., Inoue, S., Hirakawa, G., Sudo, O.: Operation and base-
line assessment of large scale activity gathering system by mobile device. In: Pro-
ceedings of DICOMO 2010 (2010)
16. Andrew, G., Gao, J.: Scalable training of L1 -regularized log-linear models. In:
Proceedings of ICML 2007, pp. 33–40 (2007)
Packing Alignment: Alignment for Sequences
of Various Length Events
1 Introduction
Sequence alignment is now one of the most popular tools to compare sequences.
In molecular biology, various types of alignments are used in various kinds of
problems: global alignments of pairs of proteins related by common ancestry
throughout their length, local alignments involving related segments of proteins,
multiple alignments of members of protein families, and alignments made during
database searches to detect homologies [4]. Dynamic time warping (DTW), a
kind of alignments between two time series, are often used in speech recognition
[7] and aligning of audio recordings [1].
Most of previous works on alignments have dealt with strings in which each
component (letter) is assumed to have the same length. In comparison of musical
sequences, there is a research on an alignment that consider the length of each
note [5]. In their research, general alignment framework is adapted to deal with
note sequences by using a score (distance) function between notes that depends
on note length. Their method is very flexible but it heuristically defines its score
function so as to reflect note length.
In this paper, we study packing alignment that explicitly treats the length
of each component (event) together with a constraint on length. One event in
a packing alignment can have a number of consecutive opposing events unless
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 234–245, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Packing Alignment: Alignment for Sequences of Various Length Events 235
the total length of them exceeds the length of that one event. Compared to the
method using a length-dependent score function, our setting reduces flexibility
but makes the problem clearer as an optimization problem. We can show that
an optimal solution of this extended alignment problem for two event sequences
s and t can be obtained in O(p(s, t)n(s)n(t)) time and O(p(s, t)(n(s) + n(t)))
space using dynamic programming, where n(s) and n(t) is the number of events
in sequences s and t, respectively, and p(s, t) is the maximum packable number
that is defined as the maximum number of events in s or t which can be opposed
to any one event in the other sequence in packing alignment.
Alignment distance can be shown to be equivalent to edit distance even in
packing alignment if two additional 0-cost edit operations, partition and con-
catenation, are introduced. Alignment of various length events is also possible
indirectly by general string alignment or DTW if all events are partitioned uni-
formly in preprocessing. There are two significant differences between packing
alignment and these conventional alignments. First, one event must not be di-
vided in packing alignment while gaps can be inserted in the middle of one event
divided by uniform partitioning in preprocessed conventional alignment. Second,
an optimal solution in packing alignment can be calculated faster than that in
preprocessed conventional alignment when the number of events increases sig-
nificantly by uniform partitioning. DTW also allows one event to be opposed to
more than one event, but packing alignment is more strict on length of oppos-
ing events and more flexible at the point that it allows gap insertions. Though
alignment becomes flexible by virtue of gap insertion, alignment with long con-
secutive gaps are not desirable for many applications. So, we also developed
gap-constraint version algorithm of packing alignment.
In our experiments, we applied packing alignment to frequent approximate
pattern extraction from a note sequence of a musical piece. We used mining
algorithm EnumSubstrFLOO [6], which heavily uses an alignment algorithm as
a subprocedure. For two MIDI files of Bach’s musical pieces, EnumSubstrFLOO
using packing alignment, which is directly applied to the original sequence, was
more than four times faster than that using DTW and general alignment, which
are applied to the sequence made by uniform partitioning. We also applied Enum-
SubstrFLOO to the melody tracks in MIDI files of three musical variations in
order to check whether themes and variations can be extracted as patterns and
occurrences. The 80% of extracted patterns and occurrences by EnumSubstr-
FLOO with packing alignment were nearly whole themes, nearly whole varia-
tions or whole two consecutive variations while the algorithm using DTW and
general alignment, which were directly applied without uniform partitioning by
ignoring note length, could extract no such appropriate ranges except a few.
Let Σ denote a finite set of event types. The gap ‘-’ is a special event type
that does not belong to Σ. Assume existence of a real-valued score function w
on (Σ ∪ {-}) × (Σ ∪ {-}), which measures similarity between two event types.
236 A. Nakamura and M. Kudo
Measures 2-5:
(Theme)
Measures 50-53:
(Variation 1)
Measures 245-248:
(Variation 5)
Fig. 1. Parts of the score of 12 Variations on “Ah Vous Dirai-je Maman (Twinkle
Twinkle Little Star)” K.265
(C5 , 14 )(R, 18 )(C5 , 18 )(G5 , 14 )(R, 18 )(G5 , 18 )(A5 , 14 )(R, 18 )(A5 , 18 )(G5 , 14 )(R, 18 )(G5 , 18 )
using event sequence representation, where event type ‘R’ denotes a rest.
A gap insertion into an event sequence s is an operation that inserts (-, l) right
before or after s[i] for some i ∈ {1, 2, ..., n(s)} and l ∈ R+ . We define a packing
alignment of two event sequences as follows.
Definition 1. A packing alignment of two event sequences s and t is a pair
(s , t ) that satisfies the following conditions.
1. s and t are gaped event sequences with the same length that are made from
s and t, respectively, by repeated gap insertions.
2. For all (j, k) ∈ {1, 2, ..., n(s )} × {1, 2, ..., n(t )},
r(s , j) ⊆ r(t , k) or r(s , j) ⊇ r(t , k) holds if r(s , j) ∩ r(t , k) = ∅.
3. For all (j, k) ∈ {1, 2, ..., n(s )} × {1, 2, ..., n(t )},
r(s , j) ∩ r(t , k) = ∅ if s [j] = t [k] =-.
Packing Alignment: Alignment for Sequences of Various Length Events 237
s C5 C5 G5 G5 A5 A5 G5 G5
s’ - C5 - C5 G5 G5 A5 A5 G5 G5
Fig. 2. Examples of packing alignments: (s, t) and (s , t ) are packing alignments of s
and t but (s , t ) is NOT
Example 2 (Continued from Example 1). Let s and t denote event sequence rep-
resentations for the melodies in measures 2-5 and measures 50-53, respectively.
By representing the length of each event as the length of a bar, s and t can be
illustrated using the diagram shown in Figure 2. In the figure, pair (s , t ) is NOT
a packing alignment of s and t because r(s , 2) ∩ r(t , 1) = ∅ but either one of
them is not contained in the other one, which violates condition 2 of Definition 1.
Pairs (s, t) and (s , t ) are packing alignments of s and t.
For event sequences s and t, let A(s, t) denote the set of packing alignments of
s and t.
Score S(s , t ) between s and t for a packing alignment (s , t ) is defined as
S(s , t ) = |s [j]|w(s [j], t [k]) + |t [k]|w(s [j], t [k]).
(j,k):r(s ,j)⊆r(t ,k) (j,k):r(s ,j)⊃r(t ,k)
Problem 1. For given event sequences s and t, calculate the packing alignment
score between s and t (and the alignment (s , t ) that achieves the score).
Let s[i..j] denote s[i]s[i + 1] · · · s[j]. The following proposition holds. The proof
is omitted due to space limitations.
238 A. Nakamura and M. Kudo
Remark 1. Mongeau and Sankoff [5] have already proposed a method using the
recurrence equation with the same search space constrained by the lengths of
s[m] and t[n]. They heuristically introduced the constraint for efficiency while
the constraint is necessary to solve the packing alignment problem.
Let s and t be event sequences with ls = max |s[i]| and lt = max |t[i]|.
1≤i≤n(s) 1≤i≤n(t)
Maximum packable number p(s, t) is the maximum of the following two numbers:
j
(1) the maximum number of events s[i], s[i + 1], ..., s[j] with k=i |s[k]| ≤ lt and
(2) the maximum number of events t[i], t[i + 1], ..., t[j] with jk=i |t[k]| ≤ ls .
Proposition 2. The optimal packing alignment problem for event sequences s
and t can be solved in O(p(s, t)n(s)n(t)) time and O(p(s, t)(n(s) + n(t))) space.
Proof. Dynamic programming using an n(s)×n(t) table can achieve the bounds.
Entry (i, j) of the table is filled by S∗ (s[1..i], t[1..j]) in the dynamic programming.
By Proposition 1, this is done using at most p(s, t)+2 entry values that have been
already calculated so far. Thus, totally, O(p(s, t)n(s)n(t)) time and O(n(s)n(t))
space are enough. The space complexity can be reduced to O(p(s, t)(n(s)+n(t)))
using a technique of a linear space algorithm proposed in [3].
Remark 2. Packing alignment is stricter on length than the edit distance defined
by Mongeau and Sankoff [5]. Operations called fragmentation and consolidation
introduced by them correspond to the partition and concatenation, respectively.
One event can be replaced with any consecutive events in fragmentation and vice
versa in consolidation regardless of event type and length while the total length
and event types are kept in the partition and concatenation. Besides, their sub-
stitution can be allowed with any event of any length. Partition, concatenation
and our substitution can be seen as special fragmentation, consolidation and
substitution they defined, and each of these their operations can be realized by
a series of our operations. Thus, our operations are more basic ones, and their
cost can be more easily and naturally determined using score function w on
Σ ∪ {-} compared to their operations.
240 A. Nakamura and M. Kudo
4 Gap Constraint
When we use alignment score as similarity measure, one problem is that score
can be high for alignments with long contiguous gaps. However, in many real
Packing Alignment: Alignment for Sequences of Various Length Events 241
applications, two sequences with the best alignment with long contiguous gaps
should not be considered to be similar. So, we consider a gap-constraint version
of packing alignment score defined as follows. For non-negative real number
g ≥ 0, let Ag (s, t) denote the set of packing alignments of s and t in which the
length of every contiguous gap subsequence defined below is at most g. Then, the
gap-constraint version of packing alignment score S∗g is defined as
max(s ,t )∈Ag (s,t) S(s , t ) if Ag (s, t) = ∅,
S∗g (s, t) =
−∞ otherwise.
We call the parameter g the maximum contiguous gap length. For packing align-
ment (s , t ) of s and t, a contiguous subsequence s [i..j] is called a contiguous gap
subsequence of s if s [i] = · · · = s [j] = - and no non-gap event in s is opposed
to t [h] and t [k] that are opposed to s [i] and s [j], respectively. A contiguous gap
subsequence of t can be defined similarly. For example, when s = (C, 1)(E, 1)
and t = (C, 2)(D, 1)(E, 2), the pair of s = (C, 1)(-, 1)(-, 1)(-, 1)(E, 1) and t is a
packing alignment, but none of s [2..4], s [2..3] and s [3..4] is a contiguous gap
subsequence of s because t[1] and t[3] which are opposed to s [2] and s [4] are
also opposed to s [1] and s [5], respectively. Let
i+l−1 i+l−1
pg (s, t) = max{l : k=i |s[k]| ≤ g or k=i |t[k]| ≤ g}.
5 Experiments
5.1 Frequent Approximate Pattern Extraction
By local alignment using packing alignment, we can define similar parts in event
sequences, so we can extract frequent approximate patterns in event sequences.
Here, we consider the task of approximate pattern extraction frequently appeared
in one event sequence. In a note sequence of a musical sheet, such a pattern can
be regarded as a most typical and impressive part. We conducted an experiment
on this task using MIDI files of famous classical music pieces.
As a frequent mining algorithm based on local alignment, we used EnumSub-
strFLOO in [6]. For a given event sequence and a minimum support σ, Enum-
SubstrFLOO extracts contiguous event sequences as approximate patterns that
have minimal locally optimal occurrences with frequency of at least σ. Local
optimality is first introduced to local alignment by Erickson and Sellers [2], and
locally optimal occurrences of approximate patterns are expected to have appro-
priate boundaries. Unfortunately, EnumSubstrFLOO with packing alignment is
not so fast; it is a O(kn3 )-time and O(n3 )-space algorithm, where n is the num-
ber of events in a given sequence s and k is the maximum packable number
242 A. Nakamura and M. Kudo
p(s, s). Since EnumSubstrFLOO keeps all occurrence candidates in memory for
efficiency and there are a lot of frequent patterns with short length, we prevented
memory shortage by setting a parameter called the minimum length θ; only the
occurrences for patterns with length of θ quarter notes were extracted.
In our experiment, we used the following score function:
a = b a is close to b a =- or b =- otherwise
w(a, b) 3 0 −1 −2
Here, we say that a is close to b if one of the following conditions are satisfied:
(1) just one of them is rest ’R’, (2) the pitch difference between them is at most
two semitones or an integral multiple of octaves.
We scored each frequent pattern by summing the alignment scores between
the pattern and its selected high-scored occurrences which are greedily selected
so that the ranges of any selected occurrences do not overlap1.
The maximum contiguous gap length was set to the length of one quarter note
throughout our experiments. The continuity of one rest does not seem important,
so we cut each rest on each beat.
For both MIDI files, the number of notes (#note) becomes 9-12 times larger
after uniform partitioning. As a result, EnumSubstrFLOOs using DTW and
general alignment are slower than that using packing alignment. The reason
why DTW is faster than general alignment is that pruning of pattern search
space works well for DTW, which means that DTW alignment score is easy to
become a negative value. Note that the best alignment score can become larger
using gaps for our score function but DTW does not use gaps.
The followings are the highest-scored pattern for packing alignment and the
longest patterns for the other methods extracted from Bach-Menuet.mid.
PA:
DTW:
GA:
The pattern extracted by packing alignment looks most appropriate for a
typical melody sequence of the Menuet.
5
The highest-pitch note is selected if the track contains overlapping notes.
6
It was set to 10 for mozart-k331-mov1 in the case of DTW and general alignment
because nothing was frequent for 40.
244 A. Nakamura and M. Kudo
File Title
(Track no.)[Composer] Url
Length Form #Event Musical Time
mozart k265 12 Variations on ”Ah vous dirais-je, Maman” K.265
(4)[Mozart] tirolmusic.blogspot.com/2007/11/12-variations-on-ah-vous-dirais-je.html
12m16s AABABA 3423 2/4[1,589),3/4[589,648)
mozart-k331-mov1 Piano Sonata No. 11 in A major, K 331, Andante grazioso
(1)[Mozart] www2s.sni.ne.jp/watanabe/classical.html
12m20s AA’AA’BA”BA” 3518 6/8[1,218),4/4[218,262)
be-pv-19 6 Variations in D on an Original Theme (Op.76)
(1)[Beethoven] www.classicalmidiconnection.com/cmc/beethoven.html
3m34s AA’BA 1458 2/4[1,50),6/4[50,66),2/4[66,98),3/4[98,156),2/4[156,181)
2 50 98 147 196 245 294 343 392 441 489 538 589
PA
Ex
DTW
GA
theme v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12
(a) mozart-k265
2 38 74 110 146 182 218
PA
Ex
DTW
GA
theme v1 v2 v3 v4 v5 v6
(b)mozart-k331-mov1
2 18 34 50 66 82 99 136 156
PA
Ex
DTW
GA
theme v1 v2 v3 v4 v5 v6 v7 theme
(c)be-pv-19
Fig. 3. Result of musical variation extraction. The horizontal axes refer to measure
number in each musical piece. The vertical broken lines show the starting and ending
positions of themes and variations. The extracted patterns are shown by thick lines and
the other thin lines are their occurrences. The thick lines also represent its occurrence
except those in DTW-rows of (a) and (b).
Packing Alignment: Alignment for Sequences of Various Length Events 245
Let us compare the result with those by other methods. In each Ex-row, a
longest frequent exact pattern and its occurrences are shown. For all the three
MIDI files, the longest frequent exact patterns are very short (2-6 measures)
and their occurrences are clustered in narrow ranges. This result indicates the
importance of using approximate patterns in this extraction task.
In the DTW-rows and GA-rows, a longest approximate pattern extracted by
EnumSubstrFLOO using DTW and general alignment, respectively, and their
occurrences are shown. Note that we applied DTW and general alignment by
ignoring note length. No extracted patterns and occurrences are nearly whole
themes nor nearly whole variations in the case with DTW, neither are those in
the case with general alignment except three nearly whole variations in be-pv-19.
These results indicate the importance of length consideration.
6 Concluding Remarks
By explicitly treating event length, we defined the problem of packing alignment
for sequences of various length events as an optimization problem, which can be
solved efficiently. Direct applicability to such sequences has not only the merit
of time and space efficiency but also the merit of non-decomposability. By virtue
of these merits, we could extract appropriate frequent approximate patterns and
their occurrences in our experiments. We would like to apply packing alignment
to other applications in the future.
Acknowledgements
This work was partially supported by JSPS KAKENHI 21500128.
References
1. Dixon, S., Widmer, G.: MATCH:A music Alignment Tool Chest. In: Proceedings of
ISMIR 2005, pp. 11–15 (2005)
2. Erickson, B.W., Sellers, P.H.: Recognition of patterns in genetic sequences. In:
Sankoff, D., Kruskal, J.B. (eds.) Time Warps, String Edits and Macromolecules:
The Theory and Practice of Sequence Comparison, ch. 2, pp. 55–91. Addison-Wesley,
Reading (1983)
3. Hirshberg, D.S.: A linear space algorithm for computing maximal common subse-
quences. Communications of the ACM 18(6), 341–343 (1975)
4. Henikoff, S., Henikoff, J.: Amino acid substitution matrices from protein blocks.
Proc. Natl. Acad. Sci. USA 89, 10915–10919 (1992)
5. Mongeau, M., Sankoff, D.: Comparison of Musical Sequences. Computers and the
Humanities 24, 161–175 (1990)
6. Nakamura, A., Tosaka, H., Kudo, M.: Mining Approximate Patterns with Frequent
Locally Optimal Occurrences. Division of Computer Science Report Series A, TCS-
TR-A-10-41, Hokkaido University (2010),
http://www-alg.ist.hokudai.ac.jp/tra.html
7. Sakoe, H., Chiba, S.: Dynamic Programming Algorithm Optimization for Spoken
Word Recognition. IEEE Transactions on Acoustics, Speech, and Signal Process-
ing ASSP-26(1), 43–49 (1978)
Multiple Distribution Data Description Learning
Algorithm for Novelty Detection
1 Introduction
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 246–257, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Multiple Distribution Data Description Learning Algorithm 247
hence it is hard to decide how strict decision boundary should be. ND is widely
applied to many application domains such as network intrusion, currency valida-
tion, user verification in computer systems, medical diagnosis [3], and machine
fault detection [16].
There are two main approaches to solving the data description problem which
are density estimation approach [1][2][12] and kernel based approach [13][14][20].
In density estimation approach, the task of data description is solved by esti-
mating a probability density of a data set [11]. This approach requires a large
number of training samples for estimation, in practice the training data is not
insufficient and hence does not represent the complete density distribution. The
estimation will mainly focus on modeling the high density areas and can result
in a bad data description [14]. Kernel-based approach aims at determining the
boundaries of the training set rather than at estimating the probability density.
The training data is mapped from the input space into a higher dimensional
feature space via a kernel function. Support Vector Machine (SVM) is one of
the well-known kernel-based methods which constructs an optimal hyperplane
between two classes by focusing on the training samples close to the edge of
the class descriptors [17]. These training samples are called support vectors. In
One-Class Support Vector Machine (OCSVM), a hyperplane is determined to
separate the normal data such that the margin between the hyperplane and
outliers is maximized [13]. Support Vector Data Description (SVDD) is a new
SVM learning method for one-class classification [14]. A hyperspherically shaped
boundary around the normal data set is constructed to separate this set from
abnormal data. The volume of this data description is minimized to reduce the
chance of accepting abnormal data. SVDD has been proven as one of the best
methods for one-class classification problems [19].
Some extensions to SVDD have been proposed to improve the margins of the
hyperspherically shaped boundary. The first extension is Small Sphere and Large
Margin (SSLM) [20] which proposes to surround the normal data in this optimal
hypersphere such that the margin—distance from outliers to the hypersphere, is
maximized. This SSLM approach is helpful for parameter selection and provides
very good detection results on a number of real data sets. We have recently
proposed a further extension to SSLM which is called Small Sphere and Two
Large Margins (SS2LM) [7]. This SS2LM aims at maximising the margin between
the surface of the hypersphere and abnormal data and the margin between that
surface and the normal data while the volume of this data description is being
minimised.
Other extensions to SVDD regarding data distribution have also been pro-
posed. The first extension is to apply SVDD to multi-class classification problems
[5]. Several class-specific hyperspheres that each encloses all data samples from
one class but excludes all data samples from other classes. The second extension
is for one-class classification which proposes to use a number of hyperspheres
to decribe the normal data set [19]. Normal data samples may have some dis-
tinctive distributions so they will locate in different regions in the feature space
and hence if the single hypersphere in SVDD is used to enclose all normal data,
248 T. Le et al.
it will also enclose abnormal data samples resulting a high false positive error
rate. However this work was not presented in detail, the proposed method is
heuristic and there is no proof provided to show that the multi-sphere approach
can provide a better data description.
We propose in this paper a new and more detailed multi-hypersphere ap-
proach to SVDD. A set of hyperspheres is proposed to describe the normal
data set assuming that normal data samples have distinctive data distributions.
We formulate the optimisation problem for multi-sphere SVDD and prove how
SVDD parameters are obtained through solving this problem. An iterative al-
gorithm is also proposed for building data descriptors, and we also prove that
the classification error will be reduced after each iteration. Experimental re-
sults on 28 well-known data sets show that the proposed method provides lower
classification error rates comparing with the standard single-sphere SVDD.
subject to
||φ(xi ) − c||2 ≤ R2 + ξi i = 1, . . . , p
||φ(xi ) − c||2 ≥ R2 − ξi i = p + 1, . . . , n
ξi ≥ 0, i = 1, . . . , n (2)
subject to
m
m
uij ||φ(xi ) − cj ||2 ≤ uij Rj2 + ξi i = 1, . . . , p
j=1 j=1
||φ(xi ) − cj ||2 ≥ Rj2 − ξij i = p + 1, . . . , n, j = 1, . . . , m
ξi ≥ 0, i = 1, . . . , p
ξij > 0, i = p + 1, . . . , n, j = 1, . . . , m (4)
where R = [Rj ]j=1,...,m is vector of radii, C1 and C2 are constants, ξi and ξij are
slack variables, φ(.) is a kernel function, and c = [cj ]j=1,...,m is vector of centres.
The mapping φ(xi0 ) of a normal data point xi0 , i0 ∈ {1, 2, . . . , p}, has to be in
one of those hyperspheres, i.e. there exists a hypershere Sj0 , j0 ∈ {1, 2, . . . , m}
such that ui0 j0 = 1 and ui0 j = 0, j = j0 .
Minimising the function in (3) over variables R, c and ξ subject to (4) will
determine radii and centres of hyperspheres and slack variables if the matrix U
is given. On the other hand, the matrix U will be determined if radii and centres
of hyperspheres are given. Therefore an iterative algorithm will be applied to
find the complete solution. The algorithm consists of two alternative steps: 1)
Calculate radii and centres of hyperspheres and slack variables, and 2) Calculate
membership U .
We present in the next sections the iterative algorithm and prove that the clas-
sification error in the current iteration will be smaller than that in the previous
iteration.
For classifying a data point x, the following decision function is used
f (x) = sign max Rj2 − ||φ(x) − cj ||2 (5)
1≤j≤m
∂L
n
=0 ⇒ cj = αi yi φ(xi ) + αij yi φ(xi ) (9)
∂cj
i∈s−1 (j) i=p+1
∂L
=0 ⇒ αi + βi = C1 , i = 1, . . . , p (10)
∂ξj
∂L
=0 ⇒ αij + βij = C2 , i = p + 1, . . . , n, j = 1, . . . , m (11)
∂ξij
βi ≥ 0, ξi ≥ 0, βi ξi = 0, i = 1, . . . , p (14)
p
n
m
L= αi ||φ(xi ) − cs(i) ||2 − αij ||φ(xi ) − cj ||2
i=1 i=p+1 j=1
p
p
p
= αi K(xi , xi ) − 2 αi φ(xi )cs(i) + αi ||cs(i) ||2 −
i=1 i=1 i=1
n
m
n
m
n
m
αij K(xi , xi ) + 2 αij φ(xi )cj − αij ||cj ||2
i=p+1 j=1 i=p+1 j=1 i=p+1 j=1
p
m
m
= αi K(xi , xj ) − 2 αi φ(xi )cj + αi ||cj ||2 −
i=1 j=1 i∈s−1 (j) j=1 i∈s−1 (j)
n
m m n m n
αij K(xi , xj ) + 2 αij φ(xi )cj − αij ||cj ||2
i=p+1 j=1 j=1 i=p+1 j=1 i=p+1
p
m
n
m
= αi K(xi , xi ) − αij K(xi , xi ) − ||cj ||2
i=1 j=1 i=p+1 j=1
m
n
= αi yi K(xi , xi ) + αij yi K(xi , xi )−
j=1 i∈s−1 (j) i=p+1
n 2
αi yi φ(xi ) + αij yi φ(xi ) (16)
i∈s−1 (j) i=p+1
252 T. Le et al.
The result in (16) shows that the optimisation problem in (3) is equivalent to
m individual optimisation problems as follows
n 2
min αi yi φ(xi ) + αij yi φ(xi ) −
i∈s−1 (j) i=p+1
n
αi yi K(xi , xi ) − αij yi K(xi , xi ) (17)
i∈s−1 (j) i=p+1
subject to
n
α i yi + αij yi = 1,
i∈s−1 (j) i=p+1
−1
0 ≤ αi ≤ C1 , i ∈ s (j), 0 ≤ αij ≤ C2 , i = p + 1, . . . , n, j = 1, . . . , m (18)
We can prove that the classification error in the current iteration will be
smaller than that in the previous iteration through the following key theorem.
Hence
m
m
uij ||φ(xi ) − cj || − Rj ≤
2 2
uij ||φ(xi ) − cj ||2 − Rj2 ≤ ξi (21)
j=1 j=1
m
uij ||φ(xi ) − cj ||2 − Rj2 = ||φ(xi ) − cj0 ||2 − Rj20 ≤ 0 ≤ ξi (22)
j=1
Case 3: xi is abnormal.
It is seen that
From (21) - (23), we can conclude that (R, c, ξ, U ) is a feasible solution at current
iteration. In addition, (R, c, ξ, U) is optimal solution at current iteration. That
results in our conclusion.
4 Experimental Results
We performed our experiments on 28 well-known data sets related to machine
fault detection and bioinformatics. These data sets were originally balanced data
sets and some of them contain several classes. For each data set, we picked up
a class at a time and divided the data set of this class into two equal subsets.
One subset was used as training set and the other one with data sets of other
254 T. Le et al.
Table 1. Number of data points in 28 data sets. #normal: number of normal data
points, #abnormal: number of abnormal data points and d: dimension.
classes were used for testing. We repeated dividing a data set ten times and
calculated the average classification rates. We also compared our multi-sphere
SVDD method with SVDD and OCSVM. The classification rate acc is measured
as [6]
√
acc = acc+ acc− (24)
where acc+ and acc− are the classification accuracy on normal and abnormal
data, respectively.
2
The popular RBF kernel function K(x, x ) = e−γ||x−x || was used in our ex-
periments. The parameter γ was searched in {2 : k = 2l + 1, l = −8, −7, . . . , 2}.
k
For SVDD and multi-sphere SVDD, the trade-off parameter C1 was searched
Multiple Distribution Data Description Learning Algorithm 255
Table 2. Classification results (in %) on 28 data sets for OCSVM, SVDD and Multi-
sphere SVDD (MS-SVDD).
over the grid {2k : k = 2l + 1, l = −8, −7, . . . , 2} and C2 was searched such that
the ratio C2 /C1 belonged to
1 p 1 p p p p
× , × , ,2 × ,4× (25)
4 n−p 2 n−p n−p n−p n−p
For OCSVM, the parameter ν was searched in {0.1k : k = 1, . . . , 9}. For multi-
sphere SVDD, the number of hyperspheres was changed from 1 to 10 and 50
iterations were applied to each training.
Table 2 presents classification results for OCSVM, SVDD, and multi-sphere
SVDD (MS-SVDD). Those results over 28 data sets show that MS-SVDD always
performs better than SVDD. The reason is that SVDD is regarded as a special
case of MS-SVDD when the number of hyperspheres is 1. MS-SVDD provides
the highest accuracies for data sets except for Colon cancer and Biomed data
sets. For some cases, MS-SVDD obtains the same result as SVDD. This could be
256 T. Le et al.
explained as only one distribution for those data sets. Our new model seems to
attain the major improvement for the larger data sets. It is quite obvious since
the large data sets could have different distributions and can be described by
different hyperspheres.
5 Conclusion
We have proposed a new multiple hypersphere approach to solving one-class
classification problem using support vector data description. A data set is de-
scribed by a set of hyperspheres. This is an incremental learning process and
we can prove theoretically that the error rate obtained in current iteration is
less than that in previous iteration. We have made comparison of our proposed
method with support vector data description and one-class support vector ma-
chine. Experimental results have shown that our proposed method provided
better performance than those two methods over 28 well-known data sets.
References
1. Bishop, C.M.: Novelty detection and neural network validation. In: IEEE Proceed-
ings of Vision, Image and Signal Processing, pp. 217–222 (1994)
2. Barnett, V., Lewis, T.: Outliers in statistical data, 3rd edn. Wiley, Chichester
(1978)
3. Campbell, C., Bennet, K.P.: A linear programming approach to novelty detection.
Advances in Neural Information Processing Systems 14 (2001)
4. Chang, C.-C., Lin, C.-J.: LIBSVM: A Library for Support Vector Machines,
http://www.csie.ntu.edu.tw/~ cjlinlibsvm
5. Hao, P.Y., Liu, Y.H.: A New Multi-class Support Vector Machine with Multi-
sphere in the Feature Space. In: Okuno, H.G., Ali, M. (eds.) IEA/AIE 2007. LNCS
(LNAI), vol. 4570, pp. 756–765. Springer, Heidelberg (2007)
6. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training set: One-sided
selection. In: Proc. 14th International Conference on Machine Learning, pp. 179–
186 (1997)
7. Le, T., Tran, D., Ma, W., Sharma, D.: An Optimal Sphere and Two Large Margins
Approach for Novelty Detection. In: Proc. IEEE World Congress on Computational
Intelligence, WCCI (accepted 2010)
8. Lin, Y., Lee, Y., Wahba, G.: Support vector machine for classification in nonstan-
dard situations. Machine Learning 15, 1115–1148 (2002)
9. Moya, M.M., Koch, M.W., Hostetler, L.D.: One-class classifier networks for target
recognition applications. In: Proceedings of World Congress on Neural Networks,
pp. 797–801 (1991)
10. Mu, T., Nandi, A.K.: Multiclass Classification Based on Extended Support Vector
Data Description. IEEE Transactions on Systems, Man and Cybernetics Part B:
Cybernetics 39(5), 1206–1217 (2009)
11. Parra, L., Deco, G., Miesbach, S.: Statistical independence and novelty detec-
tion with information preserving nonlinear maps. Neural Computation 8, 260–269
(1996)
12. Roberts, S., Tarassenko, L.: A Probabilistic Resource Allocation Network for Nov-
elty Detection. Neural Computation 6, 270–284 (1994)
Multiple Distribution Data Description Learning Algorithm 257
13. Schlkopf, Smola, A.J.: Learning with kernels. The MIT Press, Cambridge (2002)
14. Tax, D.M.J., Duin, R.P.W.: Support vector data description. Machine Learning 54,
45–56 (2004)
15. Tax, D.M.J.: Datasets (2009),
http://ict.ewi.tudelft.nl/~ davidt/occ/index.html
16. Towel, G.G.: Local expert autoassociator for anomaly detection. In: Proc. 17th
International Conference on Machine Learning, pp. 1023–1030. Morgan Kaufmann
Publishers Inc., San Francisco (2000)
17. Vapnik, V.: The nature of statistical learning theory. Springer, Heidelberg (1995)
18. Vert, J., Vert, J.P.: Consistency and convergence rates of one class svm and related
algorithm. Journal of Machine Learning Research 7, 817–854 (2006)
19. Xiao, Y., Liu, B., Cao, L., Wu, X., Zhang, C., Hao, Z., Yang, F., Cao, J.:
Multi-sphere Support Vector Data Description for Outliers Detection on Multi-
Distribution Data. In: Proc. IEEE International Conference on Data Mining Work-
shops, pp. 82–88 (2009)
20. Yu, M., Ye, J.: A Small Sphere and Large Margin Approach for Novelty Detection
Using Training Data with Outliers. IEEE Transaction on Pattern Analysis and
Machine Intelligence 31, 2088–2092 (2009)
RADAR: Rare Category Detection via
Computation of Boundary Degree
1 Introduction
Rare category detection is an interesting task which is derived from anomaly
detection. This task was firstly proposed by Pelleg et al [2] to help the user
to select useful and interesting anomalies. Compared with traditional anomaly
detection, it aims to find representative data points of the compact rare classes
that differ from the individual and isolated instances in the low-density regions.
Furthermore, a human expert is required for labeling the selected data point
under a known class or a previously undiscovered class. A good rare category
detection algorithm should discover at least one example from each class with
the least label requests.
Rare category detection has many applications in real world. In Sloan Digital
Sky Survey [2], it helps astronomer to find the useful anomalies from mass sky sur-
vey images, which may lead to some new astronomic discoveries. In financial fraud
detection [3], although most of the financial transactions are legitimate, there are
few fraudulent ones. Compared with checking them one by one, using rare cate-
gory detection is much more efficient to detect instances of the fraud patterns. In
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 258–269, 2011.
c Springer-Verlag Berlin Heidelberg 2011
RADAR: Rare Category Detection via Computation of Boundary Degree 259
intrusion detection [4], the authors adopted an active learning framework to se-
lect ”interesting traffic” from huge volume traffic data sets. Then engineers could
find out the meaningful malicious network activities. In visual analytics [5], by lo-
cating the attractive changes in mass remote sensing imagery, geographers could
determine which changes on a particular geographic area are significant.
Up until now, several approaches have been proposed for rare category detec-
tion. The main techniques can be categorized into model-based [2], density-based
[7] [8] [9], and clustering-based [10] methods. The model-based methods assume
a mixture model to fit the data, and select the strangest records in the mix-
ture components for class labeling. However, this assumption has limited their
applicable scope. For example, they require that the majority classes and the
rare classes be separable or work best in the separable case [7]. The densities-
based methods employ essentially a local-density-differential-sampling strategy,
which selects the points from the regions where the local densities fall the most.
This kind of approaches can discover examples of the rare classes rapidly, de-
spite non-separability with the majority classes. But when the local densities of
some rare classes are not dramatically higher than that of the majority classes,
their performance is not as good as that in the high density-differential case.
The clustering-based methods first perform a hierarchical mean shift clustering.
Then, select the clusters which are compact and isolated and query the cluster
modes. Intuitively, if each rare class has a high local density and is isolated, its
points will easily converge at the mode of density by using mean shift. But in
real-world data sets, it is actually not the case. First, the rare classes are often
hidden in the majority classes. Second, if the local densities of the rare classes
are not high enough, their points may converge to the other clusters. In a word,
although the density-based and clustering-based methods work reasonably well
compared with model-based methods, their performance is still affected by the
local densities of the rare classes.
In order to avoid the effect of the local densities of the rare classes, we propose
a density insensitive approach called RADAR. To the best of our knowledge,
RADAR is the first sophisticated density insensitive method for rare category
detection. In our approach, we use the change in the number of RkNN to estimate
the boundary degree for each data point. The point with a higher boundary
degree has a higher probability to be the boundary point of the rare class. Then
we sort the data points by their boundary degrees and query their class labels
with human experts.
The key contribution of our work is twofold:
(1) We proposed a density insensitive method for rare category detection.
(2) Our approach has a higher efficiency on finding new classes, effectively re-
duces the number of queries to human experts.
The rest of the paper is organized as follows. Section 2 formalizes the problem
and defines its scope. Section 3 explains the working principle and working steps
of our approach. In Section 4, we compare RADAR with existing approaches on
both synthetic data sets and real data sets. Section 5 is the conclusion of this
paper.
260 H. Huang et al.
2 Problem Formalization
Following the definition of He et al. [7], we are given a set of unlabeled examples
S = {x1 ,x2 ,...,xn }, xi ∈ Rd , which come from m distinct categories, i.e. yi
= {1,2,...,m}. Our goal is to find at least one example from each category by
as few label requests as possible. For convenience, assume that there is only
one majority class, which corresponds to yi = 1, and all the other categories
are minority classes with prior pc , c = 2,...,m. Let p1 denote the prior of the
majority category. Notice that pi , i = 1, is much smaller than p1 .
Our rare category detection strategy is selecting the points which get the
highest boundary degree for labeling. To understand our approach clearly, we
introduce the following definitions to be used for the rest of the paper.
Definition 1. (Reverse k-nearest neighbor)The reverse k-nearest neighbor
(RkNN) of a point denoted as [6]: Given a data set DB, a point p, a posi-
tive integer k and a distance metric M, reverse k-nearest neighbors of p, i.e.
RkN Np (k ), is a set of points pi that pi ∈ DB and ∀pi , p ∈ kN Npi (k ), where
kN Npi (k ) are the k-nearest neighbors of points pi .
Definition 2. (Significant point) A point is significant point if its point number
of RkNN is above a certain threshold τ :
3 RADAR Algorithm
3.1 Working Principle
In this subsection, we explain why we have adopted a RkNN-based measurement
for boundary degree, and illustrate the reason for adopting the conception of
significant point.
Significant point. Before discussing the significant points which have more
than τ RkNN, we begin with an example illustrated in Fig. 1 which comes from
literature [6]. When k = 2, the Table 1 shows the kNN and the RKNN of each
point in Fig. 1. The cardinality of each point’s RkNN is as follows: p2 , p3 , p5 and
p7 has 3 RkNN; p6 has 2 RkNN; p1 and p4 has 1 RkNN; p8 has none. Notice that
p8 ’s nearest neighbors are in a relatively compact cluster consisted of p5 , p6 , p7 .
However, points in this cluster are each other’s kNN. Since the capacity of each
262 H. Huang et al.
Point p1 p2 p3 p4 p5 p6 p7 p8
kNN p2 , p3 p1 , p3 p2 , p4 p2 , p3 p6 , p7 p5 , p7 p5 , p6 p5 , p7
RkNN p2 p1 , p3 , p4 p1 , p2 , p4 p3 p6 , p7 , p8 p5 , p7 p5 , p6 , p8
point’s kNN list is limited, p8 is not in the kNN lists of its nearest neighbors and
thus has no RkNN. According to Fig. 1, it is hard to say that p8 is a candidate
of minority-class points. In other words, if the RkNN of a point is extremely few,
it means this point is relatively far from the other points. It is not significant to
query its class label because of the low probability of this point belonging to a
compact cluster. Therefore, in our approach, we will query a point only if it’s a
significant point in advance.
3.2 Algorithm
class-label querying and thus save the querying budget. In addition, setting the
parameter w to be 1 is a suitable experimental choice. In the inner loop of Step
12, we query the point which has the maximum boundary degree with human
experts. When we find an example from a previously undiscovered class, we quit
the inner loop. In order to reduce the number of queries caused by repeatedly
selecting examples from the same discovered category, we employ a discreet
querying-duty-exemption strategy: (1) in Step 8, we build a empty point list EL
to record the points which do not need to be queried; (2) in Step 13, if a point
xj from class yj is labeled, the points falling inside a hyper-ball B of radius ry j
centered at xj will be added into EL.
A good exemption strategy can help us to reduce the querying cost. But if the
exemption strategy is greedy, more points near to the labeled points will be added
into EL. Then, the risk of preventing some minority classes from querying will
be higher, especially when the minority classes are near to each other. In order to
avoid such case, we should ensure that the number of querying-duty-exemption
points will not be too large. In our discreet exemption strategy, when we label
a point under a minority class i, the number of points in the hyper-ball B will
not be more than ki , i.e. | B |≤ ki . The reason is that the radius ri is the global
264 H. Huang et al.
minimum distance between each point and its kith nearest neighbor. When we
label a point under the majority class, we do the querying-duty exemption more
carefully because this point is usually close to a rare category’s boundary. We do
not set the corresponding radius of B to be min(k distx (k1 )). Instead, for the
x∈S
sake of discreetness, we set the r1 = minm
i=2 ri so that the nearby rare-category
points can keep their querying duties completely or partially.
4 Performance Evaluation
In this section, we compare RADAR with NNDM (density-based method pro-
posed in [7]), HMS (clustering-based method proposed in [10]) and random sam-
pling (RS) on both synthetic and real date sets. For RS, we run the experiments
50 times and take the average numbers of queries as the results.
200 200
150 150
100 100
50 50
0 0
0 50 100 150 200 0 50 100 150 200
5 5
Number of Classes Discovered
3 3
RS
2 2 RS
HMS
HMS
NNDM NNDM
1 1
RADAR RADAR
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Number of Selected Examples Number of Selected Examples
(a) Results of the high-density case (b) Results of the low-density case
101 and 100 queries respectively; HMS needs 62 and 89 queries respectively;
NNDM needs 10 and 31 queries respectively; RADAR needs 8 and 10 queries
respectively. From these results we can see that the performance of NNDM and
HMS are dramatically affected by the local densities of the rare classes. By con-
trast, RADAR and RS are more insensitive to these local densities. Furthermore,
our approach is much more sophisticated than the straightforward method RS,
and has a high efficiency on finding new classes.
The third synthetic data set in Fig. 4(a) is a multi-density data set. The
majority class has 1000 examples (green points) with Gaussian distribution.
Each minority class (red points) has 20 examples and a different density from
each other. The comparison results are shown in Fig. 4(b). From this figure, we
can learn that the performance of RADAR is better than NNDM, HMS and RS
in this multi-density data set. To find all the classes, RS needs 103 queries; HMS
needs 343 queries; NNDM needs 55 queries; RADAR needs 17 queries.
200 5
Number of Classes Discovered
4
150
3
100
2 RS
HMS
50 1 NNDM
RADAR
0 0
0 50 100 150 200 0 20 40 60 80 100
Number of Selected Examples
In this section, we compare RADAR with NNDM, HMS, and RS on 4 real data
sets from the UCI data repository [1]: the Abalone, Statlog, Wine Quality and
Yeast data sets. The properties of these data sets are summarized in Table 2.
In addition, the Statlog is sub-sampled because the original Image Segmenta-
tion (Statlog) data set contains almost the same number of examples for each
class. The sub-sampling can create an imbalanced data set which suits the rare
category detection scenario. With the sub-sampling, the largest class in Statlog
contains 256 examples; the examples of the next class are half as many as that
70
200
60
50
150
Local Density
Local Density
40
100 30
20
50
10
0 0
0 5 10 15 20 1 2 3 4 5 6
Minority Class Minority Class
160 20
140
120 15
Local Density
Local Density
100
80 10
60
40 5
20
0 0
1 2 3 4 5 1 2 3 4 5 6 7 8 9
Minority Class Minority Class
Table 3. Number of queries needed to find out all classes for each algorithm
20 7
Number of Classes Discovered
4
10
RS 3 RS
HMS HMS
2
5 NNDM NNDM
RADAR 1 RADAR
0 0
0 50 100 150 200 0 20 40 60 80 100
Number of Selected Examples Number of Selected Examples
6 10
Number of Classes Discovered
Number of Classes Discovered
5
8
4
6
3
RS RS
4
2 HMS HMS
NNDM NNDM
2
1 RADAR RADAR
0 0
0 50 100 150 200 0 50 100 150
Number of Selected Examples Number of Selected Examples
of the former one; the smallest classes all have 8 examples. The results are sum-
marized in Table 3. The mark ’-’ indicates that the algorithm cannot find out
all classes in the data set.
These real data sets are multi-density data sets. To estimate the local den-
sity of each minority class, we adopt a measurement for the local density of a
data point. We first calculate the average distance between a data point and
its k-nearest neighbors. Next, multiply the reciprocal of this average distance
by the global maximum distance between the points. The product is roughly
in proportion to the local density of the data point. Finally, we calculate aver-
age value of the products for each minority class and take this value as the
268 H. Huang et al.
5 Conclusion
We have proposed a novel approach (RADAR) for rare category detection. Com-
pared with existing algorithms, RADAR is a density insensitive method, which is
based on reverse k-nearest neighbors (RkNN). In this paper, the boundary degree
of each point is measured by variation of RkNN. Data points with high boundary
degrees are selected for the class-label querying. Experimental results on both
synthetic and real-world data sets demonstrate that the number of queries has
dramatically decreased by using our approach. Moreover, RADAR has a more
attractive property. It is more suitable to handle the multi-density data sets.
Future works involve adopting a technique of parameter automatization to set
the parameter w and adapting our approach to the prior-free case.
References
7. He, J., Carbonell, J.: Nearest-neighbor-based active learning for rare category de-
tection. In: Proc. NIPS 2007, pp. 633–640. MIT Press, Boston (2007)
8. He, J., Liu, Y., Lawrence, R.: Graph-based rare category detection. In: Proc. ICDM
2008, pp. 833–838 (2008)
9. He, J., Carbonell, J.: Prior-free rare category detection. In: Proc. SDM 2009, pp.
155–163 (2009)
10. Vatturi, P., Wong, W.: Category detection using hierarchical mean shift. In: Proc.
KDD 2009, pp. 847–856 (2009)
RKOF: Robust Kernel-Based Local Outlier
Detection
1 Introduction
Compared with the other knowledge discovery problems, outlier detection is ar-
guably more valuable and effective in finding rare events and exceptional cases
from the data in many applications such as stock market analysis, intrusion
detection, and medical diagnostics. In general, there are two definitions of the
This work is supported in part by the NSFC (Grant No. 60825204, 60935002 and
60903147) and the US NSF (Grant No. IIS-0812114 and CCF-1017828).
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 270–283, 2011.
c Springer-Verlag Berlin Heidelberg 2011
RKOF: Robust Kernel-Based Local Outlier Detection 271
outlier detection: Regression outlier and Hawkins outlier. Regression outlier de-
fines that an outlier is an observation which does not match the predefined metric
model of the interesting data [1]. Hawkins outlier defines that an outlier is an
observation that deviates so much from other observations as to arouse suspicion
that this observation is generated by a different mechanism [2]. Compared with
Regression outlier detection, Hawkins outlier detection is more challenging work
because of the unknown generative mechanism of the normal data. In this paper,
we focus on the unsupervised methods for Hawkins outlier detection. In the rest
of this paper, outlier detection refers particularly to Hawkins outlier detection.
Over the past several decades, the research on outlier detection varies from
the global computation to the local analysis, and the descriptions of outliers vary
from the binary interpretations to probabilistic representations. Breunig et al.
propose a density estimation based Local Outlier Factor (LOF) [4]. This work is
so influential that there is a rich body of the literature on the local density-based
outlier detection. On the one hand, plenty of local density-based methods are
proposed to compute the outlier factors, such as the local correlation integral [5],
the connectivity-based outlier factor [8], the spatial local outlier measure [9], and
the local peculiarity factor [7]. On the other hand, many efforts are committed
to combining machine learning methods with LOF to accommodate the large
and high dimensional data [10,14].
Although LOF is popular in use in the literature, there are two major dis-
advantages restricting its applications. First, since LOF is based on the local
density estimate theory, it is obvious that the more accurate the density esti-
mate, the better the detection performance. The local reach-ability density used
in LOF is the reciprocal of the average of reach-distances between the given
object and its neighbors. This density estimate is an extension of the nearest
neighbor density estimate, which is defined as
k 1
f (p) = · (1)
2n dk (p)
Fig. 1. (a) Eruption lengths of 107 eruptions of Old Faithful geyser. (b) The density
of Old Faithful data based on the nearest neighbor density estimate, redrawn from [3].
272 J. Gao et al.
where n is the total number of the objects, and dk (p) is the distance between
object p and its kth nearest neighbor. As shown in Fig. 1, the heavy tails of the
density function and the discontinuities in the derivative reduce the accuracy
of the density estimate. This dilemma indicates that with the LOF method, an
outlier is unable to deviate substantially from the normal objects in the complex
and large databases. Second, like all other local density-based outlier detection
methods, the performance of LOF depends on the parameter k which is defined
as the least number of the nearest neighbors in the neighborhood of an object
[4]. However, in LOF, the value of k is determined based on the average density
estimate of the neighborhood, which is statistically vulnerable to the presence of
an outlier. Hence, it is hard to determine an appropriate value of this parameter
to ensure the acceptable performance in the complex and large databases.
In order to address these two disadvantages of LOF, we propose a Robust
Kernel-based Outlier Factor (RKOF) in this paper. Specifically, the main con-
tributions of our work are as follows:
– We propose a kernel-based outlier detection method which brings the vari-
able kernel density estimate method into the computation of outlier factors,
in order to achieve a more accurate density estimate. Besides, we propose
a new kernel function named the Volcano kernel which requires a smaller
value of the parameter k for outlier detection than other kernels, resulting
in less detection time.
– We propose the weighted density estimate of the neighborhood of a given
object to improve the robustness of determining the value of the parameter
k. Furthermore, we demonstrate that this weighted density estimate method
is superior to the average density estimate method used in LOF in robust
outlier detection.
– We keep the same framework of local density-based outlier detection with
LOF. This makes that RKOF can be directly used in the extensions of LOF,
such as Feature Bagging [10], Top-n outlier detection [14], Local Kernel Re-
gression [15], and improve the detection performance of these extensions.
The remainder of this paper is organized as follows. Section 2 introduces our
RKOF method with a novel kernel function, named the Volcano kernel, and an-
alyzes the special property of the Volcano kernel. Section 3 shows the robustness
and computational complexity of RKOF. Section 4 reports the experimental
results. Finally, Section 5 concludes the paper.
2 Main Framework
A density-based outlier is detected by comparing its density estimate with its
neighborhood density estimate [4]. Hence, we first introduce the notions of the
local kernel density estimate of object p, the weighted density estimate of p’s
neighborhood. Then, we introduce the notion of the robust kernel-based outlier
factor of p, which is used to detect outliers. Besides, we analyze the influences of
different kernels to the performance of our method, and propose a novel kernel
function named the Volcano kernel with its special property in outlier detection.
RKOF: Robust Kernel-Based Local Outlier Detection 273
Definition 1. Given a data set D, an object p, and any positive integer k, the
k-distance(p) is defined as the distance d(p, o) between p and an object o ∈ D,
such that:
Definition 2. Given a data set D, an object p, and any positive integer k, the
k-distance neighborhood of p, named Nk (p), contains every object whose distance
from p is not greater than the k-distance(p), i.e., Nk (p) = {q ∈ D\{p}|d(p, q) ≤
k-distance(p)}, where any such object q is called a k-distance neighbor of p.
|Nk (p)| is the number of the k-distance neighbors of p.
kde(p) is an extension of the variable kernel density estimate [3]. kde(p) not
only retains the adaptive kernel window width that is allowed to vary from one
object to another, but also is computed locally in the k-distance neighborhood of
object p. The parameter γ equals the dimension number d in the original variable
kernel density estimate [3]. For the local kernel density estimate, the larger γ,
the more sensitive kde(p). However, the high sensitivity of kde(p) is not always
a merit for the local outlier detection in high dimensional data. For example,
if λo is always very small for all the objects in a sparse and high dimensional
data set, (λo )−γ always equals infinity. This makes kde(p) lack of the capacity
to discriminate between outliers and normal data. We give γ a default value 2
to obtain a balance between the sensitivity and the robustness.
274 J. Gao et al.
In this paper, we compute the pilot density function f (x) by the approximate
nearest neighbor density estimate according to Equation 1.
1
f (o) = (2)
k-distance(o)
where x denotes the norm of a vector x and it can be used to compute the
distances between objects.
Our RKOF method with the Gaussian kernel cannot ensure that outlier fac-
tors of the normal objects in a cluster are approximately equal to 1. Then, we
need to determine the threshold value of outlier factors in addition. The Epanech-
nikov kernel function equals zero when x is larger than 1. Hence, for most of
outliers and normal objects lying in the border of clusters, their outlier factors
equal infinity.
In order to achieve the same property with LOF, we define a novel kernel
function called the Volcano kernel as follows:
where β assures that K(x) integrates to one, and g(x) is a monotonically de-
creasing function, lying in a close interval [0, 1] and equal to zero at the infinity.
Unless otherwise specified, we use g(x) = e−|x|+1 as the default function for our
experiments.
Fig. 2 shows the curve of the Volcano kernel for the univariate data. When x
is not larger than 1, the kernel value equals a constant value β. This generates
that outlier factors of the objects deeply in the cluster are approximately equal
to 1. When x is larger than 1, the kernel value is the monotonically decreasing
276 J. Gao et al.
Fig. 2. The curve of the Volcano kernel for the univariate data
function of x and less than 1. This not only makes outlier factors continuous
and finite, but also makes outlier factors of outliers much larger than 1. Hence,
RKOF method with the Volcano kernel can capture outliers much easier, and
sort all the objects according to their RKOF values.
Fig. 3. The curves of OF (p) for the average and the weighted density estimates
rω + (1 − r)ρ (ρ − ω)r − ρ
= =
rω + (1 − r) (1 − ω)r − 1
4 Experiments
In this section, we evaluate the outlier detection capability of RKOF based on
different kernel functions and compare RKOF with the state-of-the-art outlier
detection methods on several synthetic and real data sets.
Fig. 5. The best performances of RKOF and LOF on the Synthetic-2 data (Top-20)
detection rate is 100% and the false alarm rate is zero. coverage is the ratio of
the number of the detected outliers to the 16 total outliers. RKOF(σ = 0.1)
can identify all the outliers when k ≥ 27. RKOF(σ = 1) can detect all the out-
liers when k ≥ 31. Clearly, the parameter σ directly relates to the sensitivity of
the outlier detection for RKOF. LOF is unable to identify all the outliers until
k = 60. Table 1 indicates that the available k interval of RKOF is larger than
that of LOF, which means that RKOF is less sensitive to the parameter k.
As shown in Fig. 5, RKOF with k = 14 captures all the outliers in Top-20
objects. LOF obtains its best performance with k = 20, whose detection rate is
85%. Compared with RKOF, LOF can not detect all the outliers whatever the
value of k is. It is obviously that the annular cluster and the Gaussian cluster
pose an obstacle to the choice of k. This result indicates that RKOF is more
adapted to the complex data sets than LOF.
Table 2. The AUC values and the running time in parentheses for RKOF and the
comparing methods on the real data sets by the k-d tree method [17]. Since LPF has
the higher complexity and is unable to complete the data sets in the reasonable time,
the accurate running time for LPF is not given in this table.
PP
PP Data KDD Mammography Ann-thyroid Shuttle (average)
Methods PPP
RKOFa 0.962 (1918.1s) 0.871 (15.8s) 0.970 (4.9s) 0.990 (36.4s)
RKOFb 0.961 (2095.2s) 0.870 (19.8s) 0.970 (5.2s) 0.990 (36.9s)
RKOFc 0.944 (2363.7s) 0.855 (48.2s) 0.965 (13.2s) 0.993 (36.7s)
LOF 0.610 (2160.1s) 0.640 (28.8s) 0.869 (5.9s) 0.852 (42.0s)
LDF 0.941 (2214.9s) 0.824 (36.4s) 0.943 (7.2s) 0.962 (37.1s)
LPF 0.98 (2363.7s) 0.87 (48.2s) 0.97 (13.2s) 0.992 (42.0s)
Bagging 0.61(±0.25) 0.74(±0.07) 0.98(±0.01) 0.985(±0.031)
Boosting 0.51(±0.004) 0.56(±0.02) 0.64 0.784(±0.13)
Feature Bagging 0.74(±0.1) 0.80(±0.1) 0.869 0.839
Active Learning 0.94(±0.04) 0.81(±0.03) 0.97(±0.01) 0.999(±0.0006)
a. Using Volcano kernel b. Using Gaussian kernel c. Using Epanechnikov kernel
as normal data. There are 21 attributes where 15 attributes are binary and 6
attributes are continuous. The Shuttle data set consists of 11478 records with
label 1, 13 records with label 2, 39 records with label 3, 809 records with label
5, 4 records with label 6, and 2 records with label 7. We divide this data set into
5 subsets: label 2, 3, 5, 6, 7 records vs label 1 records, where the label 1 records
are normal, and others are outliers.
All the comparing outlier detection methods are evaluated using the ROC
curves and the AUC values. The ROC curve represents the trade-off between
the detection rate as y-axis and the false alarm rate as x-axis. The AUC value
is the surface area under the ROC curve. Clearly, the larger the AUC value, the
better outlier detection method.
The AUC values for RKOF with different kernels and all other comparing
methods are given in Table 2. Also shown in Table 2 are the running time data
for RKOF with different kernels as well as those of the other three local density-
based methods; since the AUC values for other comparing methods are directly
obtained from their publications in the literature, the running time data for
these methods are not available and thus are not included in this table.
From Table 2, we see that different RKOF methods using different kernels
receive similar AUC values on all the data sets, especially the Volcano and Gaus-
sian kernels. The k values with the best detection performance for all the three
kernels on all the data sets are shown in Fig. 6(a). Clearly, the k values for the
Volcano kernel are always smaller than those of the other kernels, and the k
values for the Epanechnikov kernel are the largest among three kernels. This
experiment supports one of the contributions of this work that the proposed
Volcano kernel achieves the least computation time among the existing kernels.
RKOF: Robust Kernel-Based Local Outlier Detection 281
Fig. 6. (a)The k values with the best performance for different kernels in RKOF. (b)
ROC curves for RKOF based on the Volcano kernel on the KDD and the Mammography
data sets.
Fig. 7. AUC values of RKOF based on the Volcano kernel with different k values for
the KDD and Mammography data sets
achieves the acceptable performance that is very close to the best performance.
The AUC value of the Shuttle data set is the average AUC of all the five subsets,
where the AUC values of the subsets with the label 5, label 6, and label 7 are
all approximately equal to 1. RKOF also obtains the acceptable performance
that is very close to the best performance for the Shuttle data set. Overall, while
there is no winner for all the cases, RKOF always achieves the best performance
or is close to the best performance in all the data sets with the least running
time. In particular, RKOF achieves the best performance or is close to the best
performance for the KDD and the Mammography data sets with much less
running time, which are the two large data sets of all the four data sets. This
demonstrates the high scalability of the RKOF method in outlier detection.
Specifically, in all the cases RKOF always has less running time than LOF, LDF
and LPF. Though the running time data for the other comparing methods are
not available, from the theoretic complexity analysis it is clear that they would
all take longer running time than RKOF.
5 Conclusions
We have studied the local outlier detection problem in this paper. We have
proposed the RKOF method based on the variable kernel density estimate and
the weighted density estimate of the neighborhood of an object, which have ad-
dressed the existing disadvantages of LOF and other density-based methods. We
have proposed a novel kernel function named the Volcano kernel, which is more
suitable for outlier detection. Theoretical analysis and empirical evaluations on
the synthetic and real data sets demonstrate that RKOF is more robust and
effective for outlier detection at the same time taking less computation time.
References
1. Rousseeuw, P.J., Leroy, A.M.: Robust Rgression and Outlier Detection. John Wiley
and Sons, New York (1987)
2. Hawkins, D.: Identification of Outliers. Chapman and Hall, London (1980)
3. Silverman, B.: Density Estimation for Statistics and Data Analysis. Chapman and
Hall, London (1986)
4. Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: Lof: Identifying density-based
local outliers. In: SIGMOD, pp. 93–104 (2000)
5. Papadimitriou, S., Kitagawa, H., Gibbons, P.: Loci: Fast outlier detection using
the local correlation integral. In: ICDE, pp. 315–326 (2003)
6. Latecki, L.J., Lazarevic, A., Pokrajac, D.: Outlier Detection with Kernel Density
Functions. In: Perner, P. (ed.) MLDM 2007. LNCS (LNAI), vol. 4571, pp. 61–75.
Springer, Heidelberg (2007)
7. Yang, J., Zhong, N., Yao, Y., Wang, J.: Local peculiarity factor and its application
in outlier detection. In: KDD, pp. 776–784 (2008)
8. Tang, J., Chen, Z., Fu, A.W.-c., Cheung, D.W.: Enhancing effectiveness of outlier
detections for low density patterns. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD
2002. LNCS (LNAI), vol. 2336, pp. 535–548. Springer, Heidelberg (2002)
RKOF: Robust Kernel-Based Local Outlier Detection 283
9. Sun, P., Chawla, S.: On local spatial outliers. In: KDD, pp. 209–216 (2004)
10. Lazarevic, A., Kumar, V.: Feature bagging for outlier detection. In: KDD, pp.
157–166 (2005)
11. Abe, N., Zadrozny, B., Langford, J.: Outlier detection by active learning. In: KDD,
pp. 504–509 (2006)
12. Breiman, L.: Bagging predictors. J. Machine Learning 24(2), 123–140 (1996)
13. Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and
an application to boosting. J. Comput. Syst. Sci. 55(1), 113–139 (1997)
14. Jin, W., Tung, A., Ha, J.: Mining top-n local outliers in large databases. In: KDD,
pp. 293–298 (2001)
15. Gao, J., Hu, W., Li, W., Zhang, Z.M., Wu, O.: Local Outlier Detection Based on
Kernel Regression. In: ICPR, pp. 585–588 (2010)
16. Barnett, V., Lewis, T.: Outliers in Statistic Data. John Wiley, New York (1994)
17. Bentley, J.L.: Multidimensional binary search trees used for associative searching.
J. Communications of the ACM 18(9), 509–517 (1975)
Chinese Categorization and Novelty Mining
1 Introduction
The overabundance of information leads to the proliferation of useless and redun-
dant content. Novelty mining (NM) is able to detect useful and novel information
from a chronologically ordered list of relevant documents or sentences. Although
techniques such as dimensionality reduction [15,18], probabilistic models [16,17],
and classification [12] can be used to reduce the data size, novelty mining tech-
niques are preferred since they allow users to quickly get useful information by
filtering away the redundant content.
The process of novelty mining consists of three main steps, (i) preprocessing,
(ii) categorization, and (iii) novelty detection. This paper focuses on all three
steps of novelty mining, which has rarely been explored. In the first step, text
sentences are preprocessed by removing stop words, stemming words to their
root form, and tagging the Parts-of-Speech (POS). In the second step, each in-
coming sentence is classified into its relevant topic bin. In the final step, novelty
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 284–295, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Chinese Categorization and Novelty Mining 285
detection searches through the time sequence of sentences and retrieves only
those with “novel” information. This paper examines the link between catego-
rization and novelty mining. In this task, we need to identify all novel Chinese
text given groups of relevant sentences. Moreover, we also discuss the sentence
categorization and novelty mining performance based on the retrieval results.
The main contributions of this work are the investigation of the preprocessing
techniques for detecting novel Chinese text, the discussion of the POS filtering
rule for selecting words to represent a sentence, several experiments to compare
the novelty mining performance between Chinese and English, the discovery that
the novelty mining performance on Chinese can be as good as that on English
if we can increase the preprocessing precision on Chinese text, the application
of a mixed novelty metric that can effectively improves Chinese novelty min-
ing performance, and a set of new novelty mining evaluation measures which
can facilitate users to objectively evaluate the novelty mining results: Novelty-
Precision, Novelty-Recall, Novelty-F Score, and Sensitivity.
The rest of this paper is organized as follows. The first section gives a brief
overview of related work on detecting novel documents and sentences. The next
section introduces the details of preprocessing steps for English and Chinese.
Next, we describe the categorization algorithm and the mixed metric technique,
which is applied in Chinese novelty mining. Traditional evaluation measures are
described and new novelty evaluation measures for novelty mining are then pro-
posed. Next, the experimental results are reported on the effect of preprocessing
rules on Chinese novelty mining, Chinese novelty mining using mixed metric,
categorization in English and Chinese, and novelty mining based on categoriza-
tion using the old and newly proposed evaluation measures. The final section
summarizes the research contributions and findings.
2 Related Work
Traditional sentence categorization methods use queries from topic information
to evaluate similarity between an incoming sentence and the topic [1]. Then,
each sentence is placed into its category according to the similarity. However,
using queries from the topic information cannot guarantee satisfactory results
since these queries can only provide some limited information. Later works have
emphasized on how to expand the query so as to optimize the retrieval results
[2]. The initial query, which is usually short, can be expanded based on the
explicit user feedback or implicit pseudo-feedback in the target collections and
the external resources, such as Wikipedia, search engines, etc. [2] Moreover, ma-
chine learning algorithms have been applied to sentence categorization that first
transform sentences, which typically are strings of characters, into a representa-
tion suitable for the learning algorithms. Then, different classifiers are chosen to
categorize the sentences to their relevant topic.
Initial studies of novelty mining focused on the detection of novel documents.
A document which is very similar to any of its history documents is regarded as
a redundant document. To serve users better, novel information at the sentence
level can be further highlighted. Therefore, later studies focused on detecting
286 F.S. Tsai and Y. Zhang
novel sentences, such as those reported in TREC Novelty Tracks [11], which
compared various novelty metrics [19,21], and integrated different natural lan-
guage techniques [7,14,20,22].
Studies for novelty mining have been conducted on the English and Malay
languages [4,6,8,24]. Novelty mining studies on the Chinese language have been
performed on topic detection and tracking, which identifies and collects relevant
stories on certain topics from a stream of information [25]. However, to the best
of our knowledge, few studies have been reported on entire process of Chinese
novelty mining, from preprocessing and categorization to the actual detection of
novel information, which is the focus of this paper.
3.1 English
English preprocessing first removes all stop words, such as conjunctions, prepo-
sitions, and articles. After removing stop words, word stemming is performed,
which reduces the inflected words to their primitive root forms.
3.2 Chinese
Chinese preprocessing first needs to perform lexical analysis since there is no
obvious boundary between Chinese words. Chinese word segmentation is a very
challenging problem because of the difficulties in defining what constitutes a
Chinese word [3]. Furthermore, there are no white spaces between Chinese words
or expressions and there are many ambiguities in the Chinese language, such
as: ‘ ’ (means ‘mainboard and server’ in English) might be ‘
/ / ’ (means ‘mainboard/and/server’ in English) or ‘ / / / ’
(means ‘mainboard/ kimono/ task/ utensil’ in English). This ambiguity is a great
challenge for Chinese word segmentation. Moreover, since there are no obvious
derived words in Chinese, word stemming cannot be performed.
To reduce the noise from Chinese word segmentation and obtain a better word
list for a sentence, we first apply word segmentation on the Chinese text and
then utilize Part-of-Speech (POS) tagging to select the meaningful candidate
words. We used ICTCLAS for word segmentation and POS tagging because it
achieves a higher precision than other Chinese POS tagging softwares [23].
Two different rules were used to select the candidate words of a sentence.
“There is a picture on the wall”. After POS filtering using Rule1, following
Ü
words are used: “ (‘n’), (‘v’), (‘v’), (‘m’ measure word), (‘q’ quanti-
fier), (‘n’)”. After POS filtering using Rule2, following words remain: “ (‘n’),
(‘v’), (‘v’), Ü(‘n’)”. By using Rule2, we can remove more unimportant
words.
4 Categorization
From the output of the preprocessing steps on English and Chinese languages,
we obtain bags of English and Chinese words. The corresponding term sen-
tence matrix (TSM) can be constructed by counting the term frequency (TF) of
each word. Therefore, each sentence can be conveniently represented by a vector
where the TF value of each word is considered as one feature. Retrieving rel-
evant sentences is traditionally based on computing the similarity between the
representations of the topic and the sentences. The famous Rocchio algorithm
[10] is adopted to categorize the sentences to their topics.
The Rocchio algorithm is popular for two reasons. First, it is computationally
efficient for online learning. Secondly, compared to many other algorithms, it
works well empirically, especially at the beginning stage of adaptive filtering
where the number of training examples is very small.
5 Novelty Mining
From the output of preprocessing, a bag of words is obtained, from which the cor-
responding term-sentence matrix (TSM) can be constructed by counting the term
frequency (TF) of each word. The novelty mining system compares the incoming
sentence to its history sentences in this vector space. Since the novelty mining
process is the same for English and Chinese, a novelty mining system designed for
English can also be applied to Chinese.
The novelty of a sentence can be quantitatively measured by a novelty metric
and represented by a novelty score N . The final decision on whether a sentence is
novel depends on whether the novelty score falls above a threshold. The sentence
that is predicted as “novel” will be placed into the history list of sentences.
normalizing the metrics, novelty scores from all novelty metrics range from 0
(i.e. redundant) to 1 (i.e. totally novel). Therefore, the metrics are both com-
parable and consistent because they have the same range of values. For the
combining strategy, we adopt a new technique of measuring the novelty score
N (st ) of the current sentence st , by combining two types of metrics, as shown in
Equation (1).
where Nsym is the novelty score using the symmetric metric, Nasym is the novelty
score using the asymmetric metric, and α is the combining parameter ranging
from 0 to 1. The larger the value of α, the heavier the weight for the symmetric
metrics.
The new word count novelty metric is a popular asymmetric metric, which
was proposed for sentence-level novelty mining [1]. The idea of the new word
count novelty metric is to assign the incoming sentence the count of the new
words that have not appeared in its history sentences, as defined in Equation
(2).
newW ord(st ) = |W (st )| − W (st ) ∩ ∪t−1
i=1 W (si )
(2)
where W (si ) is the set of words in the sentence si . The values of the new word
count novelty metric for an incoming sentence are non-negative integers such as
0, 1, 2, etc. To normalize the values of the novelty scores into the range of 0 to
1, the new word count novelty metric can be normalized by the total number of
words in the incoming sentence st as below.
W (dt ) ∩ ∪t−1 W (di )
i=1
NnewW ord (dt ) = 1 − (3)
|W (dt )|
where the denominator |W (dt )| is the word count of dt . This normalized metric,
NnewW ord , has the range of values from 0 (i.e. no new word) to 1 (i.e. 100% new
words).
In the following experiments using mixed metric, α is set 0.75. We chose cosine
metrics as the symmetric metric and new word count defined in Equation (2) as
the asymmetric metric, and the term weighting function as T F .
Chinese Categorization and Novelty Mining 289
Although Precision, Recall, and F-Score can measure the novelty mining perfor-
mance well when sentences are correctly categorized, if there are errors in the
categorization, the measures cannot objectively measure the novelty mining per-
formance. In order to objectively measure the novelty mining performance, we
propose a set of new evaluation measures called Novelty Precision (N-Precision),
Novelty Recall (N-Recall) and Novelty F Score(N-F Score). They are calculated
only on correctly categorized sentences by our novelty mining system instead
of all task relevant sentences. We remove the incorrectly categorized sentences
before our novelty mining evaluation.
NN+
N-precision = (4)
N N + + N R+
NN+
N-recall = (5)
NN+ + NN−
2 × N-precision × N-recall
N-F = (6)
N-precision + N-recall
where N R+ ,N R− ,N N + ,N N − correspond to the number of sentences that fall
into each category (see Table 2).
Our N-precision, N-recall and N-F Score do not consider the novelty mining
performance of the sentences that are wrongly categorized to one topic. More-
over, in order to better measure the novelty mining result of this part, we bring
forward a new measure called Sensitivity (defined in Equation 7), which indi-
cates whether the novelty mining system is sensitive to the irrelevant sentences.
Based on our experiments, we learn that the Chinese novelty mining per-
formance is better when choosing the stricter rule (Rule2). Thus, POS filter-
ing is necessary for Chinese because just removing some non-meaningful words
(like stop words) may not be sufficient. POS filtering removes the less mean-
ingful words so that each vector can be better represented. Rule2, which keeps
only nouns, verbs, adjectives and adverbs, produces better results for novelty
mining. Therefore, the remaining experiments used Rule2 for preprocessing the
Chinese text.
0.9
0.8 Chinese S−NM
Chinese S−NM_Mix
0.7 0.85
0.8
0.6
0.75
Precision
0.7
0.5
0.6
0.4
0.5
0.3
0.4
0.2
0.3
0.2
0.1
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Fig. 1. PR curves for sentence-level novelty mining on Chinese using mixed metric on
TREC 2004. The grey dashed lines show contours at intervals of 0.1 points of F .
292 F.S. Tsai and Y. Zhang
0.7
0.8
0.6
N−Precision
0.7
0.5
0.6
0.4 Threshold
0.05
=0
0.05Threshold 0.5
0.3
=0 0.4
0.2 0.3
0.2
0.1
0.1
0
0 0.2 0.4 0.6 0.8 1
N−Recall
categorization results is not good is because not all the relevant sentences are
correctly categorized. The assessors judge the novelty of each sentence only on
the correct relevant sentences. Therefore, if the categorization of each sentence
is incorrect, the following novelty mining performance will be badly influenced.
7 Conclusion
This paper studied the entire process of preprocessing, categorization and novelty
mining for detecting novel Chinese text, which were insufficiently addressed in
previous studies. We described the Chinese preprocessing steps when choosing
different Part-of-Speech (POS) filtering rules. We compared the novelty mining
performance between Chinese and English and found that the novelty mining
performance on Chinese can be as good as that on English by increasing the
preprocessing precision on Chinese text.
294 F.S. Tsai and Y. Zhang
Then we applied a mixed novelty metric that effectively improved the Chinese
novelty mining performance at the sentence level. Next, we compared the per-
formance of categorization in English and Chinese, and found that Chinese cat-
egorization was influenced by the noise in preprocessing. Finally, we discuss the
categorization and the novelty mining performance based on retrieval results. In
order to objectively evaluate the novelty mining performance, we proposed a set
of new novelty mining evaluation measures, Novelty-Precision, Novelty-Recall,
Novelty-F Score, and Sensitivity. The new evaluation measures can more fairly
assess how the performance of novelty mining is influenced by the categorization
results.
References
1. Allan, J., Wade, C., Bolivar, A.: Retrieval and novelty detection at the sentence
level. In: SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, pp. 314–321
(2003)
2. Diaz, F., Metzler, D.: Improving the estimation of relevance models using large
external corpora. In: SIGIR 2006, Seattle, USA, pp. 154–161 (2006)
3. Gao, J., Li, M., Wu, A., Huang, C.-N.: Chinese word segmentation and named
entity recognition: A pragmatic approach. Computational Linguistics 31(4), 531–
574 (2005)
4. Kwee, A.T., Tsai, F.S., Tang, W.: Sentence-level novelty detection in english and
malay. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD
2009. LNCS, vol. 5476, pp. 40–51. Springer, Heidelberg (2009)
5. Li, Y., Taylor, J.S.: The SVM with uneven margins and Chinese document cate-
gorisation. In: Proceedings of the 17th Pacific Asia Conference on Language, In-
formation and Computation, pp. 216–227 (2003)
6. Liang, H., Tsai, F.S., Kwee, A.T.: Detecting novel business blogs. In: ICICS 2009 -
Conference Proceedings of the 7th International Conference on Information, Com-
munications and Signal Processing (2009)
7. Ng, K.W., Tsai, F.S., Chen, L., Goh, K.C.: Novelty detection for text documents
using named entity recognition. In: 2007 6th International Conference on Informa-
tion, Communications and Signal Processing, ICICS (2007)
8. Ong, C.L., Kwee, A., Tsai, F.: Database optimization for novelty detection. In:
ICICS 2009 - Conference Proceedings of the 7th International Conference on In-
formation, Communications and Signal Processing (2009)
9. PKU and CAS, Chinese POS tagging criterion (1999),
http://icl.pku.edu.cn/icl_groups/corpus/addition.htm
10. Rocchio, J.: Relevance feedback in information retrieval. In: The SMART Retrieval
System: Experiments in Automatic Document Processing, pp. 313–323 (1971)
11. Soboroff, I.: Overview of the TREC 2004 Novelty Track. In: Proceedings of TREC
2004 - the 13th Text Retrieval Conference, pp. 1–16 (2004)
12. Tan, R., Tsai, F.S.: Authorship identification for online text. In: International
Conference on Cyberworlds, pp. 155–162 (2010)
13. Tang, W., Tsai, F.S., Chen, L.: Blended metrics for novel sentence mining. Expert
Syst. Appl. 37(7), 5172–5177 (2010)
14. Tsai, F.S.: Review of techniques for intelligent novelty mining. Information Tech-
nology Journal 9(6), 1255–1261 (2010)
Chinese Categorization and Novelty Mining 295
15. Tsai, F.S.: Dimensionality reduction techniques for blog visualization. Expert Sys-
tems With Applications 38(3), 2766–2773 (2011)
16. Tsai, F.S.: A tag-topic model for blog mining. Expert Systems With Applica-
tions 38(5), 5330–5335 (2011)
17. Tsai, F.S., Chan, K.L.: Detecting cyber security threats in weblogs using proba-
bilistic models. In: Yang, C.C., Zeng, D., Chau, M., Chang, K., Yang, Q., Cheng,
X., Wang, J., Wang, F.-Y., Chen, H. (eds.) PAISI 2007. LNCS, vol. 4430, pp. 46–57.
Springer, Heidelberg (2007)
18. Tsai, F.S., Chan, K.L.: Dimensionality reduction techniques for data exploration.
In: 2007 6th International Conference on Information, Communications and Signal
Processing, ICICS 2007, pp. 1568–1572 (2007)
19. Tsai, F.S., Chan, K.L.: Redundancy and novelty mining in the business blogo-
sphere. The Learning Organization 17(6), 490–499 (2010)
20. Tsai, F.S., Chan, K.L.: An intelligent system for sentence retrieval and novelty
mining. International Journal of Knowledge Engineering and Data Mining 1(3),
235–253 (2011)
21. Tsai, F.S., Tang, W., Chan, K.L.: Evaluation of metrics for sentence-level novelty
mining. Information Sciences 180(12), 2359–2374 (2010)
22. Tsai, F.S., Zhang, Y.: D2S: Document-to-sentence framework for novelty detection.
Knowledge and Information Systems (2011)
23. Zhang, H.-P., Liu, Q., Cheng, X.-Q., Zhang, H., Yu, H.-K.: Chinese lexical analysis
using hierarchical hidden markov model. In: Second SIGHAN Workshop Affiliated
with 41th ACL, pp. 63–70 (2003)
24. Zhang, Y., Tsai, F.S.: Combining named entities and tags for novel sentence detec-
tion. In: Proceedings of the WSDM 2009 ACM Workshop on Exploiting Semantic
Annotations in Information Retrieval, ESAIR 2009, pp. 30–34 (2009)
25. Zheng, W., Zhang, Y., Zou, B., Hong, Y., Liu, T.: Research of Chinese topic track-
ing based on relevance model (2008)
Finding Rare Classes: Adapting Generative and
Discriminative Models in Active Learning
1 Introduction
Many real life problems are characterized by data distributed between vast yet
uninteresting background classes, and small rare classes of interesting instances
which should be identified. In astronomy, the vast majority of sky survey image
content is due to well understood phenomena, and only 0.001% of data is of inter-
est for astronomers to study [12]. In financial transaction monitoring, most are
ordinary but a few unusual ones indicate fraud and regulators would like to find
future instances. Computer network intrusion detection exhibits vast amounts
of normal user traffic, and a very few examples of malicious attacks [16]. Finally,
in computer vision based security surveillance of public spaces, observed activi-
ties are almost always people going about everyday behaviours, but very rarely
may be a dangerous or malicious activity of interest [19]. All of these classifi-
cation problems share two interesting properties: highly unbalanced frequencies
– the vast majority of data occurs in one or more background classes, while
the instances of interest for classification are much rarer; and unbalanced prior
supervision – the majority classes are typically known a priori, while the rare
classes are not. Classifying rare event instances rather than merely detecting any
rare event is crucial because different classes may warrant different responses,
for example due to different severity levels. In order to discover and learn to
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 296–308, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Finding Rare Classes: Adapting Generative and Discriminative Models 297
classify the interesting rare classes, exhaustive labeling of a large dataset would
be required to ensure sufficient rare class coverage. However this is prohibitively
expensive when generating each label requires significant time of a human ex-
pert. Active learning strategies might be used to discover or train a classifier
with minimal label cost, but this is complicated by the dependence of classifier
learning on discovery: one needs examples of each class to train a classifier.
The problem of joint discovery and classification has received little atten-
tion despite its importance and broad relevance. The only existing attempt to
address this is based on simply applying schemes for discovery and classifier
learning sequentially or in fixed iteration [16]. Methods which treat discovery
and classification independently perform poorly due to making inefficient use
of data, (e.g., spending time on classifier learning is useless if the right classes
have not been discovered and vice-versa). Achieving the optimal balance is crit-
ical, but non-trivial given the conflict between discovery and learning criteria.
To address this, we build a generative-discriminative model pair [11,4] for com-
puting discovery and learning query criteria, and adaptively balance their use
based on joint discovery and classification performance. Depending on the ac-
tual supervision cost and sparsity of rare class examples, the quantity of labeled
data varies. Given the nature of data dependence in generative and discrimi-
native models [11], the ideal classifier also varies. As a second contribution, we
therefore address robustness to label quantity and introduce a classifier switch-
ing algorithm to optimize performance as data is accumulated. The result is a
framework which significantly and consistently outperforms existing methods at
the important task of discovery and classification of rare classes.
symbol indicates the estimated class based on two initial labeled points (large
symbols). The black line indicates the initial decision boundary. In Figure 1(a) all
classes are known but the decision boundary needs refining. Likelihood sampling
(most unlikely point under the learned model) inefficiently builds a model of the
whole space (choosing first the points labeled L), while uncertainty sampling
selects points closest to the boundary (U symbols), leading to efficient refine-
ment. In Figure 1(b) only two classes are known. Uncertainty inefficiently queries
around the known decision boundary (choosing first the points U) without dis-
covering the new classes above. In contrast, these are the first places queried
by likelihood sampling (L symbols). Evidently, single-criterion approaches are
insufficient. Moreover, multiple criteria may be necessary for a single dataset at
different stages of learning, e.g., likelihood to detect new classes and uncertainty
to learn to classify them. A simple but inefficient approach [16] is to simply iterate
over criteria in fixed proportion. In contrast, our innovation is to adapt criteria
online so as to select the right strategy at each stage of learning, which can dra-
matically increase efficiency. Typically, “exploration” is automatically preferred
while there are easily discoverable classes, and “exploitation” to refine decision
boundaries when most classes have been discovered. This ultimately results in
better rare class detection performance than single objective, or non-adaptive
methods [16].
(a) L (b) L L
U U
U
U
L
Finally, there is the issue of what base classifier to use in the active learning
algorithm of choice. One can categorize classifiers into two broad categories: gen-
erative and discriminative. Discriminative models directly learn p(y|x) for class
y and data x. Generative models learn p(x, y) and compute p(y|x) via Bayes
rule. The importance of this for active learning is that for a given generative-
discriminative pair (in the sense of equivalent parametric form – such as naive
Bayes & logistic regression), generative classifiers typically perform better with
few training examples, while discriminative models are better asymptotically
[11]. The ideal classifier is therefore likely to be completely different early and
late in the active learning process. An automatic way to select the right classi-
fier online as more labels are obtained is therefore key. Existing active learning
work focuses on single generative [13] or discriminative [17] classifiers. We intro-
duce a novel algorithm to switch classifiers online as the active learning process
progresses in order to get the best of both worlds.
Finding Rare Classes: Adapting Generative and Discriminative Models 299
Query Criteria. Perhaps the most commonly applied query criteria are uncer-
tainty sampling and variants [14]. The intuition is that if the current classification
of a point is highly uncertain, it should be informative to label. Uncertainty is
typically quantified by posterior entropy, which for binary classification reduces
to selecting the point whose posterior is closest to p(y|x) = 0.5. The posterior
p(y|x) of every point in U is evaluated and the uncertain points queried,
pu (i) ∝ exp β p(yi |xi ) log p(yi |xi ) . (1)
yi
Incremental GMM Estimation. For online GMM learning, we use the incremen-
tal agglomerative algorithm from [15]. To summarize the procedure, for the first
n = 1..N training points observed with the same label y, {xn , y}N
n , we incremen-
tally build a model p(x|y) for y using kernel density estimation with Gaussian
kernels N (xn , Σ) and weight ωn = n1 . d is the dimension of the data x.
1
N
1
p(x|y) = ωn exp − (x − xn )T Σ −1 (x − xn ) . (3)
(2π)
d/2 1/2
|Σ| n=1
2
ωi ωj
μ(i+j) = μi + μj , (4)
ω(i+j) = ωi + ωj , ω(i+j ) ω(i+j )
ωi
Σ(i+j) = Σi + (μi − μ(i+j) )(μi − μ(i+j) )T
ω(i+j)
ωj
+ Σj + (μj − μ(i+j) )(μj − μ(i+j) )T . (5)
ω(i+j)
The components to merge are chosen by the selecting the pair of Gaussian kernels
(Gi , Gj ) whose replacement G(i+j) is most similar, in terms of the Kullback-
Leibler divergence. Specifically, we minimize the cost Cij ,
Cij = ωi KL(Gi ||G(i+j) ) + ωj KL(Gj ||G(i+j) ). (6)
Importantly for iterative active learning online, merging Gaussians and updating
the cost matrix requires constant O(Nmax ) computation every iteration once the
initial cost matrix is built. In contrast, learning a GMM with latent variables
requires multiple expensive O(n) expectation-maximization iterations [12]. The
initial covariance Σ is assumed uniform diagonal Σ = Iσ 2 , and is estimated a
priori by leave-one-out cross validation on the (large) unlabeled set U :
⎛ ⎞
1
σ̂ = argmax ⎝ exp − 2 (x − xn )2 ⎠ .
d
σ− 2 (7)
σ 2σ
n∈U x=xn
Given the learned models p(x|y), we can classify ŷ ← fgmm (x), where
Finding Rare Classes: Adapting Generative and Discriminative Models 301
fgmm (x) = argmaxp(y|x), p(y|x) ∝ wi N (x; μi,y , Σi,y )p(y). (8)
y i
SVM. We use a standard SVM approach with RBF kernels, treating multi-class
classification as a set of 1-vs-1 decisions, for which the decision rule [4] is given
(by an equivalent form to (8)) as
⎛ ⎞
fsvm (x) = argmax ⎝ αki N (x; vi ) + αk0 ⎠ , (9)
y
vi ∈SVy
and p(y|x) can be computed based on the binary posterior estimates [18].
Given the generative GMM and discriminative SVM models defined in Sec-
tion 2.2, and their respective likelihood and uncertainty query criteria defined
in Section 2.1, our first concern is how to adaptively combine the query criteria
online for discovery and classification. Our algorithm involves probabilistically
selecting a query criteria Qk according to some weights w (k ∼ Multi(w)) and
then sampling the query point from the distribution i∗ ∼ pk (i) ((1) or (2))1 .
The weights w will be adapted based on the discovery and classification perfor-
mance φ of our active learner at each iteration. In an active learning context,
[2] shows that because labels are few and biased, cross-validation is a poor way
to assess classification performance, and suggest the unsupervised measure of
binary classification entropy (CE) on the unlabeled set U instead. This is espe-
cially the case in the rare class context where there is often only one example of
a given class, so cross-validation is not well defined. To overcome this problem,
we generalize CE to multi-class entropy (MCE) of the classifier f (x) and take it
as our indication of classification performance,
ny
i I(f (xi ) = y) I(f (xi ) = y)
H=− logny i . (10)
y=1
|U| |U|
Here I is the indicator function that returns 1 if its argument is true, and ny
is the number of classes observed so far. Importantly, we explicitly reward the
discovery of new classes to jointly optimize classification and discovery. We define
overall active learning performance φt (i) upon querying point i at time t as,
φt (i) = αI(yi ∈
/ L) + (1 − α) (eHt − eHt−1 ) − (1 − e) /(2e − 2). (11)
1
We choose this method because each criterion has very different “reasons” for its
preference. An alternative is querying a product or mean [2,5,3] of the criteria. That
risks querying a merely moderately unlikely and uncertain point – neither outlying
nor on a decision boundary – which is useless for either classification or discovery.
302 T.M. Hospedales, S. Gong, and T. Xiang
The first right hand term above rewards discovery of a new class, and the second
term rewards an increase in MCE (as an estimate of classification accuracy) after
labeling point i at time t. The constants (1 − e) and (2e − 2) ensure the second
term lies between 0 and 1. The parameter α is the mixing prior for discovery
vs. classification. Given this performance measure, we define an update for the
future weight wt+1 of each active criterion k,
pk (i)
wt+1,k (q) ∝ λwt,k + (1 − λ)φt (i) +
. (12)
p(i)
Testing
Input: Testing samples U ∗ , selected classifier c.
3 Experiments
Evaluation Procedure. We tested our method on 7 rare class datasets from the
UCI repository [1] and on the CASIA gait dataset [20], for which we addressed
the image viewpoint recognition problem. We unbalanced the CASIA dataset by
sampling training classes in geometric proportion. In each case we labeled one
point from the largest class and the goal was to discover and learn to classify
the remaining classes. Table 1 summarizes the properties of each dataset. Per-
formance was evaluated at each iteration by: i) the number of distinct classes
discovered and ii) the average classification accuracy over all classes. This accu-
racy measure weights the ability to classify rare classes equally with the majority
class despite the fewer rare class points. Moreover, it means that undiscovered
rare classes automatically penalize accuracy. Accuracy was evaluated by 2-fold
cross-validation, averaged over 25 runs from random initial conditions.
Thyroid. (Figure 2(b)). Our GSsw/GSadapt model (red) is the best overall
classifier: it matches the initially superior performance of the G/G likelihood-
based model (green), but later achieves the asymptotic performance of the SVM
classifier based models. This is because of our classifier switching innovation (Sec-
tion 2.4). Figure 2(d) illustrates switching via the average (training) classifica-
tion entropy and (testing) accuracy of the classifiers composing GSsw/GSadapt.
The GMM classifier entropy (black dots) is higher than the SVM entropy (blue
dots) for the first 25 iterations. This is approximately the period over which the
GMM classifier (black line) has better performance than the SVM classifier (blue
line), so switching classifier on entropy allows the pair (green dashes) to always
perform as well as the best individual classifier for each iteration.
Classes Discovered
6
Average Accuracy
Average Accuracy
0.5
Pelleg 2004 2.5 0.6
5 S/R S/R
0.4 S/S
S/S
4 2 G/G 0.5
G/G 0.3 S/GSmix
3 S/GSmix
S/GSonline 0.4
S/GSonline 0.2 1.5
2 S/GSadapt
S/GSadapt
GSsw/GSadapt
GSsw/GSadapt 0.1 0.3
1 1
50 100 150 50 100 150 50 100 150 50 100 150
Labeled Points Labeled Points Labeled Points Labeled Points
(c) Adaptive Active Learning Criteria (d) Entropy based classifier switching (e) Glass: Discovery Glass: Classification
1 0.7 6 0.8
0.6 0.7
Classes Discovered
Entropy / Accuracy
Average Accuracy
0.8
Criteria Weight, w
5
0.5 S/R 0.6
0.6 0.4 4 S/S 0.5
Likelihood
GMM Entropy G/G
Uncertainty 0.3 0.4
0.4 SVM Entropy 3 S/GSmix
0.2 GMM Accuracy S/GSonline 0.3
0.2 SVM Accuracy 2 S/GSadapt 0.2
0.1
Joint Accuracy GSsw/GSadapt
0 0 1 0.1
0 50 100 150 0 50 100 150 20 40 60 80 100 20 40 60 80 100
Labeled Points Labeled Points Labeled Points Labeled Points
(f) Pageblocks: Discovery Pageblocks: Classification (g) Gait: Discovery Gait: Classification
5 0.8 0.7
8
Classes Discovered
0.7
Classes Discovered 0.6
Average Accuracy
Average Accuracy
4
S/R 0.6 S/R 0.5
S/S 6 S/S
3 G/G 0.5 G/G 0.4
S/GSmix 4 S/GSmix
0.4 0.3
S/GSonline S/GSonline
2
S/GSadapt 0.3 S/GSadapt 0.2
2
GSsw/GSadapt GSsw/GSadapt
1 0.2 0.1
50 100 150 50 100 150 50 100 150 50 100 150
Labeled Points Labeled Points Labeled Points Labeled Points
Fig. 2. (a) Shuttle and (b) Thyroid dataset performance. (c) Shuttle criteria adapta-
tion, (d) Thyroid entropy based classifier switching. (e) Glass, (f) Pageblocks and (g)
Gait view dataset performance.
throughout. Gait view (Figure 2(g)). The majority class contains outliers, so
likelihood criteria is unusually weak at discovery. Additionally for this data SVM
performance is generally poor, especially in early iterations. GSsw/GSadapt
adapts impressively to this dataset in two ways enabled by our contributions:
exploiting uncertainty sampling criteria extensively and switching to predicting
using the GMM classifier.
In summary the G/G method (likelihood criterion) was usually the most ef-
ficient at discovering classes as expected. However, it was usually asymptot-
ically weaker at classifying new instances. This is because generative model
mis-specification tends to cost more with increasing amounts of data [11]. S/S,
(uncertainty criterion), was general poor at discovery (and hence classification).
Alternating between likelihood and uncertainty sampling, S/GSmix (correspond-
ing to [16]) did a fair job of both discovery and classification, but under-performed
our adaptive models due to its inflexibility. S/GSonline (corresponding to [2])
was better than random or S/S, but was not the quickest learner. Our first model
S/GSadapt, which solely adapted the multiple active query criteria, was com-
petitive at discovery, but sometimes not the best at classification in early phases
with very little data – due to exclusively using the discriminative SVM classifier.
Finally, by exploiting generative-discriminative classifier switching, our complete
GSsw/GSadapt model was generally the best classifier over all stages of learn-
ing. Table 2 quantitatively summarizes the performance of the most competitive
models for all datasets in terms of area under the classification curve.
306 T.M. Hospedales, S. Gong, and T. Xiang
4 Conclusion
Summary. We highlighted active classifier learning with a priori unknown rare
classes as an under-studied but broadly relevant and important problem. To
solve joint rare class discovery and classification, we proposed a new framework
to adapt both active query criteria and classifier. To adaptively switch gen-
erative and discriminative classifiers online we introduced MCE; and to adapt
query criteria we exploited a joint reward signal of new class discovery and MCE.
In adapting to each dataset and online as data is obtained, our model signifi-
cantly outperformed contemporary alternatives on eight standard datasets. Our
approach will be of great practical value for many problems.
only other work of which we are aware which addresses both discovery and clas-
sification is [16]. This uses a fixed classifier and non-adaptively iterates between
discovery and uncertainty criteria (corresponding to our S/GSmix condition). In
contrast, our results have shown that our switching classifier and adaptive query
criteria provide compelling benefit for discovery and classification.
Future Work. There are various interesting questions for future research includ-
ing and how to create tighter coupling between the generative and discriminative
components [4], and generalizing our ideas to stream based active learning, which
is a more natural setting for some practical problems.
References
1. Asuncion, A., Newman, D.: UCI machine learning repository (2007),
http://www.ics.uci.edu/ml/
2. Baram, Y., El-Yaniv, R., Luz, K.: Online choice of active learning algorithms.
Journal of Machine Learning Research 5, 255–291 (2004)
3. Cebron, N., Berthold, M.R.: Active learning for object classification: from explo-
ration to exploitation. Data Min. Knowl. Discov. 18(2), 283–299 (2009)
4. Deselaers, T., Heigold, G., Ney, H.: SVMs, gaussian mixtures, and their genera-
tive/discriminative fusion. In: ICPR (2008)
5. Donmez, P., Carbonell, J.G., Bennett, P.N.: Dual strategy active learning. In: Kok,
J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron,
A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 116–127. Springer, Heidelberg
(2007)
6. Ertekin, S., Huang, J., Bottou, L., Giles, L.: Learning on the border: active learning
in imbalanced data classification. In: CIKM (2007)
7. Goldberger, J., Roweis, S.: Hierarchical clustering of a mixture model. In: NIPS
(2004)
8. He, H., Garcia, E.: Learning from imbalanced data. IEEE Transactions on Knowl-
edge and Data Engineering 21(9), 1263–1284 (2009)
9. He, J., Carbonell, J.: Nearest-neighbor-based active learning for rare category de-
tection. In: NIPS (2007)
10. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. IEEE
Transactions on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998)
11. Ng, A., Jordan, M.: On discriminative vs. generative classifiers: A comparison of
logistic regression and naive bayes. In: NIPS (2001)
12. Pelleg, D., Moore, A.: Active learning for anomaly and rare-category detection. In:
NIPS (2004)
13. Roy, N., McCallum, A.: Toward optimal active learning through sampling estima-
tion of error reduction. In: ICML, pp. 441–448 (2001)
14. Settles, B.: Active learning literature survey. Tech. Rep. 1648, University of
wisconsin–Madison (2009)
15. Sillito, R., Fisher, R.: Incremental one-class learning with bounded computational
complexity. In: ICANN (2007)
308 T.M. Hospedales, S. Gong, and T. Xiang
16. Stokes, J.W., Platt, J.C., Kravis, J., Shilman, M.: Aladin: Active learning of anoma-
lies to detect intrusions. Tech. Rep. 2008-24, MSR (2008)
17. Tong, S., Koller, D.: Support vector machine active learning with applications to
text classification. In: ICML (2000)
18. Wu, T.F., Lin, C.J., Weng, R.C.: Probability estimates for multi-class classification
by pairwise coupling. Journal of Machine Learning Research 5, 975–1005 (2004)
19. Xiang, T., Gong, S.: Video behavior profiling for anomaly detection. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 30(5), 893–908 (2008)
20. Yu, S., Tan, D., Tan, T.: A framework for evaluating the effect of view angle,
clothing and carrying condition on gait recognition. In: ICPR (2006)
Margin-Based Over-Sampling Method for
Learning from Imbalanced Datasets
1 Introduction
Learning from imbalanced datasets has got more and more emphases in recent
years. A dataset is imbalanced if its class distributions are skewed. The class
imbalance problem is of crucial importance since it is encountered by a large
number of real world applications, such as fraud detection [1], the detection of
oil spills in satellite radar images [2], and text classification [3]. In these scenarios,
we are usually more interested in the minority class instead of the majority class.
The traditional data mining algorithms have a poor performance due to the fact
that they give equal attention to the minority class and the majority class.
One way for solving the imbalance learning problem is to develop ”imbalanced
data oriented algorithms” that can perform well on the imbalanced datasets. For
example, Wu et al. proposed class boundary alignment algorithm which modi-
fies the class boundary by changing the kernel function of SVMs [4]. Ensemble
methods were used to improve performance on imbalance datasets [5]. In 2010,
Liu et al. proposed the Class Confidence Proportion Decision Tree (CCPDT)
[6]. Furthermore, there are other effective methods such as cost-based learning
[7] and one class learning [8].
Corresponding author.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 309–320, 2011.
c Springer-Verlag Berlin Heidelberg 2011
310 X. Fan, K. Tang, and T. Weise
2 Related Works
We use A to denote a dataset of n instances A = {a1 , ..., an }, where ai is a real-
valued vector of dimension m. Let AP ⊂ A denote the minority class instances,
AN ⊂ A denote the majority class instances.
Over-sampling techniques augment the minority class to balance between
the numbers of the majority and minority class instances. The simplest over-
sampling method is ROS. However, it may make the decision regions of the ma-
jority smaller and more specific, and thus can cause the learner to over-fit [16].
Chawla et al. over-sampled the minority class with their SMOTE method,
which generates new synthetic instances along the line between the minority in-
stances and their selected nearest neighbors [11]. Specifically, for the subset AP ,
they consider the k-nearest neighbors for each instances ai ∈ Ap . For some spec-
ified integer number k, the k-nearest neighbors are define as the k elements of
AP , whose Euclidian distance to the element ai under consideration is the small-
est. To create a synthetic instance, one of the k-nearest neighbors is randomly
Margin-Based Over-Sampling Method 311
selected and then multiplied by the corresponding feature vector difference with
a random number between [0, 1]. Take a two-dimensional problem for example:
anew = ai + (ann − ai ) × δ
near hit
near miss
Class A train point
Class A test point
Class B train point
Fig. 1. Two types of margins in terms of the Nearest Neighbor Rule. The toy problem
involves class A and class B. Margins of a new instance (the blue circle), which belongs
to class A, are shown. The sample margin 1(left) is the distance between the new
instance and the decision boundary (the Voronoi tessellation). The hypothesis margin
1(right) is the largest distance the sample points can travel without altering the label
of the new instance. In this case it is half the difference between the distance to the
nearest miss and the distance to the nearest hit.
if we draw a sphere with radius R around each prototype, any change of the
location of prototypes inside their sphere will not change the assigned labels.
Therefore, the hypothesis margin measures the stability to small changes in the
prototypes locations. See Figure 1 for illustration.
Throughout this paper we will focus on the margins for the Nearest Neighbor
rule (NN). For this special case, it is proved the following results [14]:
1. The hypothesis-margin lower bounds the sample-margin
2. It is easy to compute the hypothesis-margin of an instance x with respect
to a set of instances A by the following formula:
1
θA (x) = (||x − nearestmissA (x)|| − ||x − nearesthitA (x)||) (1)
2
where nearesthitA (x) and nearestmissA (x) denote the nearest instance to x in
dataset A with the same and different label, respectively.
In the case of the NN, we can know that the hypothesis margin is easy to
calculate and that a set of prototypes with large hypothesis margin then it has
large sample margin as well [14].
Now we consider the over-sampling problem using the large margin principle.
When adding a new minority class instance x, we consider the difference of the
overall margins for the minority class:
ΔP (x) = (θA\a∪{x} (a) − θA\a (a)) (2)
a∈AP
where A\a denotes the dataset excluding a from the dataset A, and A\a ∪ {x}
denotes the union of A\a and {x}.
Margin-Based Over-Sampling Method 313
Alternatively, one may also minimize the margins loss for the majority class,
which is
f2 = −ΔN (x) (5)
One intuitive method is to seek a good balance between maximizing the margins
gain for the minority class and minimizing the margins loss for the majority
class. This can be conducted by minimizing Eq. (6):
−ΔN (x)
f (x)3 = ,ε > 0 (6)
ΔP (x) + ε
Algorithm 1. MSYN
Input: Training set X with n instances (ai , yi ), i = 1, ..., n where ai is an
instance in the m dimensional feature space, and yi belongs to
Y = {1, −1} is the class identity label associated with ai , Define mP
and mN as the number of the minority class instances and the number
of the majority class instances, respectively. Therefore, mP < mN . BIN
is the set of synthetic instances, which is initialized as empty.
Parameter: P ressure.
1 Calculate the number of synthetic instances that need to be generated for the
minority class: G = (mN − mP ) ∗ P ressure;
2 Calculate the number of synthetic instances that needed to be generated for
each minority example ai :
G
gi =
mP
3 for each minority class instances ai do
4 for j ← 1 to gi do
5 Randomly choose one minority instance, azi , from the k nearest
neighbors for the instance ai ;
6 Generate the synthetic instances as using the technique of SMOTE;
7 Add as to BIN
8 sort the synthetic instances in BIN according to the their values of Eq. (6);
9 return (mN − mP ) instances who have the minimum (mN − mP ) values of Eq.
(6).
5 Experiment Study
1 minority class
majority class
0.9 true boundary
0.8
0.7
0.6
Feature 2
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
Feature 1
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
SMOTE MSYN
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Borderline−SMOTE ADASYN
Fig. 3. The synthetic instances and the corresponding C4.5 decision boundary after
processing by SMOTE, MSYN, Borderline-SMOTE, ADASYN, respectively.
We test the algorithms on ten datasets from the UCI Machine Learning Repos-
itory [21]. Information about these datasets is summarized in Table 1, where
num is the size of the dataset, attr is the number of features, min% is the ratio
of the number of minority class number to NUM.
Instead of using the overall classification accuracy, we uadopt metrics related
to Receiver Operating Characteristics (ROC) curve [22] to evaluate the compared
algorithms, because traditional overall classification accuracy may not be able to
provide a comprehensive assessment of the observed learning algorithms in case
of class imbalanced datasets [3]. Specifically, we use the AUC [22] and F-Measure
[23] to evaluate the performance. We apply the Wilcoxon signed rank test with
a 95% confidence level on each dataset to see whether the difference between the
compared algorithms is statistically significant.
Table 2 and Table 3 show the AUC and F-Measure for the datasets, respec-
tively. The results of Table 2 reveal that MSYN wins against SMOTE on nine
out of ten datasets, beats ADASYN on seven out of ten datasets, outperforms
Margin-Based Over-Sampling Method 317
Table 2. Result in terms of AUC in the experiments performs on real datasets. For
SMOTE, ADAYSN, ROS and Borderline-SMOTE, if the value is underlined, MSYN
has better performance than that method; if the value is starred, MSYN exhibits lower
performance compared to that method; if the value is in normal style it means that the
corresponding method does not perform significantly different from MSYN according
to the Wilcoxon signed rank test. The row W/D/L Sig. shows the number of wins,
draws and losses of MSYN from the statistical point of view.
ROS on nine out of ten datasets, and wins against Borderline-SMOTE on six out
of ten datasets. The results of Table 3 show that MSYN wins against SMOTE on
seven out of ten datasets, beats ADASYN on six out of ten datasets, beats ROS
on six out of ten datasets, and wins against Borderline-SMOTE on six out of ten
datasets. The comparisons reveal that MSYN outperforms the other methods in
terms of both AUC and F-measure.
318 X. Fan, K. Tang, and T. Weise
References
1. Chan, P.K., Stolfo, S.J.: Toward scalable learning with non-uniform class and cost
distributions: a case study in credit card fraud detection. In: Proceedings of the
Fourth International Conference on Knowledge Discovery and Data Mining, pp.
164–168 (2001)
2. Kubat, M., Holte, R.C., Matwin, S.: Machine Learning for the Detection of Oil
Spills in Satellite Radar Images. Machine Learning 30(2), 195–215 (1998)
3. Weisis, G.M.: Mining with Rarity: A Unifying Framwork. SiGKDD Explo-
rations 6(1), 7–19 (2004)
4. Wu, G., Chang, E.Y.: Class-Boundary Alignment for Imbalanced Dataset Learning.
In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC
(2003)
5. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: Smoteboost: Improving
Prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todor-
ovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119.
Springer, Heidelberg (2003)
6. Liu, W., Chawla, S., Cieslak, D.A., Chawla, N.V.: A Robust Decision Tree Algo-
rithm for Imbalanced Data Sets. In: SIAM International Conf. on Data Mining
(2010)
7. Zhou, Z.H., Liu, X.Y.: Training cost-sensitive neural networks with methods ad-
dressing the class imbalance problem. IEEE Transactions on Knowledge and Data
Engineering, 63–77 (2006)
8. Raskutti, B., Kowalczyk, A.: Extreme re-balancing for SVMs: a case study.
SIGKDD Explorations 6(1), 60–69 (2004)
9. Japkowicz, N.: The Class Imbalance Problem: Significance and Strategies. In: Pro-
ceeding of the 2000 International Conf. on Artificial Intelligence (ICAI 2000): Spe-
cial Track on Inductive Learning, Las Vegas, Nevada (2000)
10. Ling, C., Li, C.: Data Mining for Direct Marketing Problems and Solutions. In:
Proceeding of the Fourth International Conf. on Knowledge Discovery and Data
Mining, KDD 1998, New York, NY (1998)
11. Chawla, N.V., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: SMOTE: Synthetic
Minority Oversampling Technique. Journal of Artificial Intelligence Research 16,
321–357 (2002)
320 X. Fan, K. Tang, and T. Weise
12. Han, H., Wang, W.Y., Mao, B.H.: Borderline-SMOTE: A New Over-Sampling
Method in Imbalanced Data Sets Learning. Advances in Intelligent Computing,
878–887 (2005)
13. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: Adaptive Synthetic Sampling
Approach for Imbalanced Learning. In: Proceeding of International Conf. Neural
Networks, pp. 1322–1328 (2008)
14. Crammer, K., Gilad-Bachrach, R., Navot, A., Tishby, N.: Margin analysis of the
LVQ algorithm. Advances in Neural Information Processing Systems, 479–486
(2003)
15. Gilad-Bachrach, R., Navot, A., Tishby, N.: Margin based feature selection-theory
and algorithms. In: Proceeding of the Twenty-First International Conference on
Machine Learning (2004)
16. He, H., Garcia, E.A.: Learning from Imbalance Data. IEEE Transaction on Knowl-
edge and Data Engineering 21(9), 1263–1284 (2009)
17. Freund, Y., Schapire, R.: A desicion-theoretic generalization of on-line learning
and an application to boosting. Journal of Computer and System Sciences 55(1),
119–139 (1997)
18. Bowyer, A.: Computing dirichlet tessellations. The Computer Journal 24(2) (1981)
19. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and tech-
niques with Java implementations. ACM SIGMOD Record 31(1), 76–77 (2002)
20. UCL machine learning group, http://www.dice.ucl.ac.be/mlg/?page=Elena
21. Asuncion, A., Newman, D.: UCI machine learning repository (2007)
22. Bradley, A.: The use of the area under the ROC curve in the evaluation of machine
learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)
23. Van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
24. Wang, B.X., Japkowicz, N.: Imbalanced Data Set Learning with Synthetic Samples.
In: Proc. IRIS Machine Learning Workshop (2004)
25. Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F.
(eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)
26. Guo, H., Viktor, H.L.: Learning from Imbalanced Data Sets with Boosting and
Data Generation: the DataBoost-IM Approach. SIGKDD Explorations: Special
issue on Learning from Imbalanced Datasets 6(1), 30–39 (2004)
27. Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance
learning. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cyber-
netics 39(2), 539–550 (2009)
28. Cohen, W.: Fast Effective Rule Induction. In: Proceeding of 12th International
Conf. on Machine Learning, Lake Tahoe, CA, pp. 115–123. Morgan Kaufmann,
San Francisco (1995)
Improving k Nearest Neighbor with Exemplar
Generalization for Imbalanced Classification
1 Introduction
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 321–332, 2011.
c Springer-Verlag Berlin Heidelberg 2011
322 Y. Li and X. Zhang
induction may stop at a node where class for the node is decided by the majority
of instances under the node and instances of the minority class are ignored.
In contrast to most concept learning systems, k nearest neighbor (kNN) clas-
sification [6,1,2] or instance-based learning, does not formulate a generalized
conceptual model from the training instances at the training stage. Rather at
the classification stage, a simple and intuitive rule is used to make decisions:
instances close in the input space are likely to belong to the same class. Typi-
cally a kNN classifier classifies a query instance to the class that appears most
frequently among its k nearest neighbors. k is a parameter for tuning the clas-
sification performance and is typically set to three to seven.
Although instance-based learning has been advocated for imbalanced learn-
ing [10,19,3], to the best of our knowledge, a large-scale study of applying kNN
classification to imbalanced learning has not been reported in literature. Most
research efforts in this area have been on trying to improve its classification ef-
ficiency [1,2,21]. Various strategies have been proposed to avoid an exhaustive
search for all training instances and to achieve accurate classification.
In the presence of class imbalance kNN classification also faces challenges to
correctly detect the positive instances. For a query instance, if its neighborhood
is overwhelmed by negative instances, positive instances are still likely to be
ignored in the decision process. Our main idea to mitigate the decision errors is
to introduce a training stage to generalize the positive instances from a point to
a gaussian ball in the instance space. Rather than generalizing every positive in-
stances which may introduce false positives, we propose an algorithm to identify
the exemplar positive instances called pivot positive instances (c.f. Section 3),
and use them to reliably derive the positive class boundary.
Experiments on 12 real-world imbalanced datasets show that our classifier,
k Exemplar-based Nearest Neighbor (kENN), is effective and significantly im-
proves the performance of kNN for imbalanced learning. kENN also outperforms
the current re-sampling and cost-sensitive learning strategies, namely SMOTE [5]
and MetaCost [7], for imbalanced classification.
- - - -
+ + +
- -
- - -
+ - -
+ + - - -
+ P1 -
- -
+ + -
- -
+
+ - - -
+ - - - -
-
-
* -
N1 -
- - -
- - - -
- -
- -
-
- - - - - -
-
- - -
- - - -
+
- -
- - - - P3
-
+
+ * - + + * - -
- - +
P2 * -
- + -
- -
- -
under-sample the majority class, while Ling and Li [13] combined over-sampling
of the minority class with under-sampling of the majority class. Especially
Chawla and Bowyer [5] proposed Synthetic Minority Over-sampling TEchnique
(SMOTE) to over-sample the minority class by creating synthetic samples. It
was shown that SMOTE over-sampling of the minority class in combination
with under-sampling the majority class often could achieve effective imbalanced
learning.
Another popular strategy tackling the imbalanced distribution problem is
cost-sensitive learning [8]. Domingos [7] proposed a re-costing method called
MetaCost, which can be applied to general classifiers. The approach made error-
based classifiers cost-sensitive. His experimental results showed that MetaCost
reduced costs compared to cost-blind classifier using C4.5Rules as baseline.
Our experiments (c.f. Section 5) show that SMOTE in combination with
under-sampling of majority class as well as MetaCost significantly improves the
performance of C4.5 for imbalanced learning. However these strategies some-
how do not statistically significantly improve the performance of kNN for class
imbalance. This may be partly explained by that kNN makes classification de-
cision by examining the local neighborhood of query instances where the global
re-sampling and cost-adjustment strategies may not have pronounced effect.
2 Main Ideas
Fig. 1 shows an artificial two-class imbalance problem, where positive instances
are denoted as “+” and negative instances are denoted as “-”. True class bound-
aries are represented as solid lines while the decision boundaries by some classi-
fication model are represented as dashed lines. Four query instances that indeed
belong to the positive class are represented as stars (*). Three subconcepts as-
sociated with the positive class are the three regions formed by the solid lines,
denoted as P1, P2 and P3 respectively. Subconcept P1 covers a large portion of
instances in the positive instance space whereas P2 and P3 correspond to small
324 Y. Li and X. Zhang
- - - - - -
- -
+ - + -
- -
+ + * - - + + * - -
- + - +
- -
+ - + -
- - - -
(a) Standard 1NN (b) Exemplar 1NN
disjuncts of positive instances. Note that the lack of data for the subconcepts P2
and P3 cause the classification model to learn inappropriate decision boundaries
for P2 and P3. As a result, two query instances (denoted by *) that are indeed
positive defined by P2 fall outside the positive decision boundary of the classifier
and similarly for another query instance defined as positive by P 3.
Given the problem in Fig. 1, we illustrate the challenge faced by a stan-
dard kNN classifier using the subspace of instances at the lower right corner.
Figure 2(a) shows the Voronoi diagram for subconcept P3 in the subspace, where
the positive class boundary decided by standard 1NN is represented as the poly-
gon in bold line. The 1NN induction strategy where the class of an instance is
decided by the class of its nearest neighbor results in a class boundary much
smaller than the true class boundary (circle). As a result the query instance (de-
noted by *), which indeed is a positive instance inside the true positive boundary,
is predicted as negative by standard 1NN. Obviously to achieve more accurate
prediction, the decision boundary for the positive class should be expanded so
that it is closer to the true class boundary.
A naive approach to expanding the decision boundary for the positive class is
to generalize every positive instance in the training instance space from a point
to a Gaussian ball. However this aggressive approach to expanding the positive
boundary can most definitely introduce false positives. We need a strategy to
selectively expand some positive points in the training instance space so that
the decision boundary closely approximates the real class boundary while not
introducing too many false positives.
Our main idea of expanding the decision boundary for the positive class while
minimizing false positives is based on exemplar positive instances. Exemplar
positive instances should be the positive instances that can be generalized to
reliably classify more positive instances in independent tests. Intuitively these
instances should include the strong positive instances at or close to the center
of a disjunct of positive instances in the training instance space. Weak positive
instances close to the class boundaries should be excluded.
Fig. 2(b) shows the Voronoi diagram after the three positive instances at the
center of the disjunct of positive instances have been used to expand the bound-
ary for the positive class. Obviously the decision boundary after adjustment is
Improving kNN with Exemplar Generalization 325
much closer to the real class boundary. As a result, the query instance (repre-
sented by *) is now enclosed by the boundary decided by the classifier and is
correctly predicted as positive.
The Gaussian ball for a positive instance always has two positive instances— the
reference positive instance and its nearest positive neighbor. The confidence level
is a parameter tuning the performance of PPIs. A high confidence level means
the estimated false positive error rate is close to the observed false negative ratio,
and thus few false positives are tolerated in identifying PPIs. On very imbalanced
data we need to tolerate a large number of false positives to aggressively identify
PPIs to achieve high sensitivity for the positives. Our experiments (Section 5.3)
confirm this hypothesis. The default confidence level is set to 10%.
We set the FP rate threshold for identifying PPIs based on the imbalance
level of training data. The threshold for PPIs are dynamically determined by
the prior negative class frequency. If the false positive rate for a positive in-
stance estimated using Equation (1) is not greater than the threshold estimated
from the prior negative class frequency, the positive instance is a PPI. Under
this setting, a relatively larger number of FP errors are allowed in Gaussian balls
for imbalanced data while less errors are allowed for balanced data. Especially
on very balanced data the PPI mechanism will be turned off and kENN reverts
to standard kNN. For example on a balanced dataset of 50 positive instances
and 50 negative instances, at a confidence level of 10%, the FP rate thresh-
old for PPIs is 56.8% (estimated from the 50% negative class frequency using
Equation (1)). A Gaussian ball without any observed FP errors (and containing
2 positive instances only) has an estimated FP rate of 68.4%2 . As a result no
PPIs are identified at the training stage and standard kNN classification will be
applied.
2
Following standard statistics, when there are not any observed errors, for N instances
√
at confidence level c the estimated error rate is 1 − N c.
Improving kNN with Exemplar Generalization 327
where distance(t, x) is the distance between t and x using some metric of stan-
dard kNN. With the above equation, the distance between a query instance and
a PPI in the training instance space is reduced by the radius of the PPI. As a re-
sult the adjusted distance is conceptually equivalent to the distance of the query
instance to the edge of the Gaussian ball centered at the PPI. The adjusted
distance function as defined in Equation (2) can be used in kNN classification
in the presence of class imbalance, and we call the classifier k Exemplar-based
Nearest Neighbor (kENN).
5 Experiments
Dataset size #attr (num, symb) classes (pos, neg) minority (%)
Oil 937 47(47, 0) (true, false) 4.38%
Hypo-thyroid 3163 25 (7, 18) (true, false) 4.77%
PC1 1109 21 (21,0) (true, false) 6.94%
Glass 214 9 (9,0) (3, other) 7.94%
Satimage 6435 36 (36,0) (4, other) 9.73%
CM1 498 21 (21,0) (true, false) 9.84%
New-thyroid 215 5 (5,0) (3, other) 13.95%
KC1 2109 21 (21,0) (true, false) 15.46%
SPECT F 267 44 (44,0) (0, 1) 20.60%
Hepatitis 155 19 (6,13) (1, 2) 20.65%
Vehicle 846 18 (18,0) (van, other) 23.52%
German 1000 20 (7,13) (2, 1) 30.00%
learning, using kNN (IBk in WEKA) and C4.5 (J48 in WEKA) as the base classi-
fiers. All classifiers were developed based on the WEKA data mining toolkit [22],
and are available at http://www.cs.rmit.edu.au/∼zhang/ENN. For both kNN
and kENN k was set to 3 by default, and the confidence level of kENN was set
to 10%. To increase the sensitivity of C4.5 to the minority class, C4.5 was set
with the -M1 option that minimum one instance was allowed for a leaf node
and without pruning. SMOTE oversampling combined with undersampling was
applied to 3NN and C4.5, denoted as 3NNSmt+ and C4.5Smt+ respectively.
SpreadSubsample was used to undersample the majority class for uniform dis-
tribution (M=1.0), and then SMOTE was applied to generate additional 3 times
more instances for the minority class. MetaCost was used for cost-sensitive learn-
ing with 3NN and C4.5 (denoted as 3NNMeta and C4.5Meta) and the cost of
each class was set to the inverse of class ratio.
Table 1 summarizes the 12 real-world imbalanced datasets from various do-
mains used in our experiments, from highly imbalanced (the minority 4.35%) to
moderately imbalanced (the minority 30.00%). The Oil dataset was provided by
Robert Holte [11], and the task is to detect the oil spill (4.3%) from satellite
images. The CM1, KC1 and PC1 datasets were obtained from the NASA IV&V
Facility Metrics Data Program (MDP) repository (http://mdp.ivv.nasa.gov/
index.html). The task is to predict software defects (around 10% on average)
in software modules. The remaining datasets were compiled from the UCI Ma-
chine Learning Repository (http://archive.ics.uci.edu/ml). In addition to
the natural 2-class domains, like thyroid diseases diagnoses and Hepatitis, we
also constructed four imbalanced datasets by choosing one class as the positive
and the remaining classes combined as the negative.
The Receiver Operating Characteristic (ROC) curve [18] is becoming widely
used to evaluate imbalanced classification. Given a confusion matrix of four types
of decisions True Positive (TP), False Positive (FP), True Negative (TN) and
False Negative (FN), ROC curves depict tradeoffs between T P rate = T PT+F P
N
FP
and F P rate = F P +T N . Good classifiers can achieve a high TP rate at a low
Improving kNN with Exemplar Generalization 329
Table 2. The AUC for kENN, in comparison with other systems. The best result for
each dataset is in bold. AUCs with difference <0.005 are considered equivalent.
FP rate. Area Under the ROC Curve (AUC) measures the overall classification
performance [4], and a perfect classifier has an AUC of 1.0. All results reported
next were obtained from 10-fold cross validation and two-tailed paired t-tests at
95% confidence level were used to test statistical significance.
The ROC convex hull method provides visual performance analysis of classifi-
cation algorithms at different levels of sensitivity [15,16]. In the ROC space, each
point of the ROC curve for a classification algorithm corresponds to a classifier.
If a point falls on the convex hull of all ROC curves the corresponding classifier
is potentially an optimal classifier; otherwise the classifier is not optimal. Given
a classification algorithm, the higher fraction of its ROC curve points lie on the
convex hull the more chance the algorithm produce optimal classifiers.
For all results reported next, data points for the ROC curves were generated
using the ThresholdCurve module of WEKA, which correspond to the number
of TPs and FPs that result from setting various thresholds on the probability of
the positive class. The AUC value for ROC curves were obtained using the Mann
Whitney statistic in WEKA. The convex hull for ROC curves were computed
using the ROCCH package3 .
1.0
0.95
0.8
0.90
0.6
3ENN
C4.5Smt+ 3ENN
3NNSmt+ C4.5Smt+
3NNMeta 3NNSmt+
0.85
0.4
Convex Hull
0.80
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
(a) New-thyroid (b) German
Fig. 3. ROC curves with convex hull on two datasets. The x-axis is the FP rate and
the y-axis is the TP rate. Points on the convex hull are highlighted with a large circle.
0.85
0.80
AUC
0.75
0.70
Oil
Glass
KC1
German
0.65
0 10 20 30 40 50
Confidence level%
ROC curves of the four models demonstrate similar trends on German, as shown
in Fig. 3(b). Still at low FP rates, more points from 3ENN lie on the ROC convex
hull, which again shows that 3ENN is a strong model.
6 Conclusions
With kNN classification, the class of a query instance is decided by the majority
class of its k nearest neighbors. In the presence of class imbalance, a query in-
stance is often classified as belonging to the majority class and as a result many
positive (minority class) instances are misclassified. In this paper, we have pro-
posed a training stage where exemplar positive training instances are identified
332 Y. Li and X. Zhang
and generalized into Gaussian balls as concepts for the minority class. When
classifying a query instance using its k nearest neighbors, the positive concepts
formulated at the training stage ensure that classification is more sensitive to the
minority class. Extensive experiments have shown that our strategy significantly
improves the performance of kNN and also outperforms popular re-sampling and
cost-sensitive learning strategies for imbalanced learning.
References
1. Aha, D.W. (ed.): Lazy learning. Kluwer Academic Publishers, Dordrecht (1997)
2. Aha, D.W., et al.: Instance-based learning algorithms. Machine Learning 6 (1991)
3. Bosch, A., et al.: When small disjuncts abound, try lazy learning: A case study.
In: BDCML (1997)
4. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of
machine learning algorithms. Pattern Recognition 30 (1997)
5. Chawla, N.V., et al.: SMOTE: Synthetic minority over-sampling technique. Journal
of Artificial Intelligence Research 16 (2002)
6. Cover, T., Hart, P.: Nearest neighbor pattern classification. Institute of Electrical
and Electronics Engineers Transactions on Information Theory 13 (1967)
7. Domingos, P.: Metacost: A general method for making classifiers cost-sensitive. In:
KDD 1999 (1999)
8. Elkan, C.: The foundations of cost-sensitive learning. In: IJCAI (2001)
9. Fawcett, T., Provost, F.J.: Adaptive fraud detection. Data Mining and Knowledge
Discovery 1(3) (1997)
10. Holte, R.C., et al.: Concept learning and the problem of small disjuncts. In: IJCAI
1989 (1989)
11. Kubat, M., et al.: Machine learning for the detection of oil spills in satellite radar
images. Machine Learning 30(2-3) (1998)
12. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided
selection. In: ICML 1997 (1997)
13. Ling, C., et al.: Data mining for direct marketing: Problems and solutions. In: KDD
1998 (1998)
14. Menzies, T., et al.: Data mining static code attributes to learn defect predictors.
IEEE Transactions on Software Engineering 33 (2007)
15. Provost, F., et al.: The case against accuracy estimation for comparing induction
algorithms. In: ICML 1998 (1998)
16. Provost, F.J., Fawcett, T.: Robust classification for imprecise environments. Ma-
chine Learning 42(3) (2001)
17. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Fran-
cisco (1993)
18. Swets, J.: Measuring the accuracy of diagnostic systems. Science 240(4857) (1988)
19. Ting, K.: The problem of small disjuncts: its remedy in decision trees. In: Canadian
Conference on Artificial Intelligence (1994)
20. Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explorations 6(1)
(2004)
21. Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning
algorithms. Machine Learning (2000)
22. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Tech-
niques. Morgan Kaufmann, San Francisco (2005)
Sample Subset Optimization for Classifying
Imbalanced Biological Data
1 Introduction
Modern molecular biology is rapidly advanced by the increasing use of computa-
tional techniques. For tasks such as RNA gene prediction [1], promoter recogni-
tion [2], splice site identification [3], and the classification of protein localization
sites [4], it is often necessary to address the problem of imbalanced class distri-
bution because the datasets extracted from those biological systems are likely to
contain a large number of negative examples (referred to as majority class) and
a small number of positive examples (referred to as minority class). Many pop-
ular classification algorithms such as support vector machine (SVM) have been
applied to a large variety of bioinformatics problems including those mentioned
above (e.g. refs. [1,3,4]). However, most of these algorithms are sensitive to the
Corresponding author.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 333–344, 2011.
c Springer-Verlag Berlin Heidelberg 2011
334 P. Yang et al.
imbalanced class distribution and may not perform well if being directly applied
on the imbalanced data [5,6].
Sampling is a popular approach to addressing the imbalanced class distri-
bution [7]. Simple methods such as random under-sampling and random over-
sampling are routinely applied in many bioinformatics studies [8]. With random
under-sampling, the size of the majority class is reduced to compensate the im-
balance, whereas with random over-sampling, the size of the minority class is
increased to compensate the imbalance. Although they are straightforward and
computationally efficient, these two methods are prone to either increased noise
and duplicated samples or informative sample removal [9]. A more sophisticated
approach known as SMOTE is to synthesize “new” samples using original sam-
ples in the dataset [10]. However, many bioinformatics problems often present
several thousands of samples with a highly imbalanced class distribution. Ap-
plying SMOTE will introduce a large number of synthetic samples which may
increase the data noise substantially. Alternatively, a cost-metric can be speci-
fied to force the classifier to pay more attention to the minority class [11]. This
requires to choose a correct cost-metric which is often unknown a priori.
Several recent studies found that ensemble learning could improve the per-
formance of a single classifier in imbalanced data classification [6,12]. In this
study, we explore along this direction. In particular, we introduce a sample sub-
set optimization technique for ‘intelligent under-sampling’ in imbalanced data
classification. Using this technique, we designed an ensemble of SVMs specifi-
cally for learning from imbalanced biological datasets. This system has several
advantages over the conventional ones:
– It creates each base classifier using a roughly balanced training subset with
a built-in intelligent under-sampling. This is important in learning from im-
balanced data because it reduces the risk of bias towards one class while
neglecting the other one.
– The system embraces an ensemble framework in which multiple roughly bal-
anced training subsets are created to train an ensemble of classifiers. Thus,
it reduces the risk of removing informative samples from the majority class,
which may occur when a simple under-sampling technique is applied.
– As opposed to random sampling, the sample subset optimization technique
is applied to identify optimal sample subsets. This may improve the quality
of the base classifiers and result in a more accurate ensemble.
– The aforementioned biological problems often present several thousands of
training samples. The proposed technique is essentially an under-sampling
approach. It can avoid the introduction of data noise and the generated data
subsets may be more efficient for classifier training.
The rest of the paper discusses the details of the proposed sample subset op-
timization technique and the associated ensemble learning system. Section 2
presents the ensemble learning system. Section 3 describes the main idea of
sample subset optimization. The base classifier and fitness function of the en-
semble system are described in Section 4. Comparisons with typical sampling
and ensemble methods are given in Section 5. Section 6 concludes the paper.
Sample Subset Optimization for Classifying Imbalanced Biological Data 335
2 Ensemble System
Ensemble learning is an effective approach for improving the prediction accuracy
of a single classification algorithm. Such an improvement is commonly achieved
by using multiple classifiers (known as the base classifiers) each trained on a
subset of samples created by random sampling such as those used in bagging
[13], or cost-sensitive sampling such as those used in boosting [14]. The base
classifiers are typically combined using an integration function such as averaging
[15] or majority voting [16].
We propose an ensemble learning system specifically designed for imbalanced
biological data classification. The schematic representation of the proposed sys-
tem is shown in Figure 1. It has three main components – sample subset opti-
mization, base classifier, and fitness function. The key of this ensemble system
is the application of the sample subset optimization techniques (to be described
in Section 3).
Suppose that a highly imbalanced dataset contains n samples from the ma-
jority class and m samples from the minority class where n m, the system
creates each sample subset by including all m minority samples and selecting
a subset of samples from the n majority samples according to an internal opti-
mization procedure. This procedure is conducted to generate multiple optimized
sample subsets, each being a roughly balanced subset containing m minority
samples and ni carefully selected majority samples, where ni n (i = 1...L)
and L is the total number of optimized sample subsets. Using those optimized
sample subsets, we can obtain a group of base classifiers ci (i = 1...L), each
being trained on its corresponding sample subset {m + ni }. The base classifiers
are then combined using majority voting to form an ensemble of classifiers.
Algorithm 1 summarizes the procedure. A line starting with “//” in the al-
gorithm is a comment for its adjacent next line.
m m m m m’
…
Optimize samples from
n2 nL
…
n1
majority class
Base classifiers
n n’
c1 c2 … cL
Majority voting
Prediction
AUC value
Algorithm 1. sampleSubsetOptimization
Input: Imbalanced dataset DI
Output: Roughly balanced dataset DB
1: cvSize = 2;
2: cvSets = crossValidate(DI , cvSize);
3: for i = 1 to cvSize do
4: // obtain the internal training samples
5: DiT = getTrain(cvSets, i);
6: // obtain the internal test samples
7: Dit = getTest(cvSets, i);
8: // obtain samples of the minority class
9: Diminor = getMinoritySample(DiT );
10: // obtain samples of the majority class
11: Dimajor = getMajoritySample(DiT );
12: // select a subset of samples from the majority class
13: Dimajor = optimizeMajoritySample(Dimajor , Diminor , Dit );
14: DB = DB ∪ (Diminor ∪ Dimajor );
15: end for
16: return DB ;
0: if random() S(vi,j (t + 1))
si,j (t + 1) = (2)
1: if random() < S(vi,j (t + 1))
1
S(vi,j (t + 1)) = (3)
1 + e−vi,j (t+1)
where pbesti,j and gbesti,j are the previous best position and the best position
found by informants, respectively. c1 , r1 , c2 , and r2 are the learning rates and
social coefficients. random() is the random number generator with a uniform
distribution of [0,1].
Representing this optimization procedure in pseudocode, we obtain Algorithm
2. Note that the PSO algorithm produces multiple optimized sample subsets in
parallel. Therefore, by specifying the popSize parameter, we can obtain any
number of optimized sample subsets with a single execution of the algorithm.
Algorithm 2. optimizeMajoritySamples
Input: Majority samples Dmajor , Minority samples Dminor , Internal test samples Dt
pi
Output: Optimized sample subsets Dmajor (i = 1...L)
1: popSize = L;
2: initiateParticles(Dmajor , popSize);
3: for t = 1 to termination do
4: // go through each particle in the population
5: for i = 1 to popSize do
6: // extract the samples according to the indicator function set
7: Dpmajor
i
= extractSelectedSamples(pi , Dmajor );
9 9
Linear SVM border
Linear SVM border
8 8
7 7
Feature 2
Feature 2
6 6
5 5
4 4
Fig. 2. The green lines are the classification boundary created using a linear SVM with
(a) the original dataset and (b) the dataset after optimization
Figure 2(a) shows the original dataset and the resulting classification bound-
ary of a linear SVM, and Figure 2(b) shows a dataset after applying sample
subset optimization and the resulting classification boundary of a linear SVM.
Note that this is one of the optimized dataset which is used to train one base
classifier. Our ensemble is the aggregation of multiple base classifiers trained on
multiple optimized datasets. It is evident that the class ratio is more balanced
after optimization (from 25:10 to 15:10). In addition, the 3 out of 5 outlier sam-
ples are removed, and 7 redundant majority samples which has limited effect on
the decision boundary of the linear SVM classifier are removed to correct the
imbalanced class distribution.
1 M
min ||w||2 + C ξi
w,b,ξ 2
i=1
For building a classifier, a subset of samples from the majority class is selected
according to an indicator function set pi (see Section 3.1), and combined with
the samples from the minority class to form a training set Dptraini
. The goodness
of an indicator function set can be assessed by the performance of the classifier
trained with the samples specified by it. For imbalanced data, one effective way
to evaluate the performance of the classifier is to use area under the ROC curve
metric [20]. Hence, we devise AU C(hi (Dptraini
, Dtest )) as a component of fitness
pi
function, where Dtrain denotes the training set generated using pi and Dtest de-
notes the test data. Function AU C() calculates the AUC value of a classification
model hi (Da , Db ) which is trained on Da and evaluated on Db .
Moreover, the size of the subset is also important because a small training set
is likely to result in a poorly trained model with poor generalization. Therefore,
the fitness function can be constructed by combining the two components:
5 Experimental Results
In this section, we first describe four imbalanced biological datasets used in our
experiment. They are generated from several important and diverse biological
problems and represent different degrees of imbalanced class distribution. Next
we present the performance results of our ensemble algorithm compared with six
other algorithms using those datasets.
5.1 Datasets
We evaluated different algorithms using datasets generated for identification of
miRNA, classification of protein localization sites, and prediction of promoter
(drosophila and human). Specifically, the miRNA identification dataset contains
691 positive samples and 9248 negative samples, which is described by 21 fea-
tures [21]. The protein localization dataset is generated from the study discussed
in [22]. We attempted to differentiate membrane proteins (258) from the rests
(1226). The human promoter dataset contains 471 promoter sequences and 5131
coding sequences (CDS) and intron sequences. Compared to the human pro-
moter dataset, the drosophila promoter dataset has a relatively balanced class
distribution with 1936 promoter sequences and 2722 CDS and intron sequences.
We calculated the 16 dinucleotide features according to [23].
The datasets are summarized and organized according to class ratio in
Table 1.
0.9 0.92
0.91
0.85
0.9
Area Under ROC Curve
0.75 0.87
SSO−SVMs
0.86 Bag−SVMs
SSO−SVMs
0.7 Boost−SVMs
Bag−SVMs
0.85 Single−SVM
Boost−SVMs
ROS−SVM
Single−SVM
0.84 RUS−SVM
0.65 ROS−SVM
SMOTE−SVM
RUS−SVM
0.83
SMOTE−SVM
0.82
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of Base Classifiers Number of Base Classifiers
0.75 0.9
Area Under ROC Curve
0.7 0.85
Fig. 3. The comparison of different algorithms for data classification. The x-axis de-
notes the ensemble sizes and the y-axis denotes the AUC value. For those algorithms
that use a single classifier, the same AUC value is plotted on different ensemble sizes
for the purpose of comparison.
Table 3. P -value using one-tail student t-test to compare the performance difference
significantly better than the other six methods, with a p-value smaller than 0.05.
Therefore, we confirmed the effectiveness of the proposed ensemble approach.
6 Conclusion
In this paper we introduced a sample subset optimization technique for sampling
optimal sample subsets from training data. We integrated this technique in an
ensemble learning framework and created an ensemble of SVMs specifically for
imbalanced biological data classification. The proposed algorithm was applied to
several bioinformatics tasks with moderate and highly imbalanced class distribu-
tions. According to our experimental results, (1) the approaches based on data
sampling for a single SVM are generally less effective compared to the ensemble
approaches; (2) the proposed sample subset optimization technique appears to
be very effective and the ensemble optimized by this technique produced the
best classification results in terms of AUC value for all evaluation datasets.
References
1. Meyer, I.M.: A practical guide to the art of RNA gene prediction.. Briefings in
bioinformatics 8(6), 396–414 (2007)
2. Zeng, J., Zhu, S., Yan, H.: Towards accurate human promoter recognition: a re-
view of currently used sequence features and classification methods. Briefings in
Bioinformatics 10(5), 498–508 (2009)
3. Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., Rätsch, G.: Accurate splice
site prediction using support vector machines. BMC Bioinformatics 8(suppl. 10),
7 (2007)
4. Hua, S., Sun, Z.: Support vector machine approach for protein subcellular local-
ization prediction. Bioinformatics 17(8), 721–728 (2001)
5. Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbal-
anced datasets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.)
ECML 2004. LNCS (LNAI), vol. 3201, pp. 39–50. Springer, Heidelberg (2004)
6. Liu, Y., An, A., Huang, X.: Boosting prediction accuracy on imbalanced datasets
with SVM ensembles. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.)
PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 107–118. Springer, Heidelberg (2006)
7. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study.
Intelligent Data Analysis 6(5), 429–449 (2002)
8. Batuwita, R., Palade, V.: A New Performance Measure for Class Imbalance Learn-
ing. Application to Bioinformatics Problems. In: 2009 International Conference on
Machine Learning and Applications, pp. 545–550. IEEE, Los Alamitos (2009)
9. Chawla, N., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from
imbalanced data sets. ACM SIGKDD Explorations Newsletter 6, 1–6 (2004)
10. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority
over-sampling technique. Journal of Artificial Intelligence Research 16(1), 321–357
(2002)
11. Weiss, G.M.: Mining with rarity: a unifying framework. ACM SIGKDD Explo-
rations Newsletter 6(1), 7–19 (2004)
12. Hido, S., Kashima, H., Takahashi, Y.: Roughly balanced bagging for imbalanced
data. Statistical Analysis and Data Mining 2(5-6), 412–426 (2009)
344 P. Yang et al.
13. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
14. Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: A new
explanation for the effectiveness of voting methods. The Annals of Statistics 26(5),
1651–1686 (1998)
15. Tax, D., Van Breukelen, M., Duin, R.: Combining multiple classifiers by averaging
or by multiplying? Pattern Recognition 33(9), 1475–1485 (2000)
16. Lam, L., Suen, S.Y.: Application of majority voting to pattern recognition: an
analysis of its behavior and performance. IEEE Transactions on Systems, Man,
and Cybernetics, Part A: Systems and Humans 27(5), 553–568 (1997)
17. Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. Swarm Intelli-
gence 1(1), 33–57 (2007)
18. Ben-Hur, A., Ong, C.S., Sonnenburg, S., Schölkopf, B., Rätsch, G.: Support vec-
tor machines and kernels for computational biology. PLoS Computational Biol-
ogy 4(10) (2008)
19. Hsieh, C., Chang, K., Lin, C., Keerthi, S., Sundararajan, S.: A dual coordinate de-
scent method for large-scale linear SVM. In: Proceedings of the 25th International
Conference on Machine Learning, pp. 408–415. ACM, New York (2008)
20. Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8),
861–874 (2006)
21. Batuwita, R., Palade, V.: microPred: effective classification of pre-miRNAs for
human miRNA gene prediction. Bioinformatics 25(8), 989–995 (2009)
22. Horton, P., Nakai, K.: A probabilistic classification system for predicting the cellu-
lar localization sites of proteins. In: Proceedings of the Fourth International Con-
ference on Intelligent Systems for Molecular Biology, pp. 109–115. AAAI Press,
Menlo Park (1996)
23. Rani, T.S., Bhavani, S.D., Bapi, R.S.: Analysis of E. coli promoter recognition
problem in dinucleotide feature space. Bioinformatics 23(5), 582–588 (2007)
Class Confidence Weighted k NN Algorithms for
Imbalanced Data Sets
1 Introduction
A data set is “imbalanced” if its dependent variable is categorical and the number
of instances in one class is different from those in the other class. Learning
from imbalanced data sets has been identified as one of the 10 most challenging
problems in data mining research [1].
In the literature of solving class imbalance problems, data-oriented meth-
ods use sampling techniques to over-sample instances in the minor class or
under-sample those in the major class, so that the resulting data is balanced.
A typical example is the SMOTE method [2] which increases the number of
minor class instances by creating synthetic samples. It has been recently pro-
posed that using different weight degrees on the synthetic samples (so-called
safe-level-SMOTE [3]) produces better accuracy than SMOTE. The focus of
algorithm-oriented methods has been on extensions and modifications of ex-
isting classification algorithms so that they can be more effective in dealing
with imbalanced data. For example, modifications of decision tree algorithms
have been proposed to improve the standard C4.5, such as HDDT [4] and
CCPDT [5].
K NN algorithms have been identified as one of the top ten most influential
data mining algorithms [6] for their ability of producing simple but powerful
The first author of this paper acknowledges the financial support of the Capital
Markets CRC.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 345–356, 2011.
c Springer-Verlag Berlin Heidelberg 2011
346 W. Liu and S. Chawla
classifiers. The k neighbors that are the closest to a test instances are conven-
tionally called prototypes. In this paper we use the concepts of “prototypes” and
“instances” interchangeably.
There are several advanced k NN methods proposed in the recent literature.
Weinberger et al. [7] learned Mahanalobis distance matrices for k NN classifica-
tion by using semidefinite programming, a method which they call large margin
nearest neighbor (LMNN) classification. Experimental results of LMNN show
large improvements over conventional k NN and SVM. Min et al. [8] have pro-
posed DNet which uses a non-linear feature mapping method pre-trained with
Restricted Boltzmann Machines to achieve the goal of large-margin k NN classi-
fication. Recently, a new method WDk NN was introduced in [9] which discovers
optimal weights for each instance in training phase which are taken into ac-
count during test phases. This method is demonstrated superior to other k NN
algorithm including LPD [10], PW [11], A-NN [12] and WDNN [13].
In this paper, the model we propose is an algorithm-oriented method and
we preserve all original information/distribution of the training data sets. More
specifically, the contributions of this paper are as follows:
1. We express the mechanism of traditional k NN algorithms as equivalent to
using only local prior probabilities to predict instances’ labels, from which
perspective we illustrate why many existing k NN algorithms have undesir-
able performance on imbalanced data sets;
2. We propose CCW (class confidence weights), the confidence (likelihood) of a
prototype’s attributes values given its class label, which transforms prior
probabilities of to posterior probabilities. We demonstrate that this trans-
formation makes the k NN classification rule analogous to using a likelihood
ratio test in the neighborhood;
3. We propose two methods, mixture modeling and Bayesian networks, to effi-
ciently estimate the value of CCW;
The rest of the paper is structured as follows. In Section 2 we review existing
k NN algorithms and explain why they are flawed in learning from imbalanced
data. We define CCW weighting strategy and justify its effectiveness in Section
3. CCW is estimated in Section 4. Section 5 reports experiments and Section 6
concludes the paper.
2 Existing k NN Classifiers
Given labeled training data (xi , yi ) (i = 1,...,n), where xi ∈ Rd are feature
vectors, d is the number of features and yi ∈ {c1 , c2 } are binary class labels,
k NN algorithm finds a group of k prototypes from the training set that are
the closest to a test instance xt by a certain distance measure (e.g. Euclidean
distances), and estimates the test instance’s label according to the predominance
of a class in this neighborhood. When there is no weighting (NW) strategy, this
majority voting mechanism can be expressed as:
NW: yt = arg max I(yi = c) (1)
c∈{c1 ,c2 }
xi ∈φ(xt )
Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets 347
dist(xt , xi )
AI: yt = arg max I(yi = c) · (1 − ) (3)
c∈{c1 ,c2 } distmax
xi ∈φ(xt )
where dist(xt , xi ) represents the distance between the test point xt and a pro-
totype xi , and distmax is the maximum possible distance between two training
instances in the feature space which normalizes dist(x t ,xi )
distmax to the range of [0,1].
While MI and AI solve the problem of large distance variance among k neigh-
bors, their effects become insignificant if the neighborhood of a test point is
considerably dense, and one of the class (or both classes) is over-represented by
its samples – since in this scenario all of the k neighbors are close to the test
point and the difference among their distances is not discriminative [9].
where pt (c1 ) and pt (c2 ) represent the proportion of class c1 and c2 appearing
in φ(xt ) – the k -neighborhood of xt . If we integrate this k NN classification rule
into Bayes’s theorem, treat φ(xt ) as the sample space and treat pt (c1 ) and pt (c2 )
as priors 1 of two classes in this sample space, Eq. 4 intuitively illustrates that
the classification mechanism of k NN is based on finding the class label that has
a higher prior value.
This suggests that traditional k NN uses only the prior information to estimate
class labels, which has suboptimal classification performance on the minority
class when the data set is highly imbalanced. Suppose c1 is the dominating
class label, it is expected that the inequality pt (c1 ) pt (c2 ) holds true in most
1
We note that pt (c1 ) and pt (c2 ) are conditioned (on xt ) in the sample space of the
overall training data, but unconditioned in the sample space of φ(xt ).
348 W. Liu and S. Chawla
10
8.5
9
8
8
7 7.5
6 7
5
6.5
4
6
3
2 5.5
1 5
0
0 2 4 6 8 10 2.5 3 3.5 4 4.5 5 5.5 6
10
9 5
4.5
7
6
4
5
4 3.5
3
2
1
2.5
0
0 2 4 6 8 10 4 4.5 5 5.5 6 6.5
regions of the feature space. Especially in the overlap regions of two class labels,
k NN always tends to be biased towards c1 . Moreover, because the dominating
class is likely to be over-represented in the overlap regions, “distance weighting”
strategies such as WI and AI are ineffective in correcting this bias.
Figure 1 shows an example where k NN is performed by using Euclidean dis-
tance measure for k = 5. Samples of positive and negative classes are generated
from Gaussian distributions with mean [μpos pos neg neg
1 , μ2 ] = [6, 3] and [μ1 , μ2 ] =
[3, 6] respectively and a common standard deviation I (the identity matrix).
The (blue) triangles are samples of the negative/majority class, the (red) un-
filled circles are those of the positive/minority class, and the (green) filled circles
indicate the positive samples incorrectly classified by the conventional k NN al-
gorithm. The straight line in the middle of two clusters suggests a classification
boundary built by an ideal linear classifier. Figure 1(a) and 1(c) give global
overall views of k NN classifications, while Figure 1(b) and 1(d) are their corre-
sponding “zoom-in” subspaces that focus on a particular misclassified positive
sample. Imbalanced data is sampled under the class ratio of Pos:Neg = 1:10.
Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets 349
As we can see from Figure 1(a) and 1(b), when data is balanced all of the
misclassified positive samples are on the upper left side of the classification
boundary, and are always surrounded by only negative samples. But when data is
imbalanced (Figure 1(c) and 1(d)), misclassifications of positives appear on both
sides of the boundary. This is because the negative class is over-represented and
dominates much larger regions than the positive class. The incorrectly classified
positive point in Figure 1(d) is surrounded by 4 negative and 1 positive neighbors,
with a negative neighbor being the closest prototype to the test point. In this
scenario, distances weighting strategies (e.g. MI and AI) cannot be helpful to
correct the bias to negative class. In the next section, we introduce CCW and
explain how it can solve such problems and correct the bias.
3 CCW Weighted k NN
To improve the existing k NN rule, we introduce CCW to capture the probability
(confidence) of attributes values given a class label. We define CCW on a training
instance i as follows:
wiCCW = p(xi |yi ), (5)
where xi and yi represent the attribute vector and the class label of instances i.
Then the resulting classification rule integrated with CCW is:
CCW: yt = arg max I(yi = c) · wiCCW , (6)
c∈{c1 ,c2 }
xi ∈φ(xt )
j
k
L0 = p(xi |yi = c1 )xi ∈φ(xt ) , L1 = p(xi |yi = c2 )xi ∈φ(xt )
i=1 i=j+1
Note that the numerator and the denominator in the fraction of Eq. 10 corre-
spond to the two terms of the maximization problem in Eq. 9. It is essential
to ensure the majority class does not have higher priority than the minority in
imbalanced data, so we choose “Λ = 1” as the rejection threshold. Then the
mechanism of using Eq. 9 as the k NN classification rule is equivalent to “predict
xt to be c2 when Λ ≤ 1” (reject H0 ), and “predict xt to be c1 when Λ > 1” (do
not reject H0 ).
Example 1. We reuse the example in Figure 1. The size of triangles/circles is
proportional to their CCW weights: the larger the size of a triangle/cirle, the
greater the weight of that instance; and the smaller the lower the weight. In
Figure 1(d), the misclassified positive instance has four negative-class neighbors
with CCW weights 0.0245, 0.0173, 0.0171 and 0.0139, and has one positive-class
neighbor of weight 0.1691. Then the total negative-class weight is 0.0728 and the
total positive-class weight is 0.1691, and the CCW ratio is 0.0728
0.1691 < 1 which gives
a label prediction to the positive (minority) class. So even though the closest
prototype to the test instance comes from the wrong class which also dominates
the test instance’s neighborhood, a CCW weighted k NN can still correctly classify
this actual positive test instances.
approaches (i.e. WDk NN2 , LMNN3 , DNet4 , CCPDT5 and HDDT6 ) and data-
oriented methods (i.e. safe-level-SMOTE). We note that since WDk NN has been
demonstrated (in [9]) better than LPD, PW, A-NN and WDNN, in our exper-
iments we include only the more superior WDk NN among them. CCPDT and
HDDT are pruned by Fisher’s exact test (as recommended in [5]). All experi-
ments are carried out using 5×2 folds cross-validations, and the final results are
the average of the repeated runs.
2
We implement CCW-based k NNs and WDkNN inside Weka environment [17].
3
The code is obtained from www.cse.wustl.edu/~ kilian/Downloads/LMNN.html
4
The code is obtained from www.cs.toronto.edu/~ cuty/DNetkNN_code.zip
5
The code is obtained from www.cs.usyd.edu.au/~ weiliu/CCPDT_src.zip
6
The code is obtained from www.nd.edu/~ dial/software/hddt.tar.gz
Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets 353
7
http://www.kddcup-orange.com/data.php
8
http://www.agnostic.inf.ethz.ch
9
http://lib.stat.cmu.edu/
354 W. Liu and S. Chawla
0 0 0
0 10 20 30 0 10 20 30 0 10 20 30
Data sets indexes Data sets indexes Data sets indexes
0 0 0
0 10 20 30 0 10 20 30 0 10 20 30
Data sets indexes Data sets indexes Data sets indexes
Fig. 2. Classification improvements from CCW on Manhattan distance (1 norm), Eu-
clidean distance (2 norm) and Chebyshev distance (∞ norm)
References
1. Yang, Q., Wu, X.: 10 challenging problems in data mining research. International
Journal of Information Technology and Decision Making 5(4), 597–604 (2006)
2. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE. Journal of Artificial
Intelligence Research 16(1), 321–357 (2002)
3. Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-Level-SMOTE. In:
Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009.
LNCS, vol. 5476, pp. 475–482. Springer, Heidelberg (2009)
4. Cieslak, D., Chawla, N.: Learning Decision Trees for Unbalanced Data. In: Daele-
mans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS
(LNAI), vol. 5211, pp. 241–256. Springer, Heidelberg (2008)
5. Liu, W., Chawla, S., Cieslak, D., Chawla, N.: A Robust Decision Tree Algorithms
for Imbalanced Data Sets. In: Proceedings of the Tenth SIAM International Con-
ference on Data Mining, pp. 766–777 (2010)
6. Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan,
G., Ng, A., Liu, B., Yu, P., et al.: Top 10 algorithms in data mining. Knowledge
and Information Systems 14(1), 1–37 (2008)
7. Weinberger, K., Saul, L.: Distance metric learning for large margin nearest neigh-
bour classification. The Journal of Machine Learning Research 10, 207–244 (2009)
8. Min, R., Stanley, D.A., Yuan, Z., Bonner, A., Zhang, Z.: A deep non-linear feature
mapping for large-margin knn classification. In: Proceedings of the 2009 Ninth
IEEE International Conference on Data Mining, pp. 357–366 (2009)
9. Yang, T., Cao, L., Zhang, C.: A Novel Prototype Reduction Method for the K-Nearest
Neighbor Algrithms with K ≥ 1. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V.
(eds.) PAKDD 2010. LNCS, vol. 6119, pp. 89–100. Springer, Heidelberg (2010)
10. Paredes, R., Vidal, E.: Learning prototypes and distances. Pattern Recogni-
tion 39(2), 180–188 (2006)
11. Paredes, R., Vidal, E.: Learning weighted metrics to minimize nearest-neighbor
classification error. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 1100–1110 (2006)
12. Wang, J., Neskovic, P., Cooper, L.: Improving nearest neighbor rule with a simple
adaptive distance measure. Pattern Recognition Letters 28(2), 207–213 (2007)
13. Jahromi, M.Z., Parvinnia, E., John, R.: A method of learning weighted similar-
ity function to improve the performance of nearest neighbor. Information Sci-
ences 179(17), 2964–2973 (2009)
14. Cooper, G., Herskovits, E.: A Bayesian method for the induction of probablistic
networks from data. Machine Learning 9(4), 309–347 (1992)
15. Han, E., Karypis, G.: Centroid-based document classification. In: Zighed, D.A.,
Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp.
116–123. Springer, Heidelberg (2000)
16. Asuncion, A., Newman, D.: UCI Machine Learning Repository (2007)
17. Witten, I., Frank, E.: Data mining: practical machine learning tools and techniques
with Java implementations. ACM SIGMOD Record 31(1), 76–77 (2002)
18. Hendricks, W., Robey, K.: The sampling distribution of the coefficient of variation.
The Annals of Mathematical Statistics 7(3), 129–132 (1936)
19. Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves.
In: Proceedings of the 23rd International Conference on Machine Learning, pp.
233–240 (2006)
20. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. The Jour-
nal of Machine Learning Research 7, 1–30 (2006)
Multi-agent Based Classification Using
Argumentation from Experience
1 Introduction
Argumentation is concerned with the dialogical reasoning processes required to
arrive at a conclusion given two or more alternative viewpoints. The process of
multi-agent argumentation is conceptualised as a discussion, about some issue
that requires a solution, between a set of software agents with different points of
view; where each agent attempts to persuade the others that its point of view,
and the consequent solution, is the correct one. In this paper we propose apply-
ing argumentation to facilitate classification. In particular, it is argued that one
model of argumentation, Arguing from Experience ([24,23]), is well suited to the
classification tasks. Arguing from Experience provides a computational model of
argument based on inductive reasoning from past experience. The arguments are
constructed dynamically using Classification Association Rule Mining (CARM)
techniques. The setting is a “debate;; about how to classify examples; the gen-
erated Classification Association Rules (CARs) provide reasons for and against
particular classifications.
The proposed model allows a number of agents to draw directly from past
examples to find reasons for coming to a decision about the classification of an
unseen instance. Agents formulate their arguments in the form of CARs gen-
erated from datasets of past examples. Each agent’s dataset is considered to
Corresponding author.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 357–369, 2011.
c Springer-Verlag Berlin Heidelberg 2011
358 M. Wardeh et al.
Each Participant Agent has its own distinct (tabular) local dataset relating to a
classification problem (domain). These agents produce reasons for and against
classifications by mining CARs from their datasets using a number of CARM
algorithms (Section 3). The antecedent of every CAR represents a set of reasons
for believing the consequent. In other words given a CAR, P → c, this should
be read as: P are reasons to believe that the case should classify as c. CARs
are mined dynamically as required. The dynamic mining provides for four dif-
ferent types of move, each encapsulated by a distinct category of CAR. Each
Participant Agent can employ any one of the following types of move to gen-
erate arguments: (i) Proposing moves, (ii) Attacking moves, and (iii) Refining
moves. The different moves available are discussed further below. Note that each
of these moves has a set of legal next moves (see Table 1).
In order to realise the above, each Participant Agent utilises a T-tree [6] to
summarise its local dataset. A T-tree is a reverse set enumeration tree structure
where nodes are organised using reverse lexicographic ordering, which in turn
enables direct indexing according to attribute number; therefore computational
efficiency gains are achieved. A further advantage, with respect to PISA, is that
the reverse ordering dictates that each sub-tree is rooted at a particular class
attribute, and so all the attribute sets pertaining to a given class are contained
in a single T-tree branch. This means that any one of the identified dynamic
CARM requests need be directed at only one branch of the tree. This reduces
the overall processing cost compared to other prefix tree structures (such as
FP-Trees [16]). To further enhance the dynamic generation of CARs a set of
algorithms that work directly on T-trees were developed. These algorithms were
able to mine CARs satisfying different values of support threshold. At the start
of the dialogue each player has an empty T-tree and slowly builds a partial
Multi-agent Based Classification Using Argumentation from Experience 361
T-tree from their data set, as required, containing only the nodes representing
attributes from the case under discussion plus the class attribute. Note that no
node pruning, according to some user specified threshold, takes place; except
for nodes that have zero support. Two dynamic CAR retrieval algorithms were
developed: (i) Algorithm A which finds a rule that conforms to a given set of
constraints, and (ii) Algorithm B which distinguishes a given rule by adding
additional attributes. Further details of these algorithms can be found in [25].
4 Applications of PISA
Arguing from Experience enables PISA agents to undertake a number of different
tasks, mainly:
1. Multi-agent Classification: Follows the hypothesis that the described oper-
ation of PISA produces at least comparative results to that obtained using
traditional classification paradigms.
2. Ordinal Classification: Follows the hypothesis that PISA can be successfully
applied to datasets with ordered-classes, using a simple agreement strategy.
3. Classifying imbalanced data using dynamic coalitions: Follows the hypothesis
that dynamic coalitions between a number of participant agents, representing
rare classes, improves the performance of PISA with imbalanced multi-class
datasets.
In this section the above applications of PISA are empirically evaluated. For the
evaluation we used a number of real-world datasets drawn from the UCI reposi-
tory [4]. Where appropriate continuous values were discretised into ranges. The
chosen datasets (Table 2) display a variety of characteristics with respect to
number of records (R), number of classes (C) and number of attributes (A). Im-
portantly, they include a diverse number of class labels, distributed in a different
manner in each dataset (balanced and unbalanced), thus providing the desired
variation in the experience assigned to individual PISA participants.
Table 2. Summary of data sets. Columns indicate: domain name, number of records,
number of classes, number of attributes and class distribution (approximately balanced
or not.
Name R C A Bal Name R C A Bal
Hepatitis 155 2 19(56) no Ionosphere 351 2 34(157) no
HorseColic 368 2 27(85) no Congressional Vot- 435 2 17(34) yes
ing
Cylnder Bands 540 2 39(124) yes Breast 699 2 11(20) yes
Pima (Diabetes) 768 2 9(38) yes Tic-Tac-Toe 958 2 9(29) no
Mushrooms 8124 2 23(90) yes Adult 48842 2 14(97) no
Iris 150 3 4(19) yes Waveform 5000 3 22(101) yes
Wine 178 3 13(68) yes Connect4 67557 3 42(120) no
Lymphography 148 4 18(59) no Car Evaluation 1728 4 7(25) no
Heart 303 5 22(52) no Nursery 12960 5 9(32) no
Dematology 366 6 49(49) no Annealing 898 6 38(73) no
Zoo 101 7 17(42) no Automobile (Auto) 205 7 26(137) no
Glass 214 7 10(48) no Page Blocks 5473 7 11(46) no
Ecoli 336 8 8(34) no Solar Flare 1389 9 10(39) no
Led7 3200 10 8(24) yes Pen Digits 10992 10 17(89) yes
Chess 28056 18 6(58) no
Table 3. Summary of the Ensemble Methods used. The implementation of these meth-
ods was obtained from [15]. (S=Support, RDT=Random Decision Trees)
1. Decision trees: Both C4.5, as implemented in [15], and the Random Decision
Tree (RDT)[8], were used.
2. CARM : The TFPC (Total From Partial Classification) algorithm [7] was
adopted because this algorithm utilises similar data structures [6] as PISA.
3. Ensemble classifiers: Table 3 summarises the techniques used. We chose to
apply Boosting and Bagging, combined with decision trees, because previous
work demonstrated that such combination is very effective (e.g. [2,20]).
For each of the included methods (and PISA) three values were calculated
for each dataset: (i) classification error rate, (ii) Balanced Error Rate (BER)
using a confusion matrix obtained from each TCV2 ; and (iii) execution time.
2
Balanced Error Rates (BER) were calculated, for each dataset, as follows:
1
BER = C
C Fci
i=1 Fci +Tci
C = the number of classes in the dataset, Tci = the number of cases which are
correctly classified as class ci, and Fci = the number of cases which should have
been classified as ci but where classified under different class label.
Multi-agent Based Classification Using Argumentation from Experience 363
Table 4. Test set error rate (%). Values in bold are the lowest in a given dataset.
Ensembles Decision Trees
Dataset PISA Bagging ADABoost.M1 MultiBoost TFPC
Decorate
C4.5 RDT C4.5 RDT C4.5 RDT C4.5 RDT
Hepatitis 13.33 18.06 14.84 15.48 21.29 13.55 18.71 16.13 16.13 23.23 18.00
Ionosphere 3.33 7.69 6.84 7.12 10.83 6.27 10.83 7.41 8.55 2.57 14.29
HorseColic 2.78 3.89 22.78
Congress 1.78 3.01 2.31 2.08 3.01 2.08 3.01 2.77 4.16 0.00 9.30
CylBands 15.00 42.22 27.04 42.22 34.81 42.22 34.81 39.81 42.22 36.48 30.37
Breast 3.91 5.01 4.86 4.86 4.86 4.86 4.86 5.43 4.86 5.07 10.00
Pima 14.47 27.21 25.26 25.26 23.83 25.13 24.87 25.66 26.69 16.18 25.92
TicTacToe 2.84 7.20 5.43 2.19 20.35 2.19 20.35 5.85 15.45 20.77 33.68
Mushrooms 0.41 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 1.05
Adult 14.49 13.09 19.19
Iris 2.67 4.67 5.33 6.00 7.33 6.00 7.33 4.67 4.00 8.00 6.00
Waveform 2.16 17.97 11.98 21.48 21.48 13.62 11.96 21.48 21.48 2.42 33.32
Wine 1.18 0.00 25.29
Connect4 5.01 4.31 34.17
Lympho 6.23 18.92 19.59 14.86 29.73 15.54 29.73 19.59 22.97 25.00 24.29
Car Eval 4.11 4.51 1.24 2.43 6.25 2.60 6.25 4.28 5.09 5.90 30.00
Heart 5.05 20.07 19.73 22.79 21.09 19.05 19.73 20.41 19.05 4.67 46.67
Page Bloc 2.24 6.93 6.93 7.02 6.93 7.02 6.93 6.93 7.02 6.94 9.95
Nursery 6.37 2.08 3.09 0.38 3.09 0.35 3.09 1.91 2.62 3.72 22.25
Dematology 4.96 4.10 3.55 3.83 15.30 3.28 15.30 1.64 6.01 5.28 25.00
Annealing 9.55 1.22 0.67 0.45 1.78 0.56 1.78 1.34 1.56 1.67 11.80
Zoo 9.90 7.92 4.95 3.96 19.80 3.96 19.80 6.93 7.92 0.00 8.00
Auto 12.00 15.12 15.61 14.15 21.46 15.61 21.46 16.10 18.05 17.00 29.00
Glass 14.69 27.10 21.50 22.43 29.91 25.23 29.91 29.91 33.18 29.91 33.81
Ecoli 5.17 13.99 15.18 16.37 24.70 14.88 24.70 13.10 15.77 8.79 37.27
Flare 6.09 2.48 3.41 3.41 3.41 3.41 3.10 3.10 2.48 8.03 14.74
Led7 12.00 24.81 24.16 24.84 24.28 24.91 24.34 24.75 24.84 24.25 31.03
Pen Digit 2.75 4.47 1.35 1.58 2.51 5.07 1.87 2.51 5.65 1.08 18.24
Chess 9.13 18.58 15.73
These three values then provided the criteria for assessing and comparing the
classification paradigms.
The results are presented in Table 4. From the table it can be seen that
PISA performs consistently well; out performing the other association rule clas-
sifier, and giving comparable results to the decision tree methods. Additionally,
PISA produced results comparable to those produced by the ensemble methods.
Moreover, PISA scored an average overall accuracy of 93.60%, higher than that
obtained from any of the other methods tested (e.g. Bagging-RDT (89.48%) and
RDT (90.24%))3 .
Table 5 shows the BER for each of the given datasets. From the table it can
be seen that PISA produced reasonably good results overall, producing the best
result in 14 out of the 39 datasets tested.
Table 6 gives the execution times (in milliseconds) for each of the methods.
Note that PISA is not the fastest method. However, the recorded performance
is by no means the worst (for instance Decorate runs slower than PISA with
respect to the majority of the datasets). Additionally, PISA seems to run faster
than Bagging and ADABoost with some datasets.
Table 5. Test set BER (%). Values in bold are the lowest in a given dataset.
Ensembles Decision Trees
Dataset PISA Bagging ADABoost.M1 MultiBoost TFPC
Decorate
C4.5 RDT C4.5 RDT C4.5 RDT C4.5 RDT
Hepatitis 12.00 27.41 20.63 23.37 33.69 19.89 25.05 24.60 23.38 38.19 36.44
Ionosphere 4.58 7.08 6.63 6.42 11.43 5.31 11.43 7.08 8.17 2.19 13.41
HorseColic 2.80 3.71 28.63
Congress 2.35 3.43 2.66 2.27 3.19 2.27 2.27 3.05 4.69 0.00 9.71
CylBands 14.50 46.10 24.48 46.10 35.63 46.10 35.63 40.14 46.10 34.56 32.78
Breast 4.75 6.03 6.20 6.20 6.20 6.07 6.20 6.71 6.20 4.71 12.89
Pima 13.94 28.88 26.86 26.86 25.18 26.72 26.12 27.16 28.34 24.47 33.67
TicTacToe 2.14 6.71 5.35 2.25 22.46 2.25 22.46 5.25 16.98 22.94 47.44
Mushrooms 0.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 1.04
Adult 8.80 17.75 39.89
Iris 2.90 4.61 5.29 5.93 7.32 5.93 7.32 4.69 3.96 7.96 6.07
Waveform 3.93 18.00 11.99 21.51 21.51 13.64 11.97 21.51 21.51 2.39 33.35
Wine 1.42 0.00 24.05
Connect4 11.90 5.33 66.67
Lympho 15.95 30.12 9.74 25.90 43.43 38.66 43.43 39.35 35.97 47.11 16.09
Car Eval 8.21 11.25 6.77 4.79 10.43 5.29 10.43 10.24 16.55 10.67 75.00
Heart 8.25 9.16 8.97 9.93 7.98 7.85 8.97 9.43 7.97 9.82 48.02
Page Bloc 9.45 21.46 22.85 27.89 21.46 27.87 21.46 22.85 27.89 21.48 19.89
Nursery 5.47 4.14 2.28 1.05 5.82 0.76 5.78 4.55 5.98 5.74 40.10
Dematology 8.49 4.68 3.89 3.93 19.41 3.31 19.41 1.84 7.00 3.25 61.67
Annealing 16.13 6.76 3.92 2.57 4.31 3.25 4.31 7.16 6.83 4.43 33.51
Zoo 13.23 12.78 10.71 10.71 36.51 10.71 36.51 15.71 17.50 0.00 17.14
Auto 12.26 11.43 15.98 10.55 18.92 15.98 18.92 12.84 17.04 13.57 19.60
Glass 16.09 24.57 19.42 24.33 29.56 23.18 29.56 29.56 37.98 29.56 48.55
Ecoli 16.18 36.66 40.18 41.18 51.89 37.92 51.89 24.16 43.35 9.42 23.23
Flare 17.18 12.77 12.66 10.91 12.66 12.66 12.62 12.62 12.54 7.59 14.74
Led7 11.84 24.56 24.07 24.87 24.31 25.09 24.23 24.92 24.72 24.36 31.39
Pen Digit 3.47 4.48 1.51 1.59 2.23 4.92 1.89 2.23 5.57 3.73 18.38
Chess 9.63 16.38 24.53
warm and hot). Given ordered classes, one is not only concerned to maximise the
classification accuracy, but also to minimise the distances between the actual and
the predicted classes. The problem of ordinal classification is often solved by either
multi-class classification or regression methods. However, some new approaches,
tailored specifically for ordinal classification, have been introduced in the litera-
tures (e.g. [13,22]). PISA can be utilised for ordinal classification by the means of
biased agreement. Agents in PISA have the option to agree with CARs suggested
by other agents, by not attacking these rules, even if a valid attack is possible. PISA
agents can either agree with all the opponents or with a pre-defined set of opponents
that match the class order. For instance, in the weather scenario, agents support-
ing the decision that the weather is hot, agree with those with the opinion that
the weather is warm, and vice versa. Whereas agents supporting that the weather
is cold or mild agree with each other. We refer to the latter form of agreement by
the term biased agreement. In which the agents are equipped with a simple list of
the class labels that they could agree with (the agreement list). Here, we have two
forms of this mode of agreement:
1. No Attack Biased Agreement (NA-BIA): In which agents consult their
agreement list before mining any rules from their local datasets and attempt
only to attack/respond to CARs of the following shape: P → Q : ∀q ∈ Q, q ∈ /
agreementlist
2. Confidence Threshold Biased Agreement (CT-BIA): Here, if the
agents fail to attack any CARs that contradict with their agreement list,
then they try to attack CARs (P → Q : ∃q ∈ Q, q ∈ agreementlist) if and
only if they fail to mine a matching CAR, with same or higher confidence,
from their own local dataset (P → Q‘ : Q‘ ⊇ Q).
Multi-agent Based Classification Using Argumentation from Experience 365
Table 6. Test set execution times (milliseconds). Values in bold are the lowest in a
given dataset.
cm Ensembles Decision Trees
Dataset PISA Bagging ADABoost.M1 MultiBoost TFPC
Decorate
C4.5 RDT C4.5 RDT C4.5 RDT C4.5 RDT
Hepatitis 115 110 40 190 70 200 60 610 40 60 213
Ionosphere 437 1130 210 1170 20 1210 20 4090 80 12 109
HorseColic 17 4.8 108
Congress 34 50 20 20 140 130 20 590 30 15 154
CylBands 83 110 130 40 20 40 20 1190 40 17 936
Breast 31 110 110 140 110 170 170 330 8.1 8 11
Pima 75 160 90 80 130 80 110 500 20 21 11
TicTacToe 71 80 70 250 30 280 10 620 20 6.1 61.4
Mushrooms 313 750 380 110 50 60 50 6400 80 117 630
Adult 3019 706 1279
Iris 42 40 50 60 50 50 10 110 10 13 2
Waveform 1243 1840 380 4400 830 1650 560 4730 200 102 862
Wine 136 106 163
Connect4 4710 3612 6054
Lympho 15 80 50 90 10 70 10 140 5 5 29
Car Eval 74 300 110 370 20 20 310 1580 80 24 17
Heart 343 250 80 480 20 430 10 620 20 5 183
Page Bloc 159 130 430 430 130 280 130 430 120 55 60
Nursery 965 1790 720 3130 60 3760 10 1449 110 139 204
Dematology 194 160 40 230 20 20 20 480 20 7 169
Annealing 750 1090 120 850 10 1170 10 3340 50 28 689
Zoo 43 40 10 20 10 30 10 110 10 5 85
Auto 210 440 70 320 10 350 10 520 20 5 43
Glass 180 260 120 340 10 430 10 1060 20 10 43
Ecoli 139 240 150 360 10 340 10 1510 10 3 4
Flare 239 30 20 60 40 20 20 140 10 27 23
Pen Digits 1345 2300 460 5810 820 2790 800 2300 290 80 1606
Led7 78 730 360 260 130 1150 480 3380 110 90 25
Chess 2412 334 226
Table 7. The application of PISA with datasets from Table 2 with ordered classes
Datasets ER BER MSE MAE
PISA CT-BIA NA-BIA PISA CT-BIA NA-BIA PISA CT-BIA NA-BIA PISA CT-BIA NA-BIA
Lympo 6.21 4.76 3.38 15.95 20.73 13.94 0.199 0.046 0.015 2.07 1.36 0.84
Car Eval 4.11 5.00 4.03 9.53 10.09 10.61 0.863 1.220 0.708 1.02 1.32 1.01
Page Bloc 2.67 3.64 3.91 13.43 10.42 10.06 1.250 5.164 4.757 0.49 0.78 0.83
Nursery 6.37 6.27 5.83 11.79 13.57 7.88 7.450 7.071 6.725 1.61 1.57 1.46
Dema 4.96 7.95 6.87 8.49 8.74 7.53 0.144 0.143 0.100 1.46 1.37 1.24
Zoo 9.90 7.92 6.86 13.23 14.67 12.17 0.223 0.230 0.232 2.26 2.26 1.96
Ecoli 6.03 5.52 4.34 16.81 6.72 6.91 0.008 0.008 0.005 8.23 7.92 4.63
To test the hypothesis that the above approach improves the performance of
PISA when applied to ordinal classification a series of TCV tests, using a number
of datasets from Table 2 which have ordered classes, were conducted. PISA was
run using the NA-BIA and CT-BIA strategies, and the results were compared
against the use of PISA without any agreement strategy. Additionally, to provide
better comparison the Mean Squared Error (MSE) and the Mean Absolute Error
(MAE) rates for the included datasets and methods were calculated. [11] notes
that little attention has been directed at the evaluation of ordinal classification
solutions, and that simple measures, such as accuracy, are not sufficient. In [11] a
number of evaluation metrics, for ordinal classification, are compared. As a result
MSE is suggested as the best metric when more (smaller) errors are preferred
to reduce the number of large errors; while MAE is a good metric if, overall,
fewer errors are preferred with more tolerance for large errors. Table 7 provides
a summary of the results of the experiments. From the table it can be seen that
the NA-BIA produces better results with datasets with ordinal classes.
366 M. Wardeh et al.
In the following we present a refinement of the basic PISA model which en-
ables PISA to tackle the imbalance-class problem in multi-class datasets, using
Dynamic Coalitions between agents representing the rare classes. Unlike the bi-
ased agreement approach (Sub-section 4.2), coalition requires mutual agreement
among a number of participants, thus a preparation step is necessary. However,
for the purposes of this paper we assume that the agents representing the rare
classes are in coalition from the start of the dialogue, thus eliminating the need
for a preparatory step. The agents in a coalition stop attacking each other, and
only attack CARs placed by agents outside the coalition. The objective of such
coalition is to attempt to remove the agents representing dominant class(es) from
the dialogue, or at least for a pre-defined number of rounds. Once the agent in
question is removed from the dialogue, the coalition is dismantled and the agents
go on attacking each others as in a normal PISA dialogue. In the following we
provide experimental analysis of two coalition techniques:
1. Coalition (1): The coalition is dismantled if the agent supporting the dom-
inant class does not participate in the dialogue for two consecutive rounds.
2. Coalition (2): The coalition is dismantled if the agent supporting the dom-
inant class does not participate in the dialogue for two consecutive rounds,
and this agent is not allowed to take any further part in the dialogue.
Multi-agent Based Classification Using Argumentation from Experience 367
Table 8. The application of PISA with imbalanced multi-class datasets from Table 2
Datasets ER BER G-Mean Time
PISA Coal(1) Coal(2) PISA Coal(1) Coal(2) PISA Coal(1) Coal(2) PISA Coal(1) Coal(2)
Connect4 5.02 4.18 3.78 11.90 9.68 8.70 87.47 89.96 91.00 4710 5376 5818
Lympo 6.21 5.02 4.03 15.95 11.90 14.64 69.31 82.60 92.81 15 65 55
Car Eval 4.11 3.73 4.22 9.53 7.24 4.47 79.42 88.40 92.52 74 163 158
Heart 5.05 4.95 4.95 8.25 2.54 3.17 84.44 87.67 89.97 343 531 612
Page Bloc 2.24 1.43 1.14 13.43 7.96 9.63 68.17 85.43 84.02 159 207 222
Derma 4.96 3.91 3.60 8.49 4.95 4.48 75.79 84.27 90.14 194 119 107
Annealing 9.55 4.24 4.01 16.13 7.72 4.24 63.57 86.20 91.52 750 980 881
Zoo 9.90 8.00 7.00 13.23 8.33 3.92 67.19 85.42 85.51 43 93 85
Auto 12.00 6.37 5.77 12.26 6.53 6.64 79.74 87.88 90.87 210 336 293
Glass 14.69 12.02 5.74 16.09 7.45 5.81 80.12 93.60 93.24 180 178 171
Ecoli 6.03 5.15 5.64 16.18 10.93 3.92 74.16 87.31 96.01 139 86 81
Flare 6.09 7.10 6.86 17.18 5.58 5.15 77.41 91.21 95.76 2393 2291 6267
Chess 9.13 8.47 6.28 9.63 5.91 5.82 76.70 91.26 92.22 2412 3305 3393
To test the hypothesis that the above approaches improves the performance of
PISA when applied to imbalanced class datasets we ran a series of TCV tests
using a number of datasets from Table 2, which have imbalanced class distribu-
tions. The results were compared against the use of PISA without any coalition
strategy. Four measures were used in this comparison: error rate, balanced error
rate, time and geometric mean (g-mean)4 . This last measure was used to quan-
tify the classifier performance in the class [1]. Table 8 provides the result of the
above experiment. From,the table it can be seen that both coalition techniques
boost the performance of PISA, with imbalance-class datasets, with very little
additional cost in time, due to the time needed to dismantle the coalitions.
5 Conclusions
The PISA Arguing from Experience Framework has been described. PISA al-
lows a collection of agents to conduct a dialogue concerning the classification of
an example. The system progresses in a round-by-round manner. During each
round agents can elect to propose an argument advocating their own position
or attack another agent’s position. The arguments are mined and expressed in
the form of CARs, which are viewed as generalisations of the individual agent’s
experience. In the context of classification PISA provides for a “distributed”
classification mechanism that harnesses all the advantages offered by Multi-agent
Systems. The effectiveness of PISA is comparable with that of other classification
paradigms. Furthermore the PISA approach to classification can operate with
temporally evolving data. We have also demonstrated that PISA can be utilised
to produce better performance with imbalanced classes and ordinal classification
problems.
References
1. Alejo, R., Garcia, V., Sotoca, J., Mollineda, R., Sanchez, J.: Improving the Perfor-
mance of the RBF Neural Networks with Imbalanced Samples. In: Proc. 9th Intl.
Conf. on Artl. Neural Networks, pp. 162–169. Springer, Heidelberg (2007)
4 1
The geometric mean is defined as g − mean = ( C i=1 pii )
C where pii is the class
23. Wardeh, M., Bench-Capon, T., Coenen, F.: Multi-Party Argument from Experi-
ence. In: McBurney, P., Rahwan, I., Parsons, S., Maudet, N. (eds.) ArgMAS 2009.
LNCS, vol. 6057, Springer, Heidelberg (2010)
24. Wardeh, M., Bench-Capon, T., Coenen, F.: Arguments from Experience: The
PADUA Protocol. In: Proc. COMMA 2008, Toulouse, France, pp. 405–416. IOS
Press, Amsterdam (2008)
25. Wardeh, M., Bench-Capon, T., Coenen, F.: Dynamic Rule Mining for Argumenta-
tion Based Systems. In: Proc. 27th SGAI Intl. Conf. on AI (AI 2007), pp. 65–78.
Springer, London (2007)
26. Webb, G.: MultiBoosting: A Technique for Combining Boosting and Wagging. J.
Machine Learning 40(2), 159–196 (2000)
Agent-Based Subspace Clustering
Chao Luo1 , Yanchang Zhao2 , Dan Luo1 , Chengqi Zhang1 , and Wei Cao3
1
Data Sciences and Knowledge Discovery Lab
Centre for Quantum Computation and Intelligent Systems
Faculty of Engineering & IT, University of Technology, Sydney, Australia
{chaoluo,dluo,chengqi}@it.uts.edu.au
2
Data Mining Team, Centrelink, Australia
yanchang.zhao@centrelink.gov.au
3
Hefei University of Technology, China
caowei880428@163.com
1 Introduction
As an extension of traditional full-dimensional clustering, subspace clustering
seeks to find clusters in subspaces in high-dimensional data. Subspace clustering
approaches can provide fast search in different subspaces, so as to find clusters
hidden in subspaces of a full dimensional space.
The interpretability of the results is highly desirable in data mining appli-
cations. As a basic approach, the clustering results should be easily utilized by
other methods, such as visulization techniques. In last decade, subspace clus-
tering has been researched widely. However, there are still some issues in this
area. Some subspaces clustering, such as CLIQUE [3], produce only overlapped
clustering, where one data point can belong to several clusters. This makes the
clusters fail to provide a clear description of data. In addition, most subspace
clustering methods generate low quality clusters.
In order to obtain high quality subspace clustering, we design a model of
Agent-based Clustering on Subspaces (ACS). By simulating the actions and in-
teractions of autonomous agents with a view to accessing their effects on the sys-
tem as a whole, Agent-based subspace clustering can result in far more complex
and interesting clustering. The clusters obtained can provide a natural descrip-
tion of data. Agent-based subspace clustering is a powerful clustering modeling
technique that can be applied to real-business problems.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 370–381, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Agent-Based Subspace Clustering 371
This paper is organized as follows. Section 2 gives the background and re-
lated work of subspace clustering. Section 3 presents our model of agent-based
subspace clustering. The experimental results and evaluation are given in Sec-
tion 4. An application on market manipulation is also provided in Section 4. We
conclude the research in Section 5.
S1 S2 S3 S4 S5 S6 S1 S2 S3 S4 S5 S6
v1 1 14 23 12 4 21 v1 1 2 3 2 1 3
v2 2 12 22 13 13 4 v2 1 2 3 2 2 1
v3 1 23 23 11 12 2 v3 1 2 3 2 2 1
v4 25 4 14 13 11 2 v4 3 1 2 2 2 1
v5 23 2 12 1 23 2 v5 3 1 2 1 3 1
The clusters C with optimized M (C) will have a large data size and a large
dimensionality at the same time.
3.2 Agent-Based Subspace Clustering
In this section, we describe the design of our agent-based subspace clustering ap-
proach and explain how to implement the tasks of subspace clustering definded
above.
Firstly, we briefly present the model of agent-based subspace clustering. In the
agent-based model, there are a set of agents and each agent represents a data
point. The agents move from one local environment to another local environment.
We named the local environment as bins. The movement of agents is instructed
by some heuristic rules. There is a global environment to determinate when
agents stop moving. In this way, an optimized clustering is obtained as a whole.
To sum up, the complex subspace clustering is achieved by using the simple
behaviors of agents under the guidance of heuristic rules [4,6].
The key components of agent-based subspace clustering are: agents, the local
environment bins and the global environment. In order to explain the detail of
the agent-based subspace clustering model, we take the data set in Table 1 as a
simple example.
– Let A = {a1 , a2 , . . . , am } represent agents. For example, agents
A = {a1 , a2 , a3 , a4 , a5 } represent data in Table 1.
– Let B = {B1 , B2 , . . . , Bn } be a set of bins. The bins B are the local en-
vironment of agents. Therefore each bin Bi (Bi ∈ B) contains a number of
agents. We refer Bi .agents as the agents contained by Bi . For each agent
aj in Bi .agents, we say agent aj belong to bin Bi . Bin Bi has a property
Bi .dimensions, which denotes the subspace under which Bi .agents are sim-
ilar to each other.
In this model, we choose CLIQUE as the method to generate bins B. The first
step of CLIQUE is to discrete the numeric values into intervals. Table 2 shows
the agents after discretization with ξ = 3, which is the number of levels in each
dimension. The intervals on different dimensions form units. CLIQUE firstly
finds all dense units as the basic elements of clusters. Then the connected dense
units are treated as final clusters. Figure 1 is an example of result of CLIQUE
with τ = 0.8 and ξ = 3 on the data in Table 1. There are two clusters on
subspaces S4 and S6 respectively.
However, this clustering is unble to satisfy the definition of hard clustering. In
our model, the groups generated by CLIQUE are treated as bins B as an input
to our model to generate higher quality clusters. The global environment is an
important component of an agent-based model. We define an objective function
as a global environment based on Equation (1). In our model, local environment
bins B are being optimized with the movement of agents A. When objective
function M (B) in Equation (2) reaches its maximal value, agents stop moving.
Bins B are fully optimized and are treated as final clusters C.
M (B) = |Bi .dimensions| × (|Bi .objects|)2 ∀Bi ∈ B (2)
i
374 C. Luo et al.
Some simple rules are defined to make sure that M (B) can be optimized by the
movements of agents A.
The movement of agents is a parallel decentralized process. Initially, each
agent ai (ai ∈ A) randomly choose a bin Bj (Bj ∈ B ∧ ai ∈ Bj ) it belongs to as
its local environment. In the next loop, agent ai randomly chooses another bin
Bk (k = j ∧Bk ∈ B) as the destination of movement. The ΔM (Bj ) and ΔM (Bk )
measure the changes in Bj and Bk with respect to the global objective function
M (B). move(ai ) in Equation (5) indicates the influence of the movement on
M (B). If move(ai ) is evaluated as positive, the agent will move from its bin Bj
to the destination Bk . Otherwise, agent ai stays in Bj .
ΔM (Bj ) = ((|Bj .agents| − 1)2 × |Bj .dimensions|) − (|Bj .agents|2 × |Bj .dimensions|)
(3)
3.3 Algorithm
The algorithm is composed of the following three steps.
Agent-Based Subspace Clustering 375
4 Experiments
4.1 Data and Evaluation Criteria
In the experiments, we compare our ACS algorithm with existing subspace clus-
tering algorithems which include CLIQUE, DOC, FIRES, P3C, Schism, Subclu,
MineClus, and PROCLUS [1]. All these algorithms are implemented on a Weka
subspace clustering plugin tool[7]. Table 3 shows the datasets used in our exper-
iments, which are public data sets from UCI repositery.
F1 measure and Entropy are chosen to evaluate the algorithms.
– F1 measure considers recall and precision [7]. For each cluseter Ti in cluster-
ing T , there is a set of mapped found clusters mapped(Ti ). Let VTi be the
objects of the cluster Ti and Vm(Ti ) the union of all objects from the clusters
in mapped(Ti). Recall and precision are formalized by:
|VTi ∩ Vm(Ti ) |
recall(Ti ) = (6)
|VTi |
|VTi ∩ Vm(Ti ) |
precision(Ti ) = (7)
|Vm(Ti ) |
The harmonic mean of precison and recall is the F1 measure. A high F1
measure correspons to a good cluster quality.
376 C. Luo et al.
a
Breast Cancer Wisconsin (Prognostic).
b
Pima Indians Diabetes.
The overall quality of the clustering is obtained as the average over all clus-
ters Cj ∈ C weighted by the number of objects per cluster. By normalizing
with the maximal entropy log(m) for m hidden clusters and taking the in-
verse, the range is between 0 (low quality) and 1 (perfect):
k
j=1|Cj |.E(Cj )
1− (9)
log(m) kj=1 |Cj |
For a fair evaluation, we show the best results of all algorithms in the massive
experiments with various parameter settings for each algorithm.
Figures 3-8 show the performance of the algorithms on data breast, diabetes,
glass, pendigits, liver and shape. From the figures, we can see that ACS performs
better than the other subspaces methods on both F1 measure and Entropy. For
breast, glass and shape data, ACS has the best performance on F1 measure and
Entropy. In particular, ACS has much better F1 measure than the others. For
diabetes, pendigits and liver data, the performance of ACS ranks higher on F1
measure and entropy. In fact, ACS performs similarly to the first rank algorthms
in each figure.
Figures 9 and 10 show the time consumed with respect to the dimensionality
and data size. It is obvious that the time consumed by ACS is similar with those
of MineClus, CLIQUE and Schism, while STATPC, DOC, FIRES, P3C and
PROCLUS consume much longer time than the first group. We can conclude
that ACS are fast and scalable with the number of dimensions and data size.
380 C. Luo et al.
ACS has two parameters: ξ and τ . The figure 11 show the time change with
the parameters. From these figures, we can see the run time decrease with the
increase of ξ and τ .
References
1. Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for
projected clustering. In: SIGMOD 1999: Proceedings of the, ACM SIGMOD inter-
national conference on Management of data, pp. 61–72. ACM, New York (1999)
Agent-Based Subspace Clustering 381
2. Aggarwal, R.K., Wu, G.: Stock market manipulations. Journal of Business 79(4),
1915–1954 (2006)
3. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clus-
tering of high dimensional data for data mining applications. SIGMOD Rec. 27(2),
94–105 (1998)
4. Cao, L.: In-depth behavior understanding and use: the behavior informatics ap-
proach. Information Science 180(17), 3067–3085 (2010)
5. Cheng, C.-H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for mining
numerical data. In: KDD 1999: Proceedings of the Fifth ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, pp. 84–93. ACM,
New York (1999)
6. Cao, P.A.M.L., Gorodetsky, V.: Agent mining: The synergy of agents and data
mining. IEEE Intelligent Systems 24(3), 64–72 (2009)
7. Müller, E., Günnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace
projections of high dimensional data. Proc. VLDB Endow. 2(1), 1270–1281 (2009)
8. Ogston, E., Overeinder, B., van Steen, M., Brazier, F.: A method for decentralized
clustering in large multi-agent systems. In: AAMAS 2003: Proceedings of the sec-
ond international joint conference on Autonomous agents and multiagent systems,
pp. 789–796. ACM, New York (2003)
9. Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A monte carlo algorithm
for fast projective clustering. In: SIGMOD 2002: Proceedings of the 2002 ACM
SIGMOD International Conference on Management of Data, pp. 418–427. ACM,
New York (2002)
10. Xu, X., Chen, L., He, P.: A novel ant clustering algorithm based on cellular au-
tomata. Web Intelli. and Agent Sys. 5(1), 1–14 (2007)
Evaluating Pattern Set Mining Strategies in a
Constraint Programming Framework
Abstract. The pattern mining community has shifted its attention from
local pattern mining to pattern set mining. The task of pattern set min-
ing is concerned with finding a set of patterns that satisfies a set of
constraints and often also scores best w.r.t. an optimisation criteria. Fur-
thermore, while in local pattern mining the constraints are imposed at
the level of individual patterns, in pattern set mining they are also con-
cerned with the overall set of patterns. A wide variety of different pattern
set mining techniques is available in literature. The key contribution of
this paper is that it studies, compares and evaluates such search strate-
gies for pattern set mining. The investigation employs concept-learning
as a benchmark for pattern set mining and employs a constraint pro-
gramming framework in which key components of pattern set mining are
formulated and implemented. The study leads to novel insights into the
strong and weak points of different pattern set mining strategies.
1 Introduction
In the pattern mining literature, the attention has shifted from local to global
pattern mining [1,10] or from individual patterns to pattern sets [5]. Local pattern
mining is traditionally formulated as the problem of computing Th(L, ϕ, D) =
{π ∈ L | ϕ(π, D) is true}, where D is a data set, L a language of patterns, and ϕ
a constraint or predicate that has to be satisfied. Local pattern mining does not
take into account the relationships between patterns; the constraints are evalu-
ated locally, that is, on every pattern individually, and if the constraints are not
restrictive enough, too many patterns are found. On the other hand, in global
pattern mining or pattern set mining, one is interested in finding a small set of rel-
evant and non-redundant patterns. Pattern set mining can be formulated as the
problem of computing Th(L, ϕ, ψ, D) = {Π ⊆ Th(L, ϕ, D) | ψ(Π, D) is true},
where ψ expresses constraints that have to be satisfied by the overall pattern
sets. In many cases a function f is used to evaluate pattern sets and one is then
only interested in finding the best pattern set Π, i.e. arg maxΠ∈Th(L,ϕ,ψ,D) f (Π).
Within the data mining and the machine learning literature numerous ap-
proaches exist that perform pattern set mining. These approaches employ a wide
variety of search strategies. In data mining, the step-wise strategy is common,
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 382–394, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Evaluating Pattern Set Mining Strategies 383
in which first all frequent patterns are computed; they are heuristically post-
processed to find a single compressed pattern set; examples are KRIMP [16]
and CBA [12]. In machine learning, the sequential covering strategy is popular,
which repeatedly and heuristically searches for a good pattern or rule and imme-
diately adds this pattern to the current pattern- (or rule-)set; examples are FOIL
[14] and CN2 [3]. Only a small number of techniques, such as [5,7,9], search for
pattern sets exhaustively, either in a step-wise or in a sequential covering setting.
The key contribution of this paper is that we study, evaluate and compare
these common search strategies for pattern set mining. As it is infeasible to
perform a detailed comparison on all pattern set mining tasks that have been
considered in the literature, we shall focus on one prototypical task for pattern
set mining: boolean concept-learning. In this task, the aim is to most accurately
describe a concept for which positive and negative examples are given.Within
this paper we choose to fix the optimisation measure used to accuracy; our focus
is on the exploration of a wide variety of search strategies for this measure, from
greedy to complete and from step-wise to one-step approaches.
To be able to obtain a fair and detailed comparison we choose to reformulate
the different strategies within the common framework of constraint program-
ming. This choice is motivated by [4,13], who have shown that constraint pro-
gramming is a very flexible and usable approach for tackling a wide variety of
local pattern mining tasks (such as closed frequent itemset mining and discrim-
inative or correlated itemset mining), and recent work [9,7] that has lifted these
techniques to finding k-pattern sets under constraints (sets containing exactly
k patterns). In [7], a global optimization approach to mining pattern sets has
been developed and has been shown to work for concept-learning, rule-learning,
redescription mining, conceptual clustering as well as tiling. In the present work,
we employ this constraint programming framework to compare different search
strategies for pattern set mining, focusing on one mining task in more detail.
This paper is organized as follows: in Section 2, we introduce the problem of
pattern set mining and its benchmark, concept-learning; in Section 3, we formu-
late these problems in the framework of constraint programming and introduce
various search strategies for pattern set mining; in Section 4, we report on ex-
periments, and finally, in Section 5, we conclude.
This framework has been shown 1) to allow for the use of a wide range of con-
straints, 2) to work for both frequent and discriminative pattern mining [13], and
3) to be extendible towards the formulation of k pattern set mining, cf. [7,9].
These other papers provide detailed descriptions of the underlying constraint
programming algorithms and technology, including an analysis of the way in
which they explore the search tree and a performance analysis. On the other
hand, in the present paper – due to space restrictions – we need to focus on
the declarative specification of the constraint programming problems; we refer
to [4,13,7] for more details on the search strategy of such systems.
Step 1: Local Pattern Mining. Using the above notation one can formulate
many local pattern mining problems, such as frequent and discriminative pattern
mining. Indeed, consider the following constraints, introduced in [4,13]:
∀t ∈ T : Tt = 1 ↔ Ii (1 − Dti ) = 0. (Coverage)
i∈I
∀i ∈ I : Ii = 1 ↔ Tt (1 − Dti ) = 0. (Closedness)
t∈T
∀i ∈ I : Ii = 1 → Tt Dti ≥ θ. (Min. frequency)
t∈T
∀i ∈ I : Ii = 1 → accuracy( Tt Dti , Tt Dti ) ≥ θ. (Min. accuracy)
t∈T + t∈T −
In these constraints, the coverage constraint links the items to the transactions:
it states that the transaction set T must be identical to the set of all transactions
that are covered by the itemset I. The closedness constraint removes redundancy
386 T. Guns, S. Nijssen, and L. De Raedt
Step 2: Post-processing the Local Patterns. Once the local patterns have
been computed, the two step approach post-processes them in order to arrive at
the pattern set. We describe the two main approaches for this.
∀t ∈ T : Tt = 1 ↔ Pr Mtr ≥ 1. (Disj . Coverage)
r∈P
∀r ∈ P : Pr = 1 → accuracy( Ltr , Ltr ) ≥ θ (Min. Accuracy)
t∈T + t∈T −
Pr = k (Set Size)
r∈P
Each pattern has to cover the transactions (Eq. 2) and be closed (Eq. 3). The
canonical form constraint in Eq. 4 enforces a fixed lexicographic ordering on the
itemsets, thereby avoiding to find equivalent but differently ordered pattern sets.
In Eq. 5, the variables Bt are auxiliary variables representing whether transaction
t is covered by at least one pattern, corresponding to a disjunctive coverage.
The one-step global optimization approaches to pattern set mining are less
common; the authors are only aware of [7,9]. One could argue that some iterative
pattern mining strategies will find pattern sets that are optimal under certain
conditions. For instance, Tree2 [2] can find a pattern set with minimal error on
supervised training data; however, it neither provides guarantees on the size of
the final pattern set nor provides guarantees under additional constraints.
4 Experiments
To measure the quality of a pattern set, we evaluate its accuracy on the dataset.
This is an appropriate means of evaluation, as in the boolean concept learning
task we consider, the goal is to find a concise description of the training data,
rather than a hypothesis that generalizes to an underlying distribution.
The experiments were performed using the Gecode-based system proposed
by [4] and performed on PCs running Ubuntu 8.04 with Intel(R) Core(TM)2
Quad CPU Q9550 processors and 4GB of RAM. The datasets were taken from
the website accompanying this system1 . The datasets were derived from the UCI
1
http://dtai.cs.kuleuven.be/CP4IM/datasets/
Evaluating Pattern Set Mining Strategies 389
Table 1. Data properties and number of patterns found for different constraints and
thresholds. 25M+ denotes that more than 25 million patterns were found.
Mushroom Vote Hepatitis German-credit Austr.-credit Kr-vs-kp
Transactions 8124 435 137 1000 653 3196
Items 119 48 68 112 125 73
Class distr. 52% 61% 81% 70% 55% 52%
Total patterns 221524 227032 3788342 25M+ 25M+ 25M+
Pattern poor/rich poor poor poor rich rich rich
frequency ≥ 0.7 12 1 137 132 274 23992
frequency ≥ 0.5 44 13 3351 2031 8237 369415
frequency ≥ 0.3 293 627 93397 34883 257960 25M+
frequency ≥ 0.1 3287 35771 1827264 2080153 24208803 25M+
accuracy ≥ 0.7 197 193 361 2 11009 52573
accuracy ≥ 0.6 757 1509 3459 262 492337 2261427
accuracy ≥ 0.5 11673 9848 31581 6894 25M+ 25M+
accuracy ≥ 0.4 221036 105579 221714 228975 25M+ 25M+
The result of a two-step approach obviously depends on the quality of the pat-
terns found in the first step. We start by investigating the feasibility of this first
step, and then study the two-step methods as a whole.
Fig. 1. Quality & runtime for approx. methods, pattern poor hepatitis dataset. In the
left figure, algorithms with identical outcome are grouped together.
Fig. 2. Quality & runtime for approx. methods, pattern rich australian-credit dataset.
Table 2. Largest K (up to 6) and time to find it for the 2-step complete search method.
- indicates that step 1 was aborted because more than 25 million patterns were found, –
indicates that step 2 did not manage to finish within the timeout of 6 hours. * indicates
that no other method found a better pattern set.
Table 3. Largest K for which the optimal solution was found within 6 hours
Mushroom Vote Hepatitis German-credit Australian-credit Kr-vs-kp
K=2 K=4 K=3 K=2 K=2 K=3
Fig. 3. Quality & runtime for 1-step methods, german-credit dataset. In the left figure,
algorithms with identical outcome are grouped together.
In this section we compare the different one-step approaches, who need no local
pattern constraints and thresholds. We investigate how feasible the one-step
exact approach is, as well as how close the greedy sequential covering method
brings us to this optimal solution, and whether beam search can close the gap
between the two.
When comparing the two-step sequential covering approach with the one-step
approach, we already remarked that the latter is very efficient, though it might
not find the optimal solution. The one-step exact method is guaranteed to find
the optimal solution, but has a much higher computational cost. Table 3 below
shows up to which K the exact method was able to find the optimal solution
within the 6 hours time out. Comparing these results to the two-step exact
approach in Table 2, we see that pattern sets can be found without constraints,
where the two-step approach failed even with constraints.
With respect to Q1 we observed that only for the kr-vs-kp dataset the greedy
method, and hence all beam searches with a larger beam, found the same pattern
sets as the exact method. For the mushroom and vote dataset, starting from
beam width 5, the optimal pattern set was found. For the german-credit and
australian-credit, a beam width of size 15 was necessary. The hepatitis dataset
was the only dataset for which the complete method was able to find a better
pattern set, in this case for K=3, within the timeout of 6 hours.
Figure 3 shows a representative figure, in this case for the german-credit
dataset: while the greedy method is not capable of finding the optimal pat-
tern set, larger beams successfully find the optimum. For K=6, beam sizes of 15
or 20 lead to a better pattern set than when using a lower beam size. The exact
method stands out as being the most time consuming. For beam search methods,
larger beams clearly lead to larger runtimes. The runtime only increases slightly
Evaluating Pattern Set Mining Strategies 393
for increasing sizes of K because the beam search is used in a sequential covering
loop that shrinks the dataset at each iteration.
5 Conclusions
We compared several methods for finding pattern sets within a common con-
straint programming framework, where we focused on boolean concept learning
as a benchmark. We distinguished one step from two step approaches, as well
as exact from approximate ones. Each method has its strong and weak points,
but the one step approximate approaches, which iteratively mine for patterns,
provided the best trade-off between runtime and accuracy and do not depend
on a threshold; additionally, they can easily be improved using a beam search.
The exact approaches, perhaps unsurprisingly, do not scale well to larger and
pattern-rich datasets. A newly introduced approach for one-step exact pattern
set mining however has optimality guarantees and performs better than previ-
ously used two-step exact approaches. In future work our study can be extended
to consider other problem settings in pattern set mining, as well as other heuris-
tics and evaluation metrics; furthermore, even though we cast all settings in one
implementation framework in this paper, a more elaborate study could clarify
how this approach compares to the pattern set mining systems in the literature.
References
1. Bringmann, B., Nijssen, S., Tatti, N., Vreeken, J., Zimmermann, A.: Mining sets
of patterns. In: Tutorial at ECMLPKDD 2010 (2010)
2. Bringmann, B., Zimmermann, A.: Tree2 - decision trees for tree structured data.
In: Jorge, A., Torgo, L., Brazdil, P., Camacho, R., Gama, J. (eds.) PKDD 2005.
LNCS (LNAI), vol. 3721, pp. 46–58. Springer, Heidelberg (2005)
3. Clark, P., Niblett, T.: The CN2 induction algorithm. Machine Learning 3, 261–283
(1989)
4. De Raedt, L., Guns, T., Nijssen, S.: Constraint programming for itemset mining.
In: KDD, pp. 204–212. ACM, New York (2008)
5. De Raedt, L., Zimmermann, A.: Constraint-based pattern set mining. In: SDM.
SIAM, Philadelphia (2007)
6. Frank, A., Asuncion, A.: UCI machine learning repository (2010),
http://archive.ics.uci.edu/ml
7. Guns, T., Nijssen, S., De Raedt, L.: k-Pattern set mining under constraints. CW
Reports CW596, Department of Computer Science, K.U.Leuven (October 2010),
https://lirias.kuleuven.be/handle/123456789/278655
8. Kearns, M.J., Vazirani, U.V.: An introduction to computational learning theory.
MIT Press, Cambridge (1994)
394 T. Guns, S. Nijssen, and L. De Raedt
9. Khiari, M., Boizumault, P., Crémilleux, B.: Constraint programming for mining n-
ary patterns. In: Cohen, D. (ed.) CP 2010. LNCS, vol. 6308, pp. 552–567. Springer,
Heidelberg (2010)
10. Knobbe, A., Crémilleux, B., Fürnkranz, J., Scholz, M.: From local patterns to
global models: The lego approach to data mining. In: Fürnkranz, J., Knobbe, A.
(eds.) Proceedings of LeGo 2008, an ECMLPKDD 2008 Workshop (2008)
11. Knobbe, A.J., Ho, E.K.Y.: Pattern teams. In: Fürnkranz, J., Scheffer, T.,
Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 577–584.
Springer, Heidelberg (2006)
12. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining.
In: KDD, pp. 80–86 (1998)
13. Nijssen, S., Guns, T., De Raedt, L.: Correlated itemset mining in ROC space: a
constraint programming approach. In: KDD, pp. 647–656. ACM, New York (2009)
14. Quinlan, J.R.: Learning logical definitions from relations. Machine Learning 5,
239–266 (1990)
15. Rückert, U., De Raedt, L.: An experimental evaluation of simplicity in rule learning.
Artif. Intell. 172(1), 19–28 (2008)
16. Siebes, A., Vreeken, J., van Leeuwen, M.: Item sets that compress. In: Ghosh,
J., Lambert, D., Skillicorn, D.B., Srivastava, J. (eds.) SDM, pp. 395–406. SIAM,
Philadelphia (2006)
Asking Generalized Queries with Minimum Cost
1 Introduction
Active learning, as an effective learning paradigm to reduce the labeling cost in
supervised settings, has been intensively studied in recent years. In most tradi-
tional active learning studies, the learner usually regards the specific examples
directly as queries, and requests the corresponding labels from the oracle. For
instance, given a diabetes patient dataset, the learner usually presents the en-
tire patient example, such as [ID = 7354288, name = John, age = 65, gender =
male, weight = 230, blood−type = AB, blood−pressure = 160/90, temperature
= 98, · · · ] (with all the features), to the oracle, and requests the corresponding
label whether this patient has diabetes or not. However, in this case, many fea-
tures (such as ID, name, blood-type, and so on) might be irrelevant to diabetes
diagnosis. Not only could queries like this confuse the oracle, but each answer
responded from the oracle is also applicable for only one specific example.
In many real-world active learning applications, the oracles are often human
experts, thus they are usually capable of answering more general queries. For
instance, given the same diabetes patient dataset, the learner could ask a gen-
eralized query, such as “are men over age 60, weighted between 220 and 240
pounds, likely to have diabetes?”, where only three relevant features (gender,
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 395–406, 2011.
c Springer-Verlag Berlin Heidelberg 2011
396 J. Du and C.X. Ling
age and weight) are provided. Such generalized query can often represent a set
of specific examples, thus the answer for the query is also applicable to all these
examples. For instance, the answer to the above generalized query is applicable
for all men over age 60 and weighted between 220 and 240 pounds. This allows
the active learner to improve learning more effectively and efficiently.
However, although the oracles are indeed capable of answering such general-
ized queries in many applications, the cost (effort) is often higher. For instance,
it is relatively easy (i.e., with low cost) to diagnose whether one specific patient
has diabetes or not, with all necessary information provided. However, it is of-
ten more difficult (i.e., with higher cost) to provide accurate diabetes diagnoses
(accurate probability) for all men over age 60 and weighted between 220 and
240 pounds. In real-world situation, more domain expertise is usually required
for the oracles to answer such generalized queries well, thus the cost for asking
generalized queries is often more expensive. Consequently, it yields a trade-off
in active learning: on one hand, asking generalized queries can speed up the
leaning, but usually with high cost; on the other hand, asking specific queries is
much cheaper (with low cost), but the learning process might be slowed down.
In this paper, we apply a cost-sensitive framework to study generalized queries
in active learning. More specifically, we assume that the querying cost is known
to be non-uniform, and ask generalized queries in the following two scenarios:
2 Related Work
All of the active learning studies make assumptions. Specifically, most of the
previous works assume that the oracles can only answer specific queries, and the
costs for asking these queries are uniform. Thus, most active learning algorithms
1
In this paper, we only consider that both the querying cost and the misclassification
cost are on the same scale. Extra normalization might be required otherwise.
Asking Generalized Queries with Minimum Cost 397
– Step 1: Based on the current training and unlabeled datasets, the learner
constructs a generalized query according to certain objective function.
– Step 2: After obtaining the answer of the generalized query, the learner
updates the training dataset, and updates the learning model accordingly.
We can see from Equation 1 that, estimating ΔAcct (q)/CQ (q) is required to
evaluate the candidate query q. As we assume that the querying cost CQ (q) is
known, we only need separately estimate the accuracies before and after asking
q (i.e., Acct−1 and Acct (q)).
Estimating Acct−1 is rather easy. We simply apply cross-validation or leave-
one-out to the current training data, and obtain the desired average accuracy.
However, estimating Acct (q) is a bit difficult. Note that, if we know the answer
of q, the training data could be updated by using exactly the same strategy
we will describe in Section 3.2 (Updataing Learning Model), and Acct (q) thus
could be easily estimated on the updated training data. However, the answer of
q is still unknown in the current stage, thus here, we apply a simple strategy to
optimistically estimate this answer, and then evaluate q accordingly.
Specifically, we first assume that the label of q is certainly 1.3 Thus, we update
the training data (using the same method as in Section 3.2), and estimate Acct (q)
accordingly. Then, we assume that the label of q is certainly 0, and again update
the training data and estimate Acct (q) in the same way. We compare these two
estimates of Acct (q), and optimistically choose the better (higher) one as the
final estimate.
We assume that the rough size of such “to-be-predicted” data is known in this pa-
per, due to the following reason. In reality, the quantity of such “to-be-predicted”
data directly affects the quantity of resource (effort, cost) that should be spent
in constructing the learning model. For instance, if the model would be used for
only few times and on only limited unimportant data, it might not be worth to
spend much resource on model construction; on the other hand, if the model is
expected to be extensively used on a large amount of important data, it would
be even more beneficial to improve the model performance by spending more
resource. In many such real-world situations, in order to determine how much
resource should be spent in constructing the model, it is indeed known (or could
be estimated) that how extensively the model would be used in the future (i.e.,
the rough quantity of the to-be-predicted data).
It is exactly the same case in our current scenario of generalized queries. More
specifically, if the current learning model will only “play a small role” (i.e., make
predictions on only few examples) in the future, it may not worth paying high
querying cost to construct a high-performance model. On the other hand, if a
large number of examples need to be predicted, it would be indeed worthwhile
to acquire more generalized queries (at the expense of high querying cost), such
that an accurate model with low misclassification cost could be constructed.
This indicates that, the number of “to-be-predicted” examples is crucial in
minimizing total cost. Therefore, we formalized the total cost after t iterations
(denoted by CTt ) in Equation 2, where CQ i
denotes the querying cost in the ith
t
iteration, CM denotes the misclassification cost after t iterations, which further
can be calculated as the product of the average misclassification cost4 after t
t
iterations (denoted by AvgCM ) and the number of future predicted examples
(denoted by n).
t
t
CTt = i
CQ + t
CM = i
CQ t
+ AvgCM ×n (2)
i=1 i=1
To obtain the minimum total cost for the learning model, we greedily choose
the query that maximumly reduces the total cost in each learning iteration.
More formally, Equation 3 shows the objective function for searching query in
iteration t, where all notations keep same as above.
t
In the current setting, we assume that CQ and n are both known, thus we need
t−1 t
estimate AvgCM and AvgCM (q) separately, according to Equation 3. We again
t−1
adopt the similar strategy as in the previous subsection. Specifically, AvgCM
4
Average misclassification cost represents the misclassification cost averaged on each
tested examples.
400 J. Du and C.X. Ling
Searching Strategy. Given the above two objective functions for two scenarios,
the learner is required to search the query space and find the optimal one in each
iteration.
In most traditional active learning studies, each unlabeled example is directly
regarded as a candidate query. Thus, in each iteration, the query space simply
contains all the current unlabeled examples, and exhaustive search is usually ap-
plied directly. However, when asking generalized queries, each unlabeled example
can generate a set of candidate generalized queries, due to the existence of the
don’t-care
features. For instance, given a specific example with
d features, there
exist d1 generalized queries with one don’t-care feature, d2 generalized queries
with two don’t-care features, and so on. Thus, altogether 2d corresponding gener-
alized queries could be constructed from each specific example. Therefore, given
an unlabeled dataset with l examples, the entire query space would be 2d l. This
query space is thus quite large (grows exponentially to the feature dimension),
and it is unrealistic to exhaustively evaluate every candidate. Instead, we apply
greedy search to find the optimal query in each iteration.
Specifically, for each unlabeled example (with d features), we first construct
all the generalized queries with only one don’t-care feature (i.e., d1 = d queries),
and choose the best as the current candidate. Then, based only on this candidate,
we continue
to construct all the generalized queries with two don’t-care features
(i.e., d−1
1 = d− 1 queries), and again only keep the best. The process repeats to
greedily increase the number of don’t-care features in the query, until no better
query can be generated. The last generalized query thus is regarded as the best
for the current unlabeled example. We conduct the same procedure on all the
unlabeled examples, thus we can find the optimal generalized query based on
the whole unlabeled set.
With such greedy search strategy, the computation complexity of searching is
thus O(d2 ) with respect to the feature dimension d. This indicates an exponential
improvement over the complexity of the original exhaustive search Θ(2d ). Note
that, it is true that such local greedy search cannot guarantee finding the true
optimal generalized query in the entire query space, but the empirical study (see
Section 4) will show it still works effectively in most cases.
4 Empirical Study
In this section, we empirically study the performance of the proposed algorithms
on 15 real-world datasets from the UCI Machine Learning Repository [1], and
compare them with the existing active learning algorithms.
All of the 15 UCI datasets have binary class and no missing values. Infor-
mation on these datasets is tabulated in Table 2. Each whole dataset is first
split randomly into three disjoint subsets: the training set, the unlabeled set,
and the test set. The test set is always 25% of the whole dataset. To make sure
that active learning can possibly show improvement when the unlabeled data
are labeled and included into the training set, we choose a small training set
for each dataset such that the “maximum reduction” of the error rate5 is large
enough (greater than 10%). The training sizes of the 15 UCI datasets range from
1/200 to 1/5 of the whole datasets, also listed in Table 2. The unlabeled set is
the whole dataset taking away the test set and the training set.
In our experiments, we set the querying cost (CQ ) for any specific query as
1, and study the following three cost settings for generalized queries with r
don’t-care features, as follows:
5
The “maximum reduction” of the error rate is the error rate on the initial training set
R alone (without any benefit of the unlabeled examples) subtracting the error rate
on R plus all the unlabeled data in U with correct labels. The “maximum reduction”
roughly reflects the upper bound on error reduction that active learning can achieve.
Asking Generalized Queries with Minimum Cost 403
Note that, these settings of querying cost are only used here for empirically
study, any other types of querying cost could be easily applied without changing
the algorithms.
As for all the 15 UCI datasets, we have neither true target functions nor human
oracles to answer the generalized queries, we simulate the target functions by
constructing learning models on the entire datasets in the experiments. The
simulated target function regards each generalized query as a specific example
with missing values, and provides the posterior class probability as the answer
to the learner. The experiment is repeated 10 times on each dataset (i.e., each
dataset is randomly split 10 times), and the experimental results are recorded.
!
!
Fig. 1. Comparison between “AGQ-QC”, “AGQ” and “Pool” on a typical UCI data
“breast-cancer”, for balancing acc./cost trade-off
404 J. Du and C.X. Ling
AGQ-QC
C = 1 + 0.5 × r C = 1 + 0.05 × r C = 1 + 0.5 × r2
Pool 6/7/2 10/4/1 5/6/4
AGQ 14/0/1 6/7/2 15/0/0
%
& & &
&
&
&
$
$
$
$
Fig. 2. Comparison between “AGQ-QC”, “AGQ” and “Pool” on a typical UCI data
“breast-cancer”, for minimizing total cost
AGQ-TC
C = 1 + 0.5 × r C = 1 + 0.05 × r C = 1 + 0.5 × r2
Pool 6/7/2 10/4/1 6/6/3
AGQ 15/0/0 6/6/3 15/0/0
&
** &
**
&
!
$
$
$ % $ % $ %
!"# ' (
) "
We can clearly see from these figures that, when only the approximate prob-
abilistic answers are provided by the oracle, the performance of the proposed
algorithms are not significantly affected. The similar experimental results can
be shown with other settings and on other datasets. This indicates that, the
proposed algorithms are rather robust with such more realistic approximate
probabilistic answers, thus can be directly deployed in real-world applications.
5 Conclusion
In this paper, we assume that the oracles are capable of answering general-
ized queries with non-uniform costs, and study active learning with generalized
queries in cost-sensitive framework. In particular, we design two objective func-
tions to choose generalized queries in the learning process, so as to either balance
the accuracy/cost trade-off or minimize the total cost of misclassification and
querying. The empirical study verifies the superiority of the proposed methods
over the existing active learning algorithms.
References
1. Asuncion, A., Newman, D.J.: UCI machine learning repository (2007)
2. Baram, Y., El-Yaniv, R., Luz, K.: Online choice of active learning algorithms.
Journal of Machine Learning Research 5, 255–291 (2004)
3. Cohn, D.A., Ghahramani, Z., Jordan, M.I.: Active learning with statistical models.
Journal of Artificial Intelligence Research 4, 129–145 (1996)
4. Du, J., Ling, C.X.: Active learning with generalized queries. In: Proceedings of the
9th IEEE International Conference on Data Mining, pp. 120–128 (2009)
5. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The
weka data mining software: an update. SIGKDD Explorations 11(1), 10–18 (2009)
6. Kapoor, A., Horvitz, E., Basu, S.: Selective supervision: Guiding supervised learn-
ing with decision-theoretic active learning. In: Proceedings of International Joint
Conference on Artificial Intelligence (IJCAI), pp. 877–882 (2007)
7. Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learn-
ing. In: Proceedings of ICML 1994, 11th International Conference on Machine
Learning, pp. 148–156 (1994)
8. Margineantu, D.D.: Active cost-sensitive learning. In: Nineteenth International
Joint Conference on Artificial Intelligence (2005)
9. Roy, N., Mccallum, A.: Toward optimal active learning through sampling estima-
tion of error reduction. In: Proc. 18th International Conf. on Machine Learning,
pp. 441–448 (2001)
10. Settles, B., Craven, M., Friedland, L.: Active learning with real annotation costs.
In: Proceedings of the NIPS Workshop on Cost-Sensitive Learning (2008)
11. Seung, H.S., Opper, M., Sompolinsky, H.: Query by committee. In: Proceedings
of the Fifth Annual Workshop on Computational Learning Theory, pp. 287–294
(1992)
12. Tong, S., Koller, D.: Support vector machine active learning with applications to
text classification. Journal of Machine Learning Research 2, 45–66 (2002)
Ranking Individuals and Groups by Influence
Propagation
Pei Li1 , Jeffrey Xu Yu2 , Hongyan Liu3 , Jun He1 , and Xiaoyong Du1
1
Renmin University of China, Beijing, China
{lp,hejun,duyong}@ruc.edu.cn
2
The Chinese University of Hong Kong, Hong Kong, China
yu@se.cuhk.edu.hk
3
Tsinghua University, Beijing, China
hyliu@tsinghua.edu.cn
1 Introduction
Ranking the centrality (or importance) of nodes within a graph is a fundamental
problem in network analysis. Recently, the online social networking sites, such as
Facebook and MySpace, provide users with a platform to make people connected.
Learning and mining on these large-scale social networks attract attentions of
many researchers in the literature [1]. In retrospect, Freeman [2] reviewed and
evaluated the methods about centrality measures, and categorized them into
three conceptual foundations: degree, betweenness, and closeness. Accompanied
with eigenvector centrality (EVC) proposed by Bonacich [3], these four measures
dominate the empirical usage. The first three methods measure the centrality by
simply calculating the edge degree or the mean or fraction of geodesic paths [4],
and treat every node equally. In this paper, we focus on EVC, which ranks the
centrality of a node v by considering the centrality of nodes that surround v.
In the literature, most of link analysis approaches focus on the link structures
and ignore the intrinsic characteristics of nodes over a graph. However, in many
networks, nodes also contain important information, such as the page content in
a web graph. Simply overlooking these predefined importance may facilitate the
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 407–419, 2011.
c Springer-Verlag Berlin Heidelberg 2011
408 P. Li et al.
Table 1. Notations
usage of link spam. We believe the intrinsic characteristics of nodes also affect
link-based ranking significantly.
The main contributions of this work are summarized below. First, we discuss
the problems with the current EVC approaches, for example, PageRank, which
ignores the intrinsic impacts of nodes on the ranking. Second, we propose a new
Influence Propagation model, called IP model, which propagates the user-defined
importance over nodes in a graph by random walking. We allow users to specify
decay functions to control how the influence propagates over nodes. It is worth
noting that most of EVC approaches only use an exponential function, that is
not proper in many cases which we will address later. Third, we give algorithms
to rank an individual node and all nodes in a graph efficiently. Fourth, we discuss
how to rank a group (a set of nodes) regarding the centrality using both inner
and outer structural information.
The remainder of the paper is organized as follows. Section 2 gives the mo-
tivation of our work. Section 3 discuss our new influence propagation model,
and ranking algorithms for individual nodes and groups. We conducted exten-
sive performance studies and report our findings in Section 4. The related work
is given in Section 5 and we conclude in Section 6. The notations used in this
paper are summarized in Table 1.
2 The Motivation
In this section, first, we discuss our motivation to propose a new influence model,
and explain why PageRank is not applicable in some cases. Second, we give our
intuitions on how to rank the centrality for a set of nodes.
Why Not PageRank: As a typical variant of EVC [3], PageRank [5] models the
behavior of a random surfer, who clicks some hyperlink in the current page with
probability c, and periodically jumps to a random page because “gets bored”
with probability (1 − c). Let T be a transition matrix for a directed graph.
For the p-th row and q-th column element of T , Tp,q = 0 if (p, q) ∈ / E, and
Ranking Individuals and Groups by Influence Propagation 409
Fig. 1. (a) A simple directed network in which every node has an predefined impor-
tance. (b) PageRank scores and normalized IPRank scores corresponding to different
predefined importance Z. Decay function is set to be f (k) = 0.8k .
Tp,q = w(p, q)/ i∈O(p) w(p, i) otherwise, where w(p, q) is the weight of edge
(p, q). The matrix form of PageRank can be written below.
R = cRT + (1 − c)U (1)
Here, U corresponds to the distribution vector of web pages that a random
surfer periodically jumps to and U = 1 holds. Based on Eq. (1), PageRank
scores can be iteratively computed by Rk = cRk−1 T + (1 − c)U . The solution R
is a steady probability distribution with R = 1, and decided by T and U only.
It is important to note that the initial importance R0 of all nodes in PageRank
are ignored (refer to Eq. (1)). In other words, R0 is not propagated in PageRank.
As shown in Fig. 1, for the graph shown in Fig. 1(a), the PageRank scores for
a, b, c, d, and e are 0.149, 0.1, 0.223, 0.243, and 0.174, respectively, regardless
any given initial importance R0 . However, in many real applications, the initial
importance R0 plays a significant role and greatly influences the resulting R.
In addition, simply applying the PageRank to measure centrality in general
may result in unexpected results, because PageRank is originally designed to
bring order to the web graph. For example, to model the “word-of-mouth” effect
in social networks [6] where people are likely to be influenced by their friends,
the behavior of “random jumping” used in PageRank is not reasonable, since
the influence only occurs between two directly connected persons.
Motivated by propagating the initial predefined importance of nodes and ran-
domly jumping, we claim that PageRank is not applicable for link-based ranking
in all possible cases. In this paper, we propose a more general and customizable
model for link-based ranking. We propose a new Influence Propagation (IP)
model and IPRank to rank nodes and groups, based on their structural contexts
in the graph and predefined importance.
Fig. 2. Three groups with the same degrees connected to outside nodes. ((a) and (b)
are altered from Fig. 4.2.1 in [7].)
members of this group. More explicitly, let C be a group, N (C) be the set of all
nodes that are not in C, but are neighbors of a member in C. [7] normalizes and
|N (C)|
computes group degree centrality by |V |−|C|
, where |V | is the number of nodes in
the graph. Clearly, this method measures group centrality from the view of nodes
outside this group. However, given two large groups A and B where |A| > |B|,
we find that |N (A)| > |N (B)| is more likely to be and |V |−|A| < |V |−|B| holds,
making larger groups have a higher degree centrality more easily. Moreover, this
method ignores the centrality scores of nodes in groups.
In this work, we investigate how to combine the inner and outer structural
context of a specific group. Some intuitions are given below. Consider Fig. 2.
First, regarding the outer structural context, Group 2 should have a higher
score than Graph 1, because Group 2 is with a larger span of neighbors. This
intuition is drown from real-world networks such as friendship network, where
a group with more contacts outside this group have a higher ranking. Second,
regarding the inner structure of a group, both Group 2 and Group 3 have the
same outside neighbors, but the inner structure of Group 3 is more compact and
cohesive, so Group 3 is with a higher score than Group 2.
Proposition 3.1: For a random path p = v0 , v1 , ..., vk that starts at time 0,
the influence Z(0) propagating from v0 to vk is
k−1
Z(k) = Z(0) · f (k) · Ti,i+1 (3)
i=0
Proof Sketch: Let’s analyze the case of one-step propagation. For an edge
vi , vj , the influence Z(j) propagating from vi is ff (j)
(i)
· Z(i) · Ti,j . Since path p
can be viewed as a sequence of one-step propagations, Eq. (3) holds. 2
Algorithm 1. IPRank-One(G, v, Z, T , K)
Input: graph G(V, E), node v, predefined importance Z, transition matrix T ,
and maximum step K
Output: IPRank score R(v)
4: Procedure PathRecursion(v, n, x, y)
5: y = y + 1;
6: for every node u in in-neighbor set of the node n in G do
7: R(v) = R(v) + Z(u) · x · Tu,n · f (y);
8: if y < K then
9: PathRecursion(v, u, x · Tu,n , y);
10: end if
11: end for
k
in graph G is d, Algorithm 1 needs to traverse i=1 di nodes and thus collects
the same number of random walk paths. The time complexity of IPRank-One
is O(dk ) and acceptable for querying IPRank scores for one or a few nodes.
But it is obviously inefficient when we need to compute IPRank scores of all
nodes in a graph. Based on our observations, random walk paths generated by
IPRank queries of different nodes contain the shared segments, which can be
reused to save computational cost. For example, influence propagation along
path a, b, a, c and a, b, a are both computed on IPRank queries for node c
and a, but they contain the same segment a, b, a.
We develop an algorithm to compute IPRank for all nodes in matrix form that
works as follows. We call it IPRank-All, which is motivated by our IP model,
where different nodes propagate their influence with different steps. The initial
influence of all nodes is stored in a row vector Z. In the first step, all nodes
propagate influence to its out-neighbors with decay factor f (1). Let us consider
the influence received by a node. Suppose that in-neighbor set of a node v is I(v),
I(v)
the influence received by v is Z1 (v) = f (1) · i=1 (Z(i) · Ti,v ). Consider all nodes
such as v, we get Z1 = f (1) · ZT in matrix form. In the second step, according
to our IP model, all elements in Z1 will propagate to its out-neighbors and the
influence vector received on the second step is Z2 = f (2) · ZT 2 . Analogously, the
influence vector received on the k-th step can be computed iteratively by
f (k)
Zk = f (k) · ZT k = · Zk−1 T (4)
f (k − 1)
Eq. (4) and Eq. (5) form the main computation of IPRank-All algorithm. Let
Xk = ZT k , Zk can be computed iteratively by applying Zk = f (k) · Xk =
Ranking Individuals and Groups by Influence Propagation 413
Algorithm 2. IPRank-All(G, Z, T , h)
Input: graph G(V, E), initial influence vector Z, transition matrix T ,
and threshold h
Output: IPRank scores R
1: initialize R = Z;
2: for every node v ∈ V do
3: obtain K according to Eq. (2);
4: RefineRecursion(v, Z(v), 0, K);
5: end for
6: return R;
7: Procedure RefineRecursion(v, x, y, K)
8: y = y + 1;
9: for every node u in out-neighbor set of node v do
10: R(u) = R(u) + x · Tv,u · f (y);
11: if y < K then
12: RefineRecursion(u, x · Tv,u , y, K);
13: end if
14: end for
MA (v) is the membership degree. On the other hand, if nodes in the group
are more connected to each other, this group should have a higher centrality.
We do not use the simple approaches such as summing and averaging, because
they ignore the link information between individual nodes in a group. To reduce
the effect of the group size, individual nodes with a high centrality should play
a more important role, especially when they are highly connected. IP model is
also effective to help rank groups from the viewpoint of the inner structure, by
propagating the influence of these
high-score
individuals via
links. That is,
GRin = MA (v) · Z(v) + Z(u, v) (8)
v∈A u∈A
Finally we combine rankings from outer and inner structural context together
to rank groups in graph G(V,E), as shown
below.
GR = MA (v) · Z(v) + Z(u, v) (9)
v∈A u∈V
4 Experimental Study
We report our experimental results to confirm the effectiveness of our IPRank on
both individual and group levels. We compare IPRank with other four centrality
measures on accuracy, and we use various synthetic datasets and a large real
co-authorship network from DBLP. All algorithms were implemented in Java,
and all experiments were run on a machine with a 2.8 GHz CPU.
Ranking Individuals and Groups by Influence Propagation 415
Table 2. (a) Normalized centrality scores of different measures. (b) Normalized IPRank
scores while predefined importance of node b increases step by step.
(a) (b)
CD [0.20, 0.10, 0.30, 0.20, 0.20] Z Normalized IPRank Scores (%)
CB [0.21, 0.21, 0.19, 0.19, 0.21] [0.2, 0.0, 0.2, 0.2, 0.2] [16.2, 5.66, 33.2, 25.8 19.1]
CC [0.29, 0.00, 0.33, 0.04, 0.33] [0.2, 0.2, 0.2, 0.2, 0.2] [16.1, 11.6, 31.9, 23.2, 17.2]
P ageRank [0.16, 0.12, 0.32, 0.23, 0.17] [0.2, 0.4, 0.2, 0.2, 0.2] [16.0, 15.6, 31.1, 21.4, 15.9]
IP Rank [0.16, 0.21, 0.30, 0.19, 0.14] [0.2, 0.6, 0.2, 0.2, 0.2] [16.0 18.4, 30.5, 20.2, 14.9]
Table 3. (a) Ranking without predefined importance. (b) IPRank on KDD area. (c)
IPRank on WWW area.
Efficiency and Convergence Rate: PageRank does not provide ways to com-
pute the score of only one node. In contrast, IPRank-One can do this without
accuracy loss, and an advantage is that if we only need to obtain IPRank scores
of a few nodes, IPRank-One is more efficient than IPRank-All. We execute ex-
periments on a random graph with 1M nodes and 3M edges. IPRank-All takes
3.65s to perform all iterations, whereas IPRank-One only needs 0.01s to respond
IPRank query for one node. IPRank-All+ provides a more accurate measure
Ranking Individuals and Groups by Influence Propagation 417
30 400 1.0
25 0.9
300
Precision
Time (s)
0.8
15 200
0.7
10
100
5 0.6
Nodes Iterations
0 K 0 0.5
5 6 7 8 9 0.1M 0.2M 0.3M 0.4M 0.5M 0.6M 0.7M 0.8M 1 2 3 4 5 6 7 8 9 10 11
Fig. 3. (a) Time cost of traverse increasing with steps K. (b) Performance of IPRank-
All as node size increases. (c) Convergence rate of IPRank-All on DBLP dataset.
than IPRank-All when the decay of some large predefined importance needs
more iterations. Algorithms show that IPRank-One and IPRank-All+ are all
based on traverse of nodes that reach the target node within K steps. Fig. 3(a)
shows time cost of such a traverse increases rapidly when K increases. We rec-
ommend IPRank-One for IPRank query of a few nodes and IPRank-All+ for
more accurate IPRanking.
IPRank-All is suitable for most of cases. We set |E|/|V | = 5 and let the
graph size |V | increase. Time cost of each IPRank-All iteration increases near
linearly and looks acceptable, as shown in Fig. 3(b). We test the convergence
rate of IPRank-All on DBLP co-authorship network with decay f (k) = 0.7k .
The precision of iteration k is defined by averaging Rk (a)/R(a) for every node
a. Fig. 3(c) shows that after 10 iterations, the error of precision is below 0.01.
5 Related Work
Historically, measuring the centrality of nodes (or individuals) in a network has
been widely studied. Freeman [2] reviewed and categorized these methods into
three conceptual foundations: degree, betweenness, and closeness. Accompanied
with eigenvector centrality (EVC) proposed by Bonacich [3], these four measures
dominate the empirical usage of centrality. A recent summary can be found in
[4]. Besides, Tong et al. [14] proposed cTrack to find central objects in a skewed
time-evolving bipartite graph, based on random walk with restart.
In recent years, the trend of exploiting structural context becomes prevalent
in network analysis. The crucial intuition behind this trend is that “individuals
relatively closer in a network are more likely to have the similar characters”. A
typical example is PageRank [5], where the page importance is flowing and mu-
tual reinforced along hyperlinks. Other examples and applications were explored
in recent works such as [10,11,15]. [15] analyzed the propagation of trust and
distrust on large networks consisting of people. [11] used a few labeled exam-
ples to discriminate irrelevant results by computing proximity from the relevant
nodes. Gyongyi et al. discovered other good pages by propagating the trust of a
small set of good pages [10].
418 P. Li et al.
6 Conclusion
In this paper, we proposed an new influence propagation model to propagate
user-defined importance on nodes to others along random walk paths with user
control by allowing users to define decay functions. We propose new algorithms
to measure the centrality of individuals and groups according to the user’s view.
We tested our approaches using large real dataset from DBLP, and confirmed
the effectiveness and efficiency of our approaches.
References
1. Zhang, H., Smith, M., Giles, C.L., Yen, J., Foley, H.C.: Snakdd 2008 social network
mining and analysis report. SIGKDD Explorations 10(2), 74–77 (2008)
2. Freeman, L.C.: Centrality in social networks: conceptual clarification. Social Net-
works 1, 215–239 (1978)
3. Bonacich, P.: Factoring and weighting approaches to status scores and clique iden-
tification. Journal of Mathematical Sociology 2(1), 113–120 (1972)
4. Newman, M.: The mathematics of networks. In: Blume, L., Durlauf, S. (eds.) The
New Palgrave Encyclopedia of Economics, 2nd edn. Palgrave MacMillan, Bas-
ingstoke (2008), http://www-ersonal.umich.edu/~ mejn/papers/palgrave.pdf
5. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking:
Bringing order to the web. Technical Report 1999-66, Stanford InfoLab (1999)
6. Kempe, D., Kleinberg, J.M., Tardos, É.: Maximizing the spread of influence
through a social network. In: KDD, pp. 137–146 (2003)
Ranking Individuals and Groups by Influence Propagation 419
7. Everett, M.G., Borgatti, S.P.: Extending centrality. In: Wasserman, S., Faust, K.
(eds.) Social network analysis: methods and applications, pp. 58–63. Cambridge
University Press, Cambridge (1994)
8. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press,
Cambridge (1995)
9. Valente, T.: Network Models of the Diffusion of Innovations. Hampton Press, New
Jersey (1995)
10. Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.O.: Combating web spam with
trustrank. In: VLDB, pp. 576–587 (2004)
11. Sarkar, P., Moore, A.W.: Fast dynamic reranking in large graphs. In: WWW, pp.
31–40 (2009)
12. Centrality in Wikipedia, http://en.wikipedia.org/wiki/Centrality
13. Dangalchev, C.: Mining frequent cross-graph quasi-cliques. Physica A: Statistical
Mechanics and its Applications 365(2), 556–564 (2006)
14. Tong, H., Papadimitriou, S., Yu, P.S., Faloutsos, C.: Proximity tracking on time-
evolving bipartite graphs. In: SDM, pp. 704–715 (2008)
15. Guha, R.V., Kumar, R., Raghavan, P., Tomkins, A.: Propagation of trust and
distrust. In: WWW, pp. 403–412 (2004)
16. Haveliwala, T.H.: Topic-sensitive pagerank. In: WWW, pp. 517–526 (2002)
17. Lin, Z., Lyu, M.R., King, I.: Pagesim: a novel link-based measure of web page
aimilarity. In: WWW, pp. 1019–1020 (2006)
18. Baeza-Yates, R.A., Boldi, P., Castillo, C.: Generalizing pagerank: damping func-
tions for link-based ranking algorithms. In: SIGIR, pp. 308–315 (2006)
19. Jiang, D., Pei, J.: Mining frequent cross-graph quasi-cliques. TKDD 2(4) (2009)
Dynamic Ordering-Based Search Algorithm for Markov
Blanket Discovery
1 Introduction
Bayesian network (BN) [1] is a type of statistical models that efficiently represent the
joint probability distribution of a domain. It is a directed acyclic graph where nodes rep-
resent domain variables of a subject of matter, and arcs between the nodes describe the
probabilistic relationship of variables. One problem that naturally arises is the learning
of such a model from data. Most of the existing algorithms fail to construct a network of
hundreds of variables in size. A reasonable strategy for learning a large BN is to firstly
discover the Markov blanket of variables, and then to guide the construction of the full
BN [2,3,4,5].
Markov blanket indeed is an important concept and possesses potential uses in nu-
merous applications. For every variable of interest T , the Markov blanket contains a set
of parents, children, and spouses (i.e., parents of common children) of T in a BN [1].
The parents and children reflect the direct cause and direct effect of T respectively
while the spouses represent the direct cause of T ’s direct effect. Such causal knowledge
is essential if domain experts desire to manipulate the data process, e.g. to perform a
troubleshooting on a faulty device, or to test the body reaction to a medicine, or to
study the symptom of a disease, etc. Furthermore, conditioned on its Markov blanket
variables, the variable T is probabilistically independent of all other variables in the
domain. Given this important property, the Markov blanket is inextricably connected to
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 420–431, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Dynamic Ordering-Based Search Algorithm 421
B C D
E F T H
R I J
M K L
Fig. 1. Markov blanket of the target node T in the BN. It includes the parents and children of T ,
P C(T ) = {C, D, I}, and the spouses, SP (T ) = {R, H}.
the feature selection problems. Koller and Sahami [6] showed the Markov blanket of T
is the theoretically optimal set of features to predict T ’s values. We show an instance
of Markov blanket within a small BN in Fig. 1. The goal of this paper is to identify the
Markov blanket of a target variable from data in an efficient and reliable manner.
Research on Markov blanket discovery is traced back to the Grow-Shrink algo-
rithm (GS) in Margaritis and Thrun’s work [7]. The Grow-Shrink algorithm is the first
Markov blanket discovery algorithm proved to be correct. Tsamardinos et. al. [8,9] pro-
posed several variants of GS, like the incremental association Markov blanket (IAMB)
and Interleaved IAMB that aim at the improved speed and reliability. However, the
algorithms are still limited on achieving data efficiency. To overcome this limitation, at-
tempts have been made including the Max-Min Parents and Children (MMPC) [10] and
HITON-PC [11] algorithms for Markov blanket discovery. Neither of them is shown to
be correct. This motivates a new generation of algorithms like the Parent-Child based
search of Markov blanket (PCMB) [12] and the improved one - Iterative PCMB (IPC-
MB) [13]. Besides the proved soundness, IPC-MB inherits the searching strategy from
the MMPC and HITON-PC algorithms: it starts to learn both parents and children of the
target variable and then proceeds to identify spouses of the target variable. It results in
the Markov blanket from which we are able to differentiate direct causes (effects) from
indirect relation to the target variable. The differentiation on Markov blanket variables
is rather useful when the Markov blanket will be further analyzed to recover the causal
structure, e.g., providing a partial order to speed up the learning of the full BN. In a
similar vein, we will base the new algorithm on the IPC-MB and provide improvement
on both time and data efficiency.
In this paper, we propose a novel Markov blanket discovery algorithm, called Dy-
namic Ordering-based Search (DOS) algorithm. Akin to the existing algorithms, the
DOS takes an independence − based search to find a Markov blanket by assuming that
data were generated from a f aithf ul BN modeling the domain. It conducts a series
of statistical conditional independence tests toward the goal of identifying a number of
Markov blanket variables (parents and children as well as spouses). Our main contri-
bution on developing the DOS is on two aspects. Firstly, we arrange the sequence of
independence tests by ordering variables not only in the candidate set, but also in the
conditioning sets. We order the candidates using the independence measurement like
the mutual information [14], the p-value returned by the G2 tests [15], etc. Meanwhile,
we order the conditioning variables in terms of the f requency that the variables enter
422 Y. Zeng et al.
into the conditioning set in the known independence tests. We re-order the variables im-
mediately when an independence test is completed. By ordering both types of variables,
we are able to detect true negatives effectively within a small amount of conditional in-
dependence tests.
Secondly, we exploit the known conditional independence tests to remove true neg-
atives from the candidate set at the earliest time. By doing so, we need test only a small
number of the conditioning sets (generated from the candidate set) thereby improving
time efficiency. In addition, we can limit the conditioning set into a small size in the
new independence tests, which achieves data efficiency. We further provide the proof
on the correctness of the new DOS algorithm. Experimental results show the benefit
of dynamically ordering independence tests and demonstrate the superior performance
over the IPC-MB algorithm.
Bayesian network (BN) [1] is a directed acyclic graph G where each node is annotated
with a conditional probability distribution (CPD) given any instantiation of its parents.
The multiplication of all CPDs constitutes a joint probability distribution P modeling
the domain. In a BN, a node is independent of its non-descendants conditioned on its
parents.
Due to the faithfulness assumption and d-separation criterion, we are able to learn a BN
from the data generated from the domain. We may utilize statistical tests to establish
conditional independence between variables that is structured in the BN. This moti-
vates the main idea of an independence-based (or constraint-based) search for learning
BN [15]. Most of current BN or Markov blanket learning algorithms are based on the
following theorem [15].
We observe that the first part of Theorem 1 allows one to find parents and children of the
target node T , denoted by P C(T ), since there shall be an edge between P C(T ) and T ;
the second part provides possibility on identifying a spouse of T , denoted by SP (T ).
Hence Theorem 1 together with Theorem 2 provide a foundation for the Markov blanket
discovery.
3 DOS Algorithm
The Dynamic Ordering-based Search algorithm (DOS) discovers the Markov blanket
M B(T ) through two procedures. In the first procedure, the algorithm finds a candidate
set of parents and children of the target node T , called CP C(T ). It starts with the
whole set of domain variables and gradually excludes those that are independent of T
conditioned on a subset of the remained set. In the second procedure, the algorithm
identifies spouses of the target node, called SP (T ), and removes the false positives
from CP C(T ). The resulted CP C(T ) is the output M B(T ).
Prior to presenting the DOS algorithm, we introduce three functions. The first func-
tion, called Indep(X, T |S), measures the independence between the variable X and
the target variable T conditioned on a set of variables S. In our algorithm, we use G2
tests to compute the conditional independence and take the p-value (returned by G2
test) for the independence measurement [15]. The smaller the p-value, the higher the
dependence. In practice, we compare the p-value to a confidence threshold 1-α. More
precisely, we let Indep(X, T |S) be equivalent to the p-value so that we are able to con-
nect the independence measurement to conditional independence, i.e., I(X, T |S)=true
iff Indep(X, T |S) ≥ 1 − α. Notice that we assume independence tests are correct.
The second function, called F req(Y ), is a counter that measures how frequent a
variable Y enters into the conditioning set S in the previous conditional independence
tests Indep(X, T |S). A large F req(Y ) value implies a large probability of d-separating
424 Y. Zeng et al.
Main Procedure
1: Initialize the adjacent set of T : ADJ(T ) = U − {T }
2: Find the CP C(T ) through GenCP C:CP C(T ) = GenCP C(D, T, ADJ(T ))
3: Find the SP (T ) and remove the false positives through Ref CP C:M B(T ) =
Ref CP C(D, T, CP C(T ))
Sub-Procedure: Generate the CP C(T )
GenCP C(D, T, ADJ (T ))
1: Initialize the size of conditioning set S: cutsize=0
2: WHILE (|ADJ(T )| > cutsize) DO
3: Initialize the Non-PC set: N P C(T )=∅
4: FOR each X ∈ ADJ(T ) and
choose X = argmax Indep(X, T |S) DO
5: Generate the conditioning sets:
SS = GenSubset(ADJ(T ) − {X}, cutsize)
6: FOR each S ∈ SS DO
7: IF (Indep(X, T |S) ≥ 1 − α) THEN
8: N P C(T )=N P C(T ) ∪ X
9: ADJ(T ) = ADJ(T ) − N P C(T )
10: Keep the d-separate sets: Sepset(X, T )=S
11: FOR each Y ∈ S DO
12: Update F req(Y )
13: Order ADJ(T ) using F req(Y ) in the descending order
14: Break
15: cutsize = cutsize + 1
16: Return CP C(T ) = ADJ(T )
Sub-Procedure: Refine the CP C(T )
Ref CP C(D, T, CP C(T ))
1: FOR each X ∈ CP C(T ) DO
2: Find the CP C for X:
CP C(X) = GenCP C(D, X, U − {X})
3: IF T ∈ CP C(X) THEN
4: Remove the false positives: CP C(T ) = CP C(T ) − {X}
5: Continue
6: FOR each Y ∈ {CP C(X) − CP C(T )} DO
7: IF (Indep(Y, T |X ∪ Sepset(X, T )) < 1 − α) THEN
8: Add the spouse Y : SP (T ) = SP (T ) ∪ {Y }
9: CP C(T ) = CP C(T ) ∪ SP (T )
10: Return M B(T ) = CP C(T )
Fig. 2. The DOS algorithm contains two sub-procedures. The GenCP C procedure finds a can-
didate set of parents and children of T by efficiently removing Non-PC from the set of domain
variables while the Ref CP C procedure mainly adds spouses of T and removes false positives.
426 Y. Zeng et al.
ADJ(T ) using the GenSubSet function (line 5). Since we order ADJ(T ) variables
and generate the subsets in the Banker’s sequence, the conditioning set S(∈ SS) firstly
selected will have a large probability of being P C(T ) or its subset. Consequently, we
may detect an Non-PC variable within few tests. Once we identify the Non-PC variable
we immediately remove it from ADJ(T ) (lines 8-9). The reduced ADJ(T ) avoids to
generate a large number of the conditioning sets as well as a big size of the conditioning
set in the new tests.
The GenCP C procedure returns the candidate set of T ’s parents and children that
excludes false negatives. However, it may include possible false positives. For instance,
in Fig. 1, the variable M still remains in the output CP C(T ) because M is d-separated
from T only conditioned on the set {R, I}. However, the variable R is removed early
since it is independent from T given the empty set. Hence the tests will not condition
on both R and I simultaneously. The problem is fixed by checking the symmetric re-
lation between T and T ’s PC, i.e., T shall be in the PC set of T ’s PC variable and
vice versa [2,12]. For example, we may find the candidate set of M ’s parents and chil-
dren CP C(M ). If T does not belong to CP C(M ) we could safely remove M from
CP C(T ). We present this solution in the procedure Ref CP C.
In the procedure Ref CP C, we start to search the parent and children set for each
variable in CP C(T ) (line 2). If the candidate PC variable violates the symmetry (e.g.,
T ∈ CP C(X)) it will be removed from CP C(T ) (line 4). If T ∈ CP C(X), we
know that X is a true PC of T and CP C(X) may contain T ’s spouse candidates. A
spouse is not within CP C(T ), but shares common children with T . We again use G2
tests to detect the dependence between the spouse and T , and identify the true spouse
set SP (T ) (lines 7-9). We refine the CP C(T ) by removing the false positives and
retrieving the spouses, and finally return the true M B(T ).
Theorem 3 (Correctness). The Markov blanket M B(T ) returned by the DOS algo-
rithm is correct and complete given two assumptions: 1) the data D are faithful to a
BN; and 2) the independence tests are correct.
The primary complexity of the DOS algorithm is due to the procedure GenCP C in
Fig. 2. Similar to the performance evaluation of BN learning algorithms, the complex-
ity is measured in the number of conditional independence tests executed [15]. The
Dynamic Ordering-Based Search Algorithm 427
procedure needs to calculate the independence function Indep(X, T |S) for each do-
main variable given all subsets of ADJ(T ) in the worst case. Hence the number of
tests is bounded by O(|U| · 2|ADJ(T )| ). Our strategy of selecting both the candidate
variable X and the conditioning set S will quickly reduce the ADJ (T ) by removing
Non-PC variables and test only the subsets of P C(T ) in most times. Ideally, we may
expect the complexity is in the order of O(|U|·2|P C(T )| ). This is a significant reduction
on the complexity since |P C(T )| |ADJ(T )| in most cases.
4 Experimental Results
We evaluate the DOS algorithm performance over triple benchmark networks and com-
pare it with the state-of-the-art algorithm IPC-MB. To be best of our knowledge, the
IPC-MB is the best algorithm for Markov blanket discovery in the current study. Both
algorithms are implemented in Java and the experiments are run on a WindowsXP plat-
form with Pentium(R) Dual-core (2.60 GHz) with 2G memory.
We describe the used networks in Table 1. The networks range from 20+ to 50+
variables in the domain and differ in the connectivity measured by both in/out-degree
and PC numbers. They provide useful tools in a wide range of practical applications and
have been proposed as benchmarks for evaluating both BN and Markov blanket learning
algorithms [2]. For each of the networks we randomly sample data from the probability
distribution of these networks. We use both the DOS and IPC-MB algorithms to re-
construct Markov blanket of every variable from the data.
We compare the algorithms in terms of speed measured by both times and the num-
ber of conditional independence (CI) tests executed, and accuracy measured by both
precision and recall. P recision is the ratio of true positives in the output (returned by
the algorithms) while recall is the ratio of returned true positives in the true M B(T ).
In addition, we use a combined measure that is the proximity of precision and recall
of the algorithm
to perfect precision and recall expressed as the Euclidean distance:
Distance = (1 − precision)2 + (1 − recall)2 . The smaller the distance the closer
the algorithm output is to the true Markov blanket.
For a single experiment on a particular dataset we ran the algorithms using as targets
all variables in a network and computed the average values for each measurement. For
a particular size of dataset we randomly generated 10 sets and measured the average
performance of each algorithm. We set α =0.05. Tables 2 reports the experimental re-
sults for datasets of different sizes. Each entry in the tables shows average and standard
deviation values over 10 datasets of a particular size. In the table, “Insts.” refers to data
Table 2. Both speed and accuracy comparison between the DOS and IPC-MB algorithms
instances and “Algs.” to both algorithms. For the speed comparison purpose, “# CI
tests” denotes the total number of conditional independence tests. Reduction shows the
percentage by which the DOS algorithm reduces the times and number of CI tests over
the IPC-MB algorithm. For the accuracy comparison purpose, “Improvement” refers to
the improvement of the DOS algorithm over the IPC-MB algorithm in terms of accuracy
measurements like precision, recall and distance.
Dynamic Ordering-Based Search Algorithm 429
In the middle part of Table 2, we show the speed comparison between the DOS and
IPC-MB algorithms over four different datasets on three networks. The DOS algorithm
executes much faster than the I-PCMB for discovering the Markov blanket. This re-
sults from a significant reduction on the required CI tests in the DOS algorithm. As
Table 2 shows, the DOS requires average 40% of CI tests less than that done by the
IPC-MB. In some case (like ALARM network on 5000 data instances) the reduction is
up to 49.94%. The improved time efficiency is mainly due to our ordering strategy that
enables the DOS algorithm to quickly spot true negatives and reduce T ’s adajcent set
thereby avoiding uncessary CI tests.
In the right part of Table 2, we shows the accuracy of both algorithms on discovering
the Markov blanket. As expected, both algorithms perform better (smaller distance)
with a larger number of data instances. In most cases, the DOS algorithm has better
performance than the IPC-MB algorithms. It has around 8% improvement in terms of
the distance measurement compared with the IPC-MB algorithm. The improvement
is mainly due to more true positives found in the DOS algorithm (shown by more im-
provement on the recall measurement).
More importantly, the DOS demonstrates a larger improvement on the distance over
a smaller number of data instances. For the example of Insurance network, the distance
improvement is 13.95% over 300 data instances while it is 7.41% over 2000 data in-
stances. This implies more reliable CI tests in the DOS algorithm. The significant re-
duction of CI tests (shown in Table 2) also indicates improved test reliability for the
DOS algorithm. The reliability advantage appears because the DOS algorithm always
conditions on the conditioning set of small size by removing as early as possible true
negatives.
5 Related Work
Margaritis and Thrun [7] proposed the first probably correct Markov blanket discovery
algorithm - the Grow-Shrink algorithm. As implied by its name, the GS algorithm con-
tains two phases: a growing phase and a shrinking phase. It attempts to firstly add po-
tential variables into the Markov blanket and then remove false positives in the followed
phase. As the GS conducts statistical independence tests conditioned on the superset of
Markov blanket and many false positives may be included in the growing phase, it turns
out to be inefficient and cannot be scaled to a large application. However, its soundness
makes it a proved subject for future research.
The IAMB [8] was proposed to improve the GS on the time and data efficiency. It
orders the set of variables each time when a new variable is included into the Markov
blanket in the growing phase. By doing so, the IAMB is able to add fewer false positives
the first phase. However the independence tests are still conditioned on the whole (even
large) set of Markov blanket, which does not really improve the data efficiency. More-
over, the computation of conditional information values for sorting the variables in each
iteration is rather expensive in the IAMB. Yaramakala and Margaritis [17] proposed
a new heuristic function to determine the independence tests and order the variables.
However, as reported, there is no fundamental difference from the IAMB.
Later, several IAMB’s variants appeared to improve the IAMB’s limit on data ef-
ficiency like the Max-Min Parents and Children (MMPC) [10], HITON-PC [11] and
430 Y. Zeng et al.
so on. Unfortunately, both algorithms (MMPC and HITON-PC) were proved incor-
rect [12], but they do introduce a new approach on identifying the Markov blanket. The
algorithms find the Markov blanket by searching T ’s parents and children first, and then
discover T ’s spouses. This novel strategy allows independence tests to be conditioned
on a subset of T ’s neighboring or adjacent nodes instead of the whole set of Markov
blanket.
Following the same idea of MMPC and HITON-PC, Pena et. al. [12] proposed the
PCMB to conquer the data efficiency problem of the IAMB. More importantly, the
PCMB is proved correct in a theoretical way. Recently, Fu and Desmarais [13] pro-
posed the IPC-MB that always conducts statistical independence tests conditioned on
the minimum set of T ’s neighbors, which improves the PCMB on both the time and
data efficiency. However, both algorithms need to iterate a large number of subsets of
T ’s neighboring nodes in most cases and do not update the set of neighboring nodes
immediately after a true negative is detected. This allows our improvement as presented
in this paper.
We presented a new algorithm for Markov blanket discovery, called Dynamic Ordering-
based Search (DOS). The DOS algorithm orders conditional independence tests through
a strategic selection of both the candidate variable and the conditioning set. The selec-
tion is achieved by exploiting the known independence tests to order the variables. By
doing so, the new algorithm can efficiently remove true negatives so that it avoids un-
necessary conditional independence tests and the tests condition on a small set in size.
We analyzed the correctness of the DOS algorithm as well as its complexity in terms
of the number of conditional independence tests. Our empirical results show that the
DOS algorithm performs much faster and more reliably than the state-of-the-art al-
gorithm IPC-MB. The reliability advantage is more evident with a small number of
data instances. A potential research direction is investigating the utility of our ordering
scheme in independence-based algorithms for BN learning.
Acknowledgment
The first author acknowledges partial support from National Natural Science Founda-
tion of China (No. 60974089 and No. 60975052). Yanping Xiang thanks the support
from National Natural Science Foundation of China (No. 60974089).
References
1. Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible inference. Mor-
gan Kaufmann Publishers Inc., San Francisco (1988)
2. Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max-min hill-climbing bayesian network
structure learning algorithm. Machine Learning 65(1), 31–78 (2006)
Dynamic Ordering-Based Search Algorithm 431
3. Zeng, Y., Poh, K.L.: Block learning bayesian network structure from data. In: Proceedings
of the Fourth International Conference on Hybrid Intelligent Systems (HIS 2004), pp. 14–19
(2004)
4. Zeng, Y., Hernandez, J.C.: A decomposition algorithm for learning bayesian network struc-
tures from data. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008.
LNCS (LNAI), vol. 5012, pp. 441–453. Springer, Heidelberg (2008)
5. Zeng, Y., Xiang, Y., Hernandez, J.C., Lin, Y.: Learning local components to understand large
bayesian networks. In: Proceedings of The Ninth IEEE International Conference on Data
Mining (ICDM), pp. 1076–1081 (2009)
6. Koller, D., Sahami, M.: Toward optimal feature selection. In: Proceedings of the Thirteenth
International Conference on Machine Learning, pp. 284–292 (1996)
7. Margaritis, D., Thrun, S.: Bayesian network induction via local neighborhoods. Advances in
Neural Information Processing Systems 12, 505–511 (1999)
8. Tsamardinos, I., Aliferis, C.F., Statnikov, A.R.: Algorithms for large scale markov blanket
discovery. In: Proceedings of the Sixteenth International Florida Artificial Intelligence Re-
search Society Conference, pp. 376–381 (2003)
9. Tsamardinos, I., Aliferis, C.: Towards principled feature selection: Relevancy, filters and
wrappers. In: Proceedings of the Ninth International Workshop on Artificial Intelligence and
Statistics (2003)
10. Tsamardinos, I., Aliferis, C., Statnikov, A.: Time and sample efficient discovery of markov
blankets and direct causal relations. In: KDD, pp. 673–678 (2003)
11. Aliferis, C., Tsamardinos, I., Statnikov, A.: Hiton: A novel markov blanket algorithm for
optimal variable selection. In: Proceedings of American Medical Informatics Association
Annual Symposium (2003)
12. Pena, J.M., Nilsson, R., Bjorkegren, J., Tegner, J.: Towards scalable and data efficient learn-
ing of markov boundaries. International Journal of Approximate Reasoning 45(2), 211–232
(2007)
13. Fu, S., Desmarais, M.C.: Fast markov blanket discovery algorithm via local learning within
single pass. In: Proceedings of the Twenty-First Canadian Conference on Artificial Intelli-
gence, pp. 96–107 (2008)
14. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley-Interscience,
New York (2006)
15. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search. MIT Press, Cam-
bridge (2000)
16. Loughry, J., van Hemert, J., Schoofs, L.: Efficiently enumerating the subsets of a set. Depart-
ment of Mathematics and Computer Science, University of Antwerp, RUCA, Belgium, pp.
1–10 (2000)
17. Yaramakala, S., Margaritis, D.: Speculative markov blanket discovery for optimal feature
selection. In: Proceedings of the Fifth IEEE International Conference on Data Mining, pp.
809–812 (2005)
Mining Association Rules for Label Ranking
1 Introduction
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 432–443, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Mining Association Rules for Label Ranking 433
such as Mallows [17]. The other group of methods are based on measures of
similarity or correlation between rankings (e.g., [24,2]).
In this paper, we propose an adaptation of association rules mining for label
ranking based on similarity measures. Association rules mining is a very impor-
tant and successful task in data mining. Although its original purpose was only
descriptive, several adaptations have been proposed for predictive problems.
The paper is organized as follows: sections 2 and 3 introduce the label ranking
problem and the task of association rule mining, respectively; section 4 describes
the measures proposed here; section 5 presents the experimental setup and dis-
cusses the results; finally, section 6 concludes this paper.
2 Label Ranking
The formalization of the label ranking problem given here follows the one pro-
vided in [7].1 In classification, given an instance x from the instance space X, the
goal is to predict the label (or class) λ to which x belongs, from a pre-defined
set L = {λ1 , . . . , λk }. In label ranking the goal is to predict the ranking of the
labels in L that are associated with x. We assume that the ranking is a total
order over L defined on the permutation space Ω. A total order can be seen as a
permutation π of the set {1, . . . , k}, such that π(a) is the position of λa in π. Let
us also denote π −1 as the result of inverting the order in π. As in classification,
we do not assume the existence of a deterministic X → Ω mapping. Instead,
every instance is associated with a probability distribution over Ω. This means
that, for each x ∈ X, there exists a probability distribution P (·|x) such that, for
every π ∈ Ω, P (π|x) is the probability that π is the ranking associated with x.
The goal in label ranking is to learn the mapping X → Ω. The training data is
a set of instances T = {< xi , πi >}, i = 1, . . . , n, where xi are the independent
variables describing instance i and πi is the corresponding target ranking.
As an example, given a scenario where we have financial analysts making
predictions about the evolution of volatile markets, it would be advantageous
to be able to predict which analysts are more profitable in a certain market
context [2]. Moreover, if we could have beforehand the full ordered list of the best
analysts, this would certainly increase the chances of making good investments.
Given the ranking π̂ predicted by a label ranking model for an instance x,
which is, in fact, associated with the true label ranking π, we need to evaluate
the accuracy of the prediction. For that, we need a loss function on Ω. One such
function is the number of discordant label pairs,
which, if normalized into the interval [−1, 1], is equivalent to Kendall’s τ coeffi-
cient. The latter is as a correlation measure where D(π, π) = 1 and D(π, π −1 ) =
−1. We obtain a loss function by averaging this function over a set of exam-
ples. We will use it as evaluation measure in this paper, as it has been used in
1
An alternative formalization can be found in [25].
434 C.R. de Sá et al.
recent studies [7]. However, other distance measures could have been used, like
Spearman’s rank correlation coefficient [22].
3.1 Pruning
AR algorithms typically generate a large number of rules (possibly tens of thou-
sands), some of which represent only small variations from others. This is known
as the rule explosion problem [4]. It is due to the fact that the algorithm might
find rules for which the confidence can be marginally improved by adding further
conditions to the antecedent.
Pruning methods are usually employed to reduce the amount of rules, without
reducing the quality of the model. A common pruning method is based on the
improvement that a refined rule yields in comparison to the original one [4]. The
improvement of a rule is defined as the smallest difference between the confidence
of a rule and the confidence of all sub-rules sharing the same consequent. More
formally, for a rule A → C
A→π
where A ⊆ desc (X) and π ∈ Ω. The only difference is that the label λ ∈ L is
replaced by the ranking of the labels, π ∈ Ω. Similar to what the prediction
made in CBA, when an example matches the rule A → π, the predicted ranking
is π. In this regard, we can use the same basic principle of the ruleitem for CARs
in LRARs, which is A, π where A is a set of items and π ∈ Ω.
This approach has two important problems. First, the number of classes can
be extremely large, up to a maximum of k!, where k is the size of the set of
labels, L. This means that the amount of data required to learn a reasonable
mapping X → Ω is too big.
The second disadvantage is that this approach does not take into account
the differences in nature between label rankings and classes. In classification,
two examples either have the same class or not. In this regard, label ranking is
more similar to regression than to classification. This property can be used in
the induction of prediction models. In regression, a large number of observations
with a given target value, say 5.3, increases the probability of observing similar
values, say 5.4 or 5.2, but not so much for very different values, say -3.1 or
100.2. A similar reasoning can be made in label ranking. Let us consider the
case of a data set in which ranking πa = {A, B, C, D, E} occurs in 1% of the
examples. Treating rankings as classes would mean that P (πa ) = 0.01. Let us
further consider that the rankings πb = {A, B, C, E, D}, πc = {B, A, C, D, E}
436 C.R. de Sá et al.
and πd = {A, C, B, D, E} occur in 50% of the examples. Taking into account the
stochastic nature of these rankings [7], P (πa ) = 0.01 seems to underestimate the
probability of observing πa . In other words it is expected that the observation of
πb , πc and πd increases the probability of observing πa and vice-versa, because
they are similar to each other.
This affects even rankings which are not observed in the available data. For
example, even though πe = {A, B, D, C, E} is not present in the data set it
would not be entirely unexpected to see it in future data.
To take this characteristic into account, we can argue that the support of a rank-
ing π increases with the observation of similar rankings and that the variation
is proportional to the similarity. Given a measure of similarity between rankings
s(πa , πb ), we can adapt the concept of support of the rule A → π as follows:
s(πi , π)
i:A⊆desc(xi )
suplr (A → π) =
n
Essentially, what we are doing is assigning a weight to each target ranking in
the training, πi , data that represents its contribution to the probability that π
may be observed. Some instances xi ∈ X give full contribution to the support
count (i.e., 1), while others may give partial or even a null contribution.
Any function that measures the similarity between two rankings or permuta-
tions can be used, such as Kendall’s τ [16] or Spearman’s ρ [22]. The function
used here is of the form:
s (πa , πb ) if s (πa , πb ) ≥ θsup
s(πa , πb ) = (1)
0 otherwise
where s is a similarity function. This general form assumes that below a given
threshold, θsup , is not useful to discriminate between different similarity values,
as they are so different from πa . This means that, the support sup of A, πa will
have contributions from all the ruleitems of the form A, πb , for all πb where
s (πa , πb ) > θsup ). Again, many functions can be used as s .
The confidence of a rule A → π is obtained simply by replacing the measure
of support with the new one.
suplr (A → π)
conflr (A → π) =
sup (A → π)
Given that the loss function that we aim to minimize is known beforehand, it
makes sense to use it to measure the similarity between rankings. Therefore, we
use Kendall’s τ . In this case, we think that θsup = 0 would be a reasonable value,
given that it separates the negative from the positive contributions. Table 1
shows an example of a label ranking dataset represented following this approach.
Mining Association Rules for Label Ranking 437
π1 π2 π3
TID A1 A2 A3 (1, 3, 2) (2, 1, 3) (2, 3, 1)
1 L XL S 0.33 0.00 1.00
2 XXL XS S 0.00 1.00 0.00
3 L XL XS 1.00 0.00 0.33
To present a more clear interpretation, the example given in table 1, the in-
stance ({A1 = L, A2 = XL, A3 = S}) (TID=1) contributes to the support count
of the ruleitem {A1 = L, A2 = XL, A3 = S}, π3 with 1. The same instance,
will also give a small contribution of 0.33 to the support count of the ruleitem
{A1 = L, A2 = XL, A3 = S}, π1 , given their similarity. On the other hand, no
contribution to the count of the ruleitem’s {A1 = L, A2 = XL, A3 = S}, π2
support is given, which are clearly different.
However, if these are insufficient to rank the given examples, a def ault ranking
is used. The default ranking can be the average ranking [5], which is often used
for this purpose.
This approach has two problems. The first is that it can only predict rankings
which were present in the training set (except when no rules apply and the
predicted ranking is the default ranking). The second problem is that it solves
conflicts between rankings without taking into account the “continuous” nature
of rankings, which was illustrated earlier. The problem of generating a single
permutation from a set of conflicting rankings has been studied in the context
of consensus rankings.
It has been shown in [15] that a ranking obtained by ordering the average ranks
of the labels across all rankings minimizes the euclidean distance to all those
rankings. In other words, it maximizes the similarity according to Spearman’s ρ
[22]. Given m rankings πi (i = 1, . . . , m) we aggregate them by computing for
each item j (j = 1, . . . , k):
m
πi,j
rj = i=1
m
The predicted ranking π̂ is obtained by ranking the items according to the value
of rj .
We can take advantage of this in the ranker builder in the following way: the
final predicted label ranking is the consensus of all the label rankings in the
consequent of the rules rπ triggered by the test example.
To implement pruning based on improvement for LR, some adaptation is re-
quired as well. Given that the relation between target values is different from
classification, as discussed in Section 4.1, we have to limit the comparison be-
tween rules with different consequents, if the similarity function S (π, π ) ≥ θimp .
generation can be a very time consuming task. In this case, minsup must be set
to a value larger than 1%. In this work, one such example is authorship, which
has 70 attributes.
This procedure has the important advantage that it does not take into account
the accuracy of the rule sets generated, thus reducing the risk of over-fitting.
5 Experimental Results
The data sets in this work were taken from KEBI Data Repository in the Philipps
University of Marburg [7] (Table 2). Continuous variables were discretized with
two distinct methods: (1) recursive minimum entropy partitioning criterion ([11])
with the minimum description length (MDL) as stopping rule, motivated by [10]
and (2) equal width bins.
The evaluation measure is Kendall’s τ and the performance of the method was
estimated using ten-fold cross-validation. The performance of APRIORI-LR is
compared with a baseline method, the default ranking (explained earlier) and
RPC [14]. For the generation of frequent ruleitems we used CAREN [3]. The
base learner used in RPC is the Logistic Regression Algorithm, with the default
configurations of the function Logit from the Stats package of R Programing
Language [21].
Additionally, we compare the performance of our algorithm with the re-
sults obtained with constraint classification (CC), instance-based label ranking
(IBLR) and ranking trees (LRT), that were presented in [7]. We note that we did
not run experiments with these methods and simply compared our results with
440 C.R. de Sá et al.
the published results of the other methods. Thus, they were probably obtained
with different partitions of the data and can not be compared directly. However,
they provide some indication of the quality of our method, when compared to
the state-of-the-art.
The value θimp was set to 0 in all experiments. This option may not be as
intuitive as it is in θsup . However, since the focus of this work is the reduction
of the number of generated rules, this value is suitable.
5.1 Results
Table 3 shows that the method obtains results with both discretization methods
that are clearly better than the ones obtained by the baseline method. This
means that the APRIORI-LR is identifying valid patterns that can predict label
rankings.
Table 4 presents the results obtained with pruned rules using the same minsup
and minconf values as in the previous experiments and compares it to RPC
using as a base learner Logistic Regression. Rd represents the percentage of the
number of rules reduced by pruning. The results presented clearly show that the
minImp constraint, set to 0.00 and 0.01, succeeded to reduce the number of
rules. However, there was no improvement in accuracy, although it also did not
decrease. Further tests are required to understand how this parameter affects
the accuracy of the models.
Finally, table 5 compares APRIORI-LR with state of the art methods based
on published results [7]. Given that the methods were not compared under
the same conditions, this simply gives us a rough idea of the quality of the
method proposed here. It indicates that, despite the simplicity of the adapta-
tion, APRIORI-LR is a competitive method. We expect that the results can
be significantly improved, for instance, by implementing more complex pruning
methods.
Table 3. Results obtained with minimum entropy discretization and with equal width
discretization with 3 bins for each attribute
APRIORI-LR
EW ME CC IBLR LRT
authorship NA 0.608 0.920 0.936 0.882
bodyfat 0.161 0.059 0.281 0.248 0.117
calhousing 0.139 0.291 0.250 0.351 0.324
cpu-small 0.279 0.439 0.475 0.506 0.447
elevators 0.623 0.643 0.768 0.733 0.760
fried 0.676 0.774 0.999 0.935 0.890
glass 0.794 0.871 0.846 0.865 0.883
housing 0.577 0.758 0.660 0.745 0.797
iris 0.883 0.960 0.836 0.966 0.947
pendigits 0.684 NA 0.903 0.944 0.935
segment 0.496 0.829 0.914 0.959 0.949
stock 0.836 0.890 0.737 0.927 0.895
vehicle 0.675 0.774 0.855 0.862 0.827
vowel 0.709 0.680 0.623 0.900 0.794
wine 0.910 0.844 0.933 0.949 0.882
wisconsin 0.280 0.031 0.629 0.506 0.343
6 Conclusions
Acknowledgments
References
1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large
databases. In: VLDB, pp. 487–499 (1994)
2. Aiguzhinov, A., Soares, C., Serra, A.P.: A similarity-based adaptation of naive
bayes for label ranking: Application to the metalearning problem of algorithm
recommendation. In: Pfahringer, B., Holmes, G., Hoffmann, A. (eds.) DS 2010.
LNCS, vol. 6332, pp. 16–26. Springer, Heidelberg (2010)
3. Azevedo, P.J., Jorge, A.M.: Ensembles of jittered association rule classifiers. Data
Min. Knowl. Discov. 21(1), 91–129 (2010)
4. Bayardo, R., Agrawal, R., Gunopulos, D.: Constraint-based rule mining in large,
dense databases. Data Mining and Knowledge Discovery 4(2), 217–240 (2000)
5. Brazdil, P., Soares, C., Costa, J.: Ranking Learning Algorithms: Using IBL and
Meta-Learning on Accuracy and Time Results. Machine Learning 50(3), 251–277
(2003)
Mining Association Rules for Label Ranking 443
6. Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic itemset counting and im-
plication rules for market basket data. In: Proceedings of the 1997 ACM SIGMOD
international conference on Management of data - SIGMOD 1997, pp. 255–264
(1997)
7. Cheng, W., Hühn, J., Hüllermeier, E.: Decision tree and instance-based learning
for label ranking. In: ICML 2009: Proceedings of the 26th Annual International
Conference on Machine Learning, pp. 161–168. ACM, New York (2009)
8. Pinto da Costa, J., Soares, C.: A weighted rank measure of correlation. Australian
& New Zealand Journal of Statistics 47(4), 515–529 (2005)
9. Dekel, O., Manning, C.D., Singer, Y.: Log-linear models for label ranking. Advances
in Neural Information Processing Systems (2003)
10. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretiza-
tion of continuous features. In: Machine Learning - International Workshop Then
Conference, pp. 194–202 (1995)
11. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued at-
tributes for classification learning. In: IJCAI, pp. 1022–1029 (1993)
12. Fürnkranz, J., Hüllermeier, E.: Preference learning. KI 19(1), 60 (2005)
13. Har-Peled, S., Roth, D., Zimak, D.: Constraint classification: A new approach to
multiclass classification. In: Cesa-Bianchi, N., Numao, M., Reischuk, R. (eds.) ALT
2002. LNCS (LNAI), vol. 2533, pp. 365–379. Springer, Heidelberg (2002)
14. Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label ranking by learning
pairwise preferences. Artif. Intell. 172(16-17), 1897–1916 (2008)
15. Kemeny, J., Snell, J.: Mathematical Models in the Social Sciences. MIT Press,
Cambridge (1972)
16. Kendall, M., Gibbons, J.: Rank correlation methods. Griffin, London (1970)
17. Lebanon, G., Lafferty, J.D.: Conditional Models on the Ranking Poset. In: NIPS,
pp. 415–422 (2002)
18. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining.
In: Knowledge Discovery and Data Mining, pp. 80–86 (1998)
19. Park, J.S., Chen, M.S., Yu, P.S.: An effective hash-based algorithm for mining
association rules. ACM SIGMOD Record 24(2), 175–186 (1995)
20. Park, J.S., Chen, M.S., Yu, P.S.: Efficient parallel and data mining for association
rules. In: CIKM, pp. 31–36 (1995)
21. R Development Core Team: R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing, Vienna, Austria (2010),
http://www.R-project.org ISBN 3-900051-07-0
22. Spearman, C.: The proof and measurement of association between two things.
American Journal of Psychology 15, 72–101 (1904)
23. Thomas, S., Sarawagi, S.: Mining generalized association rules and sequential pat-
terns using sql queries. In: KDD, pp. 344–348 (1998)
24. Todorovski, L., Blockeel, H., Džeroski, S.: Ranking with Predictive Clustering
Trees. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI),
vol. 2430, pp. 444–455. Springer, Heidelberg (2002)
25. Vembu, S., Gärtner, T.: Label Ranking Algorithms: A Survey. In: Fürnkranz, J.,
Hüllermeier, E. (eds.) Preference Learning. Springer, Heidelberg (2010)
Tracing Evolving Clusters
by Subspace and Value Similarity
1 Introduction
Temporal properties of patterns and their analysis are under active research [5].
A well known type of pattern are clusters, corresponding to similarity-based
groupings of data objects. A good example for clusters are customer groups.
Clusters can change in the course of time and understanding this evolution can
be used to guide future decisions [5], e.g. predicting whether a specific customer
behavior will occur. The evolution can be mined by cluster tracing algorithms
that find mappings between clusters of consecutive time steps [8,13,14].
The existing algorithms have a severe limitation: Clusters are mapped if the
corresponding object sets are similar, i.e. the algorithms check whether the pos-
sibly matching clusters have a certain fraction of objects in common; they are
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 444–456, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Tracing Evolving Clusters by Subspace and Value Similarity 445
unable to map clusters with different objects, even if the objects have similar
attribute values. Our novel method, however, maps clusters only if their corre-
sponding object values are similar, independently of object identities. That is,
we trace similar behavior types, which is a fundamentally different concept. This
is a relevant scenario, as the following two examples illustrate.
Consider scientific data of the earth’s surface with the attributes temperature
and smoke degree. The latter correlates with forest fire probability. The attribute
values are recorded over several months. In this dataset, at some point in time
a high smoke degree and high temperatures occur in the northern hemisphere;
sixth months later the same phenomenon occurs in the southern hemisphere, as
the seasons on the hemispheres are shifted half-yearly to each other. Another
example is the customer behavior of people in different countries. Often it is
similar, but shifted in time. For example, the customer behavior in Europe is
similar to the behavior in North America, but only some months later. Obviously,
a cluster tracing algorithm should detect these phenomena; however, existing
methods do not, since the observed populations, i.e. the environment and the
people respectively, stay at the same place, and thus there are no shared objects
between clusters — only the behavior migrates.
With today’s complex data, patterns are often hidden in different subsets
of the dimensions; for detecting these clusters with locally relevant dimensions,
subspace clustering was introduced. However, despite that many temporal data
sets are of this kind, e.g. gridded scientific data, subspace clustering has never
been used in a cluster tracing scenario. The existing cluster tracing methods
can only cope with fullspace clusters, and thus cannot exploit the information
mined by subspace clustering algorithms. Our novel tracing method measures
the subspace similarity of clusters and thus handles subspace clusters by design.
Summarized, we introduce a method for tracing behavior types in tempo-
ral data; the types are represented by clusters. The decision, which clusters of
consecutive time steps are mapped is based on a novel distance function that
tackles the challenges of object value similarity and subspace similarity. Our
approach can handle the following developments: emerging or disappearing be-
havior as well as distinct behaviors that converge into uniform behavior and
uniform behavior that diverges into distinct behaviors. By using subspaces, we
enable the following evolutions: Behavior can gain or lose characteristics; i.e.,
the representing subspace clusters can gain or lose dimensions over time, and
clusters that have different relevant dimensions can be similar. Varying behavior
can be detected; that is, to some extent the values of the representing clusters
can change.
Clusterings of three time steps are illustrated in Fig. 1. The upper part shows
the objects; the lower part abstracts from the objects and illustrates possible
clusterings of the datasets and tracings between the corresponding clusters. Note
that the three time steps do not share objects, i.e. each time step corresponds to
a different database from the same attribute domain {d1 , d2 }; to illustrate the
different objects, we used varying object symbols. An example for behavior that
gains characteristics is the mapping of Cluster C1,1 to C2,1 , i.e. the cluster gains
446 S. Günnemann et al.
d2 d2 d2
C3,1
w1
w3 C4,1
d1 d1 d1 C1,1 C2,1
& w4 C3,2 w6
&
& w2
C4,2
& C1,2 C2,2 w C3,3 w7
5
one dimension. Varying behavior is illustrated by the mapping from C1,2 to C2,2 ;
the values of the cluster have changed. If the databases were spatial, this could
be interpreted as a movement. A behavior divergence can be seen from time step
t + 1 to t + 2: the single cluster C2,1 is mapped to the two clusters C3,1 and C3,2 .
2 Related Work
Several temporal aspects of data are regarded in the literature [5]. In stream clus-
tering scenarios, clusters are adapted to reflect changes in the observed data, i.e.
the distribution of incoming objects changes [2]. A special case of stream cluster-
ing is for moving objects [10], focusing on spatial attributes. Stream clustering
in general, however, gives no information about the actual cluster evolution over
time [5]. For this, cluster tracing algorithms were introduced [8,13,14]; they rely
on mapping clusters of consecutive time steps. These tracing methods map clus-
ters if the corresponding object sets are similar, i.e. they are based on shared
objects. We, in contrast, map clusters only if their corresponding object values
are similar, independently of shared objects. That is, we trace similar types of
behavior, which is a fundamentally different concept.
Clustering of trajectories [7,15] can be seen as an even more limited variant
of cluster tracing with similar object sets, as trajectory clusters have constant
object sets that do not change over time.
The work in [1] analyzes multidimensional temporal data based on dense
regions that can be interpreted as clusters. The approach is designed to detect
substantial changes of dense regions; however, tracing of evolving clusters that
slightly change their position or subspace is not possible, especially when several
time steps are observed.
A further limitation of existing cluster tracing algorithms is that they can
only cope with fullspace clusters. Fullspace clustering models use all dimensions
in the data space [6]. For finding clusters hidden in individual dimensions, sub-
space clustering was introduced [4]. An overview of different subspace clustering
Tracing Evolving Clusters by Subspace and Value Similarity 447
approaches can be found in [9], and the differences between subspace clustering
approaches are evaluated in [11]. Until now, subspace clusters were only ap-
plied in streaming scenarios [3], but never in cluster tracing scenarios; deciding
whether subspace clusters of varying dimensionalities are similar is a challenging
issue. Our algorithm is designed for this purpose.
These categories show whether a behavior appears in similar ways in the sub-
sequent time step. Since the characteristics of a behavior can naturally change
over time, we also trace single behaviors over several time steps, denoted as an
evolving cluster and described by a path in the mapping graph.
Evolving clusters are correctly identified if specific evolution criteria are ac-
counted in our distance function. These are presented in the following section.
are considered to be similar even if they lose some relevant dimensions. That is,
|S ∩S |
the smaller the term 1 − t,i|St,it+1,j
| , the more similar are the clusters.
This formula alone, however, would prevent an information gain: If a cluster
Ct,i evolves to Ct+1,j by spanning more relevant dimensions, this would not
be assessed positively. We would get the same distance for a cluster with the
same shared dimensions as Ct,i and without additional relevant dimensions as
Ct+1,j . Since more dimensions mean more information, we do consider this.
|S \St,i |
Consequently, the smaller the term 1 − t+1,j |St+1,j | , the more similar the clusters.
Usually it is more important for tracing that we retain relevant dimensions.
Few shared dimensions and many new ones normally do not indicate similar be-
havior. Thus, we need a trade-off between retained dimensions and new (gained)
dimensions. This is achieved by a linear combination of the two introduced terms:
σx σy2 + (μy − μx )2 1
KL(Yd Xd ) = ln( )+ 2
− =: KL(Ct,i , Ct+1,j , d)
σy 2σx 2
By using the KL, we do not just account for the absolute deviation of the
means, but we have also the advantage of including the variances. A behavior
with a high variance in a single dimension allows a higher evolution of the means
for successive similar behaviors. A small variance of the values, however, only
permits a smaller deviation of the means.
450 S. Günnemann et al.
^G«G`
^G«G`
^G«G`
FRUH ^G«G`
^G«G`
&
&
& &
FRUH ^G«G`
We use the KL for the similarity per dimension, and the overall similarity is
attained by cumulating over several dimensions. Apparently, we just have to use
dimensions that are in the intersection of both clusters. The remaining dimen-
sions are non-relevant for at least one cluster and hence are already penalized by
our subspace distance function. Our first approach for computing the similarity
based on statistical characteristics is
V (Ct,i , Ct+1,j , I) = ( KL(Ct,i , Ct+1,j , d))/|I| (1)
d∈I
with the penalty factor β ∈ [0, 1] for dimensions N onCore = (St,i ∩St+1,j )\Core.
By selecting a smaller core, the first part of the distance formula enlarges. The
second part, however, gains the possibility of determining a smaller value. The
core must comprise at least one dimension; otherwise, we could map two clusters
even if they have no dimensions with similar characteristics.
Overall distance function. To correctly identify the evolving clusters in
our temporal data we have to consider evolutions in the relevant dimensions as
well as in the value distributions. Thus, we have to use both distance measures
simultaneously. Again, we require that two potentially mapped clusters share at
least one dimension; otherwise, these clusters cannot represent similar behaviors.
Definition 6. The Overall distance function for clusters Ct,i = (Ot,i , St,i )
and Ct+1,j = (Ot+1,j , St+1,j ) with |St,i ∩ St+1,j | > 0 is defined by
dist(Ct,i , Ct+1,j ) = γ · V (Ct,i , Ct+1,j ) + (1 − γ) · S(Ct,i , Ct+1,j )
with γ ∈ [0, 1]. In the case of |St,i ∩ St+1,j | = 0, the distance is set to ∞.
We now introduce how temporal relations between time steps can be exploited.
Predecessor information. We assume an initial clustering at time step
t = 1. (We discuss this later.) Caused by the temporal aspect of the data,
clusters at a time step t occur with high probability in t + 1 — not identical,
but similar. Given a cluster and the corresponding hypercube HS at time step
t, we try to find a cluster at the next time step in a similar region. We use a
Monte Carlo approach, i.e. we draw a random point mt+1 ∈ RD that represents
the initiator of a new hypercube and that is nearby the mean mHS of HS . After
inducing an hypercube by an initiator, the corresponding cluster’s validity is
checked. The quantity of initiators is calculated by a formula introduced in [16].
Definition 8. Initiator of a hypercube. A point p ∈ RD , called initiator,
together with a width w and a subspace S induces a hypercube HSw (p) defined by
∀d ∈ S : lowd = p[d] − w2 , upd = p[d] + w2 and ∀i ∈ S : lowi = −∞, upi = ∞.
Formally, the initiator mt+1 is drawn from the region HS2w (mHS ), permitting a
change of the cluster. The new hypercube is then HSw (mt+1 ). With this method
we detect changes in the values; however, also the relevant dimensions can
change: The initiator mt+1 can induce different hypercubes for different rele-
vant dimensions S. Accordingly, beside the initiator, we have to determine the
relevant subspace of the new cluster. The next section discusses both issues.
Determining the best cluster. A first approach is to use a quality function
[12,16]: μ(HS ) = Obj(HS ) · k |S| . The more objects or the more relevant dimen-
sions are covered by the cluster, the higher is its quality. These objectives are
contrary: a trade-off is realized with the parameter k. In time step t + 1 we could
choose the subspace S that maximizes μ(HSw (mt+1 )).
This method, however, optimizes the quality of each single cluster; it is not
intended to find good tracings. Possibly, the distance between each cluster from
the previous clustering Clust and our new cluster is large, and we would find no
similar behaviors. Our solution is to directly integrate the distance function dist
into the quality function. Consequently, we choose the subspace S such that the
hypercube HSw (mt+1 ) maximizes our novel distance based quality function.
Definition 9. Distance based quality function. Given the hypercube HS in
subspace S and a clustering Clust , the distance based quality function is
4 Experiments
Setup. We use real world and synthetic data for evaluation. Real world data
are scientific grid data reflecting oceanographic characteristics as temperature
and salinity of the oceans1 . It contains 20 time steps, 8 dimensions, and 71,430
objects. The synthetic data cover 24 time steps and 20 dimensions. In average,
each time step contains 10 clusters with 5-15 relevant dimensions. We hide de-
velopments (emerge, converge, diverge, or disappear) and evolutions (subspace
and value changes) within the data. In our experiments we concentrate on the
quality of our approach. For synthetic data, the correct mappings between the
clusters are given. Based on the detected mappings we calculate the precision
and recall values: we check whether all but only the true mappings between
clusters are detected. For tracing quality we use the F1 value corresponding to
the harmonic mean of recall and precision. Our approach tackles the problem of
tracing clusters with varying subspaces and is based on object-value-similarity.
Even if we would constrain our approach to handle only full-space clusters as
existing solutions, such a comparison is only possible when we artificially add
object ids to the data (to be used by these solutions). Tracing clusters based on
these artificial object ids, however, cannot reflect the ground truth in the data.
Summarized, comparisons to other approaches are not performed since it would
be unfair. We use Opteron 2.3GHz CPUs and Java6 64bit.
Tracing quality. First, we analyze how the parameters affect the tracing
effectiveness. For lack of space, we only present a selection of the experiments.
For α, a default value of 0.1 was empirically determined. γ is evaluated in Fig. 4
for three different τ values using synthetic data. By γ we determine the trade-off
between subspace similarity and value similarity in our overall distance function.
Obviously we want to prevent extreme cases for effective tracing, i.e. subspace
similarity with no attribute similarity at all (γ → 0), or vice versa. This is
confirmed by the figure, as the tracing quality highly degrades, when γ reaches
0 or 1 for all τ values. As γ = 0.3 enables a good tracing quality for all three τ ,
1
Provided by the Alfred Wegener Institute for Polar and Marine Research, Germany.
454 S. Günnemann et al.
#ofnonͲcoredimensions
16
tracingquality
0.96
tracingquality
0.8 12
0.92
8
0.6
0.88 4
0.4 0.84 0
0 0.2 0.4 0.6 0.8 1 0 0.5 1
ɶ (tradeͲoffbetweenvalues&subspaces) ɴ (penaltyfornonͲcoredimensions)
Fig. 4. Tracing quality for different γ & τ Fig. 5. Eval. of core dimension concept
we use this as default. Note that with the threshold τ we can directly influence
how many cluster mappings are created. τ = 0.1 is a good trade-off and is used
as default. With a bigger τ the tracing quality worsens: too many mappings are
created and we cannot distinguish between meaningful or meaningless mappings.
The same is true for τ → 0: no clusters are mapped and thus the clustering
quality reaches zero; thus we excluded plots for τ → 0.
The core dimension concept is evaluated in Fig. 5. We analyze the influence on
the tracing quality (left axis) with a varying β on the x-axis; i.e., we change the
penalty for non-core dimensions. Note, non-core dimensions are a different con-
cept than non-relevant ones; non-core dimensions are shared relevant dimensions
with differing values. The higher the penalty, the more dimensions are included
in the dimension core; i.e., more shared dimensions are used for the value-based
similarity. In a second curve, we show the absolute number of non-core dimen-
sions (right axis) for the different penalties: the number decreases with higher
penalties. In this experiment the exact number of non-core dimensions in the
synthetic data is 10. We can draw the following conclusions regarding tracing
quality: A forced usage of a full core (β → 1) is a bad choice, as there can be
some shared dimensions with different values. By lowering the penalty we allow
some dimensions to be excluded from the core and thus we can increase the
tracing quality. With β = 0.1 the highest tracing quality is obtained; this is
plausible as the number of non-core dimensions corresponds to the number that
is existent in the data. A too low penalty, however, results in excluding nearly
all dimensions from the core (many non-core dimensions, β → 0) and dropping
quality. In the experiments, we use β = 0.1 as default.
Detection of behavior developments. Next we analyze whether our model
is able to detect the different behavior developments. Up to now, we used our
enhanced clustering method that utilizes the predecessor information and the
distance based quality function. Now, we additionally compare this method with
a variant that performs clustering of each step independently. In Fig. 6 we use
the oceanographic dataset and we determine for each time step the number of
disappeared behaviors for each clustering method. The experiment indicates that
the number of unmapped clusters for the approach without any predecessor or
distance information is larger than for our enhanced approach. By transferring
the clustering information between the time steps, a larger number of clusters
from one time step to the next can be mapped. We map clusters over a longer
time period, yielding a more effective tracing of evolving clusters.
Tracing Evolving Clusters by Subspace and Value Similarity 455
numberofoccurences
#ofdissapearedclusters 100
12
8
10
4
emerge
converge
diverge
dimension
dimension
disappear
1
gain
loss
0 5 10 15 20
timestep
numberofoccurences
1000
1000
100
100
10
1 10
diverge
emerge
disappear
converge
dimension
dimension
1
gain
loss
0 5 10 15 20
timestep
Fig. 8. Number of evolutions & developments on real world data; left: cumulated over
20 time steps, right: for each time step
The aim of tracing is not just to map similar clusters but also to identify
different kinds of evolution and development. In Fig. 7 we plot the number of
clusters that gain or lose dimensions and the four kinds of development cumu-
lated over all time steps. Beside the numbers our approach detects, we show the
intended number based on this synthetic data. The first four bars indicate that
our approach is able to handle dimension gains or losses; i.e., we enable sub-
space cluster tracing, which is not considered by other models. The remaining
bars show that also the developments can be accurately detected. Overall, the
intended transitions are found by our tracing. In Fig. 8 we perform a similar
experiment on real world data. We report only the detected number of patterns
because exact values are not given. On the left we cumulate over all time steps.
Again, our approach traces clusters with varying dimensions. Accordingly, on
real world data it is a relevant scenario that subspace clusters lose some of their
characteristics, and it is mandatory to use a tracing model that handle these
cases. The developments are also identified in this real world data. To show
that the effectiveness is not restricted to single time steps, we analyze the de-
tected patterns for each time step individually on the right. Based on the almost
constant slopes of all curves, we can see that our approach performs effectively.
5 Conclusion
In this paper, we proposed a model for tracing evolving subspace clusters in high
dimensional temporal data. In contrast to existing methods, we trace clusters
456 S. Günnemann et al.
based on their behavior; that is, clusters are not mapped based on the fraction of
objects they have in common, but on the similarity of their corresponding object
values. We enable effective tracing by introducing a novel distance measure that
determines the similarity between clusters; this measure comprises subspace and
value similarity, reflecting how much a cluster has evolved. In the experimental
evaluation we showed that high quality tracings are generated.
References
1. Aggarwal, C.C.: On change diagnosis in evolving data streams. TKDE 17(5), 587–
600 (2005)
2. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving
data streams. In: VLDB, pp. 81–92 (2003)
3. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clustering
of high dimensional data streams. In: VLDB, pp. 852–863 (2004)
4. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clus-
tering of high dimensional data for data mining applications. In: SIGMOD, pp.
94–105 (1998)
5. Böttcher, M., Höppner, F., Spiliopoulou, M.: On exploiting the power of time in
data mining. SIGKDD Explorations 10(2), 3–11 (2008)
6. Ester, M., Kriegel, H.P., Jörg, S., Xu, X.: A density-based algorithm for discovering
clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996)
7. Gaffney, S., Smyth, P.: Trajectory clustering with mixtures of regression models.
In: KDD, pp. 63–72 (1999)
8. Kalnis, P., Mamoulis, N., Bakiras, S.: On discovering moving clusters in spatio-
temporal data. In: Anshelevich, E., Egenhofer, M.J., Hwang, J. (eds.) SSTD 2005.
LNCS, vol. 3633, pp. 364–381. Springer, Heidelberg (2005)
9. Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: A sur-
vey on subspace clustering, pattern-based clustering, and correlation clustering.
TKDD 3(1), 1–58 (2009)
10. Li, Y., Han, J., Yang, J.: Clustering moving objects. In: KDD, pp. 617–622 (2004)
11. Müller, E., Günnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace
projections of high dimensional data. In: VLDB, pp. 1270–1281 (2009)
12. Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A monte carlo algorithm
for fast projective clustering. In: SIGMOD, pp. 418–427 (2002)
13. Rosswog, J., Ghose, K.: Detecting and tracking spatio-temporal clusters with adap-
tive history filtering. In: ICDM Workshops, pp. 448–457 (2008)
14. Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., Schult, R.: MONIC - modeling and
monitoring cluster transitions. In: KDD, pp. 706–711 (2006)
15. Vlachos, M., Gunopulos, D., Kollios, G.: Discovering similar multidimensional tra-
jectories. In: ICDE, pp. 673–684 (2002)
16. Yiu, M.L., Mamoulis, N.: Frequent-pattern based iterative projected clustering. In:
ICDM, pp. 689–692 (2003)
An IFS-Based Similarity Measure to Index
Electroencephalograms
1 Introduction
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 457–468, 2011.
c Springer-Verlag Berlin Heidelberg 2011
458 G. Berrada and A. de Keijzer
the stationarity assumption does not hold during episodes of physical or men-
tal activity, such as changes in alertness and wakefulness, during eye blinking
and during transitions between various ictal states. Therefore, EEG signals are
quasi-stationary. In view of that, we propose a similarity measure based on IFS
interpolation to index EEGs in this paper, as fractal interpolation does not as-
sume stationarity of the data and can adequately model complex structures.
Moreover, using fractal interpolation makes computing features such as the frac-
tal dimension simple (see theorem 21 for the link between fractal interpolation
parameters and fractal dimension) and the fractal dimension of EEGs is known
to be a relevant marker for some pathologies such as dementia (see [7]).
2 Background
2.1 Fractal Interpolation
Fractal dimension. Any given time series can be viewed as the observed data
generated by an unknown manifold or attractor. One important property of this
attractor is its fractal dimension. The fractal dimension of an attractor counts
the effective number of degrees of freedom in the dynamical system and therefore
quantifies its complexity. It can also be seen as the statistical quantity that gives
an indication of how completely a fractal object appears to fill space, as one
zooms down to finer and finer scales. Another dimension, called the topological
dimension or Lebesgue Covering dimension, is also defined for any object and a
fortiori for the attractor. A space has Lebesgue Covering dimension n if for every
open cover4 of that space, there is an open cover that refines it such that the
refinement5 has order at most n + 1. For example, the topological dimension of
the Euclidean space Rn is n. The attractor of a time series can be fractal (ie its
fractal dimension is higher than its topological dimension) and is then called a
strange attractor. The fractal dimension is generally a non-integer or fractional
number. Typically, for a time series, the fractal dimension is comprised between
1 and 2 since the (topological) dimension of a plane is 2 and that of a line is 1.
The fractal dimension has been used to:
– uncover patterns in datasets and cluster data ([10,2,15])
– analyse medical time series ([14,6]) such as EEGs ([1,7])
– determine the number of features to be selected from a dataset for a similarity
search while obviating the ”dimensionality curse” ([12])
n
We also define an operator T on C(K) as (T f )(x) = i=1 pi (f ◦ wi )(x). If T
maps C(K) into itself, then the pair (wi , pi) is called an iterated function system
on (K, d). The condition on T is satisfied for any set of probabilities pi if the
transformations wi are contracting, in other words, if, for any i, there exists a
δi < 1 such that: d(wi (x), wi (y)) ≤ δi d(x, y) ∀x, y ∈ K . The IFS is also denoted as
hyperbolic in this case.
After determining the contraction parameter di , we can estimate the four re-
maining parameters (namely ai ,ci ,ei ,fi ):
xi − xi−1
ai = (1)
xn − x0
xn xi−1 − x0 xi
ci = (2)
xn − x0
yi − yi−1 yn − y0
ei = − di (3)
xn − x0 xn − x0
xn yi−1 − x0 yi xn y0 − x0 yn
fi = − di (4)
xn − x0 xn − x0
The vertical scaling factors di satisfy 0 ≤ di < 1 and the constants ai ,ci ,ei and
fi are defined as in section 2.1 (in equations 1,2,3 and 4) for i = 1, 2, ..., n.
We denote G the attractor of the IFS such that G is the graph of a fractal
interpolation
function associated with the set of points.
If ni=1 |di | > 1 and the interpolation points do not lie on a straight line, then
the fractal dimension of G is the unique real solution D of i=1 |di |aD−1 i = 1.
Given the computed similarity matrix S (defined by equation 5), we can use the
k-medoids algorithm to cluster the EEGs. This algorithm requires the number
of clusters k to be known. We describe our choice of the number of clusters
below, in section 2.3. The k-medoids algorithm is similar to k-means and can be
applied through the use of the EM algorithm. k random elements are, initially,
chosen as representatives of the k clusters. At each iteration, a representative
element of a cluster is replaced by a randomly chosen nonrepresentative element
of the cluster if the selected criterion (e.g. mean-squared error) is improved by
this choice. The data points are then reassigned to their closest cluster, given
the new cluster representative elements. The iterations are stopped when no
reassignments is possible. We use the PyCluster function kmedoids described in
[5] to make our k-medoids clustering.
We interpolate each channel of each EEG (except the annotations channel) us-
ing piecewise fractal interpolation. For this purpose, we split each EEG channel
into windows and then estimate the IFS for each window. The previous descrip-
tion implies that a few parameters, namely the window size and therefore the
embedding dimension, have to be determined before estimating the piecewise
fractal interpolation function for each channel. The embedding dimension is de-
termined thanks to Takens’ theorem which states that, for the attractor of a
time series to be reconstructed correctly (i.e the same information content is
found in the state (latent) and observation spaces), the embedding dimension
denoted m satisfies : m > 2D + 1 where D is the dimension of the attractor, in
other words its fractal dimension. Since the fractal dimension of a time series
is between 1 and 2, we can get a satisfactory embedding dimension as long as
m > 2 ∗ 2 + 1 i.e m > 5. We therefore choose an embedding dimension equal to
6. And we choose the lag τ between different elements of the delay vector to be
equal to the average duration of an EEG data record i.e 1s. Therefore, we split
our EEGs in (non-overlapping) windows of 6 seconds. A standard 20-minutes
EEG (which therefore contains about 1200 data records of 1 second) would then
be split in about 200 windows of 6 seconds. Each window is subdivided into
intervals of one second each and the end-points of these intervals are taken as
interpolation points. This means there are 7 interpolation points per interval:
the starting point p0 of the window, the point one second away from p0 , the
point two seconds from p0 , the point three seconds away from p0 , the point four
seconds away from p0 , the point five seconds away from p0 and the last point of
the window. The algorithm6 to compute the fractal interpolation function per
window is as follows:
1. Choose, as an initial point, the starting point of the interval considered (the
first interval considered is the interval corresponding to the first second of
the window).
2. Choose, as the end point of the interval considered, the next interpolation
point.
3. Compute the contraction factor d for the interval considered.
4. If |d| > 1 go to 2, otherwise go to 5.
5. Form the map wi associated with the interval considered. In other words,
compute the a, c, e and f parameters associated to the interval (see equa-
tions).
Apply the map to the entire window (i.e six seconds window) to yield
x
wi for all x in the window. 6. Compute and store the distance between
y
the original values of the time series on the interval considered (i.e the inter-
val constructed in steps 2 and 3) and the values given by wi on that interval.
A possible distance is the Euclidean distance.
6
Inspired from [11].
An IFS-Based Similarity Measure to Index EEGs 463
After this fractal interpolation step, each window of each signal is represented by
5 parameters instead of by signal frequency.window duration points. The
dimension of the analysed time series is therefore reduced in this step. For a stan-
dard 20-minutes EEG containing 23 signals of frequency 250 Hz, this amounts to
representing each signal with 1000 values instead 50000 and the whole EEG with
23000 values instead of 1150000, thus to reducing the number of signal values
by almost 98%. This dimension reduction may be exploited in future work to
compress EGGs and store compressed representations of EEGs in the database
instead of raw EEGs as the whole EEGs can be reconstructed from their fractal
interpolations. Further work needs to be done on the compression of EEG data
using fractal interpolation and the loss of information that may result from this
compression. Then, for each EEG channel and for each window, we compute the
fractal dimension thanks to theorem 21. The equation of theorem 21 is solved
heuristically for each 6-second interval of each EEG signal using a bisection al-
gorithm. As we know that the fractal dimension for a time series is between 1
and 2, we search a root of the equation of theorem 21 in the interval [1,2] and
split the search interval by half at each iteration until the value of the root is
approached by an -margin ( 7 being the admissible error on the desired root).
Therefore, for each EEG channel, we have the same number of computed frac-
tal dimensions as the number of windows. This feature extraction extraction
step (fractal dimension computations) further reduces the dimensionality of the
analysed time series. In fact, the number of values representing the time series is
divided by 5 in this step. This leads to representing a standard 20-minute EEG
containing 23 signals of frequency 250 Hz by 4600 values instead of the initial
1150000 points.
We only compare EEGs that have at least a subset of identical channels (i.e
having the same labels). When two EEGs don’t have any channels (except the
annotations channel) in common, the similarity measure between them is set to
1 (as the farther (resp. closer) the distance between two EEGs, the higher (resp.
lower) and the closer to 1 (resp. closer to 0) the similarity measure). If, for the
two EEGs compared, the matching pairs of feature vectors (i.e vectors made of
7
We choose = 0.0001 in our experiments.
464 G. Berrada and A. de Keijzer
the fractal dimensions computed for each signal) do not have the same dimension
then the vector of highest dimension is approximated by a histogram and the
M most frequent values according to the histogram (M being the dimension of
the shortest vector) are taken as representatives of that vector and the distance
between the two feature vectors is approximated by the distance between the
shortest feature vector and the vector formed with the M most frequent values
of the longest vector. The similarity measure between two EEGs is given by:
N 1 d(chi
EEG1 EEG
,chi 2
)−dmin
i=1 N dmax −dmin
5 Results
Figure 3 illustrates the relation between the duration of the EEG and the time it
takes to interpolate EEGs. It shows that the increase of the fractal interpolation
time with respect to the interpolated EEG’s duration is less than linear.
Fig. 3. Execution times of the fractal interpolation in function of the EEG dura-
tion compared to the AR modelling of the EEGs. The red triangles represent the
fractal interpolation execution times and the blue crosses the AR modelling execu-
tion times. the black stars the fitting of the fractal interpolation measured execu-
tion times with function 1.14145161064 ∗ (1 − exp(−(0.5 ∗ x)2.0 )) + 275.735500586 ∗
(1 − exp(−(0.000274218988011 ∗ (x))2.12063087537 )) using the Levenberg-Marquardt
algorithm
– most of the misclassified abnormal EEGs are EEGs representing mild forms
of the pathology represented therefore their deviation from a normal EEG
is minimal
– most of the misclassified abnormal EEGs (in particular for epilepsy and brain
damage) exhibit abnormalities on only a restricted number of channels (lo-
calised version of the pathologies considered). The similarity measures, giving
equal weights to all channels, are not sensitive enough to abnormalities affect-
ing one channel. In future work, we will explore the influence of weights on
the clustering performance. About 76% of the normal EEGs are well classi-
fied. The remaining misclassified EEGs are misclassified because they exhibit
artifacts, age-specific patterns and/or sleep-specific patterns that distort the
EEGs significantly enough to make the EEGs seem abnormal. Filtering arti-
facts before computing the similarity measures and incorporating metadata
knowledge in the similarity measure would improve the clustering results.
6 Conclusion
References
1. Accardo, A., Affinito, M., Carrozzi, M., Bouquet, F.: Use of the fractal dimen-
sion for the analysis of electroencephalographic time series. Biological Cybernet-
ics 77(5), 339–350 (1997)
2. Barbará, D., Chen, P.: Using the fractal dimension to cluster datasets. In: KDD,
pp. 260–264 (2000)
3. Barnsley, M.: Fractals everywhere. Academic Press Professional, Inc., San Diego
(1988)
4. Climescu-Haulica, A.: How to Choose the Number of Clusters: The Cramer Mul-
tiplicity Solution. In: Decker, R., Lenz, H.J. (eds.) Advances in Data Analysis,
Proceedings of the 30th Annual Conference of the Gesellschaft für Klassifikation
e.V., Freie Universität Berlin, March 8-10. Studies in Classification, Data Analysis,
and Knowledge Organization, pp. 15–22. Springer, Heidelberg (2006)
5. De Hoon, M., Imoto, S., Nolan, J., Miyano, S.: Open source clustering software.
Bioinformatics 20, 1453–1454 (2004),
http://portal.acm.org/citation.cfm?id=1092875.1092876
6. Eke, A., Herman, P., Kocsis, L., Kozak, L.: Fractal characterization of complexity
in temporal physiological signals. Physiological measurement 23(1), R–R38 (2002)
7. Goh, C., Hamadicharef, B., Henderson, G.T., Ifeachor, E.C.: Comparison of Fractal
Dimension Algorithms for the Computation of EEG Biomarkers for Dementia. In:
Proceedings of the 2nd International Conference on Computational Intelligence in
Medicine and Healthcare (CIMED 2005), Costa da Caparica, Lisbon, Portugal,
June 29-July 1 (2005)
8. Hao, L., Ghodadra, R., Thakor, N.V.: Quantification of Brain Injury by EEG
Cepstral Distance during Transient Global Ischemia. In: Proceedings - 19th Inter-
national Conference - IEEE/EMBS, Chicago, IL., USA, October 30-November 2
(1997)
9. Kalpakis, K., Gada, D., Puttagunta, V.: Distance Measures for Effective Clus-
tering of ARIMA Time-Series. In: ICDM 2001: Proceedings of the 2001 IEEE
International Conference on Data Mining, pp. 273–280. IEEE Computer Society,
Washington, DC (2001)
10. Lin, G., Chen, L.: A Grid and Fractal Dimension-Based Data Stream Clustering
Algorithm. In: International Symposium on Information Science and Engieering,
vol. 1, pp. 66–70 (2008)
11. Mazel, D.S., Hayes, M.H.: Fractal modeling of time-series data. In: Conference
Record of the Twenty-Third Asilomar Conference of Signals, Systems and Com-
puters, pp. 182–186 (1989)
12. Malcok, M., Aslandogan, Y.A., Yesildirek, A.: Fractal dimension and similarity
search in high-dimensional spatial databases. In: IRI, pp. 380–384 (2006)
13. Pachori, R.B.: Discrimination between ictal and seizure-free EEG signals using
empirical mode decomposition. Res. Let. Signal Proc. 2008, 1–5 (2008)
14. Sarkar, M., Leong, T.Y.: Characterization of medical time series using fuzzy
similarity-based fractal dimensions. Artificial Intelligence in Medicine 27(2), 201–
222 (2003)
15. Yan, G., Li, Z.: Using cluster similarity to detect natural cluster hierarchies. In:
FSKD (2), pp. 291–295 (2007)
DISC: Data-Intensive Similarity Measure for
Categorical Data
1 Introduction
The concept of similarity is fundamentally important in almost every scientific
field. Clustering, distance-based outlier detection, classification and regression
are major data mining techniques which compute the similarities between in-
stances and hence choice of a particular similarity measure can turn out to be a
major cause of success or failure of the algorithm. For these tasks, the choice of
a similarity measure can be as important as the choice of data representation or
feature selection. Most algorithms typically treat the similarity computation as
an orthogonal step and can make use of any measure. Similarity measures can
be broadly divided in two categories: similarity measures for continuous data
and categorical data.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 469–481, 2011.
c Springer-Verlag Berlin Heidelberg 2011
470 A. Desai, H. Singh, and V. Pudi
2 Related Work
Determining similarity measures for categorical data is a much studied field as
there is no explicit notion of ordering among categorical values. Sneath and
Sokal were among the first to put together and discuss many of the categorical
similarity measures and discuss this in detail in their book [2] on numerical
taxonomy.
The specific problem of clustering categorical data has been actively stud-
ied. There are several books [3,4,5] on cluster analysis that discuss the problem
of determining similarity between categorical attributes. The problem has also
been studied recently in [17,18]. However, most of these approaches do not offer
solutions to the problem discussed in this paper, and the usual recommenda-
tion is to “binarize” the data and then use similarity measures designed for
binary attributes. Most work has been carried out on development of clustering
algorithms and not similarity functions. Hence these works are only marginally
or peripherally related to our work. Wilson and Martinez [6] performed a de-
tailed study of heterogeneous distance functions (for categorical and continuous
attributes) for instance based learning. The measures in their study are based
upon a supervised approach where each data instance has class information in
addition to a set of categorical/continuous attributes.
There have been a number of new data mining techniques for categorical data
that have been proposed recently. Some of them use notions of similarity which
are neighborhood-based [7,8,9], or incorporate the similarity computation into
the learning algorithm[10,11]. These measures are useful to compute the neigh-
borhood of a point and neighborhood-based measures but not for calculating
similarity between a pair of data instances. In the area of information retrieval,
Jones et al. [12] and Noreault et. al [13] have studied several similarity measures.
472 A. Desai, H. Singh, and V. Pudi
3 Problem Formulation
In this section we discuss the necessary conditions for a valid similarity measure.
Later, in Section 4.5 we describe how DISC satisfies these requirements and
prove the validity of our algorithm. The following conditions need to hold for a
distance metric “d” to be valid where d(x, y) is the distance between x and y.
1. d(x, y) ≥ 0
2. d(x, y) = 0 if and only if x=y
3. d(x, y) = d(y, x)
4. d(x, z) ≤ d(x, y) + d(y, z)
In order to come up with conditions for a valid similarity measure we use sim =
1
1+dist
, a distance-similarity mapping used in [1]. Based on this mapping we come
up with the following definitions for valid similarity measures:
1. 0 ≤ Sim(x, y) ≤ 1
2. Sim(x, y) = 1 if and only if x = y
3. Sim(x, y) = Sim(y, x)
1
4. Sim(x,y) 1
+ Sim(y,z) ≥ 1 + Sim(x,z)
1
4 DISC Algorithm
In this section we present the DISC algorithm. First in Section 4.1 we present
the motivation for our algorithm followed by data-structure description in Sec-
tion 4.2 and a brief overview of the algorithm in Section 4.3. We then describe the
algorithm for similarity matrix computation in Section 4.4. Finally in Section 4.5
we validate our similarity measure.
Table 1. Illustration
real life where a, c, b may represent low, medium and high end cars and hence
the similarity between a low-end and a medium-end car will be more than the
similarity between a low-end and a high-end car. Now the other independent
variable is Color. The average prices corresponding to the three colors namely
red, green and blue are 43, 41, 43.33. As can be seen, there is a small difference
in their prices which is in line with the fact that the cost of the car is very loosely
related to its color.
It is important to note that a notion of similarity for categorical variables has a
cognitive component to it and as such each one is debatable. However, the above
explained notion of similarity is the one that best exploits the latent information
for assigning similarity and will hence give predictors of high accuracy. This claim
is validated by the experimental results. Extracting these underlying semantics
by studying co-occurrence data forms the motivation for the algorithm presented
in this section.
where domain(Ai ) = {vi1 , . . . , vin } It can thus be seen that, the mean itself is a
point in a n-dimensional space having dimensions as vi1 ,. . . ,vin with magnitudes:
< CI[Ak : v][Ai : vi1 ], . . . , CI[Ak : v][Ai : vin ] >.
Initially all distinct values belonging to the same attribute are conceptually
vectors perpendicular to each other and hence the similarity between them is 0.
For, the given example, the mean for dimension Color when Brand : a is
denoted as μ(Brand : a, Color). As defined above, the mean in a categorical
dimension is itself a point in a n-dimensional space and hence, the dimensions
of mean for the attribute Color are red, blue, green and hence
μ(Brand : a, Color) = {CI[Brand : a][Color : red], CI[Brand : a][Color :
blue], CI[Brand : a][Color : green]}
Similarly, μ(Brand : a, P rice) = {CI[Brand : a][P rice]}
Thus the representative point for the value a of attribute Brand is given by,
τ (Brand : a) =< μ(Brand : a, Color), μ(Brand : a, P rice) >
Initially we calculate the representative points for all values of all attributes. We
then initialize similarity in a manner similar to the overlap similarity measure
where matches are assigned similarity value 1 and the mismatches are assigned
similarity value 0. Using the representative points calculated above, we assign a
new similarity between each pair of values v, v belonging to the same attribute
Ak as equal to the average of cosine similarity between their means for each
dimension. Now the cosine similarity between v and v in dimension Ai is denoted
by CS(v : v , Ai ) and is equal to the cosine similarity between vectors μ(Ak :
v, Ai ) and μ(Ak : v , Ai ). Thus, similarity between Ak : v and Ak : v is:
d
l=0,l=i CS(v : v , Al )
d−1
Thus, for the above example, the similarity between Brand:a and Brand:b is the
average of cosine similarity between their respective means in dimensions Color
and Price. Thus Sim(a, b) is given as: CS(a:b,Color)+CS(a:b,P
2
rice)
An iteration is said to have been completed, when similarity between all pairs
of values belonging to the same attribute (for all attributes) are computed using
the above methodology. These, new values are used for cosine similarity compu-
tation in the next iteration.
In this section, we describe the DISC algorithm and hence the similarity ma-
trix construction. The similarity matrix construction using DISC is described as
follows:
|CI[Ai :vij ][Am ]−CI[Ai :vik ][Am ]|
1− ; if Am is N umeric
Similaritym = M ax[Am ]−M in[Am ]
CosineP roduct(CI[Ai : vij ][Am ], CI[Ai : vik ][Am ]); if Am is Categorical
where CosineP roduct(CI[Ai : vij ][Am ], CI[Ai : vik ][Am ]) is def ined as f ollows :
vml ,v Am CI[Ai : vij ][Am : vml ] ∗ CI[Ai : vik ][Am : Vml̄ ] ∗ Sim(vml̄ , vml )
ml̄
N ormalV ector1 ∗ N ormalV ector2
N ormalV ector1 = ( vml ,v Am CI[Ai : vij ][Am : vml ] ∗ CI[Ai : vij ][Am , vml̄ ] ∗ Sim(vml , vml̄ ))1/2
ml̄
N ormalV ector2 = ( vml ,v Am CI[Ai : vik ][Am : vml ] ∗ CI[Ai : vik ][Am , vml̄ ] ∗ Sim(vml , vml̄ ))1/2
1
d ml̄
Sim(vij , vik ) = d−1 m=1,m=i Similaritym
5 Experimental Study
In this section, we describe the pre-processing steps and the datasets used in
Section 5.1 followed by experimental results in Section 5.2. Finally in Section 5.3
we provide a discussion on the experimental results.
The experimental results for classification and regression are presented in Ta-
ble 3, 4 and Table 5, 6 respectively. In these tables each row represents compet-
ing similarity measure and the column represents different datasets. In Table 3
and 4, each cell represents the accuracy for the corresponding dataset and simi-
larity measure respectively. In Table 5 and 6, each cell represents the root mean
square error (RMSE) for the corresponding dataset and similarity measure re-
spectively.
As can be seen from the experimental results, DISC is the best similarity measure
for classification for all datasets except Lymphography, Primary Tumor and
Hayes Roth Test where it is the third best for the first two and the second
best for the last one. On the basis of overall mean accuracy, DISC outperforms
the nearest competitor by about 2.87% where we define overall mean accuracy
as as the mean of accuracies over all classification datasets considered for our
experiments. For regression, DISC is the best performing similarity measure on
the basis of Root Mean Square Error (RMSE) for all datasets.
For classification datasets like Iris, Primary Tumor and Zoo the algorithm
halted after the 1st iteration while for datasets like Balance, Lymphography,
Tic-Tac-Toe, Breast Cancer the algorithm halted after the 2nd iteration. Also,
for Car-Evaluation, Hayes Roth, Teaching Assistant and Nursery the algorithm
halted after the 3rd iteration while it halted after the 4th iteration for Hayes Roth
Test. For regression, the number of iterations was less than 5 for all datasets ex-
cept Compressive Strength for which it was 9. Thus, it can be seen that the
number of iterations for all datasets is small. Also, the authors observed that
478 A. Desai, H. Singh, and V. Pudi
the major bulk of the accuracy improvement is achieved with the first iteration
and hence for domains with time constraints in training the algorithm can be
halted after the first iteration. The reason for the consistently good performance
can be attributed to the fact that a similarity computation is a major component
in nearest neighbour classification and regression techniques, and DISC captures
similarity accurately and efficiently in a data driven manner.
DISC: Data-Intensive Similarity Measure for Categorical Data 479
is O(nd + V v 2 d). Once, the similarity values are computed, using them in any
classification, regression or a clustering task is a simple table look up and is
hence O(1).
6 Conclusion
In this paper we have presented and evaluated DISC, a similarity measure for
categorical data. DISC is data intensive, generic and simple to implement. In
addition to these features, it doesn’t require any domain expert’s knowledge.
Finally our algorithm was evaluated against 14 competing algorithms on 24
standard real-life datasets, out of which 12 were used for classification and 12 for
regression. It outperformed all competing algorithms on almost all datasets. The
experimental results are especially significant since it demonstrates a reasonably
large improvement in accuracy by changing only the similarity measure while
keeping the algorithm and its parameters constant.
Apart from classification and regression, similarity computation is a pivotal
step in a number of application such as clustering, distance-based outliers detec-
tion and search. Future work includes applying our algorithm for these techniques
also. We also intend to develop a weighing measure for different dimensions for
calculating similarity which will make the algorithm more robust.
References
1. Boriah, S., Chandola, V., Kumar, V.: Similarity Measures for Categorical Data: A
Comparative Evaluation. In: Proceedings of SDM 2008. SIAM, Atlanta (2008)
2. Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy: The Principles and Practice of
Numerical Classification. W. H. Freeman and Company, San Francisco (1973)
3. Anderberg, M.R.: Cluster Analysis for Applications. Academic Press, London
(1973)
4. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood
Cliffs (1988)
5. Hartigan, J.A.: Clustering Algorithms. John Wiley & Sons, New York (1975)
6. Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif.
Intell. Res. (JAIR) 6, 1–34 (1997)
7. Biberman, Y.: A context similarity measure. In: Bergadano, F., De Raedt, L. (eds.)
ECML 1994. LNCS, vol. 784, pp. 49–63. Springer, Heidelberg (1994)
8. Das, G., Mannila, H.: Context-based similarity measures for categorical databases.
In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI),
vol. 1910, pp. 201–210. Springer, Heidelberg (2000)
9. Palmer, C.R., Faloutsos, C.: Electricity based external similarity of categorical
attributes. In: Whang, K.-Y., Jeon, J., Shim, K., Srivastava, J. (eds.) PAKDD
2003. LNCS (LNAI), vol. 2637, pp. 486–500. Springer, Heidelberg (2003)
10. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with
categorical values. Data Mining and Knowledge Discovery 2(3), 283–304 (1998)
11. Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS–clustering categorical data
using summaries. In: KDD 1999. ACM Press, New York (1999)
12. Jones, W.P., Furnas, G.W.: Pictures of relevance: a geometric analysis of similarity
measures. J. Am. Soc. Inf. Sci. 38(6), 420–442 (1987)
DISC: Data-Intensive Similarity Measure for Categorical Data 481
13. Noreault, T., McGill, M., Koll, M.B.: A performance evaluation of similarity mea-
sures, document term weighting schemes and representations in a boolean environ-
ment. In: SIGIR 1980: Proceedings of the 3rd Annual ACM Conference on Research
and Development in Information Retrieval, Kent, UK, pp. 57–76. Butterworth &
Co. (1981)
14. Zwick, R., Carlstein, E., Budescu, D.V.: Measures of similarity among fuzzy con-
cepts: A comparative analysis. International Journal of Approximate Reason-
ing 1(2), 221–242 (1987)
15. Pappis, C.P., Karacapilidis, N.I.: A comparative assessment of measures of simi-
larity of fuzzy values. Fuzzy Sets and Systems 56(2), 171–174 (1993)
16. Wang, X., De Baets, B., Kerre, E.: A comparative study of similarity measures.
Fuzzy Sets and Systems 73(2), 259–268 (1995)
17. Gibson, D., Kleinberg, J.M., Raghavan, P.: Clustering categorical data: An ap-
proach based on dynamical systems. VLDB Journal 8(34), 222–236 (2000)
18. Guha, S., Rastogi, R., Shim, K.: ROCK–a robust clusering algorith for categorical
attributes. In: Proceedings of IEEE International Conference on Data Engineering
(1999)
19. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and tech-
niques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
20. Fayyad, U.M., Irani, K.B.: On the handling of continuous-valued attributes in
decision tree generation. Machine Learning 8, 87–102 (1992)
ListOPT: Learning to Optimize for XML Ranking
1
Key Laboratory of Machine Perception (Ministry of Education),
School of Electronic Engineering and Computer Science, Peking University
2
The State Key Lab of Computer Science, Institute of Software,
Chinese Academy of Sciences, Beijing 100190, China
1 Introduction
Search engines have become an indispensable part of life and one of the key issues on
search engine is ranking. Given a query, the ranking modules can sort the retrieval doc-
uments for maximally satisfying the user’s needs. Traditional ranking methods aim to
compute the relevance of a document to a query, according to the factors, term frequen-
cies and links for example. The search result is a ranked list in which the documents
are sequenced by their relevance score in descending order. These kinds of methods
include the content based functions such as TF*IDF [1] and BM25 [2], and link based
functions such as PageRank [3] and HITS [4].
Recently, machine learning technologies have been successfully applied to informa-
tion retrieval, known and named as “learning-to-rank”. The main procedure of “learning-
to-rank” is as follow: In learning module, a set of queries is given, and each of the
queries is associated with a ground-truth ranking list of documents. The process targets
Corresponding author.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 482–492, 2011.
c Springer-Verlag Berlin Heidelberg 2011
ListOPT: Learning to Optimize for XML Ranking 483
at creating a ranking model that can precisely predict the order of documents in the
ground-truth list. Many learning-to-rank approaches have been proposed and based on
the di erences of their learning samples, these methods can be classified into three cate-
gories [5]: pointwise, pairwise and listwise. Taking single document as learning object,
the pointwise based methods intent to compute the relevance score of each document
with respect to their closeness to the ground-truth. On the other side, pairwise based
approaches take the document pair as learning sample, and rephrase the learning prob-
lem as classification problem. Lisewise based approaches take a ranked list as learning
sample, and measure the di erences between the current result list and the ground-truth
list via using a loss function. The learning purpose of listwise methods is to minimize
the loss. The experimental results in [5] [11] [12] show that the listwise based methods
perform the best among these three kinds of methods.
It is worth noting that, from the perspective of ranking, the aforementioned learning-
to-rank methods belong to the learning based ranking technologies. Here the search
results are directly obtained from the learning module, without considering the tradi-
tional content based or link based ranking functions. However, there is no evidence to
confirm that the learning based methods perform better than all the other classic content
based or link based methods. Accordingly, to substitute the other two kinds of ranking
technologies with the learning based methods might not be appropriate.
We hence consider a learning-to-optimize method ListOPT that can combine and
utilize the benefits of learning-to-rank methods and traditional content based meth-
ods. Here the ranking method is the extension to the widely known ranking function
BM25. Due to previous studies, experiments are conducted on selecting the parameters
of BM25 with the best performance, typically after thousands of runs. However, this
simple but exhaustive procedure is only applicable to the functions with few free pa-
rameters. Besides, whether the best parameter values are in the testing set is also under
suspect. To attack this defect, a listwise learning method to optimize the free parameters
is introduced.
Same as learning-to-rank methods, the key issue of learning-to-optimize method is
the definition of loss function. In this paper, we discuss the e ect of three distinct defini-
tion of loss in the learning process and the experiments show that all three loss functions
converge. The experiments also reveal that the ranking function using tuned parameter
set indeed performs better.
The primary contributions of this paper include: (1) proposed a learning-to-optimize
method which combine and utilize the traditional ranking function BM25 and listwise
learning-to-rank method, (2) introduced the definition of three query-level loss func-
tions on the basis of cosine similarity, Euclidean distance and cross entropy, confirmed
to converge by experiments, (3) the verified the e ectiveness of the learning-to-optimize
approach on a large XML dataset Wikipedia English[6].
The paper is organized as follows. In section 2, we introduce the related work. Sec-
tion 3 gives the general description on learning-to-optimize approach ListOPT. The
definition of the three loss functions are discussed in section 4. Section 5 reports our
experimental results. Section 6 is the conclusion and future work.
484 N. Gao et al.
2 Related Work
2.1 Learning-to-Rank
In recent years, many machine learning methods were applied to the problem of
ranking for information retrieval. The existing learning-to-rank methods fall into three
categories, pointwise, pairwise and listwise. The pointwise approaches [7] are firstly
proposed, transforming the ranking problem into regression or classification on single
candidate documents. On the other side, pairwise approaches, published later, regard
the ranking process as a classification of document pairs. For example, given a query Q
and an arbitrary document pair P (d1 d2 ) in the data collection, where di means the
i-th candidate document, if d1 shows higher relevance than d2 , then object pair P is set
as (p) 0, otherwise P is marked as (p) 0. The advantage of pointwise and pair-
wise approaches is that the existing classification or regression theories can be directly
applied. For instance, borrowing support vector machine, boosting and neural network
as the classification model leads to the methods of Ranking SVM [8], RankBoost [9]
and RankNet [10].
However, the objective of pointwise and pairwise learning methods is to minimize
errors in classification of single document or document pairs rather than to minimize
errors in ranking of documents. To overcome this drawback of the aforementioned two
approaches, listwise methods, such as ListNet [5], RankCosine [11] and ListMLE [12],
are proposed. In lisewise approaches, the learning object is the result list and various
kinds of loss functions are defined to measure the similarity of the predict result list
and the ground-truth result list. ListNet, the first listwise approach proposed by Cao et
al., uses the cross entropy as loss function. Qin et al. discussed about another listwise
method called RankCosine, where the cosine similarity is defined as loss function. Xia
et al. introduced likelihood loss as loss function in the listwise learning-to-rank method
ListMLE.
In information retrieval, BM25 is a highly cited ranking function used by search en-
gines to rank matching documents according to their relevance to a given search query.
It is based on the probabilistic retrieval framework developed in the 1970s and 1980s.
Though BM25 is proposed to rank the HTML format documents originally, it was in-
troduced to the area of XML documents ranking in recent years. In the last three years
of INEX1 [6] Ad Hoc track 2 [17] [18] [19], all the search engines that perform the best
use BM25 as basic ranking function. To improve the performance of BM25, Taylor et
al. introduced the pairwise learning-to-rank method RankNet to tune the parameters in
BM25, named as RankNet Tuning method [13] in this paper. However, as mentioned in
2.1, the inherent disadvantages of pairwise methods had a pernicious influence on the
1
Initiative for the Evaluation of XML retrieval (INEX), a global evaluation platform, is launched
in 2002 for organizations from Information Retrieval, Database and other relative research
fields to compare the e ectiveness and eÆciency of their XML search engines.
2
In Ad Hoc track, participates are organized to compare the retrieval e ectiveness of their XML
search engines.
ListOPT: Learning to Optimize for XML Ranking 485
Unlike the HTML retrieval, the searching retrieval results are elements in XML re-
trieval, the definition of BM25 is thus di erent from the traditional BM25 formula used
in HTML ranking. The formal definition is as follow:
(k 1) t f (t e)
ps(e Q) Wt (1)
t¾ Q k (1 b b len(e)
avel ) t f (t e)
Nd
Wt log
n(t)
In the formula, t f (t e) is the frequency of keyword t appeared in element e; Nd is the
number of files in the collection; n(t) is the number of files that contains keyword t;
len(e) is the length of element e; avel is average length of elements in the collection; Q
is a set of keywords; ps(e Q) is the predict relevance score of element e corresponding
to query Q; b and k are two free parameters.
As observed, the parameters in BM25 fall into three categories: constant parameters,
fixed parameters and free parameters. For example, parameters describing the features
of data collection like avel and Nd are defined as constant parameters. Given a certain
query and a candidate element, t f (t e) and len(e) in the formula are fixed values. This
kind of parameters is called fixed parameters. Moreover, free parameters, such as k
and b in the function, are set to make the formula more adaptable to various kinds of
data collections. Therefore, the ultimate objective of learning-to-optimize approach is
to learn the optimal set of free parameters.
relevant, we apply the F measure cite14 to evaluate the ground truth score. Given a
query qi, the ground-truth score of the j-th candidate element is defined as follow:
relevant
precision
relevant irrelevant
relevant
recall
REL
(1 012 ) precision recall
gj
i
(2)
012 precision recall
In the formula, relevant is the length of relevant contents highlighted by user in e,
while irrelevant stands for the length of irrelevant parts. REL indicates the total length
of relevant contents in the data collection. The general bias parameter is set as 0.1,
denoting that the weight of precision is ten times as much as recall.
Furthermore, for each query qi , we use the ranking function BM25 mentioned in 3.1
to get the predict relevant score of each candidate element, recorded in Ri (r1i r2i rn(i)
i
).
i i
Then each ground-truth score list G and predicted score list R form a ”instance”. The
loss function is defined as the ”distance” between standard results lists Di and search
results lists Ri .
m
L(Gi Ri ) (3)
i 1
In each training epoch, the ranking function BM25 was used to compute the predicted
score Ri . Then the learning module replaced the current free parameters with the new
parameters tuned according to the loss between Gi and Ri . Finally the process stops
either while reaching the limit cycle index or when the parameters do not change.
4 Loss Functions
In this section, three query level loss functions and the corresponding tuning formulas
are discussed. Here the three definitions of loss are based on cosine similarity, Euclidean
distance and cross entropy respectively. After computing the loss between the ground-
truth Gi and the predicted Ri , the two free parameters k and b in BM25 are tuned as
formula (4). Especially, and are set to control the learning speed.
k k k
bb b (4)
definition of the query level loss function based on cosine similarity is:
n(i) i
1 gij rij
L(G R )
i i i 1 j
(1 ) (5)
2 n(i) i 2 n(i) i 2
j 1 (g j ) j 1 (r j )
ListOPT: Learning to Optimize for XML Ranking 487
Note that in large data collection, given a query, the amount of relevant documents is
regularly much less than the number of irrelevant documents. So that a penalty function
i
is set to avoid the learning bias on irrelevant documents. Formula (6) is the weight of
j
relevant documents in learning procedure, while formula (7) is the weight of irrelevant
document. The formal definition is as follow:
NRi NIRi
if (gij 0) (6)
i NRi
j
NR NIRi
i
if (gij 0) (7)
NIRi
Where NRi is the number of relevant elements according to query qi and NIRi is the
number of irrelevant ones. After measuring the loss between the ground-truth results
and the predicted results, the adjustment range parameters k and b are determined
according to the derivatives of k and b:
With respect to k:
m
L(G i Ri )
k
k
q 1
q
r
n(i) q j
r ¡ k (8)
rqj gqj )(
n(i) q
n(i) q 2
j 1 j
( j 1 rj j 1 (g j ) )
1
m n(i)
gqj k
n(i) q 2
(r )
( ) jq (
j 1 j 1 j
)
2 q 1
n(i) q 2
j 1 (r j ) n(i) q 2
j 1 (g j ) ( n(i) q 2
j 1 (r j )
n(i) q 2 2
j 1 (g j ) )
In which:
rqj t f (t e) (t f (t e) k (1 b b t f (t e) (k 1) (1 b b len(e)
len(e)
)) )
Wt avel avel
(9)
k t¾Q (t f (t e) k (1 b b len(e)
avel
)) 2
b analogously:
m
L(G i Ri )
b
b
q 1
q
r
n(i) q j
r ¡ b
rqj n(i) q
q n(i) q 2
j 1 j
( j 1 rj g j )( j 1 (g j ) )
1
m n(i)
gqj b n(i) q 2
(r )
( ) jq (
j 1
j 1 j
)
2 q 1
n(i) q 2
j 1 (r j ) n(i) q 2
j 1 (g j ) ( n(i) q 2
j 1 (r j )
n(i) q 2 2
j 1 (g j ) )
(10)
In which:
t f (t e) (k 1) k (1
q len
rj )
Wt avel
(11)
b t¾ Q (t f (t e) k (1 b b len(e) 2
avel
))
n(i)
L(Gi Ri ) ( ij )2 (rij gij )2 (12)
j 1
The same as cosine similarity loss, we derive the derivatives of the loss function based
q q
rj rj
on Euclidean distance with respect to k and b. The definition of k and b are the same
as in formula (9) and formula (11) respectively.
With respect to k:
q
n(i) i 2 q q rj
m m
L(Gi Ri ) j 1 ( j ) (r j gj)
k k
(13)
k n(i) i 2 q q
q 1 q 1
j 1 ( j ) (r j g j )2
b analogously:
q
n(i) i 2 q q rj
m m
j 1 ( j ) (r j gj)
i i
L(G R )
b b
(14)
b n(i) i 2 q q
q 1 q 1
j 1 ( j ) (r j g j )2
n(i)
L(Gi Ri )
i
j rij log(gij) (15)
j 1
When considering cross entropy as metric, the loss function turns to formula (15).
Moreover, the penalty parameter ij in the formula is the same as in formula (6) and (7)
and the detailed tuning deflection of k and b is defined in formula (16) and formula
q q
rj rj
(17) respectively. Additionally, the definition of k
and b
are the same as in formula
(9) and formula (11). With respect to k:
m m n(i) q n(i) q
L(Gi Ri )
rj 1
rj
k
q q q
j ( rj gj ) (16)
q 1
k
q 1 j 1
k n(i)
j 1 gqj j 1
k
b analogously:
m m n(i) q n(i) q
L(Gi Ri )
rj 1
rj
b
q q q
j ( rj gj ) (17)
q 1
b
q 1 j 1
b n(i)
j 1 gqj j 1
b
5 Experiment
In this section, the XML data set used in comparison experiments is first introduced.
Then in section 5.2 we compare the e ectiveness of the optimized ranking function
BM25 under two evaluation criterions: MAP [15] and NDCG [16]. Additionally
ListOPT: Learning to Optimize for XML Ranking 489
in section 5.3, we focus on testing the association between the number of training
queries and the optimizing performance under the criterion of MAP.
Acknowledgement
This work was supported by the National High-Tech Research and Development Plan
of China under Grant No.2009AA01Z136.
References
1. Carmel, D., Maarek, Y.S., Mandelbrod, M., et al.: Searching XML documents via XML
fragments. In: SIGIR, pp. 151–158 (2003)
2. Theobald, M., Schenkel, R., Wiekum, G.: An EÆcient and Versatile Query Engine for TopX
Search. In: VLDB, pp. 625–636 (2005)
3. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order
to the web. Technical report, Stanford University (1998)
4. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. JACM, 604–632 (1998)
5. Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., Li, H.: Learning to rank: from pairwise approach to
listwise approach. In: ICML, pp. 129–136 (2007)
6. INEX,
7. Nallapati, R.: Discriminative models for information retrieval. In: SIGIR, pp. 64–71 (2004)
492 N. Gao et al.
8. Cao, Y., Xu, J., Liu, T., Li, H., Huang, Y., Hon, H.: Adapting ranking SVM to document
retrieval. In: SIGIR, pp. 186–193 (2006)
9. Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An eÆcient boosting algorithm for combining
preferences. JMLR, 933–969 (2003)
10. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.:
Learning to Rank using Gradient Descent. In: ICML, pp. 89–96 (2005)
11. Qin, T., Zhang, X.D., Tsai, M.F., Wang, D.S., Liu, T.Y., Li, H.: Query-level loss functions
for information retrieval. Information Processing and Management, 838–855 (2007)
12. Xia, F., Liu, T.Y., Wang, J., Zhang, W., Li, H.: Listwise approach to learning to rank: theory
and algorithm. In: ICML, pp. 1192–1199 (2008)
13. Taylor, M., Zaragoza, H., Craswell, N., Robertson, S., Burges, C.: Optimisation Methods for
Ranking Functions with Multiple Parameters. In: CIKM, pp. 585–593 (2006)
14. van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
15. Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval (1999)
16. Jarvelin, K., Kekalainen, J.: IR evaluation methods for retrieving highly relevant documents.
In: SIGIR, pp. 41–48 (2000)
17. Geva, S., Kamps, J., Lethonen, M., Schenkel, R., Thom, J.A., Trotman, A.: Overview of the
INEX 2009 Ad Hoc Track. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS,
vol. 6203, pp. 4–25. Springer, Heidelberg (2010)
18. Itakura, K.Y., Clarke, C.L.A.: University of waterloo at INEX 2008: Adhoc, book, and link-
the-wiki tracks. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631,
pp. 132–139. Springer, Heidelberg (2009)
19. Liu, J., Lin, H., Han, B.: Study on Reranking XML Retrieval Elements Based on Combining
Strategy and Topics Categorization. In: INEX, pp. 170–176 (2007)
Item Set Mining Based on Cover Similarity
Abstract. While in standard frequent item set mining one tries to find
item sets the support of which exceeds a user-specified threshold (mini-
mum support) in a database of transactions, we strive to find item sets
for which the similarity of their covers (that is, the sets of transactions
containing them) exceeds a user-specified threshold. Starting from the
generalized Jaccard index we extend our approach to a total of twelve
specific similarity measures and a generalized form. We present an effi-
cient mining algorithm that is inspired by the well-known Eclat algorithm
and its improvements. By reporting experiments on several benchmark
data sets we demonstrate that the runtime penalty incurred by the more
complex (but also more informative) item set assessment is bearable and
that the approach yields high quality and more useful item sets.
1 Introduction
Frequent item set mining and association rule induction are among the most
intensely studied topics in data mining and knowledge discovery in databases.
The enormous research efforts devoted to these tasks have led to a variety of
sophisticated and efficient algorithms, among the best-known of which are Apri-
ori [1], Eclat [27,28] and FP-growth [13]. However, these approaches, which find
item sets whose support exceeds a user-specified minimum in a given transac-
tion database, have the disadvantage that the support does not say much about
the actual strength of association of the items in the set: a set of items may
be frequent simply because its elements are frequent and thus their frequent
co-occurrence can even be expected by chance. As a consequence, the (usually
few) interesting item sets drown in a sea of irrelevant ones.
In order to improve this situation, we propose in this paper to change the
selection criterion, so that fewer irrelevant items sets are produced. For this
we draw on the insight that for associated items their covers—that is, the set
of transactions containing them—are more similar than for independent items.
Starting from the Jaccard index to illustrate this idea, we explore a total of
twelve specific similarity measures that can be generalized from pairs of sets
(or, equivalently, from pairs of binary vectors) as well as a generalized form.
By applying an Eclat-based mining algorithm to standard benchmark data sets
and to the 2008/2009 Wikipedia Selection for schools, we demonstrate that the
search times are bearable and that high quality item sets are produced.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 493–505, 2011.
c Springer-Verlag Berlin Heidelberg 2011
494 M. Segond and C. Borgelt
The extent rT (I) of an item set I w.r.t. a transaction database T is the size of its
carrier, that is, rT (I) = |LT (I)|. Together with the notions of cover and support
(see above), we can define the generalized Jaccard index of an item set I w.r.t.
a transaction database T as its support divided by its extent, that is, as
sT (I) |KT (I)| | KT ({i})|
JT (I) = = = i∈I .
rT (I) |LT (I)| | i∈I KT ({i})|
The diffset approach as it was reviewed in the previous section can easily be
transferred in order to find an efficient scheme for computing the carrier and
thus the extent of item sets. To this end we define the extra set ET (a | I) as
ET (a | I) = KT ({a}) − i∈I KT ({i}) = {k ∈ INn | a ∈ tk ∧ ∀i ∈ I : i ∈
/ tk }.
That is, ET (a | I) is the set of indices of all transactions that contain a, but
no item in I, and thus identifies the extra transaction indices that have to be
added to the carrier if item a is added to the item set I. For extra sets we have
ET (a | I ∪ {b}) = ET (a | I) − ET (b | I), which corresponds to the analogous
formula for diffsets reviewed above. This relation is easily verified as follows:
ET (a | I) − ET (b | I)
= {k ∈ INn | a ∈ tk ∧ ∀i ∈ I : i ∈
/ tk } − {k ∈ INn | b ∈ tk ∧ ∀i ∈ I : i ∈
/ tk }
= {k ∈ INn | a ∈ tk ∧ ∀i ∈ I : i ∈
/ tk ∧ ¬(b ∈ tk ∧ ∀i ∈ I : i ∈
/ tk )}
= {k ∈ INn | a ∈ tk ∧ ∀i ∈ I : i ∈
/ tk ∧ (b ∈
/ tk ∨ ∃i ∈ I : i ∈ tk )}
= {k ∈ INn | (a ∈ tk ∧ ∀i ∈ I : i ∈
/ tk ∧ b ∈
/ tk )
∨ (a ∈ tk ∧ ∀i ∈ I : i ∈/ tk ∧ ∃i ∈ I : i ∈ tk )}
=false
= {k ∈ INn | a ∈ tk ∧ ∀i ∈ I : i ∈
/ tk ∧ b ∈
/ tk }
= {k ∈ INn | a ∈ tk ∧ ∀i ∈ I ∪ {b} : i ∈
/ tk }
= ET (a | I ∪ {b})
In order to see how extra sets can be used to compute the extent of item sets,
let I = {i1 , . . . , im }, with some arbitrary, but fixed order of the items that is
indicated by the index. This will be the order in which the items are used as
Item Set Mining Based on Cover Similarity 497
Table 1. Quantities in terms of which the considered similarity measures are specified,
together with their behavior as functions on the partially ordered set (2B , ⊆)
quantity behavior
nT constant
sT (I) = |KT (I)| = i∈I KT ({i}) anti-monotone
rT (I) = |LT (I)| = i∈I KT ({i}) monotone
qT (I) = rT (I) − sT (I) monotone
zT (I) = nT − rT (I) anti-monotone
Thus we have a simple recursive scheme to compute the extent of an item set
from its parent in the search tree (as defined by the divide-and-conquer scheme).
The mining algorithm can now easily be implemented as follows: initially
we create a vertical representation of the given transaction database. The only
difference to the Eclat algorithm is that we have two transaction lists per item i:
one represents KT ({i}) and the other ET (i | ∅), which happens to be equal to
KT ({i}). (That is, for the initial transaction database the two lists are identical,
which, however, will obviously not be maintained in the recursive processing.)
In the recursion the first list for the split item is intersected with the first list
of all other items to form the list representing the cover of the corresponding
pair. The second list of the split item is subtracted from the second lists of all
other items, thus yielding the extra sets of transactions for these items given the
split item. From the sizes of the resulting lists the support and the extent of the
enlarged item sets and thus their generalized Jaccard index can be computed.
Measures derived from inner product: Measures derived from Hamming distance:
Russel & Rao s s Sokal & Michener s+z n−q
SR = = S = =
[21] n r+z Hamming [23,15] M n n
Kulczynski s s Faith 2s + z s + 12 z
SK = = SF = =
[19] q r−s [10] 2n n
Jaccard [16] s s AZZOO [5] s + σz
SJ = = SZ =
Tanimoto [26] s+q r σ ∈ [0, 1] n
Dice [8] Rogers & s+z n−q
2s 2s ST = =
Sørensen [25] SD = = Tanimoto [20] n+q n+q
2s + q r+s
Czekanowski [7] Sokal & Sneath 2 2(s + z) n−q
SN = =
Sokal & Sneath 1
SS =
s
=
s [24,22] n+s+z n − 12 q
[24,22] s + 2q r+q Sokal & Sneath 3 s+z n−q
SO = =
[24,22] q q
√
Baroni-Urbani sz + s
SB = √
& Buser [3] sz + r
Note that the Russel& Rao measure is simply normalized support, demon-
strating that our framework comprises standard frequent item set mining as a
special case. The Sokal& Michener measure is simply the normalized Hamming
similarity. The Dice/Sørensen/Czekanowski measure may be defined without the
factor 2 in the numerator, changing the range to [0, 0.5]. The Faith measure is
equivalent to the AZZOO measure (alter zero zero one one) for σ = 0.5 and
the Sokal& Michener measure results for σ = 1. AZZOO is meant to introduce
flexibility in how much weight should be placed on z, the number of transactions
which lack all items in I (zero zero) relative to s (one one).
All measures listed in Table 2 are anti-monotone on the partially ordered
set (2B , ⊆), where B is the underlying item base. This is obvious if in at least
one of the formulas given for a measure the numerator is (a multiple of) a
constant or anti-monotone quantity or a (weighted) sum of such quantities, and
the numerator is (a multiple of) a constant or monotone quantity or a (weighted)
sum of such quantities (see Table 1). This is the case for all but SD , SN and SB .
That SD is anti-monotone can be seen by considering its reciprocal value
−1 q −1
SD = 2s+q
2s = 1 + 2s . Since q is monotone and s is anti-monotone, SD is clearly
monotone and thus S√D is anti-monotone.
√
Applying the same approach to SB ,
we arrive at SB−1 = √sz+rsz+s
= sz+s+q
√
sz+s
= 1 q
+ √sz+s . Since q is monotone and
√ −1
both s and sz are anti-monotone, SB is clearly monotone and thus SB is anti-
q q
monotone. Finally, SN can be written as SN = 2n−2q 2n−q
= 1 − 2n−q = 1 − n+s+z .
Since q is monotone, the numerator is monotone, and since n is constant and s
and z are anti-monotone, the denominator is anti-monotone. Hence the fraction
is monotone and since it is subtracted from 1, SN is anti-monotone.
Note that all measures in Table 2 can be expressed as
√
c0 s + c1 z + c2 n + c3 sz
S= √ (1)
c4 s + c5 z + c6 n + c7 sz
by specifying appropriate coefficients c0 , . . . , c7 . For example, we obtain SJ for
c0 = c6 = 1, c5 = −1 and c1 = c2 = c3 = c4 = c7 = 0, since SJ = sr = n−z s
.
Similarly, we obtain SO for c0 = c1 = c6 = 1, c4 = c5 = −1 and c2 = c3 = c7 = 0,
since SO = s+zq
s+z
= n−s−z . This general form allows for a flexible specification of
various similarity measures. Note, however, that not all selections of coefficients
lead to an anti-monotone measure and hence one has to carefully check this
property before using a measure that differs from the pre-specified ones.
7 Experiments
We implemented the described item set mining approach as a C program that
was derived from an Eclat implementation by adding the second transaction
identifier list for computing the extent of item sets. All similarity measures listed
in Table 2 are included as well as the general form (1). This implementation has
been made publicly available under the GNU Lesser (Library) Public License.1
1
See http://www.borgelt.net/jim.html
500 M. Segond and C. Borgelt
census 3 chess
jim asc.
log of execution time
0
0
0
0
–1
2 jim asc.
jim desc.
eclat asc.
eclat desc.
1
0
2 5 10 15 20 25 30 35 40 45 50
minimum support
Naturally, the execution times of JIM are always greater than those of the
corresponding Eclat runs (with the same order of the items), but the execution
times are still bearable. This shows that even if one does not use a similarity
measure to prune the search, this additional information can be computed fairly
efficiently. However, it should be kept in mind that the idea of the approach is to
set a threshold for the similarity measure, which can effectively prune the search,
so that the actual execution times found in applications are much lower. In our
own practice we basically always achieved execution times that were lower than
for the Eclat algorithm (but, of course, with a different output).
502 M. Segond and C. Borgelt
Table 3. Jaccard item sets found in the 2008/2009 Wikipedia Selection for schools
item set sT JT
Reptiles, Insects 12 1.0000
phylum, chordata, animalia 34 0.7391
planta, magnoliopsida, magnoliophyta 14 0.6667
wind, damag, storm, hurrican, landfal 23 0.1608
tournament, doubl, tenni, slam, Grand Slam 10 0.1370
dinosaur, cretac, superord, sauropsida, dinosauria 10 0.1149
decai, alpha, fusion, target, excit, dubna 12 0.1121
conserv, binomi, phylum, concern, animalia, chordata 14 0.1053
which standard frequent item set mining did not yield sufficiently good results.
This was carried out in the EU FP7 project BISON3 and is reported in [18].
8 Conclusions
We introduced the notion of a Jaccard item set as an item set for which the
(generalized) Jaccard index of its item covers exceeds a user-specified threshold.
In addition, we extended this basic idea to a total of twelve similarity measures
for sets or binary vectors, all of which can be generalized in the same way and
can be shown to be anti-monotone. By exploiting an idea that is similar to
the difference set approach for the well-known Eclat algorithm, we derived an
efficient search scheme that is based on forming intersections and differences of
sets of transaction indices in order to compute the quantities that are needed
to compute the similarity measures. Since it contains standard frequent item
set mining as a special case, mining item sets based on cover similarity yields a
flexible and versatile framework. Furthermore, the similarity measures provide
highly useful additional assessments of found item sets and thus help us to select
the interesting ones. By running experiments on standard benchmark data sets
we showed that mining item sets based on cover similarity can be done fairly
efficiently, and by evaluating the results obtained with a threshold for the cover
similarity measure we demonstrated that the output is considerably reduced,
while expressive and meaningful item sets are preserved.
Acknowledgements
This work was supported by the European Commission under the 7th Framework
Program FP7-ICT-2007-C FET-Open, contract no. BISON-211898.
References
1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proc.
20th Int. Conf. on Very Large Databases (VLDB 1994), Santiago de Chile, pp.
487–499. Morgan Kaufmann, San Mateo (1994)
2. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. School of Infor-
mation and Computer Science, University of California at Irvine, CA, USA (2007),
http://www.ics.uci.edu/~ mlearn/MLRepository.html
3. Baroni-Urbani, C., Buser, M.W.: Similarity of Binary Data. Systematic Zool-
ogy 25(3), 251–259 (1976)
4. Bayardo, R., Goethals, B., Zaki, M.J. (eds.): Proc. Workshop Frequent Item Set
Mining Implementations (FIMI 2004), Brighton, UK. CEUR Workshop Proceed-
ings 126, Aachen, Germany (2004), http://www.ceur-ws.org/Vol-126/
5. Cha, S.-H., Tappert, C.C., Yoon, S.: Enhancing Binary Feature Vector Similarity
Measures. J. Pattern Recognition Research 1, 63–77 (2006)
3
See http://www.bisonet.eu/
504 M. Segond and C. Borgelt
6. Choi, S.-S., Cha, S.-H., Tappert, C.C.: A Survey of Binary Similarity and Distance
Measures. Journal of Systemics, Cybernetics and Informatics 8(1), 43–48 (2010)
7. Czekanowski, J.: Zarys metod statystycznych w zastosowaniu do antropologii [An
Outline of Statistical Methods Applied in Anthropology]. Towarzystwo Naukowe
Warszawskie, Warsaw (1913)
8. Dice, L.R.: Measures of the Amount of Ecologic Association between Species. Ecol-
ogy 26, 297–302 (1945)
9. Dunn, G., Everitt, B.S.: An Introduction to Mathematical Taxonomy. Cambridge
University Press, Cambirdge (1982)
10. Faith, D.P.: Asymmetric Binary Similarity Measures. Oecologia 57(3), 287–290
(1983)
11. Goethals, B. (ed.): Frequent Item Set Mining Dataset Repository. University of
Helsinki, Finland (2004), http://fimi.cs.helsinki.fi/data/
12. Goethals, B., Zaki, M.J. (eds.): Proc. Workshop Frequent Item Set Mining Imple-
mentations (FIMI 2003), Melbourne, FL, USA. CEUR Workshop Proceedings 90,
Aachen, Germany (2003), http://www.ceur-ws.org/Vol-90/
13. Han, J., Pei, H., Yin, Y.: Mining Frequent Patterns without Candidate Generation.
In: Proc. Conf. on the Management of Data (SIGMOD 2000), Dallas, TX, pp. 1–12.
ACM Press, New York (2000)
14. Hamann, V.: Merkmalbestand und Verwandtschaftsbeziehungen der Farinosae. Ein
Beitrag zum System der Monokotyledonen 2, 639–768 (1961)
15. Hamming, R.V.: Error Detecting and Error Correcting Codes. Bell Systems Tech.
Journal 29, 147–160 (1950)
16. Jaccard, P.: Étude comparative de la distribution florale dans une portion des Alpes
et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles 37, 547–579
(1901)
17. Kohavi, R., Bradley, C.E., Frasca, B., Mason, L., Zheng, Z.: KDD-Cup 2000 Or-
ganizers’ Report: Peeling the Onion. SIGKDD Exploration 2(2), 86–93 (2000)
18. Kötter, T., Berthold, M.R.: Concept Detection. In: Proc. 8th Conf. on Computing
and Philosophy (ECAP 2010). University of Munich, Germany (2010)
19. Kulczynski, S.: Classe des Sciences Mathématiques et Naturelles. Bulletin Int. de
l’Acadamie Polonaise des Sciences et des Lettres Série B (Sciences Naturelles) (Sup-
plement II), 57–203 (1927)
20. Rogers, D.J., Tanimoto, T.T.: A Computer Program for Classifying Plants. Sci-
ence 132, 1115–1118 (1960)
21. Russel, P.F., Rao, T.R.: On Habitat and Association of Species of Anopheline
Larvae in South-eastern Madras. J. Malaria Institute 3, 153–178 (1940)
22. Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy. Freeman Books, San Francisco
(1973)
23. Sokal, R.R., Michener, C.D.: A Statistical Method for Evaluating Systematic Re-
lationships. University of Kansas Scientific Bulletin 38, 1409–1438 (1958)
24. Sokal, R.R., Sneath, P.H.A.: Principles of Numerical Taxonomy. Freeman Books,
San Francisco (1963)
25. Sørensen, T.: A Method of Establishing Groups of Equal Amplitude in Plant Soci-
ology based on Similarity of Species and its Application to Analyses of the Vegeta-
tion on Danish Commons. Biologiske Skrifter / Kongelige Danske Videnskabernes
Selskab 5(4), 1–34 (1948)
26. Tanimoto, T.T.: IBM Internal Report, November 17 (1957)
Item Set Mining Based on Cover Similarity 505
27. Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New Algorithms for Fast Dis-
covery of Association Rules. In: Proc. 3rd Int. Conf. on Knowledge Discovery and
Data Mining (KDD 1997), Newport Beach, CA, pp. 283–296. AAAI Press, Menlo
Park (1997)
28. Zaki, M.J., Gouda, K.: Fast Vertical Mining Using Diffsets. In: Proc. 9th ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2003), Wash-
ington, DC, pp. 326–335. ACM Press, New York (2003)
29. Synthetic Data Generation Code for Associations and Sequential Patterns. Intelli-
gent Information Systems, IBM Almaden Research Center,
http://www.almaden.ibm.com/software/quest/Resources/index.shtml
Learning to Advertise:
How Many Ads Are Enough?
1 Introduction
Sponsored search places ads on the result pages of web search engines for dif-
ferent queries. All major web search engines (Google, Microsoft, Yahoo!) derive
significant revenue from such ads. However, the advertisement problem is often
treated as the same problem as traditional web search, i.e., to find the most
relevant ads for a given query. One different and also usually ignored problem
is “how many ads are enough for a sponsored search”. Recently, a few research
works have been conducted on this problem [5,6,8,17]. For example, Broder et
al. study the problem of “whether to swing”, that is, whether to show ads for an
incoming query [3]; Zhu et al. propose a method to directly optimize the revenue
in sponsored search [22]. In most existing search engines, the problem has been
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 506–518, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Learning to Advertise: How Many Ads Are Enough? 507
treated as an engineering issue. For example, some search engine always displays
a fixed number of ads and some search engine uses heuristic rules to determine
the number of displayed ads. However, the key question is still open, i.e., how
to optimize the number of displayed ads for an incoming query?
Motivation Example. Figure 1 (a) illustrates an example of sponsored search.
The query is “house” and the first one is a suggested ad with yellow background,
and the search results are listed in the bottom of the page. Our goal is to predict
the number of displayed ads for a given query. The problem is not easy, as it
is usually difficult to accurately define the relevance between an ad and the
query. We conduct several statistical studies on the log data of a commercial
search engine, the procedure is in two-stage: First, for each query, we obtain all
the returned ads by the search engine; Second, we use some method to remove
several unnecessarily displayed ads (detailed in Section 4). Figure 1 (b) and (c)
are the statistical results on a large click-through data (DS BroadMatch dataset
in section 3). The number of “removed ads” refers to the total number of ads
cut off in the second stage for all the queries. Figure 1(b) shows how #clicks
and Click-Through-Rate (CTR) vary with the number of removed ads. We see
that with the number of removed ads increasing, #clicks decreases, while CTR
clearly increases. This matches our intuition well, displaying more ads will gain
more clicks, but if many of them are irrelevant, it will hurt CTR. Figure 1(c)
further shows how CTR increases as #clicks decreases. This is very interesting.
It is also reported that many clicks on the first displayed ad are done before the
users realize that it is not the first search result. A basic idea here is that we
can remove some displayed ads to achieve a better performance on CTR.
850 0.014
#Clicks
40%
CTR
800 0.013
30%
750 0.012
600 0.009 0
0 1 2 3 4 5 6 7 0 −5% −10% −15% −20%
The number of removed ads 4
x 10 Decline of clicks
(a) Sponsored search (b) #removed ads vs. #clicks & CTR (c) #clicks vs. CTR
Thus, the problem becomes how to predict the number of displayed ads for
an incoming query which is non-trivial and poses two unique challenges:
• Ad ranking. For a given query, a list of related ads will be returned. Ads
displayed at the top positions should be more relevant to the query. Thus,
the first challenge is how to rank these ads.
• Ad Number Prediction. After we get the ranking list of ads, it is necessary
to answer the question “how many ads should we show?”.
508 B. Wang et al.
• We performed a deep analysis of the click-through data and found that when
the click entropy of a query exceeds a threshold, the CTR of that query will
be very near zero.
• We developed a method to determine the number of displayed ads for a given
query by an automatically selected threshold of click entropy.
• We conducted experiments on a commercial search engine and experimental
results validate the effectiveness of the proposed approach.
2 Problem Definition
Suppose we have the click-through data collected from a search engine, each
record can be represented by a triple {q, adq (p), cq (p)}, where for each query q,
adq (p) is the ad at the position p returned by the search engine and cq (p) is a
binary indicator which is 1 if this ad is clicked under this query, otherwise 0.
For each ad adq (p), there is an associated feature vector xq (p) extracted from a
query-ad pair (q, adq (p)) and can be utilized for ranking model learning.
Ad Ranking: Given the training data denoted by L = {q, ADq , Cq }q∈Q in
which Q is the query collection, for each q ∈ Q, ADq = {adq (1), · · · , adq (nq )}
is its related ad list and Cq = {cq (1), · · · , cq (nq )} is the click indicators where
nq is the total number of displayed ads. Similarly, the test data can be denoted
by T = {q , ADq }q ∈Q where Q is the query collection. In this task, we try
to learn a ranking function for displaying the query-related ads by relevance.
For each query q ∈ Q , the output of this task is the ranked ad list Rq =
{adq (i1 ), · · · , adq (inq )} where (i1 , · · · , inq ) is a permutation of (1, · · · , nq ).
Ad Number Prediction: Given the ranked ad list Rq for query q , in this
task we try to determine the number of displayed ads k and then display the
top-k ads. The output of this task can be denoted by a tuple O = {q , Rqk }q ∈Q
where Rqk are the top-k ads from Rq .
Our problem is quite different from existing works on advertisement recom-
mendation. Zhu et al. propose a method to directly optimize the revenue in
sponsored search [22]. However, they only consider how to maximize the rev-
enue, but ignore the experience of users. Actually, when no ads are relevant to
the users’ interests, displaying irrelevant ads may lead to much complains from
the users and even train the user to ignore ads. Broder et al. study the problem
of “whether to swing”, that is, whether to show ads for an incoming query [3].
However, they simplify the problem as a binary classification problem, while
in most real cases, the problem is more complex and often requires a dynamic
number for the displayed ads. Few works have been done about dynamically
predicting the number of displayed ads for a given query.
Learning to Advertise: How Many Ads Are Enough? 509
6
0.05
5
0.04
4
CTR
0.03
3
0.02
2
0.01 1
0 0
1 2 3 4 5 6 7 8 9 10 0.5 1 1.5 2 2.5 3 3.5 4
Position click entropy of query
Fig. 2. How CTR varies with the positions Fig. 3. How the number of removed ads
varies with the click entropy of a query
where P(q) is the collection of ads clicked on query q and P (ad|q) = |Clicks(q,ad)|
|Clicks(q)|
is the ratio of the number of clicks on ad to the number of clicks on query q.
1
http://www.sogou.com
510 B. Wang et al.
12 10
10
8
2
2
0 0
0 1 2 3 4 5 6 0 2 4 6 8 10
Click entropy of query Click entropy of query
Fig. 4. How Max-Clicked-Position varies with the click entropy on the two datasets
A smaller click entropy means that the majorities of users agree with each
other on a small number of ads while a larger click entropy indicates a bigger
query diversity, that is, many different ads are clicked for the same query.
Click Entropy vs. #Removed ads. Figure 3 shows how the number of removed
ads varies with the click entropy of a query on the dataset DS BroadMatch. By
this distribution, for a query, if we want to remove a given number of ads, we can
automatically obtain the threshold of the click entropy which can be utilized for
helping determine the number of displayed ads.
Click Entropy vs. Max-Clicked-Position. For a query, Max-Clicked-Position
is the last position of clicked ad. Figure 4 shows how the Max-Clicked-Position
varies with the click entropy on the two datasets. The observations are as follows:
Click Entropy vs. QueryCTR. Figure 5 shows how QueryCTR varies with
the click entropy of a query. QueryCTR is the ratio of the number of clicks of
a query to the number of impressions of this query. We can conclude that when
the click entropy of a query is greater than 3, the QueryCTR will be very near
zero. This observation is very interesting, the QueryCTR is the summation of
the ads’ click entropy, so we can utilize this observation to help determine the
number of displayed ads for a given query.
Learning to Advertise: How Many Ads Are Enough? 511
1 1
0.8 0.8
Query CTR
Query CTR 0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 1 2 3 4 5 0 1 2 3 4 5
Click Entropy of Query Click Entropy of Query
Fig. 5. How QueryCTR varies with the click entropy on the two datasets
Ad Ranking: In this task, we aim to rank all the related ads of a given query by
relevance. Specifically, given each record {q, adq (p), cq (p)} from the click-through
data L = {q, ADq , Cq }q∈Q , we can first extract its associated feature vector xq (p)
from the query-ad pair, then obtain one training instance {xq (p), cq (p)}. Simi-
larly, we can generate the whole training data L = {xq (p), cq (p)}q∈Q,p=1,···,nq ∈
Rd × {0, 1} from the click-through data where d is the number of features.
Let (x, c) ∈ L be an instance from the training data where x ∈ Rd is the
feature vector and c ∈ {0, 1} is the associated click indicator. In order to predict
the CTR of an ad, we can learn a logistic regression model as follows whose
output is the probability of that ad being clicked:
1
P (c = 1|x) = (2)
1 + e− i wi xi
where xi is the i-th feature of x and wi is the weight for that feature. P (c = 1|x)
is the predicted CTR of that ad whose feature vector is x.
For training, we can use the maximum likelihood method for parameter learn-
ing; for test, given a query, we can use the learnt logistic regression model for
predicting the CTR of one ad.
512 B. Wang et al.
Ad ranking:
1: Function learning for predicting CTRs from L
P (c = 1|x) = 1+e− 1 i wi xi
Ad Number Prediction:
2: for q ∈ Q do
3: Rank ADq by the predicted CTRs P (c = 1|x)
4: Let the number k = 0 and click entropy CE = 0
5: while CE ≤ η do
6: k=k+1
7: if adq (k) is predicted to be clicked
8: CE = CE − P (adq (k)|q )log2 P (adq (k)|q )
9: end if
10: end while
11: Rq = ADq (1 : k)
12: Output k and O = {q , Rq }q ∈Q
13: end for
Table 1 lists all the 30 features extracted from query-ad title pair, query-ad pair,
and query-keyword pair which can be divided into three categories: Relevance-
related, CTR-related and Ads-related.
Relevance-related features. The relevance-related features consist of low-
level and high-level ones. The low-level features include highlight, TF, TF*IDF
Learning to Advertise: How Many Ads Are Enough? 513
and the overlap, which can be used to measure the relevance based on keyword
matching. The high-level features include cosine similarity, BM25 and LMIR,
which can be used to measure the relevance beyond keyword matching.
CTR-related features. AdCTR can be defined as the ratio of the number of
ad clicks to the total number of ad impressions. Similarly, we can define keyCTR
and titleCTR. KeyCTR corresponds to the multiple advertising for the specific
keyword. And titleCTR corresponds to multiple advertising with the same ad
title. We also introduce features keyTitleCTR and keyAdCTR, because usually
the assignment of a keyword to an ad is determined by the sponsors and the
search engine company, the quality of this assignment will affect the ad CTR.
Ads-related features. We introduce some features for ads themself, such as
the length of ad title, the bidding price, the match type and the position.
5 Experimental Results
Evaluation. We qualitatively evaluate all the methods nqby the total number of
clicks for all queries in the test dataset: #click(q) = p=1 cq (p).
514 B. Wang et al.
4
1000 x 10
LR_CTR 1500 7
900 LR_RANDOM
1300 5
800 1200 4
1100 3
700
1000 2
600 900 1
800 0
500 0.5 1 1.5 2 2.5 3
0 2 4 6 8 entropy_threshold
the number of removed ads 4
x 10
(a) (b)
Fig. 6. (a) How the total number of clicks varies with the number of removed ads for
all three methods; (b) How the total number of clicks and the total number of removed
ads vary with the threshold of click entropy
For evaluation, we first remove a certain number of ads for a query in the
test dataset by different ways, and then find the way which leads to the least
reduction of the number of clicks.
Baselines. In order to quantitatively evaluate our approach, we compare our
method with two other baselines. Assume that we want to cut down N ads
in total. For the first baseline LR CTR, for each query in the test dataset, we
predict the CTRs for the query-related ads, and then pool the returned ads for
all the queries and re-rank them by the predicted CTRs, finally remove the last
N ads with lowest CTRs. The major problem for LR CTR is that it cannot be
updated in an online manner, that is, we need to know all the predicted CTRs for
all the queries in the test dataset in advance. This is impossible for determining
the removed ads for a given query. For the second baseline LR RANDOM, we
predict the CTRs of the query-related ads for each query in the test dataset,
and then only remove the last ad with some probability for each query. We can
tune the probability for removing a certain number of ads, the disadvantage
is that there is no explicit correspondence between these two. For our proposed
approach LR CE, we first automatically determine the threshold of click entropy
for a query and then use Algorithm 1 to remove the ads. Our approach does not
suffer from the disadvantages of the above two baselines.
Experiment Setting. All the experiments are carried out on a PC running
Windows XP with AMD Athlon 64 X2 Processor(2GHz) and 2G RAM.
We use the predicted CTRs from the ad ranking task to approximate the
term P (ad|q) in Eq. 1 in this way: P (ad|q) = CT R(ad)
where CT R(ad) and
i CT R(adi )
CT R(adi ) are the predicted CTRs of the current ad and the i-th related ad for
query q respectively. For the training, we use the feature “position”; while for
testing, we set the feature “position” as zero for all instances.
#Removed ads vs. #Clicks. Figure 6(a) shows all the results of two baselines
and our approach. From that, the main observations are as follows:
Learning to Advertise: How Many Ads Are Enough? 515
#Removed ads vs. CTR and #Clicks. Figure 6(b) shows how the total
number of clicks and the number of removed ads vary with the threshold of click
entropy. As the threshold of click entropy increases, the total number of clicks
increases while the number of removed ads decreases.
6 Related Work
7 Conclusion
In this paper, we study an interesting problem that how many ads should be
displayed for a given query. There are two challenges: ad ranking and ad number
prediction. First, we conduct extensive analyses on real click-through data of ads
and the two main observations are 1) when the click entropy of a query exceeds
a threshold the CTR of that query will be very near zero; 2) the threshold of
click entropy can be automatically determined when the number of removed ads
Learning to Advertise: How Many Ads Are Enough? 517
is given. Second, we propose a learning approach to rank the ads and to predict
the number of displayed ads for a given query. Finally, the experimental results
on a commercial search engine validate the effectiveness of our approach.
Learning to recommend ads in sponsored search presents a new and interesting
research direction. One interesting issue is how to predict the user intention
before recommending ads [7]. Another interesting issue is how to exploit click-
through data in different domains where the click distributions may be different
for refining ad ranking [21]. It would also be interesting to study how collective
intelligence (social influence between users for sentiment opinions on an ad) can
help improve the accuracy of ad number prediction [20].
References
1. Agarwal, D., Chen, B.-C., Elango, P.: Spatio-temporal models for estimating click-
through rate. In: WWW 2009, pp. 21–30 (2009)
2. Arampatzis, A., Kamps, J., Robertson, S.: Where to stop reading a ranked list?:
threshold optimization using truncated score distributions. In: SIGIR 2009, pp.
524–531 (2009)
3. Broder, A., Ciaramita, M., Fontoura, M., Gabrilovich, E., Josifovski, V., Metzler,
D., Murdock, V., Plachouras, V.: To swing or not to swing: learning when (not) to
advertise. In: CIKM 2008, pp. 1003–1012 (2008)
4. Carterette, B., Jones, R.: Evaluating search engines by modeling the relationship
between relevance and clicks. In: NIPS 2007 (2007)
5. Chapelle, O., Zhang, Y.: A dynamic bayesian network click model for web search
ranking. In: WWW 2009, pp. 1–10 (2009)
6. Chen, Y., Pavlov, D., Canny, J.F.: Large-scale behavioral targeting. In: KDD 2009,
pp. 209–218 (2009)
7. Cheng, Z., Gao, B., Liu, T.-Y.: Actively predicting diverse search intent from user
browsing behaviors. In: WWW 2010, pp. 221–230 (2010)
8. Ciaramita, M., Murdock, V., Plachouras, V.: Online learning from click data for
sponsored search. In: WWW 2008, pp. 227–236 (2008)
9. Craswell, N., Zoeter, O., Taylor, M., Ramsey, B.: An experimental comparison of
click position-bias models. In: WSDM 2008, pp. 87–94 (2008)
10. Dembczyński, K., Kotlowski, W., Weiss, D.: Predicting ads’ click-through rate with
decision rules. In: TROA 2008, Beijing, China (2008)
11. Dou, Z., Song, R., Wen, J.: A large-scale evaluation and analysis of personalized
search strategies. In: WWW 2007, pp. 581–590 (2007)
12. Dupret, G.E., Piwowarski, B.: A user browsing model to predict search engine click
data from past observations. In: SIGIR 2008, pp. 331–338. ACM, New York (2008)
13. Guo, F., Liu, C., Kannan, A., Minka, T., Taylor, M., Wang, Y.-M., Faloutsos, C.:
Click chain model in web search. In: WWW 2009, pp. 11–20 (2009)
14. Gupta, M.: Predicting click through rate for job listings. In: WWW 2009, pp.
1053–1054 (2009)
518 B. Wang et al.
15. König, A.C., Gamon, M., Wu, Q.: Click-through prediction for news queries. In:
SIGIR 2009, pp. 347–354 (2009)
16. Radlinski, F., Broder, A.Z., Ciccolo, P., Gabrilovich, E., Josifovski, V., Riedel, L.:
Optimizing relevance and revenue in ad search: a query substitution approach. In:
SIGIR 2008, pp. 403–410 (2008)
17. Richardson, M., Dominowska, E., Ragno, R.: Predicting clicks: estimating the click-
through rate for new ads. In: WWW 2007, pp. 521–530 (2007)
18. Shanahan, J., Roma, N.: Boosting support vector machines for text classification
through parameter-free threshold relaxation. In: CIKM 2003, pp. 247–254 (2003)
19. Streeter, M., Golovin, D., Krause, A.: Online learning of assignments. In: NIPS
2009, pp. 1794–1802 (2009)
20. Tang, J., Sun, J., Wang, C., Yang, Z.: Social influence analysis in large-scale net-
works. In: KDD 2009, pp. 807–816 (2009)
21. Wang, B., Tang, J., Fan, W., Chen, S., Yang, Z., Liu, Y.: Heterogeneous cross
domain ranking in latent space. In: CIKM 2009, pp. 987–996 (2009)
22. Zhu, Y., Wang, G., Yang, J., Wang, D., Yan, J., Hu, J., Chen, Z.: Optimizing
search engine revenue in sponsored search. In: SIGIR 2009, pp. 588–595 (2009)
23. Zhu, Z.A., Chen, W., Minka, T., Zhu, C., Chen, Z.: A novel click model and its
applications to online advertising. In: WSDM 2010, pp. 321–330 (2010)
TeamSkill: Modeling Team Chemistry in Online
Multi-player Games
1 Introduction
Skill assessment has long been an active area of research. Perhaps the most well-
known application is to the game of chess, where the need to gauge the skill
of one player versus another led to the development of the Elo rating system
[1]. Although mathematically simple, Elo performed well in practice, treating
skill assessment for individuals as a paired-comparison estimation problem, and
was subsequently adopted by the US Chess Federation (USCF) in 1960 and the
World Chess Federation (FIDE) in 1970. Other ranking systems have since been
developed, notably Glicko [2], [3], a generalization of Elo which sought to address
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 519–531, 2011.
c Springer-Verlag Berlin Heidelberg 2011
520 C. DeLong et al.
Elo’s ratings reliability issue, and TrueSkill [4], the well-known Bayesian model
used for player/team ranking on Microsoft’s Xbox Live gaming service.
With hundreds of thousands to millions of players competing on networks such
as Xbox Live, accurate estimations of skill are crucial because unbalanced games
- those giving a distinct advantage to one player or team over their opponent(s) -
ultimately lead to player frustration, reducing the likelihood they will continue to
play. For multiplayer-focused games, this is a particularly relevant issue as their
success or failure is tied to player interest sustained over a long period of time.
While previous work in this area [4] has been evaluated using data from a
general population of players, less attention has been paid to certain boundary
conditions, such as the case where the entire player population is highly-skilled
individually. As in team sports [5], [6], less tangible notions, such as “team chem-
istry”, are often cited as key differentiating factors, particularly at the highest
levels of play. However, in existing skill assessment approaches, player perfor-
mances are assumed to be independent from one another, summing individual
player ratings in order to arrive at an overall team rating.
In this work, we describe four approaches (TeamSkill-K, TeamSkill-AllK,
TeamSkill-AllK-EV, and TeamSkill-AllK-LS) which make use of the observed
performances of subsets of players on teams as a means of capturing “team chem-
istry” in the ratings process. These techniques use ensembles of ratings of these
subsets to improve prediction accuracy, leveraging Elo, Glicko, and TrueSkill as
“base learners” by extending them to handle entire groups of players rather than
strictly individuals. To the best of our knowledge, no similar approaches exist in
the domain of skill assessment.
For evaluation, we introduce a rich dataset compiled over the course of 2009
based on the Xbox 360 game Halo 3, developed by Bungie, LLC in Kirkland,
WA. Halo 3 is a first-person shooter (FPS) played competitively in Major League
Gaming (MLG), the largest professional video game league in the world, and is
the flagship game for the MLG Pro Circuit, a series of tournaments taking place
throughout the year in various US cities. Our evaluation shows that, in general,
predictive performance can be improved through the incorporation of subgroup
ratings into a team’s overall rating, especially in high-level gaming contexts,
such as tournaments, where teamwork is likely more prevalent. Additionally,
the modeling of variance in each rating system is found to play a large role
in determining the what gain (or loss) in performance one can expect from
using subgroup rating information. Elo, which uses a fixed variance, is found
to perform worse when used in concert with any TeamSkill approach. However,
when the Glicko and TrueSkill rating systems are used as base learners (both
of which model variance as player-level variables), several TeamSkill variants
achieve the highest observed prediction accuracy, particularly TeamSkill-AllK-
EV. Upon further investigation, we find this performance increase is especially
apparent for “close” games, consistent with the competitive gaming environment
in which the matches occur.
The paper is structured as follows. Section 2 reviews some of the relevant re-
lated work in the fields of player and team ratings/ranking systems and
TeamSkill: Modeling Team Chemistry in Online Multi-player Games 521
2 Related Work
In games, the question of how to rank (or provide ratings of) players is old, trac-
ing its roots to the work of Louis Leon Thurstone in the mid-1920’s and Bradley-
Terry-Luce models in the 1950’s. In 1927 [7], Thurstone proposed the “law of
comparitive judgement”, a means of measuring the mean distance between two
physical stimuli, Sa and Sb . Thurstone, working with stimuli such as the sense-
distance between levels of loudness, asserted that the distribution underlying
each stimulus process is normal and that as such, the mean difference between
the stimuli Sa and Sb can therefore be quantified in terms of their standard devi-
ation. This work laid the foundation for the formulation of Bradley-Terry-Luce
(BTL) models in 1952 [8], a logistic variant of Thurstone’s model which pro-
vided a rigorous mathematical examination of the paired comparison estimation
problem, using taste preference measurements as its experimental example.
The BTL model framework provided the basis for the Elo rating system,
introduced by Arpad Elo in 1959 [1]. Elo, himself a master chess player, developed
the Elo rating system to replace the US Chess Federation’s Harkness rating
system with one more grounded in statistical theory. Like Thurstone, the Elo
rating system assumes each player’s skill is normally distributed, where player
i’s expected performance is pi ∼ N (μi , β 2 ). Notably, though, Elo also assumes
players’ skill distributions share a constant variance β 2 , greatly simplifying the
mathematical calculation at the expense of capturing the relative certainty of
each player’s skill.
In 1993 [3], Mark Glickman sought to improve upon the Elo rating system by
addressing the ratings reliability issue in the Glicko rating system. By introduc-
ing a dynamic variance for each player, the confidence in a player’s skill rating
could be adjusted to produce more conservative skill estimates. However, the
inclusion of this information at the player level also incurred significant com-
putational cost in terms of updates, and so an approximate Bayesian updating
scheme was devised which estimates the marginal posterior distribution P r(θ|s),
where θ and s correspond to the player strengths and the set of game outcomes
observed thus far, respectively.
With the advent of large-scale console-based multiplayer gaming on the Mi-
crosoft Xbox in 2002 via Xbox Live, there was a growing need for a more gener-
alized ratings system not solely designed for individual players, but teams - and
any number of them - as well. TrueSkill [4], published in 2006 by Ralf Herbrich
and Thore Graepel of Microsoft Research, used a factor graph-based approach
522 C. DeLong et al.
to accomplish this. Like Glicko, TrueSkill also maintains a notion of variance for
each player, but unlike it, TrueSkill samples an expected performance pi given
a player’s expected skill, which is then summed for all players on i’s team to
represent the collective skill of that team. This expected performance pi is also
assumed to be distributed normally, but similar to Elo, a constant variance is
assumed across all players. Of note, TrueSkill’s summation of expected player
performances in quantifying a team’s expected performance assumes player per-
formances are independent of one another. In the case of team games, especially
those occurring at high levels of competition where team chemistry and cooper-
ative strategies play much larger roles, this assumption may prove problematic
in ascertaining which team has the true advantage a priori. We explore this topic
in more depth later on.
Other variants of the aforementioned approaches have also been proposed.
Coulom’s Whole History Rating (WHR) method [9] is, like other rating systems
such as Elo, based on the dynamic BTL model. Instead of incrementally updating
the skill distributions of each player after a match, it approximates the maximum
a posteri over all previous games and opponents, resulting in a more accurate skill
estimation. This comes at the cost of some computational ease and efficiency,
which the authors argue is still minimal if deployed on large-scale game servers.
Others [10] have extended the BTL model to use group comparisons instead
of paired comparisons, but also assume player performance independence by
defining a team’s skill as the sum of its players’.
Birlutiu and Heskes [11] develop and evaluate variants of expectation propaga-
tion techniques for analysis of paired comparison data by rating tennis players,
stating that the methods are generalizable to more complex models such as
TrueSkill. Menke, et al. [12] develop a BTL-based model based on the logistic
distribution, asserting that weaker teams are more likely to win than what a
normally-distributed framework would predict. They also conclude that models
based on normal distributions, such as TrueSkill, lead to an exponential increase
in team ratings when one team has more players than another.
The field of game theory includes a number of related concepts, such as the
Shapley value [13], which considers the problem of how to fairly allocate gains
among a coalition of players in a game. In the traditional formulation of skill
assessment approaches, however, gains or losses are implicitly assumed to be
equal for all players given the limitation to win/loss/team formation history
during model construction and evaluation. That is, no additional information is
available to measure the contribution of each player to a team’s win or loss.
3 Proposed Approaches
As discussed, the characteristic common to existing skill assessment approaches
is that the estimated skill of a team is quantified by summing the individual skill
ratings of each player on the team. Though understandable from the perspective
of minimizing computational costs and/or model complexity, the assumption is
not well-aligned with either intuition or research in sports psychology [5], [6].
Only in cases where the configuration of players remains constant throughout a
TeamSkill: Modeling Team Chemistry in Online Multi-player Games 523
team’s game history can the summation of individual skill ratings be expected
to closely approximate a team’s true skill. Where that assumption cannot be
made, as is the case in the dataset under study in this paper, it is difficult to
know how much of a player’s skill rating can be attributed to the individual and
how much is an artifact of the players he/she has teamed with in the past.
Closely related to this issue is the notion of team chemistry. “Team chemistry”
or “synergy” is a well-known concept [5], [6] believed to be a critical component of
highly-successful teams. It can be thought of as the overall dynamics of a team
resulting from a number of difficult-to-quantify qualities, such as leadership,
confidence, the strength of player/player relationships, and mutual trust. These
qualities are also crucial to successful Halo teams, which is sometimes described
by its players as “real-time chess”, where teamwork is believed to be the key
factor separating good teams from great ones.
The integration of any aspect of team chemistry into the modeling process
doesn’t suggest an obvious solution, though. However, a key insight is that one
need not maintain skill ratings only for individual players - they can be main-
tained for groups of players as well. The skill ratings of these groups can then be
combined to estimate the overall skill of a team. Here, we describe four methods
which make use of this approach - TeamSkill-K, TeamSkill-AllK, TeamSkill-
AllK-EV, and TeamSkill-AllK-LS.
3.1 TeamSkill-K
At a high level, this approach is simple: for a team of K players, choose a sub-
group size k ≤ K, calculate the average skill rating for all k-sized player groups
for that team using some “base learner” (such as Elo, Glicko, or TrueSkill), and
finally scale this average skill rating up by K/k to arrive at the team’s skill
rating. For k = 1, this approach is equivalent to simply summing the individual
player skill ratings together. As such, TeamSkill-K can be thought of as a gen-
eralized approach for combining skill ratings for any K-sized team given player
subgroup histories of size k.
Formally, let s∗i be the estimated skill of team i and fi (k) be a function
returning the set of skill ratings for player subgroups of size k in team i. Let
each member of the set of skill ratings returned by fi (k) be denoted as sikl ,
corresponding to the l-ith configuration of size k for team i. Here, sikl is assumed
to be a random variable drawn from some underlying distribution. Then, given
some k, the collective strength of a team of size K can be estimated as follows:
K
s∗i = E[fi (k)]
k
K!
(k − 1)!(K − k)!
k!(K−k)!
= sikl (3.1)
(K − 1)!
l=1
3.2 TeamSkill-AllK
To address this issue, a second approach was developed. Here, all available player
subgroup information, 1 ≤ k ≤ K, is used to estimate the skill rating of a
team. The general idea is to model a team’s skill rating as a recursive summa-
tion over all player subgroup histories, building in the (k − 1)-level interactions
present in a player subgroup of size k in order to arrive at the final rating
estimate.
This approach can be expressed as follows. Let s∗ikl be the estimated skill rat-
ing of the l-ith configuration of size k for team i and gi (k) be a function returning
K!
the set of estimated skill ratings s∗ikl , where 1 ≤ l ≤ k!(K−k)! for player sets of
size k in team i. When k = 0, gi (k) = {Ø} and s∗ikl = 0. As before, let sikl be
the skill rating of the l-ith configuration of size k for team i. Additionally, let
αk be a user-specified parameter in the range [0, 1] signifying the weight of the
k-ith level of estimated skill ratings. Then,
k
s∗ikl = αk sikl + (1 − αk ) E[gi (k − 1)]
k−1 ∗
k s∗ ∈gi (k−1) sik−1l
= αk sikl + (1 − αk ) ik−1l
(3.2)
k−1 |gi (k − 1)|
To compute s∗i , let s∗i = s∗ikl where k = K and l = 1 (since there is only one
player subset rating when k = K). This recursive approach ensures that all
player subset history is used.
3.3 TeamSkill-AllK-EV
In TeamSkill-AllK, if no history is available for a particular subgroup, default
values (scaled to size k) are used instead in order to continue the recursion.
Problematically, cases where limited player subset history is available will pro-
duce team skill ratings largely dominated by default rating values, potentially
resulting in inaccurate skill estimates. As such, another approach was developed,
called TeamSkill-AllK-EV. The core idea behind TeamSkill-AllK - the usage of
all available player subgroup histories - was retained, but the new implementa-
tion eschewed default values for all player subsets save those of individual players
(consistent with existing skill assessment approaches), instead focusing on the
evidence drawn solely from game history. Re-using notation, TeamSkill-AllK-EV
is as follows:
TeamSkill: Modeling Team Chemistry in Online Multi-player Games 525
K
1 K
s∗i = K E[hi (k)]
k=1 |hi (k) = ∅| k=1 k
K
K E[hi (k)]
= K (3.3)
k=1 |hi (k) = ∅| k=1 k
Here, hi (k) = fi (k) where there exists at least one player subset history of size
k, else ∅ is returned.
3.4 TeamSkill-AllK-LS
In this context, it is natural to hypothesize that the most accurate team skill
ratings could be computed using the largest possible player subsets covering all
members of a team. That is, given some player subset X and its associated rating,
ratings for subsets of X should be disregarded since they represent lower-level
interation information X would have already captured in its rating. Formally,
such an approach can be represented as follows:
K
1
s∗i = 1 E[hi (k)⊆hi (k<j≤K)] (3.4)
m|{hi (m)⊆hi (m<j≤K)}=∅
m=K k=K
One obvious advantage to this approach is its speed, since this method prunes
away from consideration ratings of subsets of previously-used supersets.
4 Dataset
The data under study in this paper was collected throughout 2009 as part of a
larger project to produce a high-quality, competitive gaming dataset. Halo 3, re-
leased in September 2007 on the Xbox 360 game console, is played professionally as
the flagship game in Major League Gaming (as were its predecessors Halo:Combat
Evolved and Halo 2). Major League Gaming (MLG) is the largest video gaming
league in the world and has grown rapidly since its inception in 2002, with Internet
viewership for 2009 events topping 750,000. After its release, Halo 3 replaced Halo
2 beginning with the 2008 season (known as the Pro Circuit).
The dataset contains Halo 3 multiplayer games between two teams of four
players each. Each game was played in one of two environments - over the Internet
on Microsoft’s Xbox Live service in custom games (known as scrimmages) or on
a local area network at an MLG tournament. Information on each game includes
the players and teams involved, the date of the game, the map and game type, the
result (win/loss) and score, and per-player statistics such as kills, deaths, assists
(where one player helps another player from the same team kill an opponent),
and score.
The dataset has several interesting characteristics, such as the high frequency
of team changes from one tournament to the next. With four players per team, it
is not uncommon for a team with a poor showing in one tournament to replace
526 C. DeLong et al.
one or two players before the next. As such, the resulting dataset lends itself
to analyses of skill at the group level since the diversity of player assignments
can aid in isolating interesting characteristics of teams who do well versus those
who do not. Additionally, since the players making up the top professional and
semi-professional teams are all highly-skilled individually, “basic” game famil-
iarity (such as control mechanics) are not considered as important a factor in
winning/losing as overall team strategy, execution, and adaptation to the opposi-
tion. This focus also helps mitigate issues pertaining to the personal motivations
of players since all must be dedicated to winning in order to have earned a spot
in the top 32 teams in the league, winnowing out those who might intentially
lose games for their teams (as is commonplace in standard Halo 3 multiplayer
gaming). Taken together, these elements make for a very high quality research
dataset for those interested in studying competitive gaming, skill ratings sys-
tems, and teamwork.
The dataset has been made available on the HaloFit web site in two formats.
The first, http://stats.halofit.org, contains several views into the dataset similar
to statistics pages of professional sports leagues such as Major League Baseball.
Users can drill down into the dataset using a series of filters to find data rele-
vant to favorite teams or players. The second, http://halofit.org, contains partial
and full comma-separated exports of the dataset. The dataset currently houses
information on over 9,100 games, 566 players, and 186 teams.
5 Experimental Analysis
The four proposed TeamSkill approaches were evaluated by predicting the out-
comes of games occuring prior to 10 Pro Circuit tournaments and comparing
their accuracy to unaltered versions (k = 1) of their base learner rating systems
- Elo, Glicko, and TrueSkill. For TeamSkill-K, all possible choices of k for teams
of 4, 1 ≤ k ≤ 4, were used. Given two teams, t1 and t2 , The prior probability of t1
winning is a straightforward derivation from the negative CDF at 0 of the distri-
bution describing the difference between two independent, normally-distributed
random variables:
tournament and the one preceding it (“recent”), and all data before the test
tournament (“complete”).
– 2 types of games - the full dataset and those games considered “close” (i.e.,
prior probability of one team winning close to 50%).
In the case where only tournament data is used as training set data, the most
recent tournament preceding the test tournament replaced the inter-tournament
scrimmage data for the “long” and “recent” game history configurations. Simi-
larly, “recent” game history when considering both tournament and scrimmage
data included the most recent tournament. “close” games were defined using a
slightly modified version of the “challenge” method [4] in which the top 20%
closest games were selected for one rating system and presented to the other
(and vice versa). In this evaluation, the closest games from the “vanilla” ver-
sions of each rating system (i.e., k = 1) were presented to each of the TeamSkill
approaches while the closest games from TeamSkill-AllK-EV were presented to
the “vanilla” versions. The reasons these two were chosen is because all the
TeamSkill approaches are intended to improve upon their respective “vanilla”
versions and that repeated testing had shown TeamSkill-AllK-EV to be the best
performing approach on full datasets in many cases. The default values used dur-
ing the evaluation of Elo (α = 0.07, β = 193.4364, μ0 = 1500, σ02 = β 2 ), Glicko
(q = log(10)/400, μ0 = 1500, σ02 = 1002 ), and TrueSkill ( = 0.5, μ0 = 25,
σ02 = (μ0 /3)2 , β = σ02 /2) correspond to the defaults outlined in [4] and [3].
Additionally, for Glicko, a rating period of one game was assigned due to the
continuity of game history over the course of 2008 and 2009, as well as to ap-
proximate an “apples to apples” comparison with respect to Elo and TrueSkill.
In the interest of space, a subset of the 3,780 total evaluations are presented
corresponding to the “complete” cases. The “long” results essentially mirrored
the “complete” results while the “recent” results were virtually identical across
all TeamSkill variations for all non-close games and produced no clear patterns
for close games (with differences only emerging after one or two tournaments, as
can be seen in the “complete” results).
Fig. 1. Prediction accuracy for both tournament and scrimmage/custom games using
complete history
5.2 Discussion
As mentioned, Elo doesn’t benefit from the inclusion of group-level ratings in-
formation. The reason stems from Elo’s use of a constant variance and as such,
Elo is not sensitive to the dynamics of a player’s skill over time. For groups of
players, this issue is compounded since the higher the k under consideration, the
less prior game history can be drawn on to infer their collective skill. With the
TeamSkill approaches, the net effect is that incorporating (k > 1)-level group
ratings ‘dilute’ the overall team rating, resulting in a higher number of closer
games since there is no provision for Elo’s constant variance to differ depending
on the size of the group under consideration.
Similarly, variance also accounts for much of the differences between Glicko
and TrueSkill’s performances. Both make use of player-level variances (and, thus,
group-level variances using the TeamSkill approaches). However, TrueSkill also
TeamSkill: Modeling Team Chemistry in Online Multi-player Games 529
Fig. 4. Prediction accuracy for both tournament and scrimmage/custom games using
complete history, close games only
Fig. 5. Prediction accuracy for tournament games using complete history, close games
only
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Fig. 6. Prediction accuracy for scrimmage/custom games using complete history, close
games only
6 Conclusions
Our experiments demonstrate that in many cases, the proposed TeamSkill ap-
proaches can outperform the “vanilla” versions of their respective base learner,
particularly in close games. We find that the ways in which skill variance is ad-
dressed by each base learner has a large effect on the prediction accuracy of the
TeamSkill approaches, the results suggesting that those employing a dynamic
variance (i.e., Glicko, TrueSkill) can benefit from group-level ratings. In our fu-
ture work, we intend to investigate ways of better representing skill uncertainty,
possibly by modeling the uncertainty itself as a distribution, and constructing
an online version of TeamSkill in order to improve skill estimates.
Acknowledgments
We would like to thank the Data Analysis and Management Research group, as well
as the reviewers, for their feedback and suggestions. We would also like to thank
Major League Gaming for making their 2008-2009 tournament data available.
References
1. Elo, A.: The Rating of Chess Players, Past and Present. Arco Publishing, New
York (1978)
2. Glickman, M.: Paired Comparison Model with Time-Varying Parameters. PhD
thesis. Harvard University, Cambridge (1993)
TeamSkill: Modeling Team Chemistry in Online Multi-player Games 531
Computer Science Dept., Univ. of California, Los Angeles, CA, 90095-1596, USA
{danhe,stott}@cs.ucla.edu
1 Introduction
Research grants are critical to the development of science and the economy. Every
year billions of dollars are invested in diverse scientific research topics, yet there is
far from sufficient funding to support all researchers and their projects. Funding
of research projects is highly selective. For example, in the past roughly only
20% of submitted projects have been funded by NIH. As a result, to maximize
their chances of being funded, researchers often feel pressured to submit grant
projects on topics that have ‘increasing momentum’ — where ‘momentum’ is
defined as the rate of change of a certain measure such as popularity, impact or
significance. It would be helpful if one could model this pressure quantitatively.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 532–543, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Learning the Funding Momentum of Research Projects 533
contains. This gives what we believe is the first quantitative definition of fund-
ing momentum of a research project.
The remainder of the paper is structured as follows. In Section 2 we summarize
previous work on analyzing grant databases, and explore the relevance of stock
market momentum models to development of ‘research topic portfolios’ — such
as identifying burst periods and momentum trends. In Section 3, we define a
model of funding momentum for research topics and projects. We propose a
framework to predict funding momentum in Section 4. We report experimental
results and comparisons in Section 5, and conclude the paper in Section 6.
2 Related Work
There has been a great deal of work in the analysis of grant portfolios, such as
in the MySciPI project [1] and the NIH RePORTER system [2]. However, these
works have focused on development of databases to store information about
awards. For example, MySciPI [1] provides a database of projects that can be
queried by keywords or by research topics. When given a research topic, extensive
information about all projects related to the topic can be extracted. However,
beyond the extraction of project information, only basic statistics (such as the
number of hits of the topic in the database, etc.) are shown. The ability to mine
information in these databases has been lacking.
We try to analyze this information based on the indicators such as popular-
ity, impact and significance. We can easily calculate these indicators over a time
window for certain research topics. We are interested in the problem of identi-
fication and prediction of periods in which the indicators have a strong upward
momentum movement, or ‘burst’. Lots of work have been done to identify bursts
of a topic over certain time series. Kleinberg [14] and Leskovec et al. [11] use
an automaton to model bursts, in which each state represents the arrival rate
of occurrences of a topic. Shasha and co-workers [19] defined bursts based on
hierarchies of fixed-length time windows. He and Parker [10] model bursts with
a ‘topic dynamics’ model, where bursts are intervals of positive change in mo-
mentum. Then they apply trend analysis indicators such as EMA and MACD
histogram, which are well-developed momentum computation methods used in
evaluating stocks, to identify burst periods. They show that their topic dynamics
model is successful in identifying bursts for individual topics, while Kleinberg’s
model is more appropriate to identify bursts from a set of topics. He and Parker
also point out that the topic dynamics model may permit adaptation of the
multitude of technical analysis methods used in analyzing market trends.
Of course, an enormous amount of work has gone into prediction of stock
prices and of market trends (such as upward or downward movement, turning
points, etc.), and a multitude of models has been proposed. For example, Al-
Qaheri et al. [5] recently developed rough sets to develop rules for predicting
stock prices. Classical methods involving neural networks (Lawrence [15], Gryc
[8], Nikooa et al. [17], Sadd [18]) in forecasting stock prices. Hassan and Nath
[9] applied Hidden Markov Models to forecast prices. Genetic Algorithms have
Learning the Funding Momentum of Research Projects 535
Symbol Description
FM(T, m) Funding Momentum for the topic or project T in a period of m months
BS(T, m) Burst Strength for the topic or project T in a period of m months
FI(T, m) Frequency Increase indicator for topic or project T in a period of m months
F(T )i Frequency of the topic or project T at month i
CF(T ) Current frequency, or Start Frequency of the topic or project T
histogram(T )i MACD Histogram value of the topic or project T at month i
cor(A, B) correlation between two topics A, B
co(A, B) co-occurrences between two topics A, B
f req(A) frequency of the topic A
also been applied [13] [16]. Recently Agrawal et al. [4] developed adaptive Neuro-
Fuzzy Inference System (ANFIS) to analyze stock momentum. Bao and Yang [7]
build an intelligent stock trading system using confirmation of turning points and
probabilistic reasoning. This paper shows how this wealth of mining experience
might be adapted in analyzing grant funding histories.
3 Funding Momentum
3.1 Technical Analysis Indicators of Momentum
Key symbols in the paper are summarized in Table 1. To define the momentum of
research topics and projects, we adapt the stock market trend analysis indicators.
Here we first include very well-established background about technical analysis
indicators of momentum.
– EMA — the exponential moving average of the momentum, or the ‘first
derivative’ of the momentum over a time period:
n
k
EMA(n)[x]t = α xt + (1 − α) EMA(n − 1)[x]t−1 = α (1 − α) xt−k
k=0
– MACD Histogram — the difference between MACD and its moving average,
or the ‘second derivative’ of the momentum:
signal(n1 , n2 , n3 ) = EMA(n3 )[MACD(n1 , n2 )]
histogram(n1 , n2 , n3 ) = MACD(n1 , n2 ) − signal(n1 , n2 , n3 )
where EMAU (n) and EMAD (n) are EMA for time series U and D, respectively.
536 D. He and D.S. Parker
−
BS(topic,m)×FI(topic,m)
FM(topic, m) = 1 − e α (1)
m
BS(topic, m) = Hi (topic) (2)
i=1
1 CF(topic) < min1≤i≤m F(topic)i
FI(topic, m) = (3)
0 otherwise
histogram(topic)i histogram(topic)i > 0
Hi (topic) = (4)
0 otherwise
The burst strength of a topic over a m-month period BS(topic, m) is the sum of
the histogram values H for the topic in burst periods, or periods where the values
are all positive. The frequency increase indicator FI(topic, m) indicates if the
frequency of the topic increases or drops within the m-month period. We define
the value of the funding momentum of a topic over a m-month period FM(topic,
m) with an exponential model to normalize the value within the range of [0, 1]
with α as a decay parameter. In this model, a higher burst strength or a higher
momentum increase ratio yield higher funding momentum, with α controlling
the rate of increase with respect to these factors.
The definition encodes the following intuition about funding momentum: (1) if
there is no burst in the m-month period, then no matter how high the frequency
or popularity, the m-month period has no funding momentum; (2) if there is a
burst, but prior to the burst there is a drop in momentum, the m-month period
has no funding momentum. (Hence, it may be advantageous to invest in the
topic after the drop ends.) (3) if there is a burst, but after the burst there is a
Learning the Funding Momentum of Research Projects 537
4 Methods
As mentioned earlier, we have adapted methods of technical analysis to compute
momentum. In the stock market, despite claims to the contrary, a common as-
sumption is that past performance is an indicator of likely future achievement.
Of course, this assumption is often violated, since the market is news-driven and
fluctuates rapidly. As we show in our experiments, for our definitions of mo-
mentum and funding momentum, the assumption often works well. Therefore,
training classifiers on past funding momentum makes sense, and in some cases
may even be adequate to forecast future funding momentum. Gryc [8] shows that
indicators such as EMA, MACD histogram, RSI can help to improve prediction
accuracy.
Although we are not able to validate whether the definitions can accurately
identify intervals that are not of increasing momentum, the prediction methods
we propose work well for the selection criteria encoded in Formulas 1–4. Alter-
natively speaking, when we select the increasing momentum topics and projects
based on these criteria, the prediction methods have good accuracy. Again, when
negative information about success in funding is available to reveal more criteria,
or any modification of the criteria, it can be assimilated into our model.
In our experiments, we have used four kinds of classifier to model whether
a topic has increasing momentum: Linear Regression, Decision Tree, SVM and
Artificial Neural Networks (ANN). The linear regression classifier estimates the
538 D. He and D.S. Parker
5 Experimental Results
5.1 Analyzing Bursts for Research Topics and Projects
The RePORTER database [2] provides information for NIH grant awards since
1986. Each award record includes a set of project terms, which can be considered
as research topics in this project (we use ‘topic’ and ‘term’ interchangeably).
Since the total number of years is only 24, we consider the funding momentum
for the terms at each month, and term frequencies are calculated by month.
As RePORTER imposes limits on volume of downloaded data, we considered
awards only for the state of California — a dataset containing 12,378 unique
terms and 119,079 awards.
We use technical analysis indicators such as the MACD histogram value and
RSI to compute momentum and identify burst periods for the terms, adapting
the definition of bursts as intervals of increasing momentum studied in He and
Parker [10]. Burst periods for the terms ‘antineoplastics’, ‘complementary-DNA’,
and ‘oncogenes’ are shown in Figure 1 — as well as their frequencies and funding
momentum. The funding momentum covers the 6 month period beyond any time
point. We set the the threshold value to 0.2 for selection of increasing momen-
tum years (we call these years increasing momentum years). Clearly increasing
momentum years are highly correlated with bursts. Strong bursts usually define
intervals that have increasing momentum. Weak bursts, such as the one for ‘gene
expression’ around 1993, are omitted by the threshold filter. However, it is not
necessarily the case that a strong burst period has increasing momentum. For
example, ‘oncogenes’ has a strong burst from year 1997 to 1999, but the increas-
ing momentum years associated with the burst extended only from 1997 to the
middle of 1998, because the burst levels off. According to criterion (3) in section
3.1, we say after the middle of year 1998, ‘oncogenes’ does not have increas-
ing momentum. We can also observe how criteria (1) and (2) affect increasing
momentum. For example, for ‘antineoplastics’, the increasing momentum years
start after 1997, which is the end of its frequency plunge; criterion (2) avoids
labeling years in a plunge followed by a burst as having increasing momentum.
Learning the Funding Momentum of Research Projects 539
100
10
50
5
0
value
value
value
10
0
20
50
30
5
term frequency term frequency term frequency
investment potentials
funding momentum investment potentials
funding momentum investment
funding potentials
momentum
40
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
month month month
Table 2. The increasing momentum years for all terms related to ‘genetic’ and ‘AIDS’
Table 4. The prediction error rate of classifier J48, SVM, Linear Regression, ANN
and Naive for 1,000 randomly selected terms for different M (month) period from the
current month
Classifier M=6 M=7 M=8 M=9 M=10 M=11 M=12
J48 0.084 0.085 0.086 0.086 0.088 0.088 0.088
SVM 0.084 0.085 0.086 0.086 0.089 0.091 0.092
Linear Regression 0.112 0.119 0.125 0.132 0.138 0.143 0.146
ANN 0.108 0.105 0.111 0.108 0.115 0.112 0.121
Naive 0.217 0.235 0.240 0.254 0.256 0.272 0.269
for news ‘memes’ presented in [11]. where topics are more correlated and their
clustering is stronger.
We can define the accuracy of our funding momentum definition for research
projects as the percentage of consistency between the increasing momentum
years of the projects and their grant years. The more they are consistent with
one another, the higher the accuracy. Varying the threshold as 0.2, 0.5 and 0.8
and randomly selecting 1000 projects, we obtained the results shown in Table
3, showing the average number of terms contained by each project, which is
around 16. Therefore, a threshold t=0.2 requires at least 3 topics be of increasing
momentum for a project, while a threshold t = 0.8 requires at least 13 such
topics. Accuracy of these results drops as the threshold increases. For t = 0.2,
the accuracy was sufficiently high that we adopted t = 0.2 in all remaining
experiments.
1.00
0.6
0.95
● ● ● ●
● ● ● ● ●
●
●
●
Sensitivity
Specificity
●
0.4
0.90
0.2
0.85
● J48 ● J48
ANN ANN
SVM SVM
0.80
LR LR
0.0
6 7 8 9 10 11 12 6 7 8 9 10 11 12
Month Month
Fig. 2. Sensitivity vs. Specificity for classifiers J48, SVM, Linear Regression and ANN
Table 5. Prediction error rates with J48 and SVM for 1,000 randomly selected terms,
with and without TA (technical analysis indicators) for 6-month periods extending
beyond the current month
Table 6. Prediction accuracy with classifier J48, SVM, ANN, Linear Regression and
Naive for 1,000 randomly selected projects with threshold 0.2 for the 6 month period
from the current month
In this paper, by analyzing historical NIH grant award data (in RePORTER
[2]), we were able to model occurrence patterns of biomedical topics in successful
grant awards with a corresponding measure that we call funding momentum. We
also developed a classification method to predict funding momentum for these
topics in projects. We were able to show that this method achieved good predic-
tion accuracy. It seems possible that indicators such as impact and significance
could be addressed with variations on funding momentum. To our knowledge,
this is the first quantitative model of funding momentum for research projects.
We also show in this work that the classification problem is highly un-balanced,
Therefore the sensitivity of all the classifiers are not satisfactory. so un-balanced
classification techniques might be used to improve performance.
We proposed a percentage model for the funding momentum of research
projects. There can be other models. For example, in the percentage model,
the topics the project contains may have semantic correlations in that some top-
ics are always show up together. A more complicated model maybe needed to
define the momentum of a research project. Another possible model is instead
of considering the percentage of the increasing momentum topics in the project,
we add the frequency of the topics as the ‘frequency’ of the project. We can then
apply the same trend models to identify intervals of increasing momentum for
the project. The intuition behind this additive model comes from stock market
Learning the Funding Momentum of Research Projects 543
References
1. MySciPI (2010), http://www.usgrd.com/myscipi/index.html
2. RePORTER (2010), http://projectreporter.nih.gov/reporter.cfm
3. Weka (2010), http://www.cs.waikato.ac.nz/ml/weka
4. Agrawal, S., Jindal, M., Pillai, G.N.: Momentum analysis based stock market pre-
diction using adaptive neuro-fuzzy inference system (anfis). In: Proc. of the In-
ternational MultiConference of Engineers and Computer Scientists, IMECS 2010
(2010)
5. Al-Qaheri, H., Hassanien, A.E., Abraham, A.: Discovering Stock Price Prediction
Rules Using Rough Sets. Neural Network World Journal (2008)
6. Andelin, J., Naismith, N.C.: Research Funding as an Investment: Can We Measure
the Returns? U.S. Government Printing Office, Washington, DC (1986)
7. Bao, D., Yang, Z.: Intelligent stock trading system by turning point confirming
and probabilistic reasoning. Expert Systems with Applications 34, 620–627 (2008)
8. Gryc, W.: Neural Network Predictions of Stock Price Fluctuations. Technical re-
port, http://i2r.org/nnstocks.pdf (accessed July 02, 2010)
9. Hassan, M.R., Nath, B.: Stock market forecasting using hidden Markov model: a
new approach.
10. He, D., Parker, D.S.: Topic Dynamics: an alternative model of ‘Bursts’ in Streams
of Topics. In: The 16th ACM SIGKDD Conference on Knowledge Discovery and
Data Mining, SIGKDD 2010, July 25-28 (2010)
11. Kleinberg, J., Leskovec, J., Backstrom, L.: Meme-tracking and the dynamics of the
news cycle. In: Proceedings of the Fifteenth ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining,, Paris, France (July 2009)
12. Johnston, M.I., Hoth, D.F.: Present status and future prospects for HIV therapies.
Science 260(5112), 1286–1293 (1993)
13. Kaboudan, M.A.: Genetic programming prediction of stock prices. Computational
Economics 16(3), 207–236 (2000)
14. Kleinberg, J.M.: Bursty and hierarchical structure in streams. Data Min. Knowl.
Discov. 7(4), 373–397 (2003)
15. Lawrence, R.: Using neural networks to forecast stock market prices. University of
Manitoba (1997)
16. Li, J., Tsang, E.P.K.: Improving technical analysis predictions: an application of
genetic programming. In: Proceedings of The 12th International Florida AI Re-
search Society Conference, Orlando, Florida, pp. 108–112 (1999)
17. Nikooa, H., Azarpeikanb, M., Yousefib, M.R., Ebrahimpourb, R., Shahrabadia, A.:
Using A Trainable Neural Network Ensemble for Trend Prediction of Tehran Stock
Exchange. IJCSNS 7(12), 287 (2007)
18. Saad, E.W., Prokhorov, D.V., Wunsch, D.C.: Comparative study of stock trend
prediction using time delay, recurrent and probabilistic neural networks. IEEE
Transactions on Neural Networks 9(6), 1456–1470 (1998)
19. Zhu, Y., Shasha, D.: Efficient elastic burst detection in data streams. In: Proceed-
ings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, Washington, DC, USA, August 24-27, pp. 336–345 (2003)
Local Feature Based Tensor Kernel for Image
Manifold Learning
1 Introduction
Conventionally, raster grayscale images can be represented by vectors by stack-
ing pixel brightness row by row. It is convenient for computer processing and
storage of images data. However, it is not natural for recognition and percep-
tion. Human brains are more likely to handle images as collections of features
laying on a highly nonlinear manifold [4]. In recent research, learning image
manifold has attracted great interests in computer vision and machine learning
community. There are two major strategies towards manifold learning for within
class variability, appearance manifolds from different views [5]: (1) Local Feature
based methods; and (2) Key Points based methods.
There exist quit a few feature extraction algorithms such as colour histogram
[6], auto-associator [7], shape context [8] etc. Among them, local appearance
The author to whom all the correspondences should be addressed.
J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 544–554, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Local Feature Based Tensor Kernel for Image Manifold Learning 545
based methods such as SIFT have drawn a lot of attention as their success
in generic object recognition [3]. Nevertheless, it was also pointed out in [3]
that it is quite difficult to study the image manifold from the local features
point of view; moreover, the descriptor itself throws an obstacle in learning a
smooth manifold because it is not in vector space. The authors of [3] proposed
a learning framework called Feature Embedding (FE) which takes local features
of images as input and constructs an interim layer of embedding in metric space
where the dis/similarity measure can be easily defined. This embedding is then
utilized in following process such as classification, visualization etc. As another
main stream, key points based methods have been successful in shape modeling,
matching and recognition, as demonstrated by the Active Shape Models (ASM)
[9] and the Shape Contexts [10].
Generally speaking, key points focus on spatial information/arrangement of
the interested objects in images while local features detail object characteriza-
tion. An ideal strategy for image manifold learning shall incorporate both kinds
of information to assist the learning procedure. Actually the combination of both
spatial and feature information has been applied in object recognition in recent
work, e.g., visual scene recognition in [11]. This kind of approaches has close
link with the task of learning from multiple sources, see [12,13]. Kernel method
is one of the approaches which can be easily adapted to the multiple sources
by the so-called tensor kernels [14] and additive kernels [15,16]. The theoretical
assumption is that the new kernel function is defined over a tensor space de-
termined by multiple source space. In our scenario, there are two sources, i.e.,
the source for the spatial, denoted by y, and the source for the local feature,
denoted by f . Thus each hyper “feature” is the tensor y ⊗ f . For the purpose
of learning image manifold, we aim at constructing appropriate kernels for the
hyper features in this paper.
The paper is organized as follows. Section 2 proposes tensor kernels suit-
able for learning image manifold. Section 3 gives a simple introduction to the
Twin Kernel Embedding which is used for manifold embedding. In Section 4, we
present several examples of using proposed tensor kernels for visualization.
As we can see, tensor kernel unifies the spatial and features information together
in harmony. It is symmetric, positive semi-definite and still normalized.
We are particularly interested in the additive tensor kernel. The reason is that
the productive tensor kernel tends to produce very small values thus forcing the
Gram matrix to be close to identity matrix in practice. This will bring some numer-
ical difficulties for dimensionality reduction. Additive tensor kernel does not have
this problem. However, the additive tensor kernel kt defined in (2) takes into ac-
count the spatial similarity between two different images which makes little sense
in practice. So we adopt a revised version of additive tensor kernel as
Local Feature Based Tensor Kernel for Image Manifold Learning 547
ρy ky (yik , yjl ) + ρf kf (fik , fjl ), k = l
kt (yik ⊗ fik , yjl ⊗ fjl ) = (3)
kf (fik , fjl ), k = l
In both (2) and (3) we need to determine two extra parameters ρy and ρf . To the
best knowledge of the authors, there is no principal way to resolve this problem.
In practice, we can optimize them using cross validation.
where k(·, ·) is the kernel function on embedded data and kt (·, ·) the kernel
function on hyper feature data of images. The first term performs the similarity
matching which shares some traits with Laplacian Eigenmaps in that it replaces
the Wij by kt (·, ·) and the Euclidean distance on embedded data xi − xj 2 by
k(·, ·). The second and third terms are regularization to control the norms of the
kernel and the embeddings. λk and λx are tunable positive parameters to control
the strength of the regularization. The logic is to preserve the similarities among
input data and reproduce them in lower dimensional latent space expressed again
in similarities among embedded data.
k(·, ·) is normally a Gaussian kernel, i.e.
because of its analytical form and strong relationship with Euclidean distance.
A gradient-based algorithm has to be employed for minimization of (4). The
conjugate gradient (CG) algorithm [18] can be applied to get the optimal X
which is the matrix of the embeddings X = [x1 , . . . , xN ] . The hyper-parameters
548 Y. Guo and J. Gao
of the kernel function k(·, ·), γ and σ, can also be optimized as well in the
minimization procedure. It frees us from setting too many parameters. To start
the CG, initial state should be provided. Any other dimensionality reduction
methods could work. However, if the non-vectorial data applicability is desirable,
only a few of them such as KPCA [19], KLE [20] would be suitable.
It is worth explaining the method of locality preserving in TKE. This is done
by the k-nearest neighboring. Given an object oi , for any other input oj , kt (oi , oj )
will be artificially set to 0 if oj is not one of the k nearest neighbors of oi .
The parameter k(> 1) in k-nearest neighboring controls the locality that the
algorithm will preserve. This process is a kind of filtering that retains what we
are interested while leaving out minor details. However, the algorithm also works
without filtering in which case TKE turns out to be a global approach.
The out-of-sample problem [21] can be easily solved by introducing a kernel
mapping as
X = Kt A (6)
where Kt is the Gram matrix of kernel kt (·, ·) and A is a parameter matrix to
be determined. Substitute (6) to TKE and optimize the objective function with
respect to A instead of X will give us a mapping from original space to lower
dimensional space. Once we have the new input, the embedding can be found by
xnew = kt (onew , O)A
and we denote O as collection of all the given data for training. This algorithm
is called BCTKE in [22] where details were provided.
which differs from (4) in that the kernel kt (·, ·) is replaced by a distance metric.
We can still minimize (7) with respect to zi . The logic is when two images are
close in feature embedding space, they are also close in the manifold.
Another easier way to learn the manifold using TKE is to convert the distance
to a kernel by
k(P̂i , P̂j ) = exp{−σk d(P̂i , P̂j )} (8)
and substitute this kernel in (4) in TKE where σk is positive parameters. So we
minimize the following objective function
L=− k(zi , zj ) exp{−σk d(P̂i , P̂j )} + λk k 2 (zi , zj ) + λz zi 2 . (9)
ij ij
4 Experimental Results
We applied the tensor kernel and TKE to image manifold learning on several
image data sets: the ducks from COIL data set, Frey faces and handwritten
digits. They are widely available online for machine learning and image process
tests. For TKE, we fixed λx = 0.001 and λk = 0.005 as stated in original
paper. We chose Gaussian kernel for kx , Eq. (5), as described in Section 3.
Its hyperparameters were set to be 1 and they were updated in runtime. We
used additive tensor kernel (3) and set ρy = 0.3 and ρf = 0.7 which were
picked from doing the same experiment with different ρy and ρf repeatedly
until best combination is found. It shows the preference to local features over
coordinates. The dimensionality of feature embedding space de and number of
features extracted from images are maximized according to the capability of
computational platform. For the demonstration purpose, we chose to visualize
those images in 2D plane to see the structure of the data.
(a) TKE
Fig. 1. ducks
methods like PCA, MDS can achieve good embedding using vectorial represen-
tation. As we can see from Fig. 1, tensor kernel on local features can capture the
intrinsic structure of the ducks, that is the horizontal rotation of the toy duck.
The is revealed successfully by KLE which gives a perfect circle like embedding.
The order of the images shows the rotation. TKE seems to focus more on the
classification information. Its embedding shows 3 connected linear components
each of which represents different facing direction. KPCA tries to do the same
thing as KLE, but not as satisfactory as KLE.
(a) TKE
sad. However, the understanding like this is somewhat artificial. This may not
even close to the truth. But we hope our algorithms can show some idea of
these two dimensions. In this case, de = 30 and 80 features were extracted from
each image. The choice of de reflects high computational cost of TKE which is a
major drawback of this algorithm. As the number of samples grows, the number
of objectives to be optimized in TKE increases linearly. So when the number of
images doubled, de has to be half for the limitation of computation resources.
In this case, KLE does not reveal any meaningful patterns (see Fig. 2). On the
other hand, TKE’s classification property is very well exhibited. It successfully
classifies happy and not happy expressions into two different groups. In each
group, from top to bottom, the face direction turns from right to left. So we
can draw two perpendicular axes on TKE’s result, horizontal one for mood, and
vertical one for face direction. KPCA reveals similar pattern as TKE does. The
only difference is that TKE’s result shows clearer cluster structure.
552 Y. Guo and J. Gao
(a) TKE
It is worth mentioning that for TKE plus tensor kernel, we use KPCA in
the last step instead of TKE for computational difficulty. Fig. 3 shows results of
final image manifold learnt by three different algorithms with tensor kernel. TKE
shows good classification capability even clearer in this experiment. All classes
have clear dominant clusters with some overlapping. Interestingly, by examining
the visualization by TKE closely, we can see digit “1” class has two subclasses of
two different types of drawing. They are properly separated. Moreover, because
they are all “1” from the local feature point of view, these two subclasses are
very close to each other whereby forms a whole digit “1” class. KLE does a very
good job separating digit “1” from others. However, other classes are overlapped
significantly. KPCA has clear “2” and “4” classes but the other classes were not
distinguishable.
This experiment once again confirms the classification ability of TKE and
effectiveness of tensor kernel on local features in depicting the structural rela-
tionships between images in terms of classification, recognition and perception.
5 Conclusion
In this paper, we proposed using tensor kernel on local features and TKE in
image manifold learning. Tensor kernel provides a homogeneous kernel solution
for images which are described as collection of local features instead of conven-
tional vector representation. The most attractive advantage of this kernel is that
it integrates multiple sources of information in a uniform measure framework
such that the following algorithm can be applied without difficulty in theoretical
interpretation.
TKE shows very strong potential in classification when it is used in con-
junction with local feature focused kernel, for example tensor kernel. So it is
interesting to explore more applications of this method in other areas such as
bioinformatics and so on. One drawback of TKE which may limit its application
is its high computational cost. The number of parameters to be optimized is
about O(n2 ) where n is the product of target dimension and number of samples.
Further research on whether some efficient approximation is achievable would
be very interesting.
References
1. Guo, Y., Gao, J., Kwan, P.W.: Twin kernel embedding. IEEE Transaction of Pat-
tern Analysis and Machine Intelligence 30(8), 1490–1495 (2008)
2. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings
of the International Conference on Computer Vision, pp. 1150–1157 (1999)
3. Torki, M., Elgammal, A.: Putting local features on a manifold. In: CVPR (2010)
4. Seung, H., Lee, D.: The manifold ways of perception. Science 290(22), 2268–2269
(2000)
5. Murase, H., Nayar, S.: Visual learning and recognition of 3D objects from appear-
ance. International Journal of Computer Vision 14, 5–24 (1995)
554 Y. Guo and J. Gao
6. Swain, M.J., Ballard, D.H.: Indexing via color histograms. In: Proceedings of the
International Conference on Computer Vision, pp. 390–393 (1990)
7. Verma, B., Kulkarni, S.: Texture feature extraction and classification. LNCS, pp.
228–235 (2001)
8. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition us-
ing shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 24(24), 509–522 (2002)
9. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models: Their
training and application. Computer Vision and Image Understanding 61(1), 38–59
(1995)
10. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using
shape contexts. IEEE Trans. on Pattern Analysis and Machine Intelligence 24(4),
509–522 (2002)
11. Sudderth, E.B., Torralba, A., Freeman, W.T., Willsky, A.S.: Describing visual
scenes using transformed objects and parts. International Journal of Computer
Vision 77(1-3), 291–330 (2008)
12. Crammer, K., Kearns, M., Wortman, J.: Learning from multiple sources. Journal
of Machine Learning Research 9, 1757–1774 (2008)
13. Cesa-Bianchi, N., Hardoon, D.R., Leen, G.: Guest editorial: Learning from multiple
sources. Machine Learning 79, 1–3 (2010)
14. Hardoon, D.R., Shawe-Taylor, J.: Decomposing the tensor kernel support vector
machine for neuroscience data with structured labels. Machine Learning 79, 29–46
(2010)
15. Schölkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regu-
larization, Optimization, and Beyond. The MIT Press, Cambridge (2002)
16. Evgeniou, T., Micchelli, C.A., Pontil, M.: Learning multiple tasks with kernel meth-
ods. Journal of Machine Learning Research 6, 615–637 (2005)
17. Gärtner, T., Lloyd, J.W., Flach, P.A.: Kernels for structured data. In: Proceedings
of the 12th International Conference on Inductive Logic Programming (2002)
18. Nabney, I.T.: NETLAB: Algorithms for Pattern Recognition. In: Advances in Pat-
tern Recognition. Springer, London (2004)
19. Schölkopf, B., Smola, A.J., Müller, K.: Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation 10, 1299–1319 (1998)
20. Guo, Y., Gao, J., Kwan, P.W.: Kernel laplacian eigenmaps for visualization of non-
vectorial data. In: Sattar, A., Kang, B.-h. (eds.) AI 2006. LNCS (LNAI), vol. 4304,
pp. 1179–1183. Springer, Heidelberg (2006)
21. Bengio, Y., Paiement, J., Vincent, P., Delalleau, O., Roux, N.L., Ouimet, M.: Out-
of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In:
Advances in Neural Information Processing Systems, vol. 16
22. Guo, Y., Gao, J., Kwan, P.W.: Twin Kernel Embedding with back constraints. In:
HPDM in ICDM (2007)
23. Cuturi, M., Fukumizu, K., Vert, J.P.: Semigroup kernels on measures. Journal of
Machine Learning Research 6, 1169–1198 (2005)
24. Geiger, A., Urtasun, R., Darrell, T.: Rank priors for continuous non-linear di-
mensionality reduction. In: IEEE Conference on Computer Vision and Pattern
Recognition, pp. 880–887 (2009)
25. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear em-
bedding. Science 290(22), 2323–2326 (2000)
Author Index