Download as pdf or txt
Download as pdf or txt
You are on page 1of 583

Lecture Notes in Artificial Intelligence 6635

Edited by R. Goebel, J. Siekmann, and W. Wahlster

Subseries of Lecture Notes in Computer Science


Joshua Zhexue Huang Longbing Cao
Jaideep Srivastava (Eds.)

Advances in
Knowledge Discovery
and Data Mining

15th Pacific-Asia Conference, PAKDD 2011


Shenzhen, China, May 24-27, 2011
Proceedings, Part II

13
Series Editors

Randy Goebel, University of Alberta, Edmonton, Canada


Jörg Siekmann, University of Saarland, Saarbrücken, Germany
Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany

Volume Editors

Joshua Zhexue Huang


Chinese Academy of Sciences
Shenzhen Institutes of Advanced Technology (SIAT)
Shenzhen 518055, China
E-mail: zx.huang@siat.ac.cn

Longbing Cao
University of Technology Sydney
Faculty of Engineering and Information Technology
Advanced Analytics Institute
Center for Quantum Computation and Intelligent Systems
Sydney, NSW 2007, Australia
E-mail: longbing.cao-1@uts.edu.au

Jaideep Srivastava
University of Minnesota
Department of Computer Science and Engineering
Minneapolis, MN 55455, USA
E-mail: Srivasta@cs.umn.edu
ISSN 0302-9743 e-ISSN 1611-3349
ISBN 978-3-642-20846-1 e-ISBN 978-3-642-20847-8
DOI 10.1007/978-3-642-20847-8
Springer Heidelberg Dordrecht London New York

Library of Congress Control Number: 2011926132

CR Subject Classification (1998): I.2, H.3, H.4, H.2.8, I.4, C.2

LNCS Sublibrary: SL 7 – Artificial Intelligence

© Springer-Verlag Berlin Heidelberg 2011


This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface

PAKDD has been recognized as a major international conference in the areas of


data mining (DM) and knowledge discovery in databases (KDD). It provides an
international forum for researchers and industry practitioners to share their new
ideas, original research results and practical development experiences from all
KDD-related areas including data mining, machine learning, artificial intelligence
and pattern recognition, data warehousing and databases, statistics, knowledge
engineering, behavioral sciences, visualization, and emerging areas such as social
network analysis.
The 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining
(PAKDD 2011) was held in Shenzhen, China, during May 24–27, 2011. PAKDD
2011 introduced a double-blind review process. It received 331 submissions af-
ter checking for validity. Submissions were from 45 countries and regions, which
shows a significant improvement in internationalization than PAKDD 2010 (34
countries and regions). All papers were assigned to at least four Program Com-
mittee members. Most papers received more than three review reports. As a
result of the deliberation process, only 90 papers were accepted, with 32 papers
(9.7%) for long presentation and 58 (17.5%) for short presentation.
The PAKDD 2011 conference program also included five workshops: the
Workshop on Behavior Informatics (BI 2011), Workshop on Advances and Issues
in Traditional Chinese Medicine Clinical Data Mining (AI-TCM), Quality Is-
sues, Measures of Interestingness and Evaluation of Data Mining Models (QIMIE
2011), Biologically Inspired Techniques for Data Mining (BDM 2011), and Work-
shop on Data Mining for Healthcare Management (DMHM 2011). PAKDD 2011
also featured talks by three distinguished invited speakers, six tutorials, and a
Doctoral Symposium on Data Mining.
The conference would not have been successful without the support of the
Program Committee members (203), external reviewers (168), Organizing Com-
mittee members, invited speakers, authors, tutorial presenters, workshop orga-
nizers, reviewers, authors and the conference attendees. We highly appreciate
the conscientious reviews provided by the Program Committee members and
external reviewers. We are indebted to the members of the PAKDD Steering
Committee for their invaluable suggestions and support throughout the orga-
nization process. Our special thanks go to the local arrangements team and
volunteers. We would also like to thank all those who contributed to the success
of PAKDD 2011 but whose names cannot be listed.
We greatly appreciate Springer LNCS for continuing to publish the main con-
ference and workshop proceedings. Thanks also to Andrei Voronkov for hosting
the entire PAKDD reviewing process on the EasyChair.org site.
Finally, we greatly appreciate the support from various sponsors and insti-
tutions. The conference was organized by the Shenzhen Institutes of Advanced
VI Preface

Technology, Chinese Academy of Sciences, China, and co-organized by the Uni-


versity of Hong Kong, China and the University of Technology Sydney, Australia.
We hope you enjoy the proceedings of PAKDD 2011, which presents cutting-
edge research in data mining and knowledge discovery. We also hope all partic-
ipants took this opportunity to exchange ideas with each other and enjoyed the
modem city of Shenzhen!

May 2011 Joshua Huang


Longbing Cao
Jaideep Srivastava
Organization

Organizing Committee
Honorary Chair
Philip S. Yu University of Illinois at Chicago, USA

General Co-chairs
Jianping Fan Shenzhen Institutes of Advanced Technology,
CAS, China
David Cheung University of Hong Kong, China

Program Committee Co-chairs


Joshua Huang Shenzhen Institutes of Advanced Technology,
CAS, China
Longbing Cao University of Technology Sydney, Australia
Jaideep Srivastava University of Minnesota, USA

Workshop Co-chairs
James Bailey The University of Melbourne, Australia
Yun Sing Koh The University of Auckland, New Zealand

Tutorial Co-chairs
Xiong Hui Rutgers, the State University of New Jersey,
USA
Sanjay Chawla The University of Sydney, Australia

Local Arrangements Co-chairs


Shengzhong Feng Shenzhen Institutes of Advanced Technology,
CAS, China
Jun Luo Shenzhen Institutes of Advanced Technology,
CAS, China

Sponsorship Co-chairs
Yalei Bi Shenzhen Institutes of Advanced Technology,
CAS, China
Zhong Ming Shenzhen University, China
VIII Organization

Publicity Co-chairs
Jian Yang Beijing University of Technology, China
Ye Li Shenzhen Institutes of Advanced Technology,
CAS, China
Yuming Ou University of Technology Sydney, Australia

Publication Chair
Longbing Cao University of Technology Sydney, Australia

Steering Committee
Co-chairs
Rao Kotagiri University of Melbourne, Australia
Graham Williams Australian National University, Australia

Life Members
David Cheung University of Hong Kong, China
Masaru Kitsuregawa Tokyo University, Japan
Rao Kotagiri University of Melbourne, Australia
Hiroshi Motoda AFOSR/AOARD and Osaka University, Japan
Graham Williams (Treasurer) Australian National University, Australia
Ning Zhong Maebashi Institute of Technology, Japan

Members
Ming-Syan Chen National Taiwan University, Taiwan, ROC
Tu Bao Ho Japan Advanced Institute of Science and
Technology, Japan
Ee-Peng Lim Singapore Management University, Singapore
Huan Liu Arizona State University, USA
Jaideep Srivastava University of Minnesota, USA
Takashi Washio Institute of Scientific and Industrial Research,
Osaka University, Japan
Thanaruk Theeramunkong Thammasat University, Thailand
Kyu-Young Whang Korea Advanced Institute of Science and
Technology, Korea
Chengqi Zhang University of Technology Sydney, Australia
Zhi-Hua Zhou Nanjing University, China
Krishna Reddy IIIT, Hyderabad, India

Program Committee
Adrian Pearce The University of Melbourne, Australia
Aijun An York University, Canada
Aixin Sun Nanyang Technological University, Singapore
Akihiro Inokuchi Osaka University, Japan
Organization IX

Akira Shimazu Japan Advanced Institute of Science and


Technology, Japan
Alfredo Cuzzocrea University of Calabria, Italy
Andrzej Skowron Warsaw University, Poland
Anirban Mondal IIIT Delhi, India
Aoying Zhou Fudan University, China
Arbee Chen National Chengchi University, Taiwan, ROC
Aristides Gionis Yahoo Research Labs, Spain
Atsuyoshi Nakamura Hokkaido University, Japan
Bart Goethals University of Antwerp, Belgium
Bernhard Pfahringer University of Waikato, New Zealand
Bo Liu University of Technology, Sydney, Australia
Bo Zhang Tsinghua University, China
Boonserm Kijsirikul Chulalongkorn University, Thailand
Bruno Cremilleux Université de Caen, France
Chandan Reddy Wayne State University, USA
Chang-Tien Lu Virginia Tech, USA
Chaveevan Pechsiri Dhurakijpundit University, Thailand
Chengqi Zhang University of Technology, Australia
Chih-Jen Lin National Taiwan University, Taiwan, ROC
Choochart Haruechaiyasak NECTEC, Thailand
Chotirat Ann Ratanamahatana Chulalongkorn University, Thailand
Chun-hung Li Hong Kong Baptist University, Hong Kong,
China
Chunsheng Yang NRC Institute for Information Technology,
Canada
Clement Yu University of Illinois at Chicago, USA
Dacheng Tao The Hong Kong Polytechnic University,
Hongkong, China
Daisuke Ikeda Kyushu University, Japan
Dan Luo University of Technology, Sydney, Australia
Daoqiang Zhang Nanjing University of Aeronautics and
Astronautics, China
Dao-Qing Dai Sun Yat-Sen University, China
David Albrecht Monash University, Australia
David Taniar Monash University, Australia
Di Wu Chinese University of Hong Kong, China
Diane Cook Washington State University, USA
Dit-Yan Yeung Hong Kong University of Science and
Technology, China
Dragan Gamberger Rudjer Boskovic Institute, Croatia
Du Zhang California State University, USA
Ee-Peng Lim Nanyang Technological University, Singapore
Eibe Frank University of Waikato, New Zealand
Evaggelia Pitoura University of Ioannina, Greece
X Organization

Floriana Esposito Università di Bari, Italy


Gang Li Deakin University, Australia
George Karypis University of Minnesota, USA
Graham Williams Australian Taxation Office, Australia
Guozhu Dong Wright State University, USA
Hai Wang University of Aston, UK
Hanzi Wang University of Adelaide, Australia
Harry Zhang University of New Brunswick, Canada
Hideo Bannai Kyushu University, Japan
Hiroshi Nakagawa University of Tokyo, Japan
Hiroyuki Kawano Nanzan University, Japan
Hiroyuki Kitagawa University of Tsukuba, Japan
Hua Lu Aalborg University, Denmark
Huan Liu Arizona State University, USA
Hui Wang University of Ulster, UK
Hui Xiong Rutgers University, USA
Hui Yang San Francisco State University, USA
Huiping Cao New Mexico State University, USA
Irena Koprinska University of Sydney, Australia
Ivor Tsang Hong Kong University of Science and
Technology, China
James Kwok Hong Kong University of Science and
Technology, China
Jason Wang New Jersey Science and Technology University,
USA
Jean-Marc Petit INSA Lyon, France
Jeffrey Ullman Stanford University, USA
Jiacai Zhang Beijing Normal University, China
Jialie Shen Singapore Management University, Singapore
Jian Yin Sun Yat-Sen University, China
Jiawei Han University of Illinois at Urbana-Champaign,
USA
Jiuyong Li University of South Australia
Joao Gama University of Porto, Portugal
Jun Luo Chinese Academy of Sciences, China
Junbin Gao Charles Sturt University, Australia
Junping Zhang Fudan University, China
K. Selcuk Candan Arizona State University, USA
Kaiq huang Chinese Academy of Sciences, China
Kennichi Yoshida Tsukuba University, Japan
Kitsana Waiyamai Kasetsart University, Thailand
Kouzou Ohara Osaka University, Japan
Liang Wang The University of Melbourne, Australia
Ling Chen University of Technology Sydney, Australia
Lisa Hellerstein Polytechnic Institute of NYU, USA
Organization XI

Longbing Cao University of Technology Sydney, Australia


Manabu Okumura Tokyo Institute of Technology, Japan
Marco Maggini University of Siena, Italy
Marut Buranarach NECTEC, Thailand
Marzena Kryszkiewicz Warsaw University of Technology, Poland
Masashi Shimbo Nara Institute of Science and Technology,
Japan
Masayuki Numao Osaka University, Japan
Maurice van Keulen University of Twente, The Netherlands
Xiaofeng Meng Renmin University of China, China
Mengjie Zhang Victoria University of Wellington, New Zealand
Michael Berthold University of Konstanz, Germany
Michael Katehakis Rutgers Business School, USA
Michalis Vazirgiannis Athens University of Economics and Business,
Greece
Min Yao Zhejiang University, China
Mingchun Wang Tianjin University of Technology and
Education, China
Mingli Song Hong Kong Polytechnical University, China
Mohamed Mokbel University of Minnesota, USA
Naren Ramakrishnan Virginia Tech, USA
Ngoc Thanh Nguyen Wroclaw University of Technology, Poland
Ning Zhong Maebashi Institute of Technology, Japan
Ninghui Li Purdue University, USA
Olivier de Vel DSTO, Australia
Pabitra Mitra Indian Institute of Technology Kharagpur,
India
Panagiotis Karras University of Zurich, Switzerland
Pang-Ning Tan Michigan State University, USA
Patricia Riddle University of Auckland, New Zealand
Panagiotis Karras National University of Singapore, Singapore
Jialie Shen Singapore Management University, Singapore
Pang-Ning Tan Michigan State University, USA
Patricia Riddle University of Auckland, New Zealand
Peter Christen Australian National University, Australia
Peter Triantafillou University of Patras, Greece
Philip Yu IBM T.J. Watson Research Center, USA
Philippe Lenca Telecom Bretagne, France
Pohsiang Tsai National Formosa University, Taiwan, ROC
Prasanna Desikan University of Minnesota, USA
Qingshan Liu Chinese Academy of Sciences, China
Rao Kotagiri The University of Melbourne, Australia
Richi Nayak Queensland University of Technology, Australia
Rui Camacho LIACC/FEUP University of Porto, Portugal
Ruoming Jin Kent State University, USA
XII Organization

S.K. Gupta Indian Institute of Technology, India


Salvatore Orlando University of Venice, Italy
Sameep Mehta IBM, India Research Labs, India
Sanjay Chawla University of Sydney, Australia
Sanjay Jain National University of Singapore, Singapore
Sanjay Ranka University of Florida, USA
San-Yih Hwang National Sun Yat-Sen University, Taiwan, ROC
Seiji Yamada National Institute of Informatics, Japan
Sheng Zhong State University of New York at Buffalo, USA
Shichao Zhang University of Technology at Sydney, Australia
Shiguang Shan Digital Media Research Center, ICT
Shoji Hirano Shimane University, Japan
Shu-Ching Chen Florida International University, USA
Shuigeng Zhou Fudan University, China
Songcan Chen Nanjing University of Aeronautics and
Astronautics, China
Srikanta Tirthapura Iowa State University, USA
Stefan Rueping Fraunhofer IAIS, Germany
Suman Nath Networked Embedded Computing Group,
Microsoft Research
Sung Ho Ha Kyungpook National University, Korea
Sungzoon Cho Seoul National University, Korea
Szymon Jaroszewicz Technical University of Szczecin, Poland
Tadashi Nomoto National Institute of Japanese Literature,
Tokyo, Japan
Taizhong Hu University of Science and Technology of China
Takashi Washio Osaka University, Japan
Takeaki Uno National Institute of Informatics (NII), Japan
Takehisa Yairi University of Tokyo, Japan
Tamir Tassa The Open University, Israel
Taneli Mielikainen Nokia Research Center, USA
Tao Chen Shenzhen Institutes of Advanced Technology,
Chinese Academy of Science, China
Tao Li Florida International University, USA
Tao Mei Microsoft Research Asia
Tao Yang Shenzhen Institutes of Advanced Technology,
Chinese Academy of Science, China
Tetsuya Yoshida Hokkaido University, Japan
Thepchai Supnithi National Electronics and Computer Technology
Center, Thailand
Thomas Seidl RWTH Aachen University, Germany
Tie-Yan Liu Microsoft Research Asia, China
Toshiro Minami Kyushu Institute of Information Sciences
(KIIS) and Kyushu University Library,
Japan
Organization XIII

Tru Cao Ho Chi Minh City University of Technology,


Vietnam
Tsuyoshi Murata Tokyo Institute of Technology, Japan
Vincent Lee Monash University, Australia
Vincent S. Tseng National Cheng Kung University, Taiwan, ROC
Vincenzo Piuri University of Milan, Italy
Wagner Meira Jr. Universidade Federal de Minas Gerais, Brazil
Wai Lam The Chinese University of Hong Kong, China
Warren Jin Australian National University, Australia
Wei Fan IBM T.J.Watson Research Center, USA
Weining Qian East China Normal University, China
Wen-Chih Peng National Chiao Tung University, Taiwan, ROC
Wenjia Wang University of East Anglia, UK
Wilfred Ng Hong Kong University of Science and
Technology, China
Wlodek Zadrozny IBM Research
Woong-Kee Loh Sungkyul University, Korea
Wynne Hsu National University of Singapore, Singapore
Xia Cui Chinese Academy of Sciences, China
Xiangjun Dong Shandong Institute of Light Industry, China
Xiaofang Zhou The University of Queensland, Australia
Xiaohua Hu Drexel University, USA
Xin Wang Calgary University, Canada
Xindong Wu University of Vermont, USA
Xingquan Zhu Florida Atlantic University, USA
Xintao Wu University of North Carolina at Charlotte, USA
Xuelong Li University of London, UK
Xuemin Lin University of New South Wales, Australia
Yan Zhou University of South Alabama, USA
Yang-Sae Moon Kangwon National University, Korea
Yao Tao The University of Auckland, New Zealand
Yasuhiko Morimoto Hiroshima University, Japan
Yi Chen Arizona State University, USA
Yi-Dong Shen Chinese Academy of Sciences, China
Yifeng Zeng Aalborg University, Denmark
Yihua Wu Google Inc.
Yi-Ping Phoebe Chen Deakin University, Australia
Yiu-ming Cheung Hong Kong Baptist University, Hong Kong,
China
Yong Guan Iowa State University, USA
Yonghong Peng University of Bradford, UK
Yu Jian Beijing Jiaotong University, China
Yuan Yuan Aston University, UK
Yun Xiong Fudan University, China
Yunming Ye Harbin Institute of Technology, China
XIV Organization

Zheng Chen Microsoft Research Asia, China


Zhi-Hua Zhou Nanjing University, China
Zhongfei (Mark) Zhang SUNY Binghamton, USA
Zhongzhi Shi Chinese Academy of Sciences, China
Zili Zhang Deakin University, Australia

External Reviewers
Ameeta Agrawal York University, Canada
Arnaud Soulet Université Francois Rabelais Tours, France
Axel Poigne Fraunhofer IAIS, Germany
Ben Tan Fudan University, China
Bian Wei University of Technology, Sydney, Australia
Bibudh Lahiri Iowa State University
Bin Yang Aalborg University, Denmark
Bin Zhao East China Normal University, China
Bing Bai Google Inc.
Bojian Xu Iowa State University, USA
Can Wang University of Technology, Sydney, Australia
Carlos Ferreira University of Porto, Portugal
Chao Li Shenzhen Institutes of Advanced Technology,
CAS, China
Cheqing Jin East China Normal University, China
Christian Beecks RWTH Aachen University, Germany
Chun-Wei Seah Nanyang Technological University, Singapore
De-Chuan Zhan Nanjing University, China
Elnaz Delpisheh York University, Canada
Erez Shmueli The Open University, Israel
Fausto Fleites Florida International University, USA
Fei Xie University of Vermont, USA
Gaoping Zhu University of New South Wales, Australia
Gongqing Wu University of Vermont, USA
Hardy Kremer RWTH Aachen University, Germany
Hideyuki Kawashima Nanzan University, Japan
Hsin-Yu Ha Florida International University, USA
Ji Zhou Fudan University, China
Jianbo Yang Nanyang Technological University, Singapore
Jinfei Shenzhen Institutes of Advanced Technology,
CAS, China
Jinfeng Zhuang Microsoft Research Asia, China
Jinjiu Li University of Technology, Sydney
Jun Wang Southwest University, China
Jun Zhang Charles Sturt University, Australia
Ke Zhu University of New South Wales, Australia
Keli Xiao Rutgers University, USA
Ken-ichi Fukui Osaka University, Japan
Organization XV

Kui Yu University of Vermont, USA


Leonard K.M. Poon Shenzhen Institutes of Advaced Technology,
Chinese Academy of Science, China
Leting Wu University of North Carolina at Charlotte, USA
Liang Du Chinese Academy of Sciences, China
Lin Zhu Shanghai Jiaotong University, China
Ling Chen University of Technology Sydney, Australia
Linhao Xu Aalborg University, Denmark
Mangesh Gupte Google Inc.
Mao Qi Nanyang Technological University, Singapore
Marc Plantevit Université Lyon 1, France
Marcia Oliveira University Porto, Portugal
Ming Li Nanjing University, China
Mingkui Tan Nanyang Technological University, Singapore
Natalja Friesen Fraunhofer IAIS, Germany
Nguyen Le Minh Japan Advanced Institute of Science and
Technology, Japan
Ning Zhang Microsoft Research Asia, China
Nuno A. Fonseca LIACC/FEUP University of Porto, Portugal
Omar Odibat IIIT, Hyderabad, India
Peipei Li University of Vermont, USA
Peng Cai East China Normal University, China
Penjie Ye University of New South Wales, Australia
Peter Tischer Monash University, Australia
Petr Kosina University of Porto, Portugal
Philipp Kranen RWTH Aachen University, Germany
Qiao-Liang Xiang Nanyang Technological University, Singapore
Rajul Anand IIIT, Hyderabad, India
Roberto Legaspi Osaka University, Japan
Romain Vuillemot INSA Lyon, France
Sergej Fries RWTH Aachen University, Germany
Smriti Bhagat Google Inc.
Stephane Lallich Telecom Bretagne, France
Supaporn Spanurattana Tokyo Institute of Technology, Japan
Vitor Santos Costa LIACC/FEUP University of Porto, Portugal
Wang Xinchao Ecole Polytechnique Federale de Lausanne (EPFL),
Switzerland
Weifeng Su Shenzhen Institutes of Advaced Technology,
Chinese Academy of Science, China
Weiren Yu University of New South Wales, Australia
Wenjun Zhou Rutgers University, USA
Xiang Zhao University of New South Wales, Australia
Xiaodan Wang Fudan University, China
Xiaowei Ying University of North Carolina at Charlotte, USA
Xin Liu Tokyo Institute of Technology, Japan
XVI Organization

Xuan Li Chinese Academy of Sciences, China


Xu-Ying Liu Nanjing University, China
Yannick Le Bras Telecom Bretagne, France
Yasayuki Okabe National Institute of Informatics, Japan
Yasufumi Takama National Institute of Informatics, Japan
Yi Guo Charles Sturt University, Australia
Yi Wang Shenzhen Institutes of Advanced Technology,
Chinese Academy of Science, China
Yi Xu SUNY Binghamton, USA
Yiling Zeng University of Technology, Sydney
Yimin Yang Florida International University, USA
Yoan Renaud INSA Lyon, France
Yong Deng Southwest University, China
Yong Ge Rutgers University, USA
Yuan YUAN Aston University, UK
Zhao Zhang East China Normal University, China
Zhenjie Zhang Aalborg University, Denmark
Zhenyu Lu University of Vermont, USA
Zhigang Zheng University of Technology, Sydney, Australia
Zhitao Shen University of New South Wales, Australia
Zhiyong Cheng Singapore Management University
Zhiyuan Chen The Open University, Israel
Zhongmou Li Rutgers University, USA
Zhongqiu Zhao University of Vermont, USA
Zhou Tianyi University of Technology, Sydney, Australia
Table of Contents – Part II

Graph Mining
Spectral Analysis of k-Balanced Signed Graphs . . . . . . . . . . . . . . . . . . . . . . 1
Leting Wu, Xiaowei Ying, Xintao Wu, Aidong Lu, and Zhi-Hua Zhou

Spectral Analysis for Billion-Scale Graphs: Discoveries and


Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
U Kang, Brendan Meeder, and Christos Faloutsos

LGM: Mining Frequent Subgraphs from Linear Graphs . . . . . . . . . . . . . . . 26


Yasuo Tabei, Daisuke Okanohara, Shuichi Hirose, and Koji Tsuda

Efficient Centrality Monitoring for Time-Evolving Graphs . . . . . . . . . . . . . 38


Yasuhiro Fujiwara, Makoto Onizuka, and Masaru Kitsuregawa

Graph-Based Clustering with Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


Rajul Anand and Chandan K. Reddy

Social Network/Online Analysis


A Partial Correlation-Based Bayesian Network Structure Learning
Algorithm under SEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Jing Yang and Lian Li

Predicting Friendship Links in Social Networks Using a Topic Modeling


Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Rohit Parimi and Doina Caragea

Info-Cluster Based Regional Influence Analysis in Social Networks . . . . . 87


Chao Li, Zhongying Zhao, Jun Luo, and Jianping Fan

Utilizing Past Relations and User Similarities in a Social Matching


System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Richi Nayak

On Sampling Type Distribution from Heterogeneous Social Networks . . . 111


Jhao-Yin Li and Mi-Yen Yeh

Ant Colony Optimization with Markov Random Walk for Community


Detection in Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Di Jin, Dayou Liu, Bo Yang, Carlos Baquero, and Dongxiao He
XVIII Table of Contents – Part II

Time Series Analysis


Faster and Parameter-Free Discord Search in Quasi-Periodic Time
Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Wei Luo and Marcus Gallagher

INSIGHT: Efficient and Effective Instance Selection for Time-Series


Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Krisztian Buza, Alexandros Nanopoulos, and Lars Schmidt-Thieme

Multiple Time-Series Prediction through Multiple Time-Series


Relationships Profiling and Clustered Recurring Trends . . . . . . . . . . . . . . . 161
Harya Widiputra, Russel Pears, and Nikola Kasabov

Probabilistic Feature Extraction from Multivariate Time Series Using


Spatio-Temporal Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Michal Lewandowski, Dimitrios Makris, and Jean-Christophe Nebel

Sequence Analysis
Real-Time Change-Point Detection Using Sequentially Discounting
Normalized Maximum Likelihood Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Yasuhiro Urabe, Kenji Yamanishi, Ryota Tomioka, and Hiroki Iwai

Compression for Anti-Adversarial Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 198


Yan Zhou, Meador Inge, and Murat Kantarcioglu

Mining Sequential Patterns from Probabilistic Databases . . . . . . . . . . . . . . 210


Muhammad Muzammal and Rajeev Raman

Large Scale Real-Life Action Recognition Using Conditional Random


Fields with Stochastic Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
Xu Sun, Hisashi Kashima, Ryota Tomioka, and Naonori Ueda

Packing Alignment: Alignment for Sequences of Various Length


Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Atsuyoshi Nakamura and Mineichi Kudo

Outlier Detection
Multiple Distribution Data Description Learning Algorithm for Novelty
Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Trung Le, Dat Tran, Wanli Ma, and Dharmendra Sharma

RADAR: Rare Category Detection via Computation of Boundary


Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Hao Huang, Qinming He, Jiangfeng He, and Lianhang Ma
Table of Contents – Part II XIX

RKOF: Robust Kernel-Based Local Outlier Detection . . . . . . . . . . . . . . . . 270


Jun Gao, Weiming Hu, Zhongfei (Mark) Zhang,
Xiaoqin Zhang, and Ou Wu

Chinese Categorization and Novelty Mining . . . . . . . . . . . . . . . . . . . . . . . . . 284


Flora S. Tsai and Yi Zhang

Finding Rare Classes: Adapting Generative and Discriminative Models


in Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
Timothy M. Hospedales, Shaogang Gong, and Tao Xiang

Imbalanced Data Analysis


Margin-Based Over-Sampling Method for Learning From Imbalanced
Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Xiannian Fan, Ke Tang, and Thomas Weise

Improving k Nearest Neighbor with Exemplar Generalization for


Imbalanced Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Yuxuan Li and Xiuzhen Zhang

Sample Subset Optimization for Classifying Imbalanced Biological


Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
Pengyi Yang, Zili Zhang, Bing B. Zhou, and Albert Y. Zomaya

Class Confidence Weighted k NN Algorithms for Imbalanced Data


Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
Wei Liu and Sanjay Chawla

Agent Mining
Multi-agent Based Classification Using Argumentation from
Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Maya Wardeh, Frans Coenen, Trevor Bench-Capon, and
Adam Wyner

Agent-Based Subspace Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370


Chao Luo, Yanchang Zhao, Dan Luo, Chengqi Zhang, and Wei Cao

Evaluation (Similarity, Ranking, Query)


Evaluating Pattern Set Mining Strategies in a Constraint Programming
Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Tias Guns, Siegfried Nijssen, and Luc De Raedt

Asking Generalized Queries with Minimum Cost . . . . . . . . . . . . . . . . . . . . . 395


Jun Du and Charles X. Ling
XX Table of Contents – Part II

Ranking Individuals and Groups by Influence Propagation . . . . . . . . . . . . 407


Pei Li, Jeffrey Xu Yu, Hongyan Liu, Jun He, and Xiaoyong Du

Dynamic Ordering-Based Search Algorithm for Markov Blanket


Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
Yifeng Zeng, Xian He, Yanping Xiang, and Hua Mao

Mining Association Rules for Label Ranking . . . . . . . . . . . . . . . . . . . . . . . . . 432


Cláudio Rebelo de Sá, Carlos Soares, Alı́pio Mário Jorge,
Paulo Azevedo, and Joaquim Costa

Tracing Evolving Clusters by Subspace and Value Similarity . . . . . . . . . . . 444


Stephan Günnemann, Hardy Kremer, Charlotte Laufkötter, and
Thomas Seidl

An IFS-Based Similarity Measure to Index Electroencephalograms . . . . . 457


Ghita Berrada and Ander de Keijzer

DISC: Data-Intensive Similarity Measure for Categorical Data . . . . . . . . . 469


Aditya Desai, Himanshu Singh, and Vikram Pudi

ListOPT: Learning to Optimize for XML Ranking . . . . . . . . . . . . . . . . . . . 482


Ning Gao, Zhi-Hong Deng, Hang Yu, and Jia-Jian Jiang

Item Set Mining Based on Cover Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 493


Marc Segond and Christian Borgelt

Applications
Learning to Advertise: How Many Ads Are Enough? . . . . . . . . . . . . . . . . . 506
Bo Wang, Zhaonan Li, Jie Tang, Kuo Zhang, Songcan Chen, and
Liyun Ru

TeamSkill: Modeling Team Chemistry in Online Multi-player Games . . . 519


Colin DeLong, Nishith Pathak, Kendrick Erickson, Eric Perrino,
Kyong Shim, and Jaideep Srivastava

Learning the Funding Momentum of Research Projects . . . . . . . . . . . . . . . 532


Dan He and D.S. Parker

Local Feature Based Tensor Kernel for Image Manifold Learning . . . . . . . 544
Yi Guo and Junbin Gao

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555


Table of Contents – Part I

Feature Extraction
An Instance Selection Algorithm Based on Reverse Nearest Neighbor . . . 1
Bi-Ru Dai and Shu-Ming Hsu

A Game Theoretic Approach for Feature Clustering and Its Application


to Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Dinesh Garg, Sellamanickam Sundararajan, and Shirish Shevade

Feature Selection Strategy in Text Classification . . . . . . . . . . . . . . . . . . . . . 26


Pui Cheong Gabriel Fung, Fred Morstatter, and Huan Liu

Unsupervised Feature Weighting Based on Local Feature Relatedness . . . 38


Jiali Yun, Liping Jing, Jian Yu, and Houkuan Huang

An Effective Feature Selection Method for Text Categorization . . . . . . . . 50


Xipeng Qiu, Jinlong Zhou, and Xuanjing Huang

Machine Learning
A Subpath Kernel for Rooted Unordered Trees . . . . . . . . . . . . . . . . . . . . . . 62
Daisuke Kimura, Tetsuji Kuboyama, Tetsuo Shibuya, and
Hisashi Kashima

Classification Probabilistic PCA with Application in Domain


Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Victor Cheng and Chun-Hung Li

Probabilistic Matrix Factorization Leveraging Contexts for


Unsupervised Relation Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa

The Unsymmetrical-Style Co-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100


Bin Wang, Harry Zhang, Bruce Spencer, and Yuanyuan Guo

Balance Support Vector Machines Locally Using the Structural


Similarity Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Jianxin Wu

Using Classifier-Based Nominal Imputation to Improve Machine


Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Xiaoyuan Su, Russell Greiner, Taghi M. Khoshgoftaar, and
Amri Napolitano
XXII Table of Contents – Part I

A Bayesian Framework for Learning Shared and Individual Subspaces


from Multiple Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Sunil Kumar Gupta, Dinh Phung, Brett Adams, and
Svetha Venkatesh

Are Tensor Decomposition Solutions Unique? On the Global


Convergence HOSVD and ParaFac Algorithms . . . . . . . . . . . . . . . . . . . . . . . 148
Dijun Luo, Chris Ding, and Heng Huang
Improved Spectral Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Sanparith Marukatat and Wasin Sinthupinyo

Clustering
High-Order Co-clustering Text Data on Semantics-Based Representation
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Liping Jing, Jiali Yun, Jian Yu, and Joshua Huang
The Role of Hubness in Clustering High-Dimensional Data . . . . . . . . . . . . 183
Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, and
Mirjana Ivanović
Spatial Entropy-Based Clustering for Mining Data with Spatial
Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Baijie Wang and Xin Wang

Self-adjust Local Connectivity Analysis for Spectral Clustering . . . . . . . . 209


Hui Wu, Guangzhi Qu, and Xingquan Zhu
An Effective Density-Based Hierarchical Clustering Technique to
Identify Coherent Patterns from Gene Expression Data . . . . . . . . . . . . . . . 225
Sauravjyoti Sarmah, Rosy Das Sarmah, and
Dhruba Kumar Bhattacharyya
Nonlinear Discriminative Embedding for Clustering via Spectral
Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Yubin Zhan and Jianping Yin

An Adaptive Fuzzy k -Nearest Neighbor Method Based on Parallel


Particle Swarm Optimization for Bankruptcy Prediction . . . . . . . . . . . . . . 249
Hui-Ling Chen, Da-You Liu, Bo Yang, Jie Liu, Gang Wang, and
Su-Jing Wang

Semi-supervised Parameter-Free Divisive Hierarchical Clustering of


Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Tengke Xiong, Shengrui Wang, André Mayers, and Ernest Monga
Table of Contents – Part I XXIII

Classification
Identifying Hidden Contexts in Classification . . . . . . . . . . . . . . . . . . . . . . . . 277
Indrė Žliobaitė
Cross-Lingual Sentiment Classification via Bi-view Non-negative
Matrix Tri-Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Junfeng Pan, Gui-Rong Xue, Yong Yu, and Yang Wang
A Sequential Dynamic Multi-class Model and Recursive Filtering by
Variational Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Xiangyun Qing and Xingyu Wang
Random Ensemble Decision Trees for Learning Concept-Drifting Data
Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Peipei Li, Xindong Wu, Qianhui Liang, Xuegang Hu, and
Yuhong Zhang
Collaborative Data Cleaning for Sentiment Classification with Noisy
Training Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Xiaojun Wan

Pattern Mining
Using Constraints to Generate and Explore Higher Order Discriminative
Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
Michael Steinbach, Haoyu Yu, Gang Fang, and Vipin Kumar
Mining Maximal Co-located Event Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
Jin Soung Yoo and Mark Bow
Pattern Mining for a Two-Stage Information Filtering System . . . . . . . . . 363
Xujuan Zhou, Yuefeng Li, Peter Bruza, Yue Xu, and
Raymond Y.K. Lau
Efficiently Retrieving Longest Common Route Patterns of Moving
Objects By Summarizing Turning Regions . . . . . . . . . . . . . . . . . . . . . . . . . . 375
Guangyan Huang, Yanchun Zhang, Jing He, and Zhiming Ding
Automatic Assignment of Item Weights for Pattern Mining on Data
Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Yun Sing Koh, Russel Pears, and Gillian Dobbie

Prediction
Predicting Private Company Exits Using Qualitative Data . . . . . . . . . . . . 399
Harish S. Bhat and Daniel Zaelit
A Rule-Based Method for Customer Churn Prediction in
Telecommunication Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
Ying Huang, Bingquan Huang, and M.-T. Kechadi
XXIV Table of Contents – Part I

Text Mining
Adaptive and Effective Keyword Search for XML . . . . . . . . . . . . . . . . . . . . 423
Weidong Yang, Hao Zhu, Nan Li, and Guansheng Zhu

Steering Time-Dependent Estimation of Posteriors with


Hyperparameter Indexing in Bayesian Topic Models . . . . . . . . . . . . . . . . . . 435
Tomonari Masada, Atsuhiro Takasu, Yuichiro Shibata, and
Kiyoshi Oguri

Constrained LDA for Grouping Product Features in Opinion Mining . . . 448


Zhongwu Zhai, Bing Liu, Hua Xu, and Peifa Jia

Semantic Dependent Word Pairs Generative Model for Fine-Grained


Product Feature Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
Tian-Jie Zhan and Chun-Hung Li

Grammatical Dependency-Based Relations for Term Weighting in Text


Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
Dat Huynh, Dat Tran, Wanli Ma, and Dharmendra Sharma

XML Documents Clustering Using a Tensor Space Model . . . . . . . . . . . . . 488


Sangeetha Kutty, Richi Nayak, and Yuefeng Li

An Efficient Pre-processing Method to Identify Logical Components


from PDF Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
Ying Liu, Kun Bai, and Liangcai Gao

Combining Proper Name-Coreference with Conditional Random Fields


for Semi-supervised Named Entity Recognition in Vietnamese Text . . . . . 512
Rathany Chan Sam, Huong Thanh Le, Thuy Thanh Nguyen, and
Thien Huu Nguyen

Topic Analysis of Web User Behavior Using LDA Model on Proxy


Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
Hiroshi Fujimoto, Minoru Etoh, Akira Kinno, and
Yoshikazu Akinaga

SizeSpotSigs: An Effective Deduplicate Algorithm Considering the Size


of Page Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
Xianling Mao, Xiaobing Liu, Nan Di, Xiaoming Li, and Hongfei Yan

Knowledge Transfer across Multilingual Corpora via Latent Topics . . . . . 549


Wim De Smet, Jie Tang, and Marie-Francine Moens

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561


Spectral Analysis of k-Balanced Signed Graphs

Leting Wu1 , Xiaowei Ying1 , Xintao Wu1 , Aidong Lu1 , and Zhi-Hua Zhou2
1
University of North Carolina at Charlotte, USA
{lwu8,xying,xwu,alu1}@uncc.edu
2
National Key Lab for Novel Software Technology, Nanjing University, China
zhouzh@nju.edu.cn

Abstract. Previous studies on social networks are often focused on net-


works with only positive relations between individual nodes. As a signif-
icant extension, we conduct the spectral analysis on graphs with both
positive and negative edges. Specifically, we investigate the impacts of
introducing negative edges and examine patterns in the spectral space
of the graph’s adjacency matrix. Our theoretical results show that com-
munities in a k-balanced signed graph are distinguishable in the spectral
space of its signed adjacency matrix even if connections between commu-
nities are dense. This is quite different from recent findings on unsigned
graphs, where communities tend to mix together in the spectral space
when connections between communities increase. We further conduct
theoretical studies based on graph perturbation to examine spectral pat-
terns of general unbalanced signed graphs. We illustrate our theoretical
findings with various empirical evaluations.

1 Introduction

Signed networks were originally used in anthropology and sociology to model


friendship and enmity [2, 4]. The motivation for signed networks arose from the
fact that psychologists use -1, 0, and 1 to represent disliking, indifference, and
liking, respectively. Graph topology of signed networks can then be expressed as
an adjacency matrix where the entry is 1 (or −1) if the relationship is positive
(or negative) and 0 if the relationship is absent.
Spectral analysis that considers 0-1 matrices associated with a given network
has been well developed. As a significant extension, in this paper we investigate
the impacts of introducing negative edges in the graph topology and examine
community patterns in the spectral space of its signed adjacency matrix. We
start from k-balanced signed graphs which have been extensively examined in
social psychology, especially from the stability of sentiments perspective [5]. Our
theoretical results show that communities in a k-balanced signed graph are dis-
tinguishable in the spectral space of its signed adjacency matrix even if connec-
tions between communities are dense. This is very different from recent findings
on unsigned graphs [12, 9], where communities tend to mix together when con-
nections between communities increase. We give a theoretical explanation by
treating the k-balanced signed graph as a perturbed one from a disconnected

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 1–12, 2011.
c Springer-Verlag Berlin Heidelberg 2011
2 L. Wu et al.

k-block network. We further conduct theoretical studies based on graph pertur-


bation to examine spectral patterns of general unbalanced signed graphs. We
illustrate our theoretical findings with various empirical evaluations.

2 Notation
A signed graph G can be represented as the symmetric adjacency matrix An×n
with aij = 1 if there is a positive edge between node i and j, aij = −1 if
there is a negative edge between node i and j, and aij = 0 otherwise. A has
n real eigenvalues. Let λi be the i-th largest eigenvalue of A with eigenvector
xi , λ1 ≥ λ2 ≥ · · · ≥ λn . 
Let xij denote the j-th entry of xi . The spectral
decomposition of A is A = i λi xi xTi .
x1 xi xk xn

⎛ ⎞
x11 · · · xi1 · · · xk1 · · · xn1
⎜ .. .. .. .. ⎟
⎜ . . . . ⎟ (1)
⎜ ⎟

αu →⎜ x1u · · · xiu · · · xku · · · xnu ⎟⎟
⎜ . .. .. .. ⎟
⎝ .. . . . ⎠
x1n · · · xin · · · xkn · · · xnn

Formula (1) illustrates our notions. The eigenvector xi is represented as a


column vector. There usually exist k leading eigenvalues that are significantly
greater than the rest ones for networks with k well separated communities.
We call row vector αu = (x1u , x2u , · · · , xku ) the spectral coordinate of node
u in the k-dimensional subspace spanned by (x1 , · · · xk ). This subspace re-
flects most topological information of the original graph. The eigenvectors xi
(i = 1, . . . , k) naturally form the canonical basis of the subspace denoted by
ξi = (0, . . . , 0, 1, 0 . . . , 0), where the i-th entry of ξi is 1.
Let E be a symmetric perturbation matrix, and B be the adjacency matrix
after perturbation, B = A + E. Similarly, let μi be the i-th largest eigenvalue
of B with eigenvector yi , and yij is the j-th entry of yi . Row vector α u =
(y1u , . . . , yku ) is the spectral coordinate of node u after perturbation.

3 The Spectral Property of k-Balanced Graph


The k-balanced graph is one type of signed graphs that have received extensive
examinations in social psychology. It was shown that the stability of sentiments
is equivalent to k-balanced (clusterable). A necessary and sufficient condition for
a signed graph to be k-balanced is that the signed graph does not contain the
cycle with exactly one negative edge [2].
Definition 1 Graph G is a k-balanced graph if the node set V can be divided
into k non-trivial disjoint subsets such that V1 , . . . , Vk , edges connecting any two
nodes from the same subset are all positive, and edges connecting any two nodes
from different subsets are all negative.
Spectral Analysis of k-Balanced Signed Graphs 3

The k node sets, V1 , . . . , Vk , 


naturally form k communities denoted by C1 , . . . , Ck
respectively. Let ni = |Vi | ( i ni = n), and Ai be the ni × ni adjacency matrix
of community Ci . After re-numbering the nodes properly, the adjacency matrix
B of a k-balanced graph is:
⎛ ⎞
A1 0
⎜ .. ⎟
B = A + E, where A = ⎝ . ⎠, (2)
0 Ak

and E represents the negative edges across communities. More generally, euv =
1(−1) if a positive(negative) edge is added between node u and v, and euv = 0
otherwise.

3.1 Non-negative Block-Wise Diagonal Matrix

For a graph with k disconnected communities, its adjacency matrix A is shown


in (2). Let νi be the largest eigenvalue of Ai with eigenvector zi of dimension
ni × 1. Without loss of generality, we assume ν1 > · · · > νk . Since the entries of
Ai are all non-negative, with Perron-Frobenius theorem [10], νi is positive and
all the entries of its eigenvector zi are non-negative. When the k communities
are comparable in size, νi is the i-th largest eigenvalues of A (i.e., λi = νi ),
and the eigenvectors of Ai can be naturally extended to the eigenvalues of A as
follows: ⎛ ⎞
z1 0 ··· 0
⎜0 z2 ··· 0⎟
⎜ ⎟
(x1 , x2 , · · · , xk ) = ⎜ . .. .. .. ⎟ (3)
⎝ .. . . .⎠
0 0 ··· zk
Now, consider node u in community Ci . Note that all the entries in xi are non-
negative, and the spectral coordinate of node u is just the u-th row of the matrix
in (3). Then, we have

αu = (0, · · · , 0, xiu , 0, · · · , 0), (4)

where xiu > 0 is the only non-zero entries of αu . In other words, for a graph
with k disconnected comparable communities, spectral coordinates of all nodes
locate on k positive half-axes of ξ1 , · · · , ξk and nodes from the same community
locate on the same half axis.

3.2 A General Perturbation Result

Let Γui (i = 1, . . . , k) be the set of nodes in Ci that are newly connected to node
u by perturbation E: Γui = {v : v ∈ Ci , euv = ±1}. In [11], we derived several
theoretical results on general graph perturbation. We include the approximation
of spectral coordinates below as a basis for our spectral analysis of signed graphs.
Please refer to [11] for proof details.
4 L. Wu et al.

Theorem 1. Let A be a block-wise diagonal matrix as shown in (2), and E be


a symmetric perturbation matrix satisfying E2  λk . Let βij = xTi Exj . For
a graph with the adjacency matrix B = A + E, the spectral coordinate of an
arbitrary node u ∈ Ci can be approximated as
⎛ ⎞

euv x1v
euv xkv
u ≈ xiu ri + ⎝
α ,..., ⎠ (5)
1
λ1 k
λk
v∈Γu v∈Γu

where scalar xiu is the only non-zero entry in its original spectral coordinate
shown in (4), and ri is the i-th row of matrix R in (6):
⎛ ⎞
1 λ2β−λ 12
· · · λkβ−λ
1k

⎜ β21 1 1

⎜ λ1 −λ2 1 · · · λkβ−λ 2k

R=⎜ .⎜ 2
⎟. (6)
⎝ ..
.
.. . .. .. ⎟
.

λ1 −λk λ2 −λk · · ·
βk1 βk2
1

3.3 Moderate Inter-community Edges

Proposition 1. Let B = A + E where A has k disconnected communities and


E2  λk and E is non-positive. We have the following properties:

1. If node u ∈ Ci is not connected to any Cj (j = i), α u lies on the half-line


ri that starts from the origin, where ri is the i-th row of matrix R shown in
(6). The k half-lines are approximately orthogonal to each other.
2. If node u ∈ Ci is connected to node v ∈ Cj (j = i), α u deviate from ri .
Moreover, the angle between α u and rj is an obtuse angle.

To illustrate Proposition 1, we now consider a 2-balanced graph. Suppose that


a graph has two communities and we add some sparse edges between two com-
munities. For node u ∈ C1 and v ∈ C2 , with (5), the spectral coordinates can be
approximated as

u ≈ x1u r1 + (0,
α euv x2v ), (7)
λ2 2
v∈Γu
1

v ≈ x2v r2 + (
α euv x1u , 0), (8)
λ1 1
u∈Γv

where r1 = (1, λ2β−λ


12
1
) and r2 = ( λ1β−λ
21
2
, 1).
The Item 1 of Proposition 1 is apparent from (7) and (8). For those nodes
with no inter-community edges, the second parts of the right hand side (RHS)
of (7) and (8) are 0 since all euv are 0, and hence they lie on the two half-lines
r1 (nodes in C1 ) and r2 (nodes in C2 ). Note that r1 and r2 are orthogonal since
r1 r2T = 0.
Spectral Analysis of k-Balanced Signed Graphs 5

0.2 0.2 0.2


C C C
1 1 1
C C C
2 2 2
0.15 0.15 0.15

0.1 0.1 0.1

2
0.05

2
0.05
2

e
0.05

e
e

0 0 0

−0.05 −0.05 −0.05

−0.1 −0.1 −0.1


−0.1 −0.05 0 0.05 0.1 0.15 0.2 −0.1 −0.05 0 0.05 0.1 0.15 0.2 −0.1 −0.05 0 0.05 0.1 0.15 0.2
e e e
1 1 1

(a) Disconnected (b) Add negative edges (c) Add positive edges

Fig. 1. Synth-2: rotation and deviation with inter-community edges (p = 0.05)

Next, we explain Item 2 of Proposition 1. Consider the inner product


1

u , r2 = α

α u r2T = euv x2v .
λ2 2 v∈Γu

If node u ∈ C1 has some negative links to C2 (euv = −1),


α u , r2 is thus
u lies outside the two half-lines r1 and r2 .
negative. In other words, α
Another interesting pattern is the direction of rotation of the two half lines.
For the 2-balanced graph, r1 and r2 rotate counter-clockwise from the axis ξ1
and ξ2 . Notice that allthe added edges are negative (euv = −1), and hence
β12 = β21 = xT1 Ex2 = nu,v=1 euv x1u x2v < 0. Therefore, λ2β−λ12
1
> 0, λ1β−λ 21
2
< 0,
which implies that r1 and r2 have an counter-clockwise rotation from the basis.
Comparison with adding positives edges. When the added edges are all
positive (euv = 1), we can deduct the following two properties in a similar
manner:

1. Nodes with no inter-community edges lie on the k half-lines. (When k = 2,


the two half-lines exhibit a clockwise rotation from the axes.)
2. For node u ∈ Ci that connects to node v ∈ Cj , α u and rj form an acute
angle.

Figure 1 shows the scatter plot of the spectral coordinates for a synthetic graph,
Synth-2. Synth-2 is a 2-balanced graph with 600 and 400 nodes in each commu-
nity. We generate Synth-2 and modify its inter-community edges via the same
method as Synthetic data set Synth-3 in Section 5.1. As we can see in Figure
1(a), when the two communities are disconnected, the nodes from C1 and C2 lie
on the positive part of axis ξ1 and ξ2 respectively. We then add a small number
of edges connecting the two communities (p = 0.05). When the added edges are
all negative, as shown in Figure 1(b), the spectral coordinates of the nodes from
the two communities form two half-lines respectively. The two quasi-orthogonal
half-lines rotate counter-clockwise from the axes. Those nodes having negative
inter-community edges lie outside the two half-lines. On the contrary, if we add
positive inter-community edges, as shown in Figure 1(c), the nodes from two
communities display two half-lines with a clockwise rotation from the axes, and
nodes with inter-community edges lie between the two half-lines.
6 L. Wu et al.

3.4 Increase the Magnitude of Inter-community Edges

Theorem 1 stands when the magnitude of perturbation is moderate. When deal-


ing with perturbation of large magnitude, we can divide the perturbation ma-
trix into several perturbation matrices of small magnitude and approximate the
eigenvectors step by step. More general, the perturbed spectral coordinate of a
node u can be approximated as

n
u ≈ αu R +
α euv αv Λ−1 , (9)
v=1

where Λ = diag(λ1 , . . . , λk ).
One property implied by (9) is that, after adding negative inter-community
edges, nodes from different communities are still separable in the spectral space.
Note that R is close to an orthogonal matrix, and hence the first part of RHS
of (9) specifies an orthogonal transformation. The second part of RHS of (9)
specifies a deviation away from the position after the transformation. Note that
when the inter-community edges are all negative (euv = −1), the deviation of
αu is just towards the negative direction of αv (each dimension is weighted
with λ−1
i ). Therefore, after perturbation, node u and v are further separable
from each other in the spectral space. The consequence of this repellency caused
by adding negative edges is that nodes from different communities stay away
from each other in the spectral space. As the magnitude of the noise increases,
more nodes deviate from the half-lines ri , and the line pattern eventually dis-
appears.

0.2 0.2 0.2


C C C
1 1 1
C C C
2 2 2
0.15 0.15 0.15

0.1 0.1 0.1


2

0.05 0.05 0.05


e

0 0 0

−0.05 −0.05 −0.05

−0.1 −0.1 −0.1


−0.1 −0.05 0 0.05 0.1 0.15 0.2 −0.1 −0.05 0 0.05 0.1 0.15 0.2 −0.1 −0.05 0 0.05 0.1 0.15 0.2
e e e
1 1 1

(a) Negative edges (p = 0.1)(b) Negative edges (p = 0.3) (c) Negative edges (p = 1)

0.2 0.2 0.2


C C C
1 1 1
C C C
2 2 2
0.15 0.15 0.15

0.1 0.1 0.1


2

0.05 0.05 0.05


e

0 0 0

−0.05 −0.05 −0.05

−0.1 −0.1 −0.1


−0.1 −0.05 0 0.05 0.1 0.15 0.2 −0.1 −0.05 0 0.05 0.1 0.15 0.2 −0.1 −0.05 0 0.05 0.1 0.15 0.2
e e e
1 1 1

(d) Positive edges (p = 0.1)(e) Positive edges (p = 0.3) (f) Positive edges (p = 1)

Fig. 2. Synth-2 with different types and magnitude of inter-community edges


Spectral Analysis of k-Balanced Signed Graphs 7

Positive large perturbation. When the added edges are positive, we can sim-
ilarly conclude the opposite phenomenon: more nodes from the two communities
are “pulled” closer to each other by the positive inter-community edges and are
finally mixed together, indicating that the well separable communities merge
into one community.
Figure 2 shows the spectral coordinate of Synth-2 when we increase the mag-
nitude of inter-community edges (p = 0.1, 0.3 and 1). For the first row (Figure
2(a) to 2(c)), we add negative inter-community edges in Synth-2, and for the
second row (Figure 2(d) to 2(f)), we add positive inter-community edges. As
we add more and more inter-community edges, no matter positive or negative,
more and more nodes deviate from the two half-lines, and finally the line pattern
diminishes in Figure 2(c) or 2(f). When adding positive inter-community edges,
the nodes deviate from the lines and hence finally mix together as show in Fig-
ure 2(f), indicating that two communities merge into one community. Different
from adding positive edges, which mixes the two communities in the spectral
space, adding negative inter-community edges “pushes” the two communities
away from each other. This is because nodes with negative inter-community
edges lie outside the two half-lines as shown in Figure 2(a) and 2(b). Even when
p = 1, as shown in Figure 2(c), two communities are still clearly separable in the
spectral space.

4 Unbalanced Signed Graph


Signed networks in general are unbalanced and their topologies can be consid-
ered as perturbations on balanced graphs with some negative connections within
communities and some positive connections across communities. Therefore, we
can divide an unbalanced signed graph into three parts
B = A + Ein + Eout , (10)
where A is a non-negative block-wise diagonal matrix as shown in (2), Ein rep-
resents the negative edges within communities, and Eout represents the both
negative and positive inter-community edges.
Add negative inner-community edges. For the block-wise diagonal ma-
trix A, we first discuss the case where a small number of negative edges are
added within the communities. Ein is also a block-wise diagonal. Hence βij =
xTi Ein xj = 0 for all i = j, and matrix R caused by Ein in (6) is reduced to the
identity matrix I.
Consider that we add one negative inner-community edge between node u, v ∈
Ci . Since R = I, only λi and xi are involved in approximating α u and α
v:
xiv
u = (0, . . . , 0, yiu , 0, . . . , 0), yiu ≈ xiu −
α
λi
xiu
v = (0, . . . , 0, yiv , 0, . . . , 0), yiv ≈ xiv −
α .
λi
Without loss of generality, assume xiu > xiv , and we have the following proper-
ties when adding euv = −1:
8 L. Wu et al.

0.2 0.2 0.2


C C C
1 1 1
C C C
2 2 2
0.15 0.15 0.15

0.1 0.1 0.1


2

2
0.05 0.05 0.05
e

e
0 0 0

−0.05 −0.05 −0.05

−0.1 −0.1 −0.1


−0.1 −0.05 0 0.05 0.1 0.15 0.2 −0.1 −0.05 0 0.05 0.1 0.15 0.2 −0.1 −0.05 0 0.05 0.1 0.15 0.2
e e e
1 1 1

(a) 2 disconnected, q = 0.1 (b) p = 0.1, q = 0.1 (c) p = 0.1, q = 0.2

Fig. 3. Spectral coordinates of unbalanced graphs generated from Synth-2

1. Both node u and v move towards the negative part of axis ξi after pertur-
bation: yiu < xiu and yiv < xiu .
2. Node v moves farther than u after perturbation: |yiv − xiv | > |yiu − xiu |.

The two preceding properties imply that, for those nodes close to the origin,
adding negative edges would “push” them towards the negative part of axis ξi ,
and a small number of nodes can thus lie on the negative part of axis ξi , i.e.,
yiu < 0).
Add inter-community edges. The spectral perturbation caused by adding
Eout on to matrix A + Ein can be complicated. Notice that (A + Ein ) is still a
block-wise matrix, and we can still apply Thereom 1 and conclude that, when
Eout is moderate, the major nodes from k communities form k lines in the
spectral space and nodes with inter-community edges deviate from the lines.
It is difficult to give the explicit form of the lines and the deviations, because
xiu and the inter-community edges can be either positive and negative. However,
we expect that the effect of adding negative edges on positive nodes is still
dominant in determining the spectral pattern, because most nodes lie along the
positive part of the axes and the majority of inter-community edges are negative.
Communities are still distinguishable in the spectral space. The majority of nodes
in one community lie on the positive part of the line, while a small number
of nodes may lie on the negative part due to negative connections within the
community.
We make graph Synth-2 unbalanced by flipping the signs a small proportion
q of the edges. When the two communities are disconnected, as shown in Figure
3(a), after flipping q = 0.1 inner-community edges, a small number of nodes lie
on the negative parts of the two axes. Figure 3(b) shows the spectral coordinates
of the unbalanced graph generated from balanced graph Synth-2 (p = 0.1, q =
0.1). Since the magnitude of the inter-community edges is small, we can still
observe two orthogonal lines in the scatter plots. The negative edges within the
communities cause a small number of nodes lie on the negative parts of the two
lines. Nodes with inter-community edges deviate from the two lines. For Figure
3(c), we flip more edge signs (p = 0.1, q = 0.2). We can observe that more nodes
lie on the negative parts of the lines, since more inner-community edges are
changed to negative. The rotation angles of the two lines are smaller than that
Spectral Analysis of k-Balanced Signed Graphs 9

in Figure 3(b). This is because the positive inter-community edges “pull” the
rotation clockwise a little, and the rotation we observe depends on the effects
from both positive and negative inter-community edges.

5 Evaluation
5.1 Synthetic Balanced Graph
Data set Synth-3 is a synthetic 3-balanced graph generated from the power law
degree distribution with parameter 2.5. The 3 communities of Synth-3 contain
600, 500, 400 nodes, and 4131, 3179, 2037 edges respectively. All the 13027 inter-
community edges are set to be negative. We delete the inter-community edges
randomly until a proportion p of them remain in the graph. The parameter p
is the ratio of the magnitude of inter-community edges to that of the inner-
community edges. If p = 0 there are no inter-community edges. If p = 1, inner-
and inter-community edges have the same magnitude.
Figure 4 shows the change of spectral coordinates of Synth-3, as we increase
the magnitude of inter-community edges. When there are no any negative links
(p = 0), the scatter plot of the spectral coordinates is shown in Figure 4(a). The
disconnected communities display 3 orthogonal half-lines. Figure 4(b) shows the
spectral coordinates when the magnitude of inter-community edges is moderate
(p = 0.1). We can see the nodes form three half-lines that rotate a certain angle,
and some of the nodes deviate from the lines. Figures 4(c) and 4(d) show the
spectral coordinates when we increase the magnitude of inter-community edges
(p = 0.3, 1). We can observe that, as more inter-community edges are added,
more and more nodes deviate from the lines. However, nodes from different
communities are still separable from each other in the spectral space.
We also add positive inter-community edges on Synth-3 for comparison, and
the spectral coordinates are shown in Figures 4(e) and 4(f). We can observe
that, different from adding negative edges, as the magnitude of inter-community
edges (p) increases, nodes from the three communities get closer to each other,
and completely mix in one community in Figure 4(f).

5.2 Synthetic Unbalanced Graph


To generate an unbalanced graph, we randomly flip the signs of a small proportion
q of the inner- and inter-community edges of a balanced graph, i.e., the parameter
q is the proportion of unbalanced edges given the partition. We first flip edge signs
on the graph with small magnitude inter-community edges. Figure 5(a) and 5(b)
show the spectral coordinates after we flip q = 10% and q = 20% edge signs on
Synth-3 with p = 0.1. We can observe that, even the graph is unbalanced, nodes
from the three communities exhibit three lines starting from the origin, and some
nodes deviate from the lines due to the inter-community edges.
We then flip edge signs on the graph with large magnitude inter-community
edges. Figure 5(c) shows the spectral coordinates after we flip q = 20% edge
signs on Synth-3 with p = 1. We can observe that the line pattern diminishes
because of the large number of inter-community edges. However, the nodes from
10 L. Wu et al.

C C C
1 1 1
C C C
2 2 2
0.2
C C C
3 3 3
0.15

0.1 0.2

0.2
0.05 0.1
3
e

3
0.1

e
0 −0.1
0

3
e
−0.1 0
−0.05
−0.1 0
−0.1 0
−0.1 −0.1
−0.1 0 −0.1
0.1
0 0.1 0
−0.1 0.1
0 0.1 0.1
0.1 e1
0.2 e1
0.2 0.2 0.2 0.2 0.2
e e
1 2 e
e 2
2

(a) 3 disconnected commu- (b) Negative p = 0.1 (c) Negative p = 0.3


nities

C C C
1 1 1
C C C
2 2 2
C 0.2 C 0.2 C
3 3 3
0.2
0.15 0.15
0.1
0.1
3

0.1
e

0
0.05

3
e
3

0.05
e

−0.1
−0.1 0
0
−0.05
0 −0.1
−0.05
−0.1 −0.1
0 −0.1
0.1 0 −0.1 0
0.1 −0.1 0.1 0 0.1
0.1
0.2 0.2 −0.1 −0.05 0 0.05 0.2 0.2 0.2
e e1 0.1 0.15 0.2
2 e e
1 1
e
e 2

(d) Negative p = 1 (e) Positive p = 0.1 (f) Positive p = 1

Fig. 4. The spectral coordinates of the 3-balanced graph Synth-3. (b)-(d): add negative
inter-community edges; (e)-(f): add positive inter-community edges.

C C C
1 1 1
C C C
2 2 2
C C C
3 3 3

0.2 0.2
0.2
0.1 0.1
0.1
3
e

−0.1 0
e
3

−0.1 0 −0.1
e

0
−0.1
0 −0.1 −0.1
0 0
−0.1 −0.1
−0.1 0
0.1 0
0 0.1 0.1
0.1 0.1
0.1
e1 e1 e1
0.2 0.2 0.2 0.2 0.2 0.2
e e
e 2 2
2

(a) p = 0.1, q = 0.1 (b) p = 0.1, q = 0.2 (c) p = 1, q = 0.2

Fig. 5. The spectral coordinates of a unbalanced synthetic graph generated via flipping
signs of inner- and inter-community edges of Synth-3 with p = 0.1 or 1

3 communities are separable in the spectral space, indicating that the unbalanced
edges do not greatly change the patterns in the spectral space.

5.3 Comparison with Laplacian Spectrum


The signed Laplacian matrix is defined as L = D̄ − A where D̄n×n is a diagonal
n
degree matrix with D̄ii = j=1 |Aij | [7]. Note that the unsigned Laplacian
is defined as L = D − A where Dn×n is a diagonal degree matrix with
matrix 
n
Dii = j=1 Aij . The eigenvectors corresponding to the k smallest eigenvalues
of Laplacian matrix also reflect the community structure of a signed graph:
the k communities form k clusters in the Laplacian spectral space. However,
eigenvectors associated with the smallest eigenvalues are generally instable to
noise according to the matrix perturbation theory [10]. Hence, when it comes
Spectral Analysis of k-Balanced Signed Graphs 11

C
1
C
2
C
3 0.05 0.03

0.02
0.05
0.01

3
e
0

3
e
0
3
e

−0.05 −0.01

−0.02
−0.05
−0.05 C C
0.05
−0.05 0 −0.05 1 1
−0.02
C −0.03 C
2 2
0 C 0 0 −0.02
C 0
3 3
0
0.02
0.05 e1 −0.05 0.02
0.05 0.05
e e1
2
e e
2 1 e
2

(a) p = 0.1, q = 0 (bal- (b) p = 0.1, q = 0.2 (c) p = 1, q = 0.2


anced)

Fig. 6. The Laplacian spectral space of signed graphs

to real-world networks, the communities may no longer form distinguishable


clusters in the Laplacian spectral space.
Figure 6(a) shows the Laplacian spectrum of a balanced graph, Synth-3 with
p = 0.1. We can see that the nodes from the three communities form 3 clusters in
the spectral space. However, the Laplacian spectrum is less stable to the noise. Fig-
ure 6(b) and 6(c) plot the Laplacian spectra of the unbalanced graphs generated
from Synth-3. We can observe that C1 and C2 are mixed together in Figure 6(b)
and all the three communities are not separable from each other in Figure 6(c).
For comparison, the adjacency spectra of the corresponding graphs were shown
in Figure 5(b) and Figure 5(c) respectively where we can observe that the three
communities are well separable in the adjacency spectral space.

6 Related Work

There are several studies on community partition in social networks with neg-
ative (or negatively weighted) edges [1, 3]. In [1], Bansal et al. introduced cor-
relation clustering and showed that it is an NP problem to make a partition to
a complete signed graph. In [3], Demaine and Immorlica gave an approxima-
tion algorithm and showed that the problem is APX-hard. Kruegis et al. in [6]
presented a case study on the signed Slashdot Zoo corpus and analyzed various
measures (including signed clustering coefficient and signed centrality measures).
Leskovic et al. in [8] studied several signed online social networks and developed
a theory of status to explain the observed edge signs. Laplacian graph kernels
that apply to signed graphs were described in [7]. However, the authors only
focused on 2-balanced signed graphs and many results (such as signed graphs’
definiteness property) do not hold for general k-balanced graphs.

7 Conclusion

We conducted theoretical studies based on graph perturbation to examine spec-


tral patterns of signed graphs. Our results showed that communities in a k-
balanced signed graph are distinguishable in the spectral space of its signed
12 L. Wu et al.

adjacency matrix even if connections between communities are dense. To our


best knowledge, these are the first reported findings on showing separability of
communities in the spectral space of the signed adjacency matrix. In our future
work, we will evaluate our findings using various real signed social networks.
We will also develop community partition algorithms exploiting our theoretical
findings and compare with other clustering methods for signed networks.

Acknowledgment

This work was supported in part by U.S. National Science Foundation (CCF-
1047621, CNS-0831204) for L. Wu, X. Wu, and A. Lu and by the Jiangsu Sci-
ence Foundation (BK2008018) and the National Science Foundation of China
(61073097) for Z.-H. Zhou.

References
1. Bansal, N., Chawla, S.: Correlation clustering. Machine Learning 56, 238–247
(2002)
2. Davis, J.A.: Clustering and structural balance in graphs. Human Relations 20,
181–187 (1967)
3. Demaine, E.D., Immorlica, N.: Correlation clustering with partial information. In:
Working Notes of the 6th International Workshop on Approximation Algorithms
for Combinatorial Optimization Problems, pp. 1–13 (2003)
4. Hage, P., Harary, F.: Structural models in anthropology, pp. 56–60. Cambridge
University Press, Cambridge (1983)
5. Inohara, T.: Characterization of clusterability of signed graph in terms of new-
comb’s balance of sentiments. Applied Mathematics and Computation 133, 93–104
(2002)
6. Kunegis, J., Lommatzsch, A., Bauckhage, C.: The slashdot zoo: mining a social
network with negative edges. In: WWW 2009, pp. 741–750 (2009)
7. Kunegis, J., Schmidt, S., Lommatzsch, A., Lerner, J., Luca, E.W.D., Albayrak, S.:
Spectral analysis of signed graphs for clustering, prediction and visualization. In:
SDM, pp. 559–570 (2010)
8. Leskovec, J., Huttenlocher, D., Kleinberg, J.: Signed networks in social media. In:
CHI, pp. 1361–1370 (2010)
9. Prakash, B.A., Sridharan, A., Seshadri, M., Machiraju, S., Faloutsos, C.: Eigen-
Spokes: Surprising patterns and scalable community chipping in large graphs. In:
Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119,
pp. 435–448. Springer, Heidelberg (2010)
10. Stewart, G.W., Sun, J.: Matrix perturbation theory. Academic Press, London
(1990)
11. Wu, L., Ying, X., Wu, X., Zhou, Z.-H.: Line orthogonality in adjacency eigenspace
and with application to community partition. Technical Report, UNC Charlotte
(2010)
12. Ying, X., Wu, X.: On randomness measures for social networks. In: SDM, pp.
709–720 (2009)
Spectral Analysis for Billion-Scale Graphs:
Discoveries and Implementation

U Kang, Brendan Meeder, and Christos Faloutsos

Carnegie Mellon University, School of Computer Science


{ukang,bmeeder,christos}@cs.cmu.edu

Abstract. Given a graph with billions of nodes and edges, how can we find pat-
terns and anomalies? Are there nodes that participate in too many or too few
triangles? Are there close-knit near-cliques? These questions are expensive to an-
swer unless we have the first several eigenvalues and eigenvectors of the graph
adjacency matrix. However, eigensolvers suffer from subtle problems (e.g., con-
vergence) for large sparse matrices, let alone for billion-scale ones.
We address this problem with the proposed HE IGEN algorithm, which we
carefully design to be accurate, efficient, and able to run on the highly scalable
M AP R EDUCE (H ADOOP) environment. This enables HE IGEN to handle matrices
more than 1000× larger than those which can be analyzed by existing algorithms.
We implement HE IGEN and run it on the M45 cluster, one of the top 50 super-
computers in the world. We report important discoveries about near-cliques and
triangles on several real-world graphs, including a snapshot of the Twitter social
network (38Gb, 2 billion edges) and the “YahooWeb” dataset, one of the largest
publicly available graphs (120Gb, 1.4 billion nodes, 6.6 billion edges).

1 Introduction
Graphs with billions of edges, or billion-scale graphs, are becoming common; Facebook
boasts about 0.5 billion active users, who-calls-whom networks can reach similar sizes
in large countries, and web crawls can easily reach billions of nodes. Given a billion-
scale graph, how can we find near-cliques, the count of triangles, and related graph
properties? As we discuss later, triangle counting and related expensive operations can
be computed quickly, provided we have the first several eigenvalues and eigenvectors.
In general, spectral analysis is a fundamental tool not only for graph mining, but also
for other areas of data mining. Eigenvalues and eigenvectors are at the heart of numer-
ous algorithms such as triangle counting, singular value decomposition (SVD), spectral
clustering, and tensor analysis [10]. In spite of their importance, existing eigensolvers
do not scale well. As described in Section 6, the maximum order and size of input
matrices feasible for these solvers is million-scale.
In this paper, we discover patterns on near-cliques and triangles, on several real-
world graphs including a Twitter dataset (38Gb, over 2 billion edges) and the “Ya-
hooWeb” dataset, one of the largest publicly available graphs (120Gb, 1.4 billion nodes,
6.6 billion edges). To enable discoveries, we propose HE IGEN, an eigensolver for
billion-scale, sparse symmetric matrices built on the top of H ADOOP, an open-source
M AP R EDUCE framework. Our contributions are the following:

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 13–25, 2011.
c Springer-Verlag Berlin Heidelberg 2011
14 U Kang, B. Meeder, and C. Faloutsos

1. Effectiveness: With HE IGEN we analyze billion-scale real-world graphs and report


discoveries, including a high triangle vs. degree ratio for adult sites and web pages
that participate in billions of triangles.
2. Careful Design: We choose among several serial algorithms and selectively paral-
lelize operations for better efficiency.
3. Scalability: We use the H ADOOP platform for its excellent scalability and imple-
ment several optimizations for HE IGEN, such as cache-based multiplications and
skewness exploitation. This results in linear scalability in the number of edges, the
same accuracy as standard eigensolvers for small matrices, and more than a 76×
performance improvement over a naive implementation.
Due to our focus on scalability, HE IGEN can handle sparse symmetric matrices with
billions of nodes and edges, surpassing the capability of previous eigensolvers (e.g.
[20] [16]) by more than 1,000 ×. Note that HE IGEN is different from Google’s PageR-
ank algorithm since HE IGEN computes top k eigenvectors while PageRank computes
only the first eigenvector. Designing top k eigensolver is much difficult and subtle than
designing the first eigensolver, as we will see in Section 4. With this powerful tool we
are able to study several billion-scale graphs, and we report fascinating patterns on the
near-cliques and triangle distributions in Section 2.
The HE IGEN algorithm (implemented in H ADOOP ) is available at
http://www.cs.cmu.edu/∼ukang/HEIGEN. The rest of the paper presents the
discoveries in real-world networks, design decisions and details of our method, experi-
mental results, and conclusions.

2 Discoveries

In this section, we show discoveries on billion-scale graphs using HE IGEN. We focus


on the structural properties of networks: spotting near-cliques and finding triangles. The
graphs we used in this and Section 5 are described in Table 11 .

2.1 Spotting Near-Cliques

In a large, sparse network, how can we find tightly connected nodes, such as those
in near-cliques or bipartite cores? Surprisingly, eigenvectors can be used for this pur-
pose [14]. Given an adjacency matrix W and its SVD W = U ΣV T , an EE-plot is
defined to be the scatter plot of the vectors Ui and Uj for any i and j. EE-plots of some
real-world graphs contain clear separate lines (or ‘spokes’), and the nodes with the
largest values in each spoke are separated from the other nodes by forming near-cliques
or bipartite cores. Figures 1 shows several EE-plots and spyplots (i.e., adjacency matrix
of induced subgraph) of the top 100 nodes in top eigenvectors of YahooWeb graph.
In Figure 1 (a) - (d), we observe clear ‘spokes,’ or outstanding nodes, in the top
eigenvectors. Moreover, the top 100 nodes with largest values in U1 , U2 , and U4 form a
1
YahooWeb, LinkedIn: released under NDA.
Twitter: http://www.twitter.com/
Kronecker: http://www.cs.cmu.edu/∼ukang/dataset
Epinions: not public data.
Spectral Analysis for Billion-Scale Graphs 15

Table 1. Order and size of networks

Name Nodes Edges Description


YahooWeb 1,413 M 6,636 M WWW pages in 2002
Twitter 62.5 M 2,780 M who follows whom in 2009/11
LinkedIn 7.5 M 58 M person-person in 2006
Kronecker 59 K ∼ 177 K 282 M ∼ 1,977 M synthetic graph
Epinions 75 K 508 K who trusts whom

(a) U4 vs. U1 (b) U3 vs. U2 (c) U7 vs. U5 (d) U8 vs. U6

(e) U1 spoke (f) U2 spoke (g) U3 spoke (h) U4 spoke (i) Structure of
bi-clique
Fig. 1. EE-plots and spyplots from YahooWeb. (a)-(d): EE-plots showing the values of nodes in
the ith eigenvector vs. in the jth eigenvector. Notice the clear ‘spokes’ in top eigenvectors signify
the existence of a strongly related group of nodes in near-cliques or bi-cliques as depicted in (i).
(e)-(h): Spyplots of the top 100 largest nodes from each eigenvector. Notice that we see a near
clique in U3 , and bi-cliques in U1 , U2 , and U4 . (i): The structure of ‘bi-clique’ in (e), (f), and (h).

‘bi-clique’, shown in (e), (f), and (h), which is defined to be the combination of a clique
and a bipartite core as depicted in Figure 1 (i). Another observation is that the top seven
nodes shown in Figure 1 (g) belong to indymedia.org which is the site with the
maximum number of triangles in Figure 2.
Observation 1 (Eigenspokes). EE-plots of YahooWeb show clear spokes. Additionally,
the extreme nodes in the spokes belong to cliques or bi-cliques.

2.2 Triangle Counting


Given a particular node in a graph, how are its neighbors connected? Do they form
stars? Cliques? The above questions about the community structure of networks can
be answered by studying triangles (three nodes connected to each other). However,
directly counting triangles in graphs with billions of nodes and edges is prohibitively
expensive [19]. Fortunately, we can approximate triangle counts with high accuracy
using HE IGEN by exploiting its connection to eigenvalues [18]. In a nutshell, the total
number of triangles in a graph is related to the sum of cubes of eigenvalues, and the
first few eigenvalues provide extremely good approximations. A slightly more elaborate
analysis approximates the number of triangles in which a node participates, using the
cubes of the first few eigenvalues and the corresponding eigenvectors.
16 U Kang, B. Meeder, and C. Faloutsos

(a) LinkedIn (58M edges) (b) Twitter (2.8B edges) (c) YahooWeb (6.6B edges)
Fig. 2. The distribution of the number of participating triangles of real graphs. In general, they
obey the “triangle power-law.” Moreover, well-known U.S. politicians participate in many trian-
gles, demonstrating that their followers are well-connected. In the YahooWeb graph, we observe
several anomalous spikes which possibly come from cliques.

Using the top k eigenvalues computed with HE IGEN, we analyze the distribution
of triangle counts of real graphs including the Linkedin, Twitter social, and YahooWeb
graphs in Figure 2. We first observe that there exists several nodes with extremely large
triangle counts. In Figure 2 (b), Barack Obama is the person with the fifth largest num-
ber of participating triangles, and has many more than other U.S. politicians. In Figure 2
(c), the web page lists.indymedia.org contains the largest number of triangles;
this page is a list of mailing lists which apparently point to each other.
We also observe regularities in triangle distributions and note that the beginning part
of the distributions follows a power-law.

Observation 2 (Triangle power law). The beginning part of the triangle count distri-
bution of real graphs follows a power-law.

In the YahooWeb graph in Figure 2 (c), we observe many spikes. One possible reason
of the spikes is that they come from cliques: a k-clique generates k nodes with ( 2 )
k−1

triangles.

Observation 3 (Spikes in triangle distribution). In the Web graph, there exist several
spikes which possibly come from cliques.

The rightmost spike in Figure 2 (c) contains 125 web pages that each have about 1
million triangles in their neighborhoods. They all belong to the news site ucimc.org,
and are connected to a tightly coupled group of pages.
Triangle counts exhibit even more interesting patterns when combined with the de-
gree information as shown in the degree-triangle plot of Figure 3. We see that celebrities
have high degree and mildly connected followers, while accounts for adult sites have
many fewer, but extremely well connected, followers. Degree-triangle plots can be used
to spot and eliminate harmful accounts such as those of adult advertisers and spammers.

Observation 4 (Anomalous Triangles vs. Degree Ratio). In Twitter, anomalous


accounts have very high triangles vs. degree ratio compared to other regular accounts.

All of the above observations need a fast, scalable eigensolver. This is exactly what
HE IGEN does, and we describe our proposed design next.
Spectral Analysis for Billion-Scale Graphs 17

Fig. 3. The degree vs. participating triangles of some ‘celebrities’ (rest: omitted, for clarity) in
Twitter accounts. Also shown are accounts of adult sites which have smaller degree, but belong to
an abnormally large number of triangles (= many, well connected followers - probably, ‘robots’).

3 Background - Sequential Algorithms


In the next two sections, we describe our method of computing eigenvalues and eigen-
vectors of billion-scale graphs. We first describe sequential algorithms to find eigenval-
ues and eigenvectors of matrices. We limit our attention to symmetric matrices due to
the computational difficulties; even the best methods for non-symmetric eigensolver re-
quire much more computation than symmetric eigensolvers. We list the alternatives for
computing the eigenvalues of symmetric matrix and the reasoning behind our choice.
– Power method: the simplest and most famous method for computing the topmost
eigenvalue. However, it can not find the top k eigenvalues.
– Simultaneous iteration (or QR): an extension of the Power method to find top
k eigenvalues. It requires large matrix-matrix multiplications that are prohibitively
expensive for billion-scale graphs.
– Lanczos-NO(No Orthogonalization): the basic Lanczos algorithm [5] which ap-
proximates the top k eigenvalues in the subspace composed of intermediate vectors
from the Power method. The problem is that while computing the eigenvalues, they
can ‘jump’ up to larger eigenvalues, thereby outputting spurious eigenvalues.
Although all of the above algorithms are not suitable for calculations on billion-scale
graphs using M AP R EDUCE, we present a tractable, M AP R EDUCE-based algorithm for
computing the top k eigenvectors and eigenvalues in the next section.

4 Proposed Method
In this section we describe HE IGEN, a parallel algorithm for computing the top k eigen-
values and eigenvectors of symmetric matrices in M AP R EDUCE.

4.1 Summary of the Contributions


Efficient top k eigensolvers for billion-scale graphs require careful algorithmic con-
siderations. The main challenge is to carefully design algorithms that work well on
distributed systems and exploit the inherent structure of data, including block structure
and skewness, in order to be efficient. We summarize the algorithmic contributions here
and describe each in detail in later sections.
18 U Kang, B. Meeder, and C. Faloutsos

1. Careful Algorithm Choice: We carefully choose a sequential eigensolver algo-


rithm that is efficient for M AP R EDUCE and gives accurate results.
2. Selective Parallelization: We group operations into expensive and inexpensive
ones based on input sizes. Expensive operations are done in parallel for scalability,
while inexpensive operations are performed faster on a single machine.
3. Blocking: We reduce the running time by decreasing the input data size and the
amount of network traffic among machines.
4. Exploiting Skewness: We decrease the running time by exploiting skewness of
data.
4.2 Careful Algorithm Choice
In Section 3, we considered three algorithms that are not tractable for analyzing billion-
scale graphs with M AP R EDUCE. Fortunately, there is an algorithm suitable for such a
purpose. Lanczos-SO (Selective Orthogonalization) improves on the Lanczos-NO by
selectively reorthogonalizing vectors instead of performing full reorthogonalizations.

Algorithm 1. Lanczos -SO(Selective Orthogonalization)


Input: Matrix An×n , random n-vector b, maximum number of steps m, error threshold 
Output: Top k eigenvalues λ[1..k], eigenvectors Y n×k
1: β0 ← 0, v0 ← 0, v1 ← b/||b||;
2: for i = 1..m do
3: v ← Avi ; // Find a new basis vector
4: αi ← viT v;
5: v ← v − βi−1 vi−1 − αi vi ; // Orthogonalize against two previous basis vectors
6: βi ← ||v||;
7: Ti ← (build tri-diagonal matrix from α and β);
8: QDQT ← EIG(Ti ); // Eigen decomposition of Ti
9: for j = 1..i do

10: if βi |Q[i, j]| ≤ ||Ti || then
11: r ← Vi Q[:, j];
12: v ← v − (r T v)r; // Selectively orthogonalize
13: end if
14: end for
15: if (v was selectively orthogonalized) then
16: βi ← ||v||; // Recompute normalization constant βi
17: end if
18: if βi = 0 then
19: break for loop;
20: end if
21: vi+1 ← v/βi ;
22: end for
23: T ← (build tri-diagonal matrix from α and β);
24: QDQT ← EIG(T ); // Eigen decomposition of T
25: λ[1..k] ← top k diagonal elements of D; // Compute eigenvalues
26: Y ← Vm Qk ; // Compute eigenvectors. Qk is the columns of Q corresponding to λ

The main idea of Lanczos-SO is as follows: We start with a random initial basis
vector b which comprises a rank-1 subspace. For each iteration, a new basis vector
Spectral Analysis for Billion-Scale Graphs 19

is computed by multiplying the input matrix with the previous basis vector. The new
basis vector is then orthogonalized against the last two basis vectors and is added to the
previous rank-(m − 1) subspace, forming a rank-m subspace. Let m be the number of
the current iteration, Qm be the n × m matrix whose ith column is the ith basis vector,
and A be the matrix for which we want to compute eigenvalues. We also define Tm =
Q∗m AQm to be a m × m matrix. Then, the eigenvalues of Tm are good approximations
of the eigenvalues of A . Furthermore, multiplying Qm by the eigenvectors of Tm gives
good approximations of the eigenvectors of A. We refer to [17] for further details.
If we used exact arithmetic, the newly computed basis vector would be orthogonal
to all previous basis vectors. However, rounding errors from floating-point calculations
compound and result in the loss of orthogonality. This is the cause of the spurious eigen-
values in Lanczos-NO. Orthogonality can be recovered once the new basis vector is
fully re-orthogonalized to all previous vectors. However, doing this becomes expensive
as it requires O(m2 ) re-orthogonalizations, where m is the number of iterations. A bet-
ter approach uses a quick test (line 10 of Algorithm 1) to selectively choose vectors that
need to be re-orthogonalized to the new basis [6]. This selective-reorthogonalization
idea is shown in Algorithm 1.
The Lanczos-SO has all the properties that we need: it finds the top k largest eigen-
values and eigenvectors, it produces no spurious eigenvalues, and its most expensive
operation, a matrix-vector multiplication, is tractable in M AP R EDUCE. Therefore, we
choose Lanczos-SO as our choice of the sequential algorithm for parallelization.

4.3 Selective Parallelization


Among many sub-operations in Algorithm 1, which operations should we parallelize?
A naive approach is to parallelize all the operations; however, some operations run
more quickly on a single machine rather than on multiple machines in parallel. The
reason is that the overhead incurred by using M AP R EDUCE exceeds gains made by
parallelizing the task; simple tasks where the input data is very small complete faster
on a single machine. Thus, we divide the sub-operations into two groups: those to be
parallelized and those to be run in a single machine. Table 2 summarizes our choice
for each sub-operation. Note that the last two operations in the table can be done with
a single-machine standard eigensolver since the input matrices are tiny; they have m
rows and columns, where m is the number of iterations.

4.4 Blocking
Minimizing the volume of information sent between nodes is important to designing ef-
ficient distributed algorithms. In HE IGEN, we decrease the amount of network traffic by
using the block-based operations. Normally, one would put each edge ”(source, desti-
nation)” in one line; H ADOOP treats each line as a data element for its ’map()’ function.
Instead, we propose to divide the adjacency matrix into blocks (and, of course, the cor-
responding vectors also into blocks), and put the edges of each block on a single line,
and compress the source- and destination-ids. This makes the map() function a bit more
complicated to process blocks, but it saves significant transfer time of data over the
network. We use these edge-blocks and the vector-blocks for many parallel operations
in Table 2, including matrix-vector multiplication, vector update, vector dot product,
20 U Kang, B. Meeder, and C. Faloutsos

Table 2. Parallelization Choices. The last column (P) indicates whether the operation is paral-
lelized in HE IGEN. Some operations are better to be run in parallel since the input size is very
large, while others are better in a single machine since the input size is small and the overhead of
parallel execution overshadows its decreased running time.

Operation Description Input P?


y ← y + ax vector update Large Yes
γ ← xT x vector dot product Large Yes
y ← αy vector scale Large Yes
||y|| vector L2 norm Large Yes
y ← M n×n x large matrix-large,dense vector multiplication Large Yes
y ← Msn×m xs large matrix-small vector multiplication (n  m) Large Yes
As ← Msn×m Nsm×k large matrix-small matrix multiplication (n  m > k) Large Yes
||T || matrix L2 norm which is the largest singular value of Tiny No
the matrix
EIG(T ) symmetric eigen decomposition to output QDQT Tiny No

vector scale, and vector L2 norm. Performing operations on blocks is faster than doing
so on individual elements since both the input size and the key space decrease. This
reduces the network traffic and sorting time in the M AP R EDUCE Shuffle stage. As we
will see in Section 5, the blocking decreases the running time by more than 4×.

Algorithm 2. CBMV(Cache-Based Matrix-Vector Multiplication) for HE IGEN


Input: Matrix M = {(idsrc , (iddst , mval))}, Vector x = {(id, vval)}
Output: Result vector y
1: Stage1-Map(key k, value v, Vector x) // Multiply matrix elements and the vector x
2: idsrc ← k;
3: (iddst , mval) ← v;
4: Output(idsrc , (mval × x[iddst ])); // Multiply and output partial results
5:
6: Stage1-Reduce(key k, values V []) // Sum up partial results
7: sum ← 0;
8: for v ∈ V do
9: sum ← sum + v;
10: end for
11: Output(k, sum);

4.5 Exploit Skewness: Matrix-Vector Multiplication


HE IGEN uses an adaptive method for sub-operations based on the size of the data. In
this section, we describe how HE IGEN implements different matrix-vector multiplica-
tion algorithms by exploiting the skewness pattern of the data. There are two matrix-
vector multiplication operations in Algorithm 1: the one with a large vector (line 3) and
the other with a small vector (line 11).
Spectral Analysis for Billion-Scale Graphs 21

The first matrix-vector operation multiplies a matrix with a large and dense vector,
and thus it requires a two-stage standard M AP R EDUCE algorithm by Kang et al. [9]. In
the first stage, matrix elements and vector elements are joined and multiplied to make
partial results which are added together to get the result vector in the second stage.
The other matrix-vector operation, however, multiplies with a small vector. HE IGEN
uses the fact that the small vector can fit in a machine’s main memory, and distributes
the small vector to all the mappers using the distributed cache functionality of H ADOOP.
The advantage of the small vector being available in mappers is that joining edge ele-
ments and vector elements can be done inside the mapper, and thus the first stage of the
standard two-stage matrix-vector multiplication can be omitted. In this one-stage algo-
rithm the mapper joins matrix elements and vector elements to make partial results, and
the reducer adds up the partial results. The pseudo code of this algorithm, which we call
CBMV(Cache-Based Matrix-Vector multiplication), is shown in Algorithm 2. We want
to emphasize that this operation cannot be performed when the vector is large, as is
the case in the first matrix-vector multiplication. The CBMV is faster than the standard
method by 57× as described in Section 5.

4.6 Exploiting Skewness: Matrix-Matrix Multiplication


Skewness can also be exploited to efficiently perform matrix-matrix multiplication (line
26 of Algorithm 1). In general, matrix-matrix multiplication is very expensive. A stan-
dard, yet naive, way of multiplying two matrices A and B in M AP R EDUCE is to mul-
tiply A[:, i] and B[i, :] for each column i of A and sum the resulting matrices. This
algorithm, which we call MM(direct Matrix-Matrix multiplication), is very inefficient
since it generates huge matrices and sums them up many times. Fortunately, when one
of the matrices is very small, we can utilize the skewness to make an efficient M AP R E -
DUCE algorithm. This is exactly the case in HE IGEN; the first matrix is very large,
and the second is very small. The main idea is to distribute the second matrix by the
distributed cache functionality in H ADOOP, and multiply each element of the first ma-
trix with the corresponding rows of the second matrix. We call the resulting algorithm
Cache-Based Matrix-Matrix multiplication, or CBMM. There are other alternatives to
matrix-matrix multiplication: one can decompose the second matrix into column vec-
tors and iteratively multiply the first matrix with each of these vectors. We call the al-
gorithms, introduced in Section 4.5, Iterative matrix-vector multiplications (IMV) and
Cache-based iterative matrix-vector multiplications (CBMV). The difference between
CBMV and IMV is that CBMV uses cache-based operations while IMV does not. As
we will see in Section 5, the best method, CBMM, is faster than naive methods by 76×.

4.7 Analysis

We analyze the time and the space complexity of HE IGEN. In the lemmas below, m is
the number of iterations, |V | and |E| are the number of nodes and edges, and M is the
number of machines.
22 U Kang, B. Meeder, and C. Faloutsos

Lemma 1 (Time Complexity). HE IGEN takes O(m |V |+|E|


M
log |V |+|E|
M
) time.
Proof. (Sketch) The running time of one iteration of HE IGEN is dominated by the
matrix-large vector multiplication whose running time is O(m |V |+|E|
M log |V |+|E|
M ). 


Lemma 2 (Space Complexity). HE IGEN requires O(|V | + |E|) space.


Proof. (Sketch) The maximum storage is required at the intermediate output of the two-
stage matrix-vector multiplication where O(|V | + |E|) space is needed. 


5 Performance
In this section, we present experimental results to answer the following questions:
– Scalability: How well does HE IGEN scale up?
– Optimizations: Which of our proposed methods give the best performance?
We perform experiments in the Yahoo! M45 H ADOOP cluster with total 480 hosts, 1.5
petabytes of storage, and 3.5 terabytes of memory. We use H ADOOP 0.20.1. The scala-
bility experiments are performed using synthetic Kronecker graphs [12] since realistic
graphs of any size can be easily generated.

5.1 Scalability
Figure 4(a,b) shows the scalability of HE IGEN-BLOCK, an implementation of HE IGEN
that uses blocking, and HE IGEN-PLAIN, an implementation which does not. Notice
that the running time is near-linear in the number of edges and machines. We also note
that HE IGEN-BLOCK performs up to 4× faster when compared to HE IGEN-PLAIN.

5.2 Optimizations
Figure 4(c) shows the comparison of running time of the skewed matrix-matrix mul-
tiplication and the matrix-vector multiplication algorithms. We used 100 machines for
YahooWeb data. For matrix-matrix multiplications, the best method is our proposed
CBMM which is 76× faster than repeated naive matrix-vector multiplications (IMV).
The slowest MM algorithm didn’t even finish, and failed due to heavy amounts of data.
For matrix-vector multiplications, our proposed CBMV is faster than the naive method
(IMV) by 48×.

6 Related Works
The related works form two groups: eigensolvers and M AP R EDUCE/H ADOOP.
Large-scale Eigensolvers: There are many parallel eigensolvers for large matrices: the
work by Zhao et al. [21], HPEC [7], PLANO [20], PREPACK [15], SCALABLE [4],
PLAYBACK [3] are several examples. All of them are based on MPI with message
passing, which has difficulty in dealing with billion-scale graphs. The maximum order
of matrices analyzed with these tools is less than 1 million [20] [16], which is far from
web-scale data. Very recently(March 2010), the Mahout project [2] provides SVD on
Spectral Analysis for Billion-Scale Graphs 23

Fig. 4. (a) Running time vs. number of edges in 1 iteration of HE IGEN with 50 machines. Notice
the near-linear running time proportional to the edges size. (b) Running time vs. number of ma-
chines in 1 iteration of HE IGEN . The running time decreases as number of machines increase. (c)
Comparison of running time between different skewed matrix-matrix and matrix-vector multipli-
cations. For matrix-matrix multiplication, our proposed CBMM outperforms naive methods by at
least 76×. The slowest matrix-matrix multiplication algorithm (MM) even didn’t finish and the
job failed due to excessive data. For matrix-vector multiplication, our proposed CBMV is faster
than the naive method by 57×.

top of H ADOOP . Due to insufficient documentation, we were not able to find the input
format and run a head-to-head comparison. But, reading the source code, we discov-
ered that Mahout suffers from two major issues: (a) it assumes that the vector (b, with
n=O(billion) entries) fits in the memory of a single machine, and (b) it implements the
full re-orthogonalization which is inefficient.
MapReduce and Hadoop: M AP R EDUCE is a parallel programming framework for
processing web-scale data. M AP R EDUCE has two major advantages: (a) it handles data
distribution, replication, and load balancing automatically, and furthermore (b) it uses
familiar concepts from functional programming. The programmer needs to provide only
the map and the reduce functions. The general framework is as follows [11]: The map
stage processes input and outputs (key, value) pairs. The shuffling stage sorts the map
output and distributes them to reducers. Finally, the reduce stage processes the values
with the same key and outputs the final result. H ADOOP [1] is the open source imple-
mentation of M AP R EDUCE. It also provides a distributed file system (HDFS) and data
processing tools such as PIG [13] and Hive . Due to its extreme scalability and ease of
use, H ADOOP is widely used for large scale data mining [9,8] .

7 Conclusion
In this paper we discovered patterns in real-world, billion-scale graphs. This was possi-
ble by using HE IGEN, our proposed eigensolver for the spectral analysis of very large-
scale graphs. The main contributions are the following:
– Effectiveness: We analyze spectral properties of real world graphs, including Twit-
ter and one of the largest public Web graphs. We report patterns that can be used
for anomaly detection and find tightly-knit communities.
– Careful Design: We carefully design HE IGEN to selectively parallelize operations
based on how they are most effectively performed.
– Scalability: We implement and evaluate a billion-scale eigensolver. Experimenta-
tion shows that HE IGEN is accurate and scales linearly with the number of edges.
24 U Kang, B. Meeder, and C. Faloutsos

Future research directions include extending the analysis and the algorithms for multi-
dimensional matrices, or tensors [10].

Acknowledgements

This material is based upon work supported by the National Science Foundation under
Grants No. IIS-0705359, IIS0808661, IIS-0910453, and CCF-1019104, by the Defense
Threat Reduction Agency under contract No. HDTRA1-10-1-0120, and by the Army
Research Laboratory under Cooperative Agreement Number W911NF-09-2-0053. This
work is also partially supported by an IBM Faculty Award, and the Gordon and Betty
Moore Foundation, in the eScience project. The views and conclusions contained in
this document are those of the authors and should not be interpreted as representing
the official policies, either expressed or implied, of the Army Research Laboratory or
the U.S. Government or other funding parties. The U.S. Government is authorized to
reproduce and distribute reprints for Government purposes notwithstanding any copy-
right notation here on. Brendan Meeder is also supported by a NSF Graduate Research
Fellowship and funding from the Fine Foundation, Sloan Foundation, and Microsoft.

References

[1] Hadoop information, http://hadoop.apache.org/


[2] Mahout information, http://lucene.apache.org/mahout/
[3] Alpatov, P., Baker, G., Edward, C., Gunnels, J., Morrow, G., Overfelt, J., van de Gejin, R.,
Wu, Y.-J.: Plapack: Parallel linear algebra package - design overview. In: SC 1997 (1997)
[4] Blackford, L., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I.: Scalapack
users’s guide. In: SIAM (1997)
[5] Lanczos, C.: An iteration method for the solution of the eigenvalue problem of linear dif-
ferential and integral operators. J. Res. Nat. Bur. Stand. (1950)
[6] Demmel, J.W.: Applied numerical linear algebra. SIAM, Philadelphia (1997)
[7] Guarracino, M.R., Perla, F., Zanetti, P.: A parallel block lanczos algorithm and its imple-
mentation for the evaluation of some eigenvalues of large sparse symmetric matrices on
multicomputers. Int. J. Appl. Math. Comput. Sci. (2006)
[8] Kang, U, Chau, D.H., Faloutsos, C.: Mining large graphs: Algorithms, inference, and dis-
coveries. In: IEEE International Conference on Data Engineering (2011)
[9] Kang, U, Tsourakakis, C., Faloutsos, C.: Pegasus: A peta-scale graph mining system - im-
plementation and observations. In: ICDM (2009)
[10] Kolda, T.G., Sun, J.: Scalable tensor decompsitions for multi-aspect data mining. In: ICDM
(2008)
[11] Lämmel, R.: Google’s mapreduce programming model – revisited. Science of Computer
Programming 70, 1–30 (2008)
[12] Leskovec, J., Chakrabarti, D., Kleinberg, J.M., Faloutsos, C.: Realistic, mathematically
tractable graph generation and evolution, using kronecker multiplication. In: Jorge, A.M.,
Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI),
vol. 3721, pp. 133–145. Springer, Heidelberg (2005)
[13] Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign
language for data processing. In: SIGMOD 2008 (2008)
Spectral Analysis for Billion-Scale Graphs 25

[14] Prakash, B.A., Sridharan, A., Seshadri, M., Machiraju, S., Faloutsos, C.: EigenSpokes: Sur-
prising patterns and scalable community chipping in large graphs. In: Zaki, M.J., Yu, J.X.,
Ravindran, B., Pudi, V. (eds.) PAKDD 2010. LNCS, vol. 6119, pp. 435–448. Springer, Hei-
delberg (2010)
[15] Lampe, J., Lehoucq, R.B., Sorensen, D.C., Yang, C.: Arpack user’s guide: Solution of large-
scale eigenvalue problems with implicitly restarted arnoldi methods. SIAM, Philadelphia
(1998)
[16] Song, Y., Chen, W., Bai, H., Lin, C., Chang, E.: Parallel spectral clustering. In: ECML
(2008)
[17] Trefethen, L.N., Bau III., D.: Numerical linear algebra. SIAM, Philadelphia (1997)
[18] Tsourakakis, C.: Fast counting of triangles in large real networks without counting: Algo-
rithms and laws. In: ICDM (2008)
[19] Tsourakakis, C.E., Kang, U, Miller, G.L., Faloutsos, C.: Doulion: Counting triangles in
massive graphs with a coin. In: KDD (2009)
[20] Wu, K., Simon, H.: A parallel lanczos method for symmetric generalized eigenvalue prob-
lems. Computing and Visualization in Science (1999)
[21] Zhao, Y., Chi, X., Cheng, Q.: An implementation of parallel eigenvalue computation using
dual-level hybrid parallelism. LNCS (2007)
LGM: Mining Frequent Subgraphs from Linear
Graphs

Yasuo Tabei1 , Daisuke Okanohara2, Shuichi Hirose3 , and Koji Tsuda1,3


1
ERATO Minato Project, Japan Science and Technology Agency, Sapporo, Japan
2
Preferred Infrastructure, Inc, Tokyo, Japan
3
Computational Biology Research Center, National Institute of Advanced Industrial
Science and Technology (AIST), Tokyo, Japan
yasuo.tabei@gmail.com, hillbig@preferred.jp, shuichi.hirose@nagase.co.jp,
koji.tsuda@aist.go.jp

Abstract. A linear graph is a graph whose vertices are totally ordered.


Biological and linguistic sequences with interactions among symbols are
naturally represented as linear graphs. Examples include protein con-
tact maps, RNA secondary structures and predicate-argument struc-
tures. Our algorithm, linear graph miner (LGM), leverages the vertex
order for efficient enumeration of frequent subgraphs. Based on the re-
verse search principle, the pattern space is systematically traversed with-
out expensive duplication checking. Disconnected subgraph patterns are
particularly important in linear graphs due to their sequential nature.
Unlike conventional graph mining algorithms detecting connected pat-
terns only, LGM can detect disconnected patterns as well. The utility and
efficiency of LGM are demonstrated in experiments on protein contact
maps.

1 Introduction
Frequent subgraph mining is an active research area with successful applications
in, e.g., chemoinformatics [15], software science [4], and computer vision [13].
The task is to enumerate the complete set of frequently appearing subgraphs in
a graph database. Early algorithms include AGM [8], FSG [9] and gSpan [19].
Since then, researchers paid considerable efforts to improve the efficiency, for
example, by mining closed patterns only [20], or by early pruning that sacrifices
the completeness (e.g., leap search [18]). However, graph mining algorithms are
still too slow for large graph databases (see e.g.,[17]). The scalability of graph
mining algorithms is much worse than those for more restricted classes such as
trees [1] and sequences [14]. It is due to the fact that, for trees and sequences,
it is possible to design a pattern extension rule that does not create duplicate
patterns (e.g., rightmost extension) [1]. For general graphs, there are multiple
ways to generate the same subgraph pattern, and it is necessary to detect du-
plicate patterns and prune the search tree whenever duplication is detected. In
gSpan [19], a graph pattern is represented as a DFS code, and the duplication
check is implemented via minimality checking of the code. It is a very clever

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 26–37, 2011.
c Springer-Verlag Berlin Heidelberg 2011
LGM: Mining Frequent Subgraphs from Linear Graphs 27

c
b
a a
1 2 3 4 5 6
A B A B C A

Fig. 1. An example of linear graph

mechanism, because one does not need to track back the patterns generated so
far. Nevertheless, the complexity of duplication checking is exponential to the
pattern size [19]. It harms efficiency substantially, especially when mining large
patterns.
A linear graph is a graph whose vertices are totally ordered [3,5] (Figure 1). For
example, protein contact maps, RNA secondary structures, alternative splicing
patterns in molecular biology and predicate-argument structures [11] in natural
languages can be represented as linear graphs. Amino acid residues of a protein
have natural ordering from N- to C-terminus, and English words in a sentence
are ordered as well. Davydov and Batzoglou [3] addressed the problem of aligning
several linear graphs for RNA sequences, assessed the computational complexity,
and proposed an approximate algorithm. Fertin et al. assessed the complexity
of finding a maximum common pattern in a set of linear graphs [5]. In this pa-
per, we develop a novel algorithm, linear graph miner (LGM), for enumerating
frequently appearing subgraphs in a large number of linear graphs. The advan-
tage of employing linear graphs is that we can derive a pattern extension rule
that does not cause duplication, which makes LGM much more efficient than
conventional graph mining algorithms.
We design the extension rule based on the reverse search principle [2]. Perhaps
confusingly, ’reverse search’ does not refer to a particular search method, but a
guideline for designing enumeration algorithms. A pattern extension rule specifies
how to generate children from a parent in the search space. In reverse search, one
specifies a rule that generates a parent uniquely from a child (i.e., reduction map).
The pattern extension rule is obtained by ’reversing’ the reduction map: When gen-
erating children from a parent, all possible candidates are prepared and those map-
ping back to the parent by the reduction map are selected. An advantage of reverse
search is that, given a reduction map, the completeness of the resulting pattern ex-
tension rule can easily be proved [2]. In data mining, LCM, one of the fastest closed
itemset miner, was designed using reverse search [16]. It is applied in the design
of a dense module enumeration algorithm [6] and a geometric graph mining algo-
rithm recently [12]. In computational geometry and related fields, there are many
successful applications1 . LGM’s reduction map is very simple: remove the largest
edge in terms of edge ordering. Fortunately, it is not necessary to take the “can-
didate preparation and selection” approach in LGM. We can directly reverse the
reduction map to an explicit extension rule here.
1
See a list of applications at
http://cgm.cs.mcgill.ca/~ avis/doc/rs/applications/index.html
28 Y. Tabei et al.

Linear graphs can be perceived as the fusion of graphs and sequences.


Sequence mining algorithms such as Prefixspan [14] can usually detect gaped se-
quence patterns. In applications like motif discovery in protein contact maps [7],
it is essential to allow “gaps” in linear graph patterns. More precisely, discon-
nected graph patterns should be allowed for such applications. Since conventional
graph mining algorithms can detect only connected graph patterns, their appli-
cation to contact maps is difficult. In this paper, we aim to detect connected and
disconnected patterns with a unified framework.
In experiments, we used a protein 3D-structure dataset from molecular biol-
ogy. We compared LGM with gSpan in efficiency, and found that LGM is more
efficient than gSpan. It is surprising to us, because LGM detects a much larger
number of patterns including disconnected ones. To compare the two methods
on the same basis, we added supplementary edges to help gSpan to detect a
part of disconnected patterns. Then, the efficiency difference became even more
significant.

2 Preliminaries
Let us first define linear graphs and associated concepts.
Definition 1 (Linear graph). Denote by Σ V and Σ E the set of vertex and
edge labels, respectively. A labeled and undirected linear graph g = (V, E, LV , LE )
consists of an ordered vertex set V ⊂ N, an edge set E ⊆ V ×V , a vertex labeling
LV : V → Σ V and an edge labeling LE : E → Σ E . Let the size of the linear
graph |g| be the number of its edges. Let G denote the set of all possible linear
graphs and let θ ∈ G denote the empty graph.
The difference from ordinary graphs is that the vertices are defined as a subset
of natural numbers, introducing the total order. Notice that we do not impose
connectedness here. The order of edges is defined as follows:
Definition 2 (Total order among edges). ∀e1 = (i, j), e2 = (k, l) ∈ Eg ,
e1 <e e2 if and only if i) i < k or ii) i = k, j < l.
Namely, one first compares the indices of the left nodes. If they are identical, the
right nodes are compared. The subgraph relationship between two linear graphs
is defined as follows.
Definition 3 (Subgraph). Given two linear graphs g1 = (V1 , E1 , LV1 , LE1 ),
g2 = (V2 , E2 , LV2 , LE2 ), g1 is a subgraph of g2 , g1 ⊆ g2 , if and only if there exists
an injective mapping m : V1 → V2 such that
1. ∀i ∈ V1 : LV1 (i) = LV2 (m(i)), vertex labels are identical,
2. ∀(i, j) ∈ E1 : (m(i), m(j)) ∈ E2 , LE1 (i, j) = LE2 (m(i), m(j)), all edges of g1
exist in g2 , and
3. ∀(i, j) ∈ E1 : i < j → m(i) < m(j), the order of vertices is conserved.
The difference from the ordinary subgraph relation is that the vertex order is
conserved. Finally, frequent subgraph mining is defined as follows.
LGM: Mining Frequent Subgraphs from Linear Graphs 29

Definition 4 (Frequent linear subgraph mining). For a set of linear graphs


G = {g1 , · · · , g|G|}, gi ∈ G, a minimum support threshold σ > 0 and a maximum
pattern size s > 0, find all g ∈ G such that g is frequent enough in G, i.e.,

|{i = 1, ..., |G| : g ⊆ gi }| ≥ σ, |g| ≤ s

3 Enumeration of Linear Subgraphs


Before addressing the frequent pattern mining problem, let us design an algo-
rithm for enumerating all subgraphs of a linear graph. For simplicity, we do not
consider vertex and edge labels in this section, but inclusion of the labels is
straightforward.

3.1 Reduction Map


Suppose we would like to enumerate all subgraphs in a linear graph shown in the
bottom of Figure 2, left. All linear subgraphs form a natural graph-shaped search
space, where one can traverse upwards or downwards by deleting or adding an
edge (Figure 2, left). For enumeration, however, one has to choose edges in the
search graph to form a search tree (Figure 2, right). Once a search tree is defined,
the enumeration can be done either by depth-first or breadth-first traversal. To
this aim, we specify a reduction map f : G → G which transforms a child to its
parent uniquely. The mapping is chosen such that when it is applied repeatedly,
we eventually reduce it to an element of the solution set S ⊂ G. Formally, we
write ∀x ∈ G : ∃k ≤ 0 : f k (x) ∈ S. In our case, the reduction map is defined as
removing the “largest” edge from the child graph. The largest edge is defined via
the total order introduced in Definition 2. By evaluating the mapping repeatedly
the graph is shrunk to the empty graph. Thus, here we have S = {θ}.
By applying f (g) for all possible g ∈ G, we can induce a search tree with θ ∈ G
being the root node, shown in Figure 2, right. A question is if we can always
define a unique search tree for any linear graph. The reverse search theorem [2]
says that the proposition is true iff any node in the graph-shaped search space
converges to the root node (i.e., empty graph) by applying the map a finite
number of times. For our reduction map, it is true, because each possible linear
graph g ∈ G is reduced to the empty graph by successively applying f to g.
A characteristic point of reverse search is that the search tree is implicitly
defined by the reduction map. In actual traversal, the search tree is created on
demand: when a traverser is at a node with graph g and would like to move
down, a set of children nodes are generated by extending g. More precisely, one
enumerate all linear graphs by inverting the reduction mapping such that the
tree is explored from the root node towards the leaves.
The inverse mapping f −1 : G → G ∗ generates for a given linear graph g ∈ G
a set of extended graphs X = {g  | f (g  ) = g}.
There are three types of extension patterns according to the number of added
nodes in the reduction mapping: (A) no-node-addition, (B) one-node-addition,
30 Y. Tabei et al.

empty empty

Fig. 2. (Left) Graph-shaped search space. (Right) Search tree induced by the reduction
map.

Linear Graph (A-1)

i j i j

(B-1) (B-2) (B-3) (B-4)

i j i j i j i j

(C-1) (C-2) (C-3)

i j i j i j

(C-4) (C-5) (C-6)

i j i j i j

Fig. 3. Example of children patterns. There are three types of extension with respect
to the number of nodes: (A) no-node-addition, (B) one-node-addition, (C) two-nodes-
addition.

(C) two-nodes-addition. Let us define the largest edge of g as (i, j), i < j. Then,
the enumeration of case A is done by adding an edge which is larger than (i, j).
For case B, a node is inserted to the position after i, and this node is connected to
every other node. If the new edge is smaller than (i, j), this extension is canceled.
For case C, two nodes are inserted to the position after i. In that case, the added
two nodes must be connected by a new edge. All patterns of valid extensions are
shown in Figure 3. This example does not include node labels, but for actual
applications, node labels need to be enumerated as well.

4 Frequent Pattern Mining


In frequent pattern mining, we employ the same search tree described above,
but the occurrence of a pattern in all linear graphs are tracked in an occurrence
list LG (g) [19], defined as follows:
LG (g) = {(i, m) : gi ∈ G, g ⊆ gi with node correspondence m}.
LGM: Mining Frequent Subgraphs from Linear Graphs 31

Algorithm 1. Linear Graph Miner (LGM)


Input:
A set of linear graphs: G = {g1 , ..., g|G| }
Minimum support: σ ≥ 0
Maximum pattern size: s ≥ 0
1: function LGM(G, σ, s)  the main function
2: Mine(G, φ, σ, s)
3: end function
4: function Mine(G, g, σ, s)
5: sup ← support(LG (g))
6: if sup < σ then check support condition
7: return
8: end if
9: Report occurrence of subgraph g
10: if |g| = s then check pattern size
11: return
12: end if
13: scan G once by using LG (g), find all extensions f −1 (g)
14: for g  ∈ f −1 (g)
15: Mine(G, g  , σ, s)
call Mine for every extended pattern g 
16: end for
17: end function

When a pattern g is extended, its occurrence list LG (g) is updated as well.


Based on the occurrence list, the support of each pattern g, i.e., the number of
linear graphs which contains the pattern, is calculated. Whenever the support is
smaller than the threshold s, the search tree is pruned at this node. This pruning
is possible, because of the anti-monotonicity of the support, namely the support
of a graph is never larger than that of its subgraph. Algorithm 1 describes the
recursive algorithm for frequent mining. In line 13, each pattern g is extended
to larger graphs g  ∈ f −1 (g) by inverse reduction mapping f −1 . The possible
extensions f −1 (g) for each pattern g are found using the location list LG (g).
The function Mine is recursively called for each extended pattern g  ∈ f −1 (g)
in line 15. The graph pruning happens in lines 7, if the support for the pattern
g is smaller than the minimum support threshold σ or in line 11 if the pattern
size |g| is equal to the maximum pattern size s.

5 Complexity Analysis
The computational time of frequent pattern mining depends on the minimum
support and maximum pattern size thresholds [19]. Also, it depends on the “den-
sity” of the database: If all graphs are almost identical (i.e., a dense database),
the mining would take a prohibitive amount of time. So, conventional worst case
analysis is not amenable to mining algorithms. Instead, the delay, interval time
32 Y. Tabei et al.

between two consecutive solutions, is often used to describe the complexity. Gen-
eral graph mining algorithms including gSpan are exponential delay algorithms,
i.e., the delay is exponential to the size of patterns [19]. The delay of our algo-
rithm is only polynomial, because no duplication checks are necessary thanks to
the vertex order.

Theorem 1 (Polynomial delay). For N linear graphs G, a minimum support


σ > 0, and a maximum pattern size s > 0, the time between two successive calls
to Report in line 9 is bounded by a polynomial of the size of input data.
Proof. Let M := maxi |Vgi |, F := maxi |Egi |. The number of matching locations
in the linear graphs G can decrease in case g is enlarged, because the only largest
edge is added. Considering the number of variations, it is easy to see that the
location list always satisfies |LG (g)| ≤ M 2 N . Therefore, the mapping f −1 (g) can
be produced in O(M 2 N ) time, because the procedure searches for the location
list in line 13.
The time complexity between two successive calls to Report can now be
bounded by considering two cases after Report has been called once.
– Case 1. There is an extension g  fulfilling the minimum support condition,or
the size of g  is s. Then Report is called within O(M 2 N ) time.
– Case 2. There is no extension g  fulfilling the minimum support condi-
tion.Then, no recursion happens and M ine returns in O(M 2 N ) time to its
parent node in the search tree. The maximum number of times this can hap-
pen successively is bounded by the depth of the reverse search tree, which
is bounded by O(F ), because each level in the search tree adds one edge.
Therefore, in O(M 2 N F ) time the algorithm either calls Report again or
finishes.
Thus, the total time between two successive calls to Report is bounded by
O(M 2 N F ).

6 Experiments
We performed a motif extraction experiment from protein 3D structures.
Frequent and characteristic patterns are often called “motifs” in molecular biol-
ogy, and we adopt that terminology here. All experiments were performed on a
Linux machine with an AMD Opteron processor (2 GHz and 4GB RAM).

1-gap linear graph 2-gap linear graph

1 2 3 4 5 1 2 3 4 5

Fig. 4. Example of gap linear graph. 1-gap linear graph (left) and 2-gap linear
graph (right) are represented, respectively. Edges corresponding to gaps are represented
in bold line.
LGM: Mining Frequent Subgraphs from Linear Graphs 33

40000 LGM
gSpan+g1

30000

time (sec)
20000

10000

10

20

30

40

50
minimum support

Fig. 5. Execution time for the protein data. The line labeled by gSpan+g1 is execution
time for gSpan on the 1-gap linear graph dataset. gSpan does not work on the 2-gap
linear graph dataset even if the minimum support threshold is 50.

6.1 Motif Extraction from Protein 3D Structures

We adopted the Glyankina et al’s dataset [7] which consists of pairs of homol-
ogous proteins: one is derived from a thermophilic organism and the other is
from a mesophilic organism. This dataset was made for understanding struc-
tural properties of proteins which are responsible for the higher thermostability
of proteins from thermophilic organisms compared to those from mesophilic or-
ganisms. In constructing a linear graph from a 3D structure, each amino acid is
represented as a vertex. Vertex labels are chosen from {1, . . . , 6}, which repre-
sents the following six classes: aliphatic {AVLIMC}, aromatic {FWYH}, polar
{STNQ}, positive {KR}, negative {DE}, special (reflecting their special con-
formation properties) {GP} [10]. An edge is drawn between the pair of amino
acid residues whose distance is within 5 angstrom. No edge labels are assigned.
In total, 754 graphs were made. Average number of vertices and edges are 371
and 498, respectively, and the number of labels is 6. To detect the motifs char-
acterizing the difference between two organisms, we take the following two-step
approach. First, we employ LGM to find frequent patterns from all proteins of
both organisms. In this setting, we did not use (c-6) patterns in Figure 3. Fi-
nally, the patterns significantly associated with organism difference are selected
via statistical tests.
We assess the execution time of our algorithm in comparison with gSpan. The
linear graphs from 3D-structure proteins are not always connected graphs and
the gSpan can not be applied to such disconnected graphs. Hence, we made two
kinds of gaped linear graph: 1-gap linear graph and 2-gap linear graph. 1-gap
linear graph is a linear graph whose contiguous vertices in a protein sequence
are connected by an edge; 2-gap linear graph is a 1-gap linear graph whose two
vertices skipping one in a protein sequence are connected by an edge (Figure 4).
We run gSpan on two datasets: one consists of 1-gap linear graphs and the other
consists of 2-gap linear graphs. We run LGM on the original linear graphs. We
set the maximum execution time to 12 hours for both programs. Figure 5 shows
the execution time by changing minimum support thresholds. gSpan does not
34 Y. Tabei et al.

Significant subgraphs in TATA-binding protein Significant subgraphs in human polII promotor


pvalue:4×10-4 pvalue:9×10-5

129 130 132 133 134 135 136 147 148 151 152 154 155

1 1 1 1 5 5 1 1 1 1 3 1 3
pvalue:3×10-2 pvalue:9×10-5

129 130 133 134 135 147 148 149 150 151 153

1 1 1 5 5 1 1 2 1 1 3
pvalue:3×10-2 pvalue:2×10-4

157 158 159 160 161 143 144 146 147 151 152 155

1 1 3 1 1 1 1 1 1 1 3 3

Fig. 6. Significant subgraphs detected by LGM. The p-value calculated by fisher exact
test is attached to each linear graph. The node labels 1, 2, 3, 4 and 5, represent aliphatic,
aromatic, polar, positive and negative proteins, respectively.

Fig. 7. 3D-structures of TATA-binding protein(left) and human pol II promotor protein


(right). The spheres represent the amino acid residues corresponding to vertices forming
subgraphs in figure 6.

work on the 2-gap linear graph dataset even if the minimum support threshold
is 50. Our algorithm is faster than gSpan on the 1-gap linear graph dataset, and
its execution time is reasonable.
Then, we assess a motif extraction ability of our algorithm. To choose signif-
icant subgraphs from the enumerated subgraphs, we use Fisher’s exact test. In
this case, a significant subgraph should distinguish thermophilic proteins from
mesophilic proteins. Thus, for each frequent subgraph, we count the number of
proteins containing this subgraph in the thermophilic and mesophilic proteins;
and generate a 2×2 contingency table, which includes the number of thermophilic
organisms that contain subgraph g nT P , the number of thermophilic organisms
that does not contain a subgraph g nF P , the number of mesophilic organisms
LGM: Mining Frequent Subgraphs from Linear Graphs 35

that does not contain a subraph g nF N and the number of mesophilic organisms
that contain a subgraph g nT N . The probability representing the independence
in the contingency table is calculated as follows:
  
ng ng 
nT P nF N ng !ng !nP !nN !
Pr =   = ,
n n!nT P !nF P !nF N !nT N !
np

where nP is the number of thermophilic proteins; nN the number of mesophilic


proteins; ng the number of proteins with a subgraph g; ng the number of proteins
without a subgraph g  . The p-value of the two-sided Fisher’s exact test on a table
can be computed by the sum of all probabilities of tables that are more extreme
than this table.
We ranked the frequent subgraphs according to the p-values, and obtained
103 subgraphs whose p-values are no more than 0.001. Here, we focused on a
pair of proteins, TATA-binding protein and human polII promotor protein, where
TATA-binding protein is derived from a thermophilic organism and human polII
promotor is from a mesophilic organism. The reason we chose these two proteins
is that they include a large number of statistically significant motifs which are
mutually exclusive between two organisms. These two proteins share the same
function as DNA-binding protein, but their thermostabilities are different. Fig-
ure 6 shows the top-3 subgraphs in significance. Figure 7 shows 3D-structure
proteins, TATA-binding protein (left) and human polII promotor protein(right),
and the amino acid residues forming top3-subgraphs are represented by spheres.

7 Conclusion

We proposed an efficient frequent subgraph mining algorithm from linear graphs.


A key point is that vertices in a linear graph are totally ordered. We designed a
fast enumeration algorithm from linear graphs based on this property. For an ef-
ficient enumeration without duplication, we define a search tree based on reverse
search techniques. Different from gSpan, our algorithm enumerates frequent sub-
graphs including disconnected ones by traversing this search tree. Many kinds
of data, such as protein 3D-structures and alternative splicing forms, which can
be represented as linear graphs, include disconnected subgraphs as important
patterns. The computational time of our algorithm is polynomial-delay.
We performed a motif extraction experiment of a protein 3D-structure dataset
in molecular biology. In the experiment, our algorithm could extract important
subgraphs as frequent patterns. By comparing our algorithm to gSpan with
respect to execution time, we have shown our algorithm is fast enough for the
real world datasets.
Data which can be represented as linear graphs occur in many fields, for
instance bioinformatics and natural language processing. Our mining algorithm
from linear graphs provide a new way to analyze such data.
36 Y. Tabei et al.

Acknowledgements
This work is partly supported by research fellowship from JSPS for young sci-
entists, MEXT Kakenhi 21680025 and the FIRST program. We would like to
thank M. Gromiha for providing the protein 3D-structure dataset, T. Uno and
H. Kashima for fruitful discussions.

References
1. Abe, K., Kawasoe, S., Asai, T., Arimura, H., Arikawa, S.: Optimized substructure
discovery for semi-structured data. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.)
PKDD 2002. LNCS (LNAI), vol. 2431, pp. 1–14. Springer, Heidelberg (2002)
2. Avis, D., Fukuda, K.: Reverse search for enumeration. Discrete Appl. Math. 65,
21–46 (1996)
3. Davydov, E., Batzoglou, S.: A computational model for RNA multiple sequence
alignment. Theoretical Computer Science 368, 205–216 (2006)
4. Eichinger, F., Böhm, K., Huber, M.: Mining edge-weighted call graphs to localise
software bugs. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD
2008, Part I. LNCS (LNAI), vol. 5211, pp. 333–348. Springer, Heidelberg (2008)
5. Fertin, G., Hermelin, D., Rizzi, R., Vialette, S.: Common structured patterns in
linear graphs: Approximation and combinatorics. In: Ma, B., Zhang, K. (eds.) CPM
2007. LNCS, vol. 4580, pp. 241–252. Springer, Heidelberg (2007)
6. Georgii, E., Dietmann, S., Uno, T., Pagel, P., Tsuda, K.: Enumeration of condition-
dependent dense modules in protein interaction networks. Bioinformatics 25(7),
933–940 (2009)
7. Glyakina, A.V., Garbuzynskiy, S.O., Lobanov, M.Y., Galzitskaya, O.V.: Different
packing of external residues can explain differences in the thermostability of pro-
teins from thermophilic and mosophilic organisms. Bioinformatics 23, 2231–2238
(2007)
8. Inokuchi, A., Washio, T., Motoda, H.: An apriori-based algorithm for mining fre-
quent substructures from graph data. In: Zighed, D.A., Komorowski, J., Żytkow,
J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 13–23. Springer, Heidelberg
(2000)
9. Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: Proceedings of the
2001 IEEE International Conference on Data Mining (ICDM 2001), pp. 313–320
(2001)
10. Mirny, L.A., Shakhnovich, E.I.: Universally Conserved Positions in Protein Folds:
Reading Evolutionary Signals about Stability, Folding Kinetics and Function. Jour-
nal of Molecular Biology 291, 177–196 (1999)
11. Miyao, Y., Sætre, R., Sagae, K., Matsuzaki, T., Tsujii, J.: Task-oriented evaluation
of syntactic parsers and their representations. In: 46th Annual Meeting of the
Association for Computational Linguistics (ACL), pp. 46–54 (2008)
12. Nowozin, S., Tsuda, K.: Frequent subgraph retrieval in geometric graph databases.
In: Perner, P. (ed.) ICDM 2008. LNCS (LNAI), vol. 5077, pp. 953–958. Springer,
Heidelberg (2008)
13. Nowozin, S., Tsuda, K., Uno, T., Kudo, T., Bakir, G.: Weighted substructure
mining for image analysis. In: IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos
(2007)
LGM: Mining Frequent Subgraphs from Linear Graphs 37

14. Pei, J., Han, J., Mortazavi-asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu,
M.: Mining sequential patterns by pattern-growth: The prefixspan approach. IEEE
Transactions on Knowledge and Data Engineering 16(11), 1424–1440 (2004)
15. Saigo, H., Nowozin, S., Kadowaki, T., Taku, K., Tsuda, K.: gBoost: a mathematical
programming approach to graph classification and regression. Machine Learning 75,
69–89 (2008)
16. Uno, T., Kiyomi, M., Arimura, H.: LCM ver.3: collaboration of array, bitmap and
prefix tree for frequent itemset mining. In: Proceedings of the 1st International
Workshop on Open Source Data Mining: Frequent Pattern Mining Implementa-
tions, pp. 77–86 (2005)
17. Wale, N., Karypis, G.: Comparison of descriptor spaces for chemical compound
retrieval and classification. In: Proceedings of the 2006 IEEE International Con-
ference on Data Mining, pp. 678–689 (2006)
18. Yan, X., Cheng, H., Han, J., Yu, P.S.: Mining significant graph patterns by leap
search. In: Proceedings of the ACM SIGMOD International Conference on Man-
agement of Data, pp. 433–444 (2008)
19. Yan, X., Han, J.: gSpan: Graph-based substructure pattern mining. In: Proceedings
of the 2002 IEEE International Conference on Data Mining (ICDM 2002), pp. 721–
724 (2002)
20. Yan, X., Han, J.: CloseGraph: mining closed frequent graph patterns. In: Proceed-
ings of 2003 International Conference on Knowledge Discovery and Data Mining
(SIGKDD 2003), pp. 286–295 (2003)
Efficient Centrality Monitoring for
Time-Evolving Graphs

Yasuhiro Fujiwara1 , Makoto Onizuka1 , and Masaru Kitsuregawa2


1
NTT Cyber Space Laboratories, Japan
{fujiwara.yasuhiro,onizuka.makoto}@lab.ntt.co.jp
2
The University of Tokyo, Japan
kitsure@tkl.iis.u-tokyo.ac.jp

Abstract. The goal of this work is to identify the nodes that have the
smallest sum of distances to other nodes (the lowest closeness central-
ity nodes) in graphs that evolve over time. Previous approaches to this
problem find the lowest centrality nodes efficiently at the expense of
exactness. The main motivation of this paper is to answer, in the affir-
mative, the question, ‘Is it possible to improve the search time without
sacrificing the exactness?’. Our solution is Sniper, a fast search method
for time-evolving graphs. Sniper is based on two ideas: (1) It computes
approximate centrality by reducing the original graph size while guaran-
teeing the answer exactness, and (2) It terminates unnecessary distance
computations early when pruning unlikely nodes. The experimental re-
sults show that Sniper can find the lowest centrality nodes significantly
faster than the previous approaches while it guarantees answer exactness.

Keywords: Centrality, Graph mining, Time-evolving.

1 Introduction
In graph theory, the facility location problem is quite important since it involves
finding good locations for one or more facilities in a given environment. Solving
this problem starts by finding the nodes whose distances to other nodes is the
shortest in the graph, since the cost it takes to reach all other nodes from these
nodes is expected to be low. In graph analysis, the centralities based on this
concept are closeness. In this paper, the closeness centrality of node u, Cu , is
defined as the sum of distances from the node to other nodes.
The naive approach, the exact computation of centrality, is impractical; it
needs distances of all node pairs. This led to the introduction of approximate
approaches, such as the annotation approach [13] and the embedding approach
[12,11], to estimate centralities. These approaches have the advantage of speed
at the expense of exactness. However, approximate algorithms are not adopted
by many practitioners. This is because the optimality of the solution is not
guaranteed; it is hard for approximate algorithms to identify the lowest centrality
node exactly. Furthermore, the focus of traditional graph theory has been limited
to just ‘static’ graphs; the implicit assumption is that nodes and edges never

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 38–50, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Efficient Centrality Monitoring for Time-Evolving Graphs 39

change. Recent years have witnessed a dramatic increase in the availability of


graph datasets that comprise many thousands and sometimes even millions of
time-evolving nodes; a consequence of the widespread availability of electronic
databases and the Internet. Recent studies on large-scale graphs are discovering
several important principles of time-evolving graphs [10,8]. Thus demands for the
efficient analysis of time-evolving graphs are increasing. We address the following
problem in this paper:
Given: graph G[t] = (V [t], E[t]) at time t where V [t] is a set of nodes and E[t]
is a set of edges at time t.
Find: the nodes that have the lowest closeness centrality in graph G[t].

We propose a novel method called Sniper that can efficiently identify the lowest
centrality nodes in time-evolving graphs. To the best of our knowledge, our
approach is the first solution to achieve both exactness and efficiency at the
same time in identifying the lowest centrality nodes from time-evolving graphs.

1.1 Problem Motivation


The problem tackled in this paper must be overcome to develop the following
important applications.
Networks of interaction have been studied for a long time by social science
researchers, where nodes correspond to people or organizations, and edges rep-
resent some type of social interaction. The question of ‘which is the most im-
portant node in a network?’ is being avidly pursued by scientific researchers.
An important example is the network obtained by considering scientific publica-
tions. Nodes in this case are researchers, papers, books, or entire journals, and
edges correspond to co-authorship or citations. This kind of network generally
grows very rapidly over time. For example, the collaboration network of scien-
tists in the database area contains several tens of thousands of authors and its
rate of growth is increasing year by year; there are several thousand new authors
each year [5]. The systematic construction of such networks was introduced by
Garfield, who later proposed a measure of standing for journals that is still in
use. This measure, called impact factor, is defined as the number of citations per
published item [6]. Basically, the impact factor is a very simple measure, since it
corresponds to the degree of the citation network. However the degree is a local
measure, because the value is only determined by the number of adjacent nodes.
That is, if a high-degree node lies in an isolated community of the network, the
influence of the node is very limited.
Closeness centrality is a global centrality measure since it is computed by
summing the distances to all other nodes in a graph. Therefore, it is an effective
measure of influence on other nodes. The most influential node can be effectively
detected as the lowest closeness centrality node by monitoring time-evolving
graphs. Nascimento et al. analyzed SIGMOD’s co-authorship graph [9] They
successfully discovered that L. A. Rowe, M. Stonebraker, and M. J. Carey were
40 Y. Fujiwara, M. Onizuka, and M. Kitsuregawa

the most influential researchers from 1986 to 1988, 1989 to 1992, and 1993 to
2002, respectively. All these three are very famous and important researchers in
the database community.
The remainder of this paper is organized as follows. Section 2 describes re-
lated work. Section 3 overviews some of the background of this work. Section 4
introduces the main ideas of Sniper. Section 5 discusses some of the topics re-
lated to Sniper. Section 6 gives theoretical analyses of Sniper. Section 7 reviews
the results of our experiments. Section 8 provides our brief conclusion.

2 Related Work
Many papers have been published on approximations of node-to-node distances.
The previous distance approximation schemes are distinguished into two types:
annotation types and embedding types. Rattigna et al. studied two annotation
schemes [13]. They randomly select nodes in a graph and divide the graph into
regions that are connected, mutually exclusive, and collectively exhaustive. They
give a set of annotations to every node from the regions. Distances are computed
by the annotations. They demonstrated their method can compute node dis-
tances more accurately than the embedding approaches. However, this method
can require O(n2 ) space and O(n3 ) time to estimate the lowest centrality nodes
as described in their paper.
The Landmark technique is an embedding approach [7,12], and estimates node-
to-node distance from selected nodes at O(n) time. The minimum distance via a
landmark node is utilized as node distance in this method. Another embedding
technique is Global Network Positioning, which was studied by Ng et al. [11]. Node
distances are estimated from the Lp norm between node pairs. These embedding
techniques require O(n2 ) space since all n nodes hold distances to O(n) selected
nodes. Moreover, they require O(n3 ) time to identify the lowest centrality node.
This is because they take O(n) time to estimate a node pair distance and need the
distances of n2 node pairs to compute centralities of all nodes.

3 Preliminary
In this section, we introduce the background to this paper. Social networks and
others can be described as graph G = (V, E), where V is the set of nodes, and
E is the set of edges. We use n and m to denote the number of nodes and edges,
respectively. That is n = |V | and m = |E|. A path from node u to v is the
sequence of nodes linked by edges, beginning with node u and ending at node
v. A path from node u to v is the shortest path if and only if the number of
nodes in the path is the smallest possible among all paths from node u to v. The
distance between node u and v, d(u, v), is the number of edges in the shortest
path connecting them in the graph. Therefore d(u, u) = 0 for every u ∈ V , and
d(u, v) = d(v, u) for u, v ∈ V . The closeness centrality of node u, C
u , is the sum
of the distances from the node to any other node, and computed as v∈V d(u, v).
Efficient Centrality Monitoring for Time-Evolving Graphs 41

4 Centrality Monitoring
In this section, we explain the two main ideas underlying Sniper. The main
advantage of Sniper is to exactly and efficiently identify the lowest closeness
centrality nodes in time-evolving graphs. While we focus on undirected and
unweighted graphs in this section, our approach can be applied to weighted
or directed graphs as described in Section 5.1. Moreover, we can handle range
queries (find the nodes whose centralities are less than a given threshold) and
K-best queries (find the K lowest centrality nodes) as described in Section 5.2.
For ease of explanation, we assume that no two nodes will have exactly the same
centrality value and that one node is added to a time-evolving graph at each
time tick. These assumptions can be eliminated easily. And all proofs in this
section are omitted due to the space limitations.

4.1 Ideas Behind Sniper


Our solution is based on the two ideas described below.

Node aggregation. We introduce approximations to reduce the high cost of the


existing approaches. Instead of computing the exact centrality of every node, we
approximate the centrality, and use the result to efficiently prune high-centrality
nodes.
For a given graph with n nodes and m edges, we create an approximate graph
of n nodes and m edges (n < n, m < m) by aggregating ‘similar’ nodes in
the graph. For the approximate graph, O(n + m ) time is required for Sniper to
compute the approximate centralities, while the existing approximate algorithm
requires O(n2 ) time as described in Section 2. We exploit the Jaccard coefficient
to find similar nodes, and then aggregate the original nodes to create node
groups. We refer to such groupings as aggregate nodes.
This new idea has the following two major advantages. First, we can find the
answer node exactly; the node that has the lowest centrality is never missed
by this approach. This is because our approximate graphs guarantee the lower
bounding distances. This means that we can safely discard unpromising nodes
at low CPU cost. The second advantage is that this idea can reduce the number
of nodes that must be processed to compute centralities, as well as reducing the
computation cost for each node. That is, we can identify the lowest centrality
node among a large number of nodes efficiently.

Tree estimation. Although our approximation technique is able to discard most


of the unlikely nodes, we still rely on exact centrality computation to guarantee
the correctness of the search results. Here we focus on reducing the cost of this
computation.
To compute the exact centrality of a node, distances to all other nodes from
the node have to be computed by breadth-first search (BFS). But clearly the
exhaustive exploration of nodes in a graph is not computationally feasible, espe-
cially for large graphs. Our proposal exploits the following idea: If a node cannot
42 Y. Fujiwara, M. Onizuka, and M. Kitsuregawa

be the lowest centrality node, we terminate subsequent distance computations


as being unnecessary.
Our search algorithm first holds a candidate node, which is expected to have
low centrality. We then estimate the distances of unexplored nodes in the dis-
tance computation from a single BFS-tree to obtain the lower centrality bound.
In the search process, if the lower centrality bound of a node gives a value larger
than the exact centrality of the candidate node, the node cannot be the lowest
centrality node in the original graph. Accordingly, unnecessary distance compu-
tations can be terminated early.

4.2 Node Aggregation

Our first idea involves aggregating nodes of the original graph, which enables us
to compute the lower centrality bound and thus realize reliable node pruning.

Graph Approximation. We reduce the original graph size in order to compute


approximate centralities at low computation cost. To realize efficient search,
given original graph G with n nodes and m edges, we compute n nodes and m
edges in the approximate graph G . That is, the original graph G = (V, E) is
collapsed to yield the approximate graph G = (V  , E  ). We first describe how
to compute the edges of the approximate graph, and then show our approach of
original node aggregation.
For the aggregate nodes u and v  , there is an edge, {u , v  } ∈ E  , if and only
if there is at least one edge between aggregated original nodes in u and v  . This
definition is important in computing the lower centrality bound. Formally, we
obtain the edges between aggregate node u and v as follows:
Definition 1 (Node aggregation). In the approximate graph G , node u and
v  have an edge if and only if:

(1)u = v  , (2)∃{u, v} s.t. u ∈ u ∩ v ∈ v  (1)

where u ∈ u indicates that aggregate node u contains original node u.


To reduce the approximation error, we aggregate similar nodes. As described
above, the aggregate nodes share an edge if and only if there is at least one edge
between the original nodes that have been aggregated. Therefore, the approx-
imation error decreases as the number of neighbors shared by the aggregated
nodes increases. For this reason, we utilize the Jaccard coefficient since it is a
simple and natural measure of similarity between sets [4].
Let Nu and Nv be sets of neighbors (adjacent nodes) of nodes u and v, respec-
tively; the Jaccard coefficient is defined as |Nu ∩ Nv |/|Nu ∪ Nv |, i.e. the size of
the intersection of the sets divided by the size of their union. We aggregate node
u and v if the most similar node of u is node v, this yields good approximations.
Note, we do not aggregate nodes u and v if the size of their intersection is less
than one half the size of their union to avoid aggregating dissimilar nodes.
Efficient Centrality Monitoring for Time-Evolving Graphs 43

If one node is added to a time-evolving graph, we compute its most similar


node to update the approximate graph. The naive approach to compute the most
similar node for the added node is to compute the similarities for all nodes. We,
on the other hand, utilize the following lemma to efficiently update the most
similar node:
Lemma 1 (Update the most similar nodes). For the added node, the most
similar node is at most two hops apart.
By using the above lemma, we first obtain the nodes which are one and two hop
away from the added node in the search process. And we compute similarity for
the added node and update the most similar node. And we link the aggregate
nodes with Definition 1.
Even though we assume a single node is added for time-evolving graphs in
each time tick, Lemma 1 can also be applied for the case of single node deletion.
That is we can efficiently update the most similar node with Lemma 1 for node
deletion. We iterate the above procedure for each node if several nodes are added.
If one edge is added/deleted, we delete one connected node and add the node.

Lower Bounding Centrality. Given an approximate graph, we compute the


approximate centrality of node u as follows:
Definition 2 (Approximate closeness centrality). For the approximate
graph, the approximate closeness centrality of node u , Cu , is computed as fol-
lows:

Cu = {d(u , v  ) · |v |} (2)
v  ∈V 

where d(u , v  ) is node distance in the approximation graph (i.e. the number of
hops from node u to v ) and |v  | is the number of original nodes aggregated
within node v .
We can provide the following lemma about the centrality approximation:
Lemma 2 (Approximate closeness centrality). For any node in the
approximate graph, Cu ≤ Cu holds.
Lemma 2 provides Sniper with the property of finding the exact answer as is
described in Section 6.

4.3 Tree Estimation


We introduce an algorithm for computing original centralities efficiently. We ter-
minate subsequent distance computations from a node if the estimate centrality
of the node is larger than the exact centrality of the candidate node. In this
approach, we compute lower bounding distances of unexplored nodes via BFS
to estimate the lower centrality bound of a node. Estimations are obtained from
a single BFS-tree.
44 Y. Fujiwara, M. Onizuka, and M. Kitsuregawa

Notation. We first give some notations for the estimation. In the search process,
we construct the BFS-tree rooted at a selected node. As a result, the selected
node forms layer 0. The direct neighbors of the node form layer 1. All nodes
that are i hops apart from the selected node form layer i. We later describe our
approach to selecting the node.
Next, we check by BFS that the exact centralities of other nodes in the tree are
lower than the exact centrality of the candidate node. We define the set of nodes
explored by BFS as Vex , and the set of unexplored nodes as Vun (= V − Vex ).
dmax (u) is the maximum distance of the explored node from node u, that is
dmax (u) = max{d(u, v) : v ∈ Vex }. Moreover, we define the explored layers of
the tree as Lex if and only if there exists at least one explored node in the layer.
Similarly we define the unexplored layers as Lun if and only if there exists no
explored node in the layer. The layer number of node u is denoted as l(u).

Centrality Estimation. We define how to estimate the centrality of a node.


We estimate the closeness centrality of node u via BFS as follows:
Definition 3 (Estimate closeness centrality). For the original graph, we
define the following centrality estimate of node u, Ĉu , to terminate distance
computation in BFS:
 
Ĉu = d(u, v) + e(u, v) (3)
v∈Vex v∈Vun

dmax (u) (v ∈ Vun ∩ Lex )
e(u, v) =
dmax (u) + min{|l(v) − l(w)|} − 1 (v ∈ Lun , w ∈ Lex )

The estimation is the same as exact centrality if all nodes are explored (i.e.
Vex = V ) in Equation (3). To show the property of the centrality estimate, we
introduce the following lemma:

Lemma 3 (Estimate closeness centrality). For the original graph, Ĉu ≤ Cu


holds in BFS.

This property enables Sniper to identify the lowest centrality node exactly.
The selection of the root node of the tree is important for efficient pruning.
We select the lowest centrality node of the previous time tick as the root node.
There are two reasons for this approach. The first is that this node and nearby
nodes are expected to have the lowest centrality value, and thus are likely to be
the answer node after node addition. In the case of time-evolving graphs, small
numbers of nodes are continually being added to the large number of already
existing nodes. Therefore, there is little difference between the graphs before and
after node addition. In addition, we can more accurately estimate the centrality
value of a node if the node is close to the root node; this is the second reason.
This is because our estimation scheme is based on distances from the root node.
Efficient Centrality Monitoring for Time-Evolving Graphs 45

Algorithm 1. Sniper
Input: G[t] = (V, E), a time-evolving graph at time t
uadd , the node added at time t
ulow [t − 1], the previous lowest centrality node
Output: ulow [t]: the lowest centrality node.
1: //Update the approximate graph
2: update the approximate graph by the update algorithm;
3: //Search for the lowest centrality node
4: Vexact ← empty set;
5: compute the BFS-tree of node ulow [t − 1];
6: compute θ, the exact centrality of node ulow [t − 1];
7: for each node v  ∈ V  do
8: compute Cv in the approximate graph by the estimation algorithm;
9: if Cv ≤ θ then
10: for each node v ∈ v  do
11: append node v → Vexact ;
12: end for
13: end if
14: end for
15: for each node v ∈ Vexact do
16: compute Cv in the original graph by the estimation algorithm;
17: if Cv < θ then
18: θ ← Cv ;
19: ulow [t] ← v;
20: end if
21: end for
22: return ulow [t];

4.4 Search Algorithm


Our main approach to finding the lowest centrality node is to prune unlikely
nodes by using our approximation, and then confirm by exact centrality compu-
tations whether the viable nodes are the answer. However, an important question
is which node should be selected as the candidate in time-evolving graphs. We
select the previous lowest centrality node as the candidate. This node likely to
have lowest centrality as described in Section 4.3. After we construct the BFS-
tree, the exact centrality of the candidate node can be directly obtained with
this approach.
Algorithm 1 shows the search algorithm that targets the lowest closeness cen-
trality node. In this algorithm, ulow [t], ulow [t − 1] and uadd indicate the lowest
centrality node, the previous lowest centrality node, and the added node, respec-
tively. Vexact represents the set of nodes for which we compute exact centralities.
The algorithm can be divided into two phases: update and search. In the
update phase, Sniper computes the approximate graph by the update algorithm
(line 2). In the search phase, Sniper first computes the BFS-tree of the answer
node of the last time tick (line 5) and θ (line 6). If the approximate centrality of
a node is larger than θ, we prune the node since it cannot be the lowest centrality
46 Y. Fujiwara, M. Onizuka, and M. Kitsuregawa

node. Otherwise, Sniper appends aggregated original nodes to Vexact (lines 9-13),
and then computes exact centralities to identify the lowest centrality node (lines
15-21).

5 Extension
In this section, we give a discussion of possible extensions to Sniper.

5.1 Directed or Weighted Graphs


We focus on undirected and unweighted graphs in this paper, but Sniper can
also handle directed or weighted graphs effectively. As described in Section 4.2,
approximate graphs have an edge if and only if there is at least one edge between
aggregated nodes in an undirected and unweighted graph. However, we must
modify how approximate graphs are constructed if we are to handle other kinds
of graphs.
For directed graphs, we apply Definition 1 for each direction to handle directed
edges of approximate graphs. For weighted graphs, we choose the lowest value of
the weights of the original edges as the weight of the aggregated edge to compute
the lower bound of exact centralities.
To estimate centrality values for weighted graphs, we can directly apply Defini-
tion 2. For weighted graphs, however, we need a little modification. We estimate
distance from node u to v as dmax (u) + min{ω(v, w) : w ∈ V \ v} where ω(v, w)
is the weight of edge {v, w}.

5.2 Other Types of Queries


Although the search algorithm described here identifies the node that has the
lowest centrality, the proposed approach can be applied to range queries and
K-best queries. Range queries find the nodes whose centralities are less than a
given threshold, while K-best queries find the K lowest centrality nodes.
For range queries, we utilize the given search threshold as θ, instead of the
exact centrality of the previous time tick (i.e., we do not use the candidate). We
compute approximate centralities of all nodes and prune unlikely nodes using
the given θ; we confirm the answer nodes by calculating exact centralities.
For K-best queries, we first compute the exact centralities at time t of all
K answer nodes in the last time tick. Next, we select the K-th lowest exact
centrality as θ. Subsequent procedures are the same as for the case of identifying
the lowest centrality node.

6 Theoretical Analysis
This section provides theoretical analyses that confirm the accuracy and com-
plexity of Sniper. Let n be the number of nodes and m the number of edges.
We prove that Sniper finds the lowest centrality node accurately (without fail)
as follows:
Efficient Centrality Monitoring for Time-Evolving Graphs 47

Theorem 1 (Find the lowest centrality node). Sniper guarantees the exact
answer when identifying the node whose centrality is the lowest.
Proof. Let ulow be the lowest centrality node in the original graph, and θlow
be the exact centrality of ulow (i.e., θlow is the lowest centrality). Also let θ be
the candidate centrality in the search process.
In the approximate graph, since θlow ≤ θ, the approximate centrality of node
ulow is never greater than θ (Lemma 2). Similarly, in the original graph, the
estimate centrality of node ulow is never grater than θ (Lemma 3). The algorithm
discards a node if (and only if) its approximate or estimated centrality is greater
than θ. Therefore, the lowest centrality node ulow can never be pruned during
the search process. 
We now turn to the complexity of Sniper. Note that the previous approaches
need O(n2 ) space and O(n3 ) time to compute the lowest centrality node.

Theorem 2 (Complexity of Sniper). Sniper requires O(n + m) space and


O(n2 + nm) time to compute the lowest centrality node.

Proof. We first prove that Sniper requires O(n + m) space. Sniper keeps the
approximate graph and the original graph. In the approximate graph, since the
number of nodes and edges are at most n and m, respectively, Sniper needs
O(n+m) space for the approximate graph; O(n+m) space is required for keeping
the original graph. Therefore, the space complexity of Sniper is O(n + m).
Next, we prove that Sniper requires O(n2 + nm) time. To identify the lowest
centrality node, Sniper first updates the approximate graph and then computes
approximate and exact centralities. Sniper needs O(nm) time to update the
approximate graph, since it requires O(m) time to compute similarity for the
added node against each node in the original graph. It requires O(n2 + nm)
time to compute the approximate and exact centralities since the number of
nodes and edges are at most n and m, respectively. Therefore, Sniper requires
O(n2 + nm) time. 
Theorem 2 shows, theoretically, that the space and time complexities of Sniper
are lower in order than those of the previous approximate approaches. In practice,
the search cost depends on the effectiveness of the approximation and estimation
techniques used by Sniper. In the next section, we show their effectiveness by
presenting the results of extensive experiments.

7 Experimental Evaluation
We performed experiments to demonstrate Sniper’s effectiveness in a comparison
to two annotation approaches: the Zone annotation scheme and the Distant to
zone annotation scheme (abbreviated to DTZ). These were selected since they
outperform the other embedding schemes on the contents of our dataset; the
same result is reported in [13]. Zone and DTZ annotation have two parameters:
48 Y. Fujiwara, M. Onizuka, and M. Kitsuregawa

5 5 2
10 10 1 10
Sniper Sniper ZoneZone(zones=2) Zone(zones=2)
Zone Zone Zone(zones=3) Zone(zones=3)
4 DTZ 4 DTZ Zone(zones=4) Zone(zones=4)
10 10 DTZ(zones=2) DTZ(zones=2)
0.75
DTZ(zones=3) DTZ(zones=3)

Wall clock time [s]

Wall clock time [s]


DTZ(zones=4)
Wall clock time [s]

101 DTZ(zones=4)
Sniper Sniper
103

Error ratio
103
0.5
2
10
102
100
0.25
1
1
10
10
0
0
10 10-1
0 50000 200000 350000 500000 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8
10 9 10
P2P Social WWW Number of nodes Number of dimensions Number of dimensions

(1) Error ratio (2) Wall clock time

Fig. 1. Efficiency of Fig. 2. Scalability of Fig. 3. The results of the annotation


Sniper Sniper approaches

zones and dimensions. Zones are divided regions of the entire graph, and di-
mensions are sets of zones1 . Note that these approaches can compute centrality
quickly at the expense of exactness.
We used the following three public datasets in the experiments: P2P [1], Social
[2], and WWW [3]. They are a campus P2P network for file sharing, a free on-
line social network, and web pages within the ‘nd.edu’ domain, respectively. We
extracted the largest connected component from the real data, and we added
single nodes one by one in the experiments.
We evaluated the search performance through wall clock time. All experiments
were conducted on a Linux quad 3.33 GHz Intel Xeon server with 32GB of main
memory. We implemented our algorithms using GCC.

7.1 Efficiency of Sniper


We assessed the search time needed for Sniper and the annotation approach.
Figure 1 shows the efficiency of Sniper where the number of nodes are 500, 000
for P2P and Social, and 100, 000 for WWW. We also show the scalability of
our approach in Figure 2; this figure shows the wall clock time as a function of
the number of nodes. We show only the result of P2P in Figure 2 due to space
limitations. These figures indicate Sniper’s total processing time (both update
and search time are included). We set the number of zones and the dimension
parameter to 2 and 1, respectively. Note that, these parameter values allow the
annotation approaches to estimate the lowest centrality node most efficiently.
These figures show that our method is much faster than the annotation ap-
proaches under all conditions examined. Specifically, Sniper is more than 110
times faster.
The annotation approaches require O(n2 ) time for computing centralities
while Sniper requires O(n + m ) time for computing approximate centralities.
Even if Sniper computes the approximate centralities of all aggregate nodes to
prune the nodes, this cost does not alter the search cost since approximate com-
putations are effectively terminated. Sniper requires O(n + m) time to compute

1
To compute the centralities of all nodes by the annotation approaches, we sampled
half pairs from all nodes, which is the same setting used in [13].
Efficient Centrality Monitoring for Time-Evolving Graphs 49

exact centralities for nodes that cannot be pruned through approximation. This
cost, however, has no effect on the experimental results. This is because a sig-
nificant number of nodes are pruned by approximation.

7.2 Exactness of the Search Results


One major advantage of Sniper is that it guarantees the exact answer, but this
raises the following simple question: ‘How successful are the previous approaches
in providing the exact answer even though they sacrifice exactness?’.
To answer this question, we conducted comparative experiments on the an-
notation approaches. As the metric of accuracy, we used the error ratio, which
is the error centrality value of the estimated lowest centrality node divided by
the centrality value of the exact answer node. Figure 3-(1) shows the error ratio
and the wall clock time of the annotation approaches with various parameter
settings. The number of nodes is 10, 000 and the dateset used is P2P in these
figures.
As we can see from Figure 3, the error ratio of Sniper is 0 because it identi-
fies the lowest centrality node without fail. The annotation approaches, on the
other hand, have much higher error ratios. Therefore, it is not practical to use
the annotation approaches in identifying the lowest centrality node. Figure 3-(2)
shows that Sniper greatly reduces the computation time even though it guaran-
tees the exact answer. The efficiency of the annotation approaches depends on
the parameters used.
Furthermore, the results show that the annotation approaches force a trade-
off between speed and accuracy. That is, as the number of zones and dimensions
parameters decreases, the wall clock time decreases but the error ratio increases.
The annotation approaches are approximation techniques and so can miss the
lowest centrality node. Sniper also computes approximate centralities, but unlike
the annotation approaches, Sniper does not discard the lowest centrality node in
the search process. As a result, Sniper is superior to the annotation approaches
in not only accuracy, but also speed.

8 Conclusions
This paper addressed the problem of finding the lowest closeness centrality node
from time-evolving graphs efficiently and exactly. Our proposal, Sniper, is based
on two ideas: (1) It approximates the original graph by aggregating original nodes
to compute approximate centralities efficiently, and (2) It terminates unnecessary
distance computations early in finding the answer nodes, which greatly improves
overall efficiency. Our experiments show that Sniper works as expected; it can
find the lowest centrality node at high speed; specifically, it is significantly (more
than 110 times) faster than existing approximate methods.
50 Y. Fujiwara, M. Onizuka, and M. Kitsuregawa

References
1. http://kdl.cs.umass.edu/data/canosleep/canosleep-info.html
2. http://snap.stanford.edu/data/soc-LiveJournal1.html
3. http://vlado.fmf.uni-lj.si/pub/networks/data/ND/NDnets.htm
4. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of
the web. Computer Networks 29(8-13), 1157–1166 (1997)
5. Elmacioglu, E., Lee, D.: On six degrees of separation in dblp-db and more. SIG-
MOD Record 34(2), 33–40 (2005)
6. Garfield, E.: Citation Analysis as a Tool in Journal Evaluation. Science 178, 471–
479 (1972)
7. Goldberg, A.V., Harrelson, C.: Computing the shortest path: search meets graph
theory. In: SODA, pp. 156–165 (2005)
8. Leskovec, J., Kleinberg, J.M., Faloutsos, C.: Graph evolution: Densification and
shrinking diameters. TKDD 1(1) (2007)
9. Nascimento, M.A., Sander, J., Pound, J.: Analysis of sigmod’s co-authorship graph.
SIGMOD Record 32(3), 8–10 (2003)
10. Newman: The structure and function of complex networks. SIREV: SIAM Re-
view 45 (2003)
11. Ng, T.S.E., Zhang, H.: Predicting internet network distance with coordinates-based
approaches. In: INFOCOM (2002)
12. Potamias, M., Bonchi, F., Castillo, C., Gionis, A.: Fast shortest path distance
estimation in large networks. In: CIKM, pp. 867–876 (2009)
13. Rattigan, M.J., Maier, M., Jensen, D.: Using structure indices for efficient approx-
imation of network properties. In: KDD, pp. 357–366 (2006)
Graph-Based Clustering with Constraints

Rajul Anand and Chandan K. Reddy

Department of Computer Science,


Wayne State University, Detroit, MI, USA
rajulanand@wayne.edu, reddy@cs.wayne.edu

Abstract. A common way to add background knowledge to the clustering al-


gorithms is by adding constraints. Though there had been some algorithms that
incorporate constraints into the clustering process, not much focus was given to
the topic of graph-based clustering with constraints. In this paper, we propose
a constrained graph-based clustering method and argue that adding constraints
in distance function before graph partitioning will lead to better results. We also
specify a novel approach for adding constraints by introducing the distance limit
criteria. We will also examine how our new distance limit approach performs in
comparison to earlier approaches of using fixed distance measure for constraints.
The proposed approach and its variants are evaluated on UCI datasets and com-
pared with the other constrained-clustering algorithms which embed constraints
in a similar fashion.

Keywords: Clustering, constrained clustering, graph-based clustering.

1 Introduction
One of the primary forms of adding background knowledge for clustering the data is
to provide constraints during the clustering process [1]. Recently, data clustering using
constraints has received a lot of attention. Several works in the literature have demon-
strated improved results by incorporating external knowledge into clustering in differ-
ent applications such as document clustering, text classification. The addition of some
background knowledge can sometimes significantly improve the quality of the final
results obtained. The final clusters that do not obey the initial constraints are often in-
adequate for the end-user. Hence, adding constraints and respecting these constraints
during the clustering process plays a vital role in obtaining desired results in many
practical domains. Several methods are proposed in the literature for adding instance-
level and cluster-level constraints. Constrained versions of partitional [19,1,7], hierar-
chical [5,13] and more recently, density-based [17,15] clustering algorithms have been
studied thoroughly. However, there has been little work in utilizing the constraints in
the graph-based clustering methods [14].

1.1 Our Contributions


We propose an algorithm to systematically add instance-level constraints to the graph-
based clustering algorithm. In this work, we primarily focused our attention to one such
popular algorithm, CHAMELEON, an overview of which is provided in section 3.2.
Our contributions can be outlined as follows:

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 51–62, 2011.
c Springer-Verlag Berlin Heidelberg 2011
52 R. Anand and C.K. Reddy

– Investigate the appropriate way of embedding constraints into the graph-based clus-
tering algorithm for obtaining better results.
– Propose a novel distance limit criteria for must-links and cannot-links while em-
bedding constraints.
– Study the effects of adding different types of constraints to graph-based clustering.

The remainder of the paper is organized as follows: we briefly review the current ap-
proaches for using constraints in different methods in Section 2. In Section 3, we will
describe the various notations used throughout our paper and also give an overview of
a graph-based clustering method, namely, CHAMELEON. Next, we propose our algo-
rithm and discuss our approach regarding how and where to embed constraints in Sec-
tion 4. We present several empirical results on different UCI datasets and comparisons
to the state-of-the-art methods in Section 5. Finally, Section 6 concludes our discussion.

2 Relevant Background

Constraint-based clustering has received a lot of attention in the data mining community
in the recent years [3]. In particular, instance-based constraints have been successfully
used to guide the mining process. Instance-based constraints enforce constraints on
data points as opposed to  and δ constraints which work on the complete clusters.
The -constraint says that for cluster X having more than two points, for each point
x ∈ X, there must be another point y ∈ X such that the distance betweeen x and
y is at most . The δ-constraint requires distance between any two points in different
clusters to be at least δ . This methodology has also been termed as semi-supervised
clustering [9] when the cluster memberships are available for some data. As pointed out
in the literature [19,5], even adding a small number of constraints can help in improving
the quality of results.
Embedding instance-level constraints into the clustering method can be done in sev-
eral ways. A popular method of incorporating constraints is to compute a new distance
metric and perform clustering. Other methods directly embed constraints into optimiza-
tion criteria of the clustering algorithm [19,1,5,17]. Hybrid methods of combining these
two basic approaches are also studied in the literature [2,10]. Adding instance-level con-
straints to the density-based clustering methods had recently received some attention as
well as [17,15]. Inspite of the popularity of graph-based clustering methods, not much
attention is given to the problem of adding constraints to these methods.

3 Preliminaries
Let us consider a dataset D, whose cardinality is represented as D. The total number
of classes in the dataset are K . Proximity graph is constructed from this dataset by
computing the pair-wise Euclidean distance between the instances. A user-defined pa-
rameter k is used to define the number of neigbors considered for each data point. The
hyper-graph partitioning algorithm generates intermediate subgraphs (or sub-clusters)
to be formed which are represented by κ.
Graph-Based Clustering with Constraints 53

3.1 Definitions
Given a dataset D with each point denoted as (x, y) where x represents the point and y
represent the corresponding label, we define constraints as follows:
Definition 1: Must-Link Constraints(ML): Two instances (x1 , y1 ) and (x2 , y2 ) are said
to be must-link constraints, if and only if, y1 = y2 where y1 , y2 ∈ K.
Definition 2: Cannot-Link Constraints(CL): Two instances (x1 , y1 ) and (x2 , y2 ) are
said to be cannot-link constraints, if and only if, y1 = y2 , where y1 , y2 ∈ K.
Definition 3: Transitivity of ML-constraints: Let X, Y be two components formed
must−link
using ML-constraints. Then, a new ML-constraint (x1 → x2 ) where x1 ∈ X
must−link
and x2 ∈ Y introduces the following new constraints: (xi → xj ) ∀xi , xj where
xi ∈ X and xj ∈ Y , i = j,X = Y .
Definition 4: Entailment of CL-constraints: Let X, Y be two components formed using
cannot−link
ML-constraints. Then, a new CL-constraint (x1 → x2 ), where x1 ∈ X and
cannot−link
x2 ∈ Y introduces the following new CL-constraints: (xi → xj ) ∀xi , xj
where xi ∈ X and xj ∈ Y , i = j,X = Y .

3.2 Graph-Based Hierarchical Clustering


We chose to demonstrate the performance of adding constraints to the popularly stud-
ied and practically successful CHAMELEON clustering algorithm. Karypis et al. [11]
proposed CHAMELEON algorithm which can find arbitrarily shaped, varying density
clusters. It operates on sparse graphs containing similarity or dissimilarity between data
points. Compared to various graph-based clustering methods [18] such as Minimum
Spanning Tree clustering, OPOSSUM, ROCK, SLINK, CHAMELEON is superior be-
cause it incorporates the best features of graph-based clustering (like similarity mea-
sure on vertices as in ROCK) and the hierarchical clustering part which is similar or
better than SLINK. These features make CHAMELEON attractive to add constraints
to obtain better results. Moreover, CHAMELEON outperforms other algorithms like
CURE [11] and also density-based methods like DBSCAN [18]. Thus, we believe adding
constraints to CHAMELEON will not only give better results but also provide some in-
sights on the performance of other similar algorithms in the presence of constraints.
Unlike other algorithms which use a given static modeling parameters to find clus-
ters, CHAMELEON find clusters by dynamic modeling. CHAMELEON uses both
closeness and interconnectivity while identifying the most similar pair of clusters to
be merged. CHAMELEON works in two phases. In the first phase, it finds the k-nearest
neighbors based on the similarity between the data points. Then, using an efficient
multi-level graph partitioning algorithm (such as METIS [12]), sub-clusters are cre-
ated in such a way that similar data points are merged together. In the second phase,
these sub-clusters are combined by using a novel agglomerative hierarchical algorithm.
Clusters are merged using RI and RC metrics defined below. Let X, Y be two clusters.
Mathematically, Relative Interconnectivity (RI) is defined as follows:
EC(X, Y )
RI = 1
(1)
2
(EC(X) + EC(Y ))
54 R. Anand and C.K. Reddy

where EC(X, Y ) is the sum of edges that connects clusters X and Y in the k-nearest
neighbor graph. EC(X) is the minimum sum of the cut-edges if we bisect cluster X;
and EC(Y ) is the minimum sum of the cut-edges if we bisect cluster Y . Let lx and ly
represents size of the clusters X and Y respectively. Mathematically, Relative Closeness
(RC) is defined as follows:

S EC (X, Y )
RC = ly
(2)
lx
S (X) + lx +l
lx +ly EC y
S EC (Y )

where S EC (X, Y ) is the average weight of edges connecting clusters X and Y in k-


nearest neighbor graph. S EC (X), S EC (Y ) represents the average weight of edges if
the clusters X and Y are bisected correspondingly.
There are many schemes to account for both of the measures. The function used
to combine them is (RI × RC α ). Here another parameter α is included so as to give
preference to one of the two measures. Thus, we maximize the function:

argmaxα∈(0,∞) (RI × RC α ) (3)

4 Constrained Graph-Based Clustering


CHAMELEON, like other graph-based algorithms, is sensitive to the parameters as a
slight change in similarity values can both dramatically increase or decrease the quality
of the final outcome. For CHAMELEON, changes in similarity measures might result
in different k-nearest neighbors. Overlapping clusters or clusters with very minimal
inter-cluster distance sometimes leads to different class members in the same cluster.
In this work, the primary emphasis is to demonstrate that adding constraints to graph-
based clustering can potentially avoid this problem at least sometimes, if not always.
Our basis for this assumption, is the transitivity of ML constraints and the entailing
property of CL constraints (Section 3.1).

4.1 Embedding Constraints

Using distance (or dissimilarity) metric to enforce constraints [13] was claimed to be
effective in practice, despite having some drawbacks. The main problem is caused due
to setting the distance to zero between all the must-linked pair of constraints. i.e., Let
(pi , pj ) be two instances in a must-link constraint then,

distance(pi , pj ) = 0

To compensate for distorted metric, we run all-pairs-shortest-path algorithm so that


new distance measures is similar to the original space. If we bring any two points much
closer to each other, i.e.

lim distance(pi , pj ) − n = η (4)


n→distance(pi ,pj )
Graph-Based Clustering with Constraints 55

At the first look, it may seem that this minute change will not affect the results signif-
icantly. However, after running all-pairs-shortest-path algorithm, the updated distance
matrix in this case, will respect the original distance measures better than setting the dis-
tance to zero. Similarly for cannot-link constraints, let (qi , qj ) be a pair of cannot-link
constraints, then the points qi and qj are taken apart as far as possible. i.e.,

lim distance(qi , qj ) + n = λ (5)


n→∞

Thus, by varying the values of η and λ, we can push and pull away points reasonably. It
seems that this might create a problem for finding optimal values of η and λ. However,
our preliminary experiments show that the basic limiting values for these parameters is
enough in most of the cases. This addition of constraints (and thus the manipulation of
the distance matrix) can be performed in the CHAMELEON algorithm primarily in any
of the two phases. We can add these constraints before (or after) the graph partitioning
step. After the graph partitioning, we can add constraints during agglomerative cluster-
ing. However, we prefer to add constraints before graph partitioning primarily due to
the following reasons:
– When the data points are already in sub-clusters, enforcing constraints through dis-
tance will not be beneficial unless we ensure that all such constraints are satis-
fied during the agglomerative clustering. However, constraint satisfaction might not
lead to convergence every time. Especially with CL constraints, even determining
whether satisfying assignments exist is NP-complete.
– Intermediate clusters formed are on the basis of original distance metric. Hence, RI
and RC on the original distance metric will get undermined by the new distance
updation through constraints.

4.2 The Proposed Algorithm


Our approach for embedding constraints into the clustering algorithm is through learn-
ing a new distance (or dissimilarity) function. This measure is also adjusted to ensure
that the new distance (or dissimilarity) function respects the original distance values to
a maximum extent for unlabeled data points. We used Euclidean distance for calculat-
ing dissimilarity. For embedding constraints, an important and a intuitive question is:
where to embed these constraints to achieve the best possible results? As outlined in
the previous section, we choose to embed constraints in first phase of CHAMELEON.
Now, we will present a step-by-step discussion of our algorithm.

Using Constraints. Our algorithm begins by using constraints to modify the distance
matrix. To utilize properties of the constraints (Section 3.1) and to restore metricity
of the distances, we propogate constraints. The must-links are propogated is done by
running the fast version of all-pairs-shortest-path algorithm. If u, v represents the source
and destination respectively, then the shortest path between u, v involves only points
u, v and x, where x must belong to any pair of ML constraints. Using this modification,
the algorithm runs in O(n2 m) (here m is the number of unique points in ML). The
complete-link clustering inherently propagates the cannot-link constraints. Thus, there
is no need to propagate CL constraints during Step 1.
56 R. Anand and C.K. Reddy

Algorithm 1. Constrained CHAMELEON(CC)


Input: Dataset D, Set of must-link constraints M L = {ml1 , ..., mln }, Set of cannot-link con-
straints CL = {cl1 , ...., cln }, Number of desired clusters K, Number of nearest neighbors k,
Number of intermediate clusters κ, Significance factor for RI or RC α.
Output: Set of K clusters
1: Step 1: Embed Constraints
2: for all (p1,p2) ∈ ML do
3: limn→distance(p1 ,p2 ) distance (p1 , p2 ) - n = η
4: end for
5: fastAllPairShortestPaths(DistanceMatrix)
6: for all (q1,q2) ∈ CL do
7: limn→∞ distance (p1 , p2 ) + n = λ
8: end for
9: Step 2: k-nn = Build k-nearest neighbor graph
10: Step 3: Partition the k-nn graph into κ clusters using edge cut minimization
11: Step 4: Merge the κ clusters until K number of clusters remain by maximizing RI ×
RC α as criteria for merging clusters

Importance of Constraint Positioning. Imposition of CL constraints just before Stage


4 rather than in Stage 1 might seem reasonable. We used CL constraints in Stage 1 due
to our experimental observations stated below:
1. Hyper-graph partitioning with constraints is often better than constrained agglomer-
ative clustering, when we are not trying to satisfy constraints in either one of them.
2. Clusters induced by graph partitioning have stronger impact on the final clustering
solution.
After Step 1, we create the k nearest neighbor graph (Step 2) and partition the k-nn using
a graph partitioning algorithm (METIS). The κ number of clusters are then merged us-
ing the agglomerative clustering where the aim is to maximize the product (RI ×RC α ).
Complete-link agglomerative clustering is used to propogate CL constraints embedded
earlier. The cut-off point in dendrogram of the clustering is decided by parameter K
(See Algorithm 1). The time complexity of unconstrained version of our algorithm is
O(nκ + nlogn + κ2 logκ) [11]. The time complexity of Stage 1 consists of adding
constraints which is O(l) (l = M L + CL) and O(n2 m) for propagation of ML con-
straints. Thus, overall complexity of O(n2 m) for Stage 1. Therefore, time complexity
of our algorithm is finally O(nκ + nlogn + n2 m + κ2 logκ).

5 Experimental Results
We will now present our experimental results obtained using the proposed method on
benchmark datasets from UCI machine Learning Repository [8]. Our results on various
versions of Constrained CHAMELEON(CC) were obtained with same parameter set-
tings for each dataset. These parameters were not tuned particularly for CHAMELEON,
however we did follow some specific guidelines for each dataset to obtain these param-
eters. We used the same default settings for all the internal parameters of the METIS
Graph-Based Clustering with Constraints 57

hyper-graph partitioning package except κ, which is dataset dependent. We did not


compare our results directly with constrained hierarchical clustering, since CC itself
contains hierarchical clustering, which will be similar or better than stand-alone hi-
erarchical clustering algorithms. Instead, we compared with those algorithms which
embed constraints into the distance function in the same manner as our approach. Our
CC with fixed values of (0,∞) for (η, λ) is similar to [13] except that we have graph-
partitioning step on nearest-neighbor graph before agglomerative clustering. So, we
ruled out this algorithm and instead compared our results with MPCK-means [4] as this
algorithm also embeds constraints in the distance function. MPCK-means incorporates
learning of the distance function on labeled data and on the data affected by constraints
in each iteration. Thus, it learns different distance function for each cluster.
For the performance measure, we used the Rand Index Statistic [16], which measures
the agreement between two sets of clusters X and Y for the same set of n objects as
follows: R = a+b n , where a is the number of pairs of objects assigned to the same
2
cluster in both X and Y , and b is the number of pairs of objects assigned to different
clusters in both X and Y . All the parameter selection is done systematically. For all the
clustering results, K is set to be the true number of classes in the dataset. The value of
α is chosen between 1-2 with the increments of 0.1. We ran some basic experiments
on CHAMELEON for each dataset, to figure out the effect of α on the results. We
choose the particular value of α for each dataset which can potentially produce better
result. We used similar procedure for k and κ. It is important to note that κ is dataset
dependent parameter among all the other parameters. We are assuming that at least 10
data points should be present in a single cluster. Thus K ≤ κ ≤ D/10. We used the
class labels of the data points to generate constraints. We randomly select a pair of data
points and check their labels: if the labels are same, they are denoted as must-link, else
they are denoted as cannot-link. To assess the impact of the constraints on the quality
of the results, we varied the number of constraints. We generated results for ML only,
CL only and ML, CL constraints. The complete dataset is used to randomly select data
points for constraints, thus removing any bias towards the generated constraints.

Table 1. Average Rand Index values for 100 ML + CL constraints on UCI datasets

Datasets Instances Attributes Classes MPCK-means CC(p=1) CC(fixed)


Ionosphere 351 34 2 0.5122 0.5245 0.5355
Iris 150 4 3 0.6739 0.7403 0.7419
Liver 345 6 2 0.5013 0.5034 0.5097
Sonar 208 60 2 0.5166 0.5195 0.5213
Wine 178 13 3 0.6665 0.6611 0.7162
Average 0.5741 0.5898 0.6049

We used five UCI datasets in our experiments as shown in Table 1. Average Rand
Index values for 100 Must-link and Cannot-link constraints clearly outlines that on most
occasions, MPCK-means [4] is outperformed by both the variants of CC. CC(fixed)
performed marginally better than CC(p=1). Also, we only show results for CC(p=5)
and CC(p=15), since the results of CC(p=1) and CC(p=10) are similar to the other two.
58 R. Anand and C.K. Reddy

For each dataset, we randomly select constraints and run algorithm once per con-
straint set. This activity is done 10 times and we report the average Rand-Index value for
all the 10 runs. We used this experimentation for all the variants of CC and
MPCK-means. The results are depicted in Figs. 1-3. We state that the distance value
for must-links and cannot-links can be varied instead of fixed values like 0 and ∞ cor-
respondingly. The CC(fixed) uses (0,∞) for distance measures. Ideally, the values of
(η, λ) could be anything close to extreme values of 0 and ∞, yet they have to be quan-
tifiable. In order to quantify them in our experiments, we defined as follows:

λ = Dmax ∗ 10p (6)

η = Dmin ∗ 10−p (7)

where Dmax , Dmin represents maximum and minimum distance values in the data ma-
trix respectively. In order to study the effect of p, we varied it’s values: p = 1, 5, 10 and
15. Thus, we have CC(p = 1), CC(p = 5), CC(p = 10) and CC(p = 15) referring to differ-
ent values of (η, λ). It is interesting to note that, for different values of p, distance values
for constraints are different for each dataset due to different minimum and maximum
distance values. In this manner, we respect the metricity of original distances and vary
our constraint values accordingly.
We tried various parameter settings and found only few selected ones to be making
some significant difference in the quality of the final results. It is also important to note
that these settings were found by running the basic CHAMELEON algorithm rather
than CC. This is because, finding optimal parameters for CC using various constraints
will be constraints-specific and it will not truly represent the algorithmic aspect. We
then run CC using a few selected settings for all the variants of CC using all constraints
size and finally report the average values specific to one set of parameters only show-
ing better performance on average across all CC variants. The individual settings of
parameters (k, κ, α) for each dataset shown in results are as follows: Ion(19,10,1.2),
Iris(9,3,1), Liver(10,5,2), Sonar(6,3,1) and Wine(16,3,1). In summary, we selected the
best results obtained by the basic version of the CHAMELEON algorithm, and have
shown that these best results can be improved by adding constraints.
We observed across all the variants of CC and MPCK-means for all datasets con-
sistently that the performance decreases as the number of constraints increase, except
in some prominent cases (Figs. 1(d),2(a),2(b) and 3(d)). This observation is consistent
with the results outlined in the recent literature [6]. We stated earlier that we did not
attempt to satisfy constraints implicitly or explicitly. However, we observed that during
Step 3 of Algorithm 1, for fewer constraints, most of the times the constraint violation
is zero in the intermediate clusters, which is often reflected in the final partitions. As the
number of constraints increase, the number of constraint violations also increase. How-
ever, on an average, violations are roughly between 10%-15% for must-link constraints,
20%-25% for cannot-link constraints, and about 15%-20% for must-links and cannot-
links combined. We also observed that few times, the constraint violations are reduced
after Step 4, i.e., after the final agglomerative clustering. Thus, we can conclude that the
effect of constraints is significant in Step 3 and we re-iterate that embedding constraints
earlier is always better for CC.
Graph-Based Clustering with Constraints 59

0.61 1
CC(fixed) CC(fixed)
0.6 CC(p=5) CC(p=5)
CC(p=15) CC(p=15)
0.95 MPCK
0.59 MPCK

0.58
0.9

0.57

Rand Index
Rand Index

0.56 0.85

0.55
0.8
0.54

0.53 0.75

0.52

0.7
0.51 10 20 30 40 50 60 70 80 90 100
10 20 30 40 50 60 70 80 90 100 Number of ML constraints
Number of ML constraints

(a) (b)
0.514 0.54
CC(fixed)
CC(p=5) CC(fixed)
0.512 0.535
CC(p=15) CC(p=5)
MPCK CC(p=15)
0.53 MPCK
0.51

0.525
0.508
Rand Index

Rand Index

0.52
0.506
0.515
0.504
0.51

0.502
0.505

0.5 0.5

0.498 0.495
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of ML constraints Number of ML constraints

(c) (d)

Fig. 1. Different versions of Constrained CHAMELEON(CC) compared with MPCK-means us-


ing Rand Index values averaged over 10 runs with ML constraints on different UCI datasets (a)
Ionosphere (b) Iris (c) Liver and (d) Sonar

Overall, different variants of our algorithm CC outperformed MPCK-means. Iris and


liver datasets are examples where for all combinations of constraints, CC results are
clearly much better than MPCK-means. In the rest of the two datasets, CC performed
nearly similar to MPCK-means. Only in some cases, MPCK-means performed slightly
better than CC variants as shown in Figs. 1(a), 2(a) and 2(d). Even in these particu-
lar scenarios, at least one variant of CC outperformed (or nearly matched) the result
of MPCK-means. Surprisingly, CC(fixed) was slightly better or worse compared to the
other variants of CC. A direct comparison of CC(fixed) with MPCK-means reveals that
only in two cases (Figs. 2(a) and 2(d)), it outperformed CC(fixed). In the rest of the
scenarios, CC(fixed) performed better. The primary reason for wavering performance
in Ionosphere and Sonar datasets could be attributed to the large number of attributes
in these datasets (Table 1). Due to curse of dimensionality, distance function looses its
meaning by directly affecting nearest neigbours. Adding constraints do provide some
contrast so as to group similar objects, but overall discernability is still less. It is impor-
tant to note that, we did not search or tune for optimal values of (η, λ) for any particu-
lar dataset. During our initial investigation, we found that, for some change in values,
the results were improved. We did some experiments on iris dataset and were able to
achieve average Rand Index value of 0.99 and quite often achieved perfect clustering
60 R. Anand and C.K. Reddy

0.7 0.94

0.68
CC(fixed)
0.92
CC(p=5)
0.66
CC(p=15)
0.9 MPCK
0.64

0.62
0.88
Rand Index

Rand Index
0.6

0.86
0.58

0.56 CC(fixed) 0.84


CC(p=5)
0.54 CC(p=15)
MPCK
0.82
0.52

0.5 0.8
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of CL constraints Number of CL constraints

(a) (b)
0.507 0.505
CC(fixed) CC(fixed)
CC(p=5) CC(p=5)
0.506 CC(p=15) CC(p=15)
0.504
MPCK MPCK

0.505
0.503

0.504
0.502
Rand Index

Rand Index

0.503

0.501
0.502

0.5
0.501

0.499
0.5

0.499 0.498
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of CL constraints Number of CL constraints

(c) (d)

Fig. 2. Different versions of Constrained CHAMELEON(CC) compared with MPCK-means us-


ing Rand Index values averaged over 10 runs with CL constraints on different UCI datasets (a)
Ionosphere (b) Iris (c) Liver (d) and Sonar

(Rand Index=1) during some of the runs for 190 constraints, with same settings as used
in all other experiments shown. However, it will be too early to conclude that finding
tuned values for (η, λ) will always increase performance, based on some initial results
and will need further experimental evidence.
Based on our findings, we observed that changing values for (η, λ) did sometimes
increase performance, but not consistently and can also sometimes lead to decrease in
performance. We were also surprised by this phenomenon demonstrated by both the
algorithms. In our case, carrying more experiments with additional constraints revealed
that this decrease in performance is true upto a particular number of constraints. Af-
ter that we again see rise in performance and with enough number of constraints (1%
to 5% of constraints in our case with these datasets), we are able to decipher original
clustering or close to it (Rand Index close to 1.0). CC(fixed) compared to other vari-
ants of CC were only slightly different on an average. CC(fixed) performed reasonably
well across all the datasets on nearly all settings with MPCK-means. Other variants of
CC were also better on an average compared to MPCK-means. Thus, our algorithm
performed better than MPCK-means in terms of handling the decrease in performance
when the number of constraints increase. Most importantly, our algorithm performed
well despite not trying to satisfy constraints implicitly or explicitly.
Graph-Based Clustering with Constraints 61

0.64 1
CC(fixed) CC(fixed)
CC(p=5) CC(p=5)
0.62 CC(p=15) 0.95 CC(p=15)
MPCK MPCK

0.6 0.9

0.58 0.85
Rand Index

Rand Index
0.56 0.8

0.54 0.75

0.52 0.7

0.5 0.65
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of ML,CL constraints Number of ML,CL constraints

(a) (b)
0.518 0.545
CC(fixed) CC(fixed)
0.516 CC(p=5) CC(p=5)
0.54
CC(p=15) CC(p=15)
MPCK MPCK
0.514 0.535

0.512
0.53

0.51
Rand Index
Rand Index

0.525
0.508
0.52
0.506
0.515
0.504
0.51
0.502
0.505
0.5

0.5
0.498 10 20 30 40 50 60 70 80 90 100
10 20 30 40 50 60 70 80 90 100 Number of ML,CL constraints
Number of ML,CL constraints

(c) (d)

Fig. 3. Different versions of Constrained CHAMELEON(CC) compared with MPCK-means us-


ing Rand Index values averaged over 10 runs with ML and CL constraints on different UCI
datasets (a) Ionosphere (b) Iris (c) Liver and (d) Sonar

6 Conclusion
In this work, we presented a novel constrained graph-based clustering method based
on the CHAMELEON algorithm. We proposed a new framework for embedding con-
straints into the graph-based clustering algorithm to obtain promising results. Specifi-
cally, we thoroughly investigated the “how and when to add constraints” aspect of the
problem. We also proposed a novel method for the distance limit criteria while em-
bedding constraints into the distance function. Our algorithm outperformed the popular
MPCK method on several real-world datasets under various constraint settings.

References
1. Basu, S., Banerjee, A., Mooney, R.J.: Semi-supervised clustering by seeding. In: Proceedings
of the Nineteenth International Conference on Machine Learning (ICML 2002), pp. 27–34
(2002)
2. Basu, S., Bilenko, M., Mooney, R.J.: A probabilistic framework for semi-supervised cluster-
ing. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 59–68 (2004)
62 R. Anand and C.K. Reddy

3. Basu, S., Davidson, I., Wagstaff, K.: Constrained Clustering: Advances in Algorithms, The-
ory, and Applications. Chapman & Hall/CRC (2008)
4. Bilenko, M., Basu, S., Mooney, R.J.: Integrating constraints and metric learning in semi-
supervised clustering. In: Proceedings of the Twenty-first International Conference on Ma-
chine Learning, ICML 2004 (2004)
5. Davidson, I., Ravi, S.S.: Agglomerative Hierarchical Clustering with Constraints: Theoreti-
cal and Empirical Results. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J.
(eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 59–70. Springer, Heidelberg (2005)
6. Davidson, I., Ravi, S.S., Shamis, L.: A sat-based framework for efficient constrained cluster-
ing. In: Jonker, W., Petković, M. (eds.) SDM 2010. LNCS, vol. 6358, pp. 94–105. Springer,
Heidelberg (2010)
7. Davidson, I., Wagstaff, K.L., Basu, S.: Measuring Constraint-Set Utility for Partitional Clus-
tering Algorithms. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS
(LNAI), vol. 4213, pp. 115–126. Springer, Heidelberg (2006)
8. Frank, A., Asuncion, A.: UCI machine learning repository (2010),
http://archive.ics.uci.edu/ml
9. Gunopulos, D., Vazirgiannis, M., Halkidi, M.: From unsupervised to semi-supervised learn-
ing: Algorithms and evaluation approaches. In: SIAM International Conference on Data Min-
ing: Tutorial (2006)
10. Halkidi, M., Gunopulos, D., Kumar, N., Vazirgiannis, M., Domeniconi, C.: A framework for
semi-supervised learning based on subjective and objective clustering criteria. In: Proceed-
ings of the 5th IEEE International Conference on Data Mining (ICDM 2005), pp. 637–640
(2005)
11. Karypis, G., Han, E.-H., Kumar, V.: Chameleon: Hierarchical clustering using dynamic mod-
eling. IEEE Computer 32(8), 68–75 (1999)
12. Karypis, G., Kumar, V.: Metis 4.0: Unstructured graph partitioning and sparse matrix order-
ing system. Tech. Report, Dept. of Computer Science, Univ. of Minnesota (1998)
13. Klein, D., Kamvar, S.D., Manning, C.D.: From instance-level constraints to space-level con-
straints: Making the most of prior knowledge in data clustering. In: Proceedings of the Nine-
teenth International Conference on Machine Learning (ICML 2002), pp. 307–314 (2002)
14. Kulis, B., Basu, S., Dhillon, I.S., Mooney, R.J.: Semi-supervised graph clustering: a ker-
nel approach. In: Proceedings of the Twenty-Second International Conference on Machine
Learning (ICML 2005), pp. 457–464 (2005)
15. Lelis, L., Sander, J.: Semi-supervised density-based clustering. In: Perner, P. (ed.) ICDM
2009. LNCS, vol. 5633, pp. 842–847. Springer, Heidelberg (2009)
16. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the Amer-
ican Statistical Association 66(336), 846–850 (1971)
17. Ruiz, C., Spiliopoulou, M., Menasalvas, E.: Density based semi-supervised clustering. Data
Mining and Knowledge Discovery 21(3), 345–370 (2009)
18. Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining, US edition. Addison
Wesley, Reading (2005)
19. Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering with back-
ground knowledge. In: Proceedings of the Eighteenth International Conference on Machine
Learning (ICML 2001), pp. 577–584 (2001)
A Partial Correlation-Based Bayesian Network
Structure Learning Algorithm under SEM

Jing Yang and Lian Li

Department of Computer Science and Technology,


Hefei University of Technology, Hefei 230009, China
jsjyj0801@163.com

Abstract. A new algorithm, PCB (Partial Correlation-Based) algorithm,


is presented for Bayesian network structure learning. The algorithm com-
bines ideas from local learning with partial correlation techniques in an
effective way. It reconstructs the skeleton of a Bayesian network based on
partial correlation and then performs greedy hill-climbing search to ori-
ent the edges. Specifically, we make three contributions. Firstly, we give
the proof that in a SEM (simultaneous equation model) with uncorrelated
errors, when datasets are generated by SEM no matter what distribution
disturbances subject to, we can use partial correlation as the criterion of CI
test. Second, we have done a series of experiments to find the best thresh-
old value of partial correlation. Finally, we show how partial relation can
be used in Bayesian network structure learning under SEM. The effective-
ness of the method is compared with current state of the art methods on
8 networks. Simulation shows that PCB algorithm outperforms existing
algorithms in both accuracy and run time.

Keywords: partial relation; Bayesian network structure learning; SEM


(simultaneous equation model).

1 Introduction
Learning the structure of Bayesian network from dataset D is useful, unfortu-
nately, it is an NP-hard problem [2]. Consequently, many heuristic techniques
have been proposed. One of the most basic search algorithms is a local greedy
hill-climbing search over all DAG structures. The size of the search space of
greedy search is super exponential to the number of variables. One of the ap-
proaches uses constraints placed on the search to improve efficiency of the search,
such as the K2 algorithm [3], the SC algorithm [4], the MMHC algorithm [15],
the L1MB algorithm [7].
One drawback of the K2 algorithm is that it requires a total variable order-
ing. The SC algorithm first introduces local learning idea and proposes two-phase
framework including a Restrict step and a Search step. In the Restrict step, the
SC algorithm uses mutual information to find a set of potential neighbors for
each node and achieves fast learning by restricting the search space. One draw-
back of the SC algorithm is that it only allows a variable to have a maximum

Corresponding author.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 63–74, 2011.
c Springer-Verlag Berlin Heidelberg 2011
64 J. Yang and L. Li

up to k parents. However a common parameter k for all nodes will have to sac-
rifice either efficiency or quality of reconstruction [15]. The MMHC algorithm
uses the max-min parents-children (MMPC) algorithm to identify a set of po-
tential neighbors [15]. Experiments show that the MMHC algorithm has quite
high accuracy, one drawback of which is that it needs conditional independency
tests on exponentially large conditioning sets. The L1MB algorithm introduces
L1 techniques to learn DAG structure and uses the LARS algorithm to find a
set of potential neighbors [7]. The L1MB algorithm has good time performance.
However, the L1MB algorithm can describe the correlation between a set of
variables and a variable, not the correlation between two variables. Experiments
show that the L1MB algorithm has low accuracy.
In fact, many algorithms, such as the K2, SC, PC [13], TPDA [1], MMHC, can
be implemented efficiently with discrete variables, and are not applicable to the
continuous variables straightforwardly. The L1MB algorithm has been designed
for continuous variables. However, its accuracy is not very high.
Partial correlation method can reveal the true correlation between two vari-
ables by eliminating the influences of other correlative variables[16]. It has been
successfully applied to many fields such as medicine [8], economics [14], and geol-
ogy [16]. In causal discovery, it has been used (as transformed by Fisher’s z [12] )
as a continuous replacement for CI tests in PC algorithm. Pellet et al. introduced
partial-correlation-based CI test into causation discovery with the assumption
that data follows multivariate Gaussian distribution for continuous variables [9].
However, when the data doesn’t follow multivariate Gaussian distribution, can
partial correlation be CI test?
Our first contribution is that we give the proof that partial correlation can
be used as the criterion of CI test under linear simultaneous equation model
(SEM), which includes multivariate Gaussian distribution as a special case. Our
second contribution is that we propose an effective algorithm, called PCB (Par-
tial Correlation-Based), which combines ideas from local learning with partial
correlation techniques in an effective way. PCB algorithm works on the continu-
ous variable setting with the assumption data generated by SEM. Computational
complexity of PCB is O(3mn2 + n3 ) (n is the number of variables and m is the
number of cases). Advantages of PCB are that PCB has quite good time perfor-
mance and quite high accuracy. The time complexity of our PCB is polynomially
bounded by the number of variables. The third advantage of PCB algorithm is
that PCB algorithm uses a relevance threshold to evaluate the correlation to al-
leviate the drawback of SC algorithm (common parameter k for all nodes), and
we also find the best relevance threshold by a series of extensive experiments.
Empirical results show that PCB outperforms the above existing algorithms in
both accuracy and time performance.
The remainder of the paper is structured as follows. In section 2, we present
PCB algorithm and give computational complexity analysis. Some empirical
results are shown and discussed in section 3. Finally, we conclude our work and
address some issues about future work in section 4.
A PCB Bayesian Network Structure Learning Algorithm under SEM 65

2 PCB Algorithm

PCB(Partial Correlation-Based) includes two steps: the Restrict step and the
Search step.

2.1 Restrict Step

The restrict step is analogous to the pruning step of the SC algorithm, the
MMHC algorithm, the L1MB algorithm. In this paper, partial correlation is
used to identify the candidate neighbors. To a certain extent, there is a corre-
lation between each two variables, but the correlation is affected by the other
correlative variables. Simple correlation method does not consider the influences,
so it cannot reveal the true correlation between two variables. Partial correla-
tion can eliminate the influences of other correlative variables and reveal the
true correlation between two variables. A larger magnitude of partial correlation
coefficient means a closer correlation [Xu et al., 2007]. So partial correlation is
used to select the potential neighbors. Before we give our algorithm, we give
some definitions and theorems.
Definition 2.1[9] (SEM). A SEM (structural equation model) is a set of equa-
tions describing the value of each variable Xi in X as a function fXi of its parents
Pa(Xi ) and a random disturbance term uXi :

xi = fXi (Pa(Xi ), uXi ) (1)

In our paper, without loss of generality, we simplify the function as linear, so we


multiply a weight set WTXi and a parent set Pa(Xi ), one weight for one parent.
Here WXi and Pa(Xi ) are vectors, and WTXi is transposing vector of WXi . we
can obtain the equation(2):

xi = WTXi Pa(Xi ) + uXi (2)

equation(2) is a special case of the general SEM described by equation(1). Distur-


bances uXi are continuous random variables. Specially, when all uXi are Gaus-
sian random variables, X subjects to multivariate Gaussian distribution. Then,
partial correlation is a valid CI measure [9]. However,we want to deal with a
more general case, when uXi are continuous but from arbitrary distribution.
Can partial correlation be a valid CI measure?
Definition 2.2[9] (Conditional independence). In a variable set X, two
random variables Xi , Xj ∈ X are conditionally independent given Z ⊆ X \
{Xi , Xj }, if and only if P (Xi |Xj , Z) = P (Xi |Z) , denoted as Ind(Xi , Xj |Z) .
Definition 2.3[9] (d-separation). In a DAG G, two nodes Xi , Xj are
d-separated by Z ⊆ X \ {Xi , Xj }, if and only if every path from Xi to Xj
is blocked by Z, denoted as Dsep(Xi , Xj |Z). A path is blocked if at least one
diverging or serially connected node in Z or if at least one converging node and
66 J. Yang and L. Li

all its descendants are not in Z. If X and Y are not d-separated by Z, they are
d-connected,denoted as Dcon(Xi , Xj |Z).
Theorem 2.1.[12] In a SEM with uncorrelated errors (that means for two ran-
dom variables Xi , Xj ∈ X, uXi and uXj are uncorrelated), Z ⊆ X \ {Xi , Xj },
a partial correlation ρ(Xi , Xj |Z) is entailed to be zero if and only if Xi and Xj
are d-separated given Z.
Definition 2.4[10] (Perfect map). If the Causal Markov and Faithfulness con-
ditions hold together, A DAG G is a directed perfect map of a joint probability
distribution P (X), and there is bijection between d-separation in G and condi-
tional independence in P :
∀Xi , Xj ∈ X, ∀Z ⊆ X \ {Xi , Xj } : Ind(Xi , Xj |Z) ⇔ Dsep(Xi , Xj |Z) (3)
Definition 2.5[5](Linear Correlation). In a variable set X, the linear corre-
lation coefficient γXi Xj between two random variables Xi , Xj ∈ X, provides the
most commonly used measure to assess the strength of the linear relationship
between Xi and Xj is defined by
γXi Xj = σXi Xj /σXi σXj (4)
where σXi Xj ,denotes the covariance between Xi and Xj , and σXi and σXj denote
the standard deviation of Xi and Xj respectively. γxi xj is estimated by

m
(xki − x̄i )(xkj − x¯j )
γ̂Xi Xj =  k=1
 (5)

m 
m
(xki − x¯i )2 (xkj − x¯j )2
k=1 k=1

Here, m is the number of instances. xki means k-th realization (or case) of Xi ,
and x̄i is the mean of Xi . xkj means k-th case of Xj , and x̄j is the mean of Xj .
Definition 2.6[10] (Partial correlation). In a variable set X, the partial
correlation between two random variables Xi , Xj ∈ X, given Z ⊆ X \ {Xi , Xj },
noted ρ(Xi , Xj |Z), is the correlation of the residuals RXi and RXj resulting from
the least-squares linear regression of Xi on Z and of Xj on Z, respectively. Partial
correlation can be computed efficiently without having to solve the regression
problem by inverting the correlation matrix R of the X. With R−1 = (rij ), here
R−1 is the inverse matrix of R, we have:

ρ(Xi , Xj |X \ {Xi , Xj }) = −rij / rii rjj (6)
In this case, we can compute all partial correlations with a single matrix inver-
sion. This is an approach we use in our algorithm.
Theorem 2.2. In a SEM with uncorrelated errors, when data is generated by
the SEM no matter what distribution disturbances subject to, we can use partial
correlation as the criterion of CI test.
Prove: From Theorem 2.1, ∀Xi , Xj ∈ X, ∀Z ⊆ X \ {Xi , Xj }, the partial corre-
lation ρ(Xi , Xj |Z) is entailed to be zero if and only if Xi and Xj are d-separated
A PCB Bayesian Network Structure Learning Algorithm under SEM 67

given Z; From Definition 2.4 there is bijection between d-separation in G and


conditional independence in P , Ind(Xi , Xj |Z) ⇔ Dsep(Xi , Xj |Z), thus a partial
correlation ρ(Xi , Xj |Z) is entailed to be zero if and only if Xi and Xj are condi-
tionally independent given Z. So we can use partial correlation as the criterion
of CI test in a SEM with uncorrelated errors.
Definition 2.7 (Strong relevance). ∀Xi , Xj ∈ X, ∀Z ⊆ X \ {Xi , Xj }, Xi is
strongly relevant to Xj if the partial correlation ρ(Xi , Xj |Z) >= threshold.
Definition 2.8(Weak relevance). ∀Xi , Xj ∈ X, ∀Z ⊆ X \ {Xi , Xj }, Xi is
weekly relevant to Xj if the partial correlation ρ(Xi , Xj |Z) <= threshold.

The outline of the Restrict step is shown in Fig.1. Input of the step is threshold
k and a dataset D = {x1 , · · · , xm } of instances of X, where each xi is a complete
assignment to the variables X1 , · · · , Xn in V al(X1 , · · · , Xn ). Each column of the
dataset represents one variable. Output of the step is a set of potential neighbors
PN(Xj ) of each Xj and the matrix of potential neighbors PNM. If PNM(i, j)
is 1, that means Xi is Xj ’s potential neighbor. Otherwise, if PNM(i, j) is 0,
Xi isn’t Xj ’s potential neighbor. Initially, PN(Xj )(potential neighbors of each
variable Xj ) is empty, all elements of PNM are set to 0 (step 1 ). Then we
select a set of potential neighbors for each variable and obtain the final matrix
of potential neighbors (from step 2 to step 9). For each pair of variables Xi and
Xj (Xi , Xj ∈ X, j = 1 to n, i = 1, · · · , j, i = j), Z = X \ {Xi , Xj }, calculate
ρ(Xi , Xj |Z), if absolute value of ρ(Xi , Xj |Z) is greater than k, then choose Xi
as Xj ’s potential neighbor and set the value of PNM(i, j) to 1, otherwise set
the value of PNM(i, j) to 0. In fact, ρ(Xi , Xj |Z) (i < j) equals ρ(Xj , Xi |Z),
however, if there is strong correlation between them, we only set PNM(i, j) to
1. PNM is upper triangular matrix and on the diagonal elements are 0. Because,
Search step includes reverse-edge operator, by performing greedy hill-climbing
search, the step can orient the edges properly.

2.2 Search Step


After Restrict step, we find the candidate neighbors of each variable, then per-
form a greedy hill-climbing search. We assume that we have fully observed (com-
plete) data, which are continuous, that our goal is to find the DAG G that
minimizes the MDL cost, MDL is defined as


n
|θ̂imle |
M DL(G) = (N LL(i, Pa(Xi ), θ̂imle ) + log m) (7)
i=1
2

m
N LL(i, Pa(Xi ), θ) = − log(P (Xj,i |Xj,Pa(Xi ) , θ)) (8)
j=1

The method is used in [7]. Where m is the number of data cases, n is the
number of variables, Pa(Xi ) are the parents of node i in G, N LL(i, Pa(Xi ), θ)
is the negative loglikelihood of node i with parents Pa(Xi ) and parameters θ,
68 J. Yang and L. Li

1 m
Input: a dataset D={x ,…, x }, threshold: k
Output: a set of potential neighbors PN(Xj) of each variable Xj and potential neighbors matrix PNM
1. PN( Xj )=‡ ( Xj X, j=1 to n ) , PNM( i, j )=0 ( i=1 to n, j=1 to n )
2. for Xj  X, j=1 to n, do
3. for Xi  X, i=1 to j , izj , Z = X \ { Xi, Xj }, do
4. Calculate partial correlation U (Xi , Xj |Z)
5. if abs( U(Xi, Xj |Z) )> =k then PN(Xj)= PN(Xj) ‰ Xi , PNM(i, j)=1,
6. else PNM(i, j)=0
7. end for
8. end for
9. return PN and PNM

Fig. 1. Outline of the Restrict step

and θ̂imle = argminθ N LL(i, Pa(Xi ), θ) is the maximum likelihood estimate of


i’s parameters. The term |θ̂i | is the number of free parameters in the CPD
(conditional probability distributions) for node i. For linear regression, |θ̂i | =
|Pa(Xi )|, the number of parents.
Search step performs a greedy hill-climbing search to obtain the final DAG. We
follow the L1MB implementation (also to allow for a fair comparison). The search
begins with an empty graph. The basic heuristic search procedure we use is a
greedy hill-climbing that considers local moves in the form of edge addition, edge
deletion, and edge reversal. At each iteration, the procedure examines the change
in the score for each possible move, and applies the one that leads to the biggest
decrease in MDL score. These iterations are repeated until convergence. The
important difference from standard greedy search is that the search is constrained
to only consider adding an edge if it was discovered by PCB in the Restrict step.
The operator remove-edge just can be used to remove the edge added in the
graph actually. Maybe the orientation of some edge is not right, if reversing the
edge can lead to decrease in MDL score, the operator reverse-edge should be
used to reverse the edge. We terminate the procedure after some fixed number
of changes failed to result in an improvement over the best score so far. After
termination, the procedure returns the best scoring structure it encountered.

2.3 Time Complexity of PCB Algorithm


A dataset with n variables and m cases is considered. For comparison with L1MB,
we only consider time complexity of Restrict step. Time complexity of Restrict
step is O(3mn2 + n3 ).
Computations of PCB algorithm mainly exist in calculating the correlation
coefficient matrix R and calculating the inverse matrix of R. Matrix(n ∗ n)
multiplication algorithm needs n2 vector inner products and computational com-
plexity of vector(n) inner product is O(n), so computational complexity of ma-
trix multiplication algorithm is O(n3 ) at most. From Definition 2.5, we know
A PCB Bayesian Network Structure Learning Algorithm under SEM 69

that calculating the correlation coefficient of two variables needs 3 vector in-
ner products, and the correlation coefficient matrix has n2 elements, calculating
the correlation coefficient matrix requires 3n2 inner products, for m cases, com-
putational complexity of vector(m) inner product is O(m), thus computational
complexity of calculating the correlation coefficient matrix R is O(3mn2 ). We
know that the computation of the inverse matrix and matrix multiplication are
equal, so computational complexity of calculating the inverse matrix of R(n ∗ n)
is at most O(n3 ). We can get the conclusion that the total time complexity of
Restrict step is O(3mn2 + n3 ).

3 Experimental Results

3.1 Networks, Datasets and Measures of Performance

The experiments were conducted on a computer with Windows XP, Inter(R)


2.6GHz CPU and 2GB memory. All together 8 networks are selected from
Bayes net repository (BNR)1 except factors network. Factors network is syn-
thetic. The networks, including their number of nodes and edges are shown
as follows: 1.alarm (37/46), 2.barley (48/84), 3.carpo (61/74), 4.factors(27/68),
5.hailfinder(56/66), 6.insurance (27/52), 7.mildew (35/46), 8.water (32/66).
Datasets used in our experiment are generated by SEM. We adopt the follow-
ing two kinds of SEMs.

(1) xi = WTXi Pa(Xi ) + N (0, 1) (2) xi = WTXi Pa(Xi ) + rand(0, 1)

The weights are generated randomly, generally, randomly distributed uniformly


[9] or distributed normally[7]. We sampled the weights from ±1 + N (0, 1)/4.
Datasets sampled from SEM (1) belong to multivariate Gaussian distribution
and are continuous data. Datasets sampled from SEM (2) don’t belong to mul-
tivariate Gaussian distribution and are also continuous data.
We employ two metrics to compare the algorithms: run time and structural
errors. Structural errors include all of the errors including missing edges, extra
edges, missing orientation and wrong orientation. The number of structural er-
rors means the number of incorrect edges in the estimated model compared to
the true model[7].

3.2 Experimental Results and Analyses

We firstly evaluate the performance of PCB algorithm under the above two
cases, with different sample sizes, thresholds and networks. Fig.2 shows the re-
sults under SEM (2). X axis denotes networks. Y axis denotes the number of
structural errors. The results of SEM (1) are omitted because of space. From
Fig.2, we can see that threshold has a great effect on the performance of PCB
algorithm. The results of different SEMs are similar. When the dataset size is
1
http://www.cs.huji.ac.il/labs/compbio/Repository
70 J. Yang and L. Li

structural errors with different sample sizes, thresholds and networks of PCB algorithm under SEM(2)

Fig. 2. Structural errors of PCB algorithm. X axis denotes networks: 1.alarm 2.barley
3.carpo 4.factors 5.hailfinder 6.insurance 7.mildew 8.water. Y axis denotes the number
of structural errors. With different sample sizes(1000, 5000, 10000, 20000, 100000),
thresholds(0, 0.1, 0.3, m (m is the mean) ) and networks(the above 8 networks), PCB
algorithm has been tested.

small (1000, 5000), PCB (0.1) has the fewest structural errors on average, with
the dataset size gets larger, PCB(m) and PCB(0.1) have similar performances.
So when the threshold is 0.1, PCB algorithm achieves the best performance and
has the fewest structural errors on average almost on all the networks. Zero par-
tial correlation is not the best choice for CI test. For zero partial correlation
means independent, however, relevance has different extent, such as strong rel-
evance and weak relevance. The threshold is hard to select, maybe it depends
on the adopted networks. We have done a series of extensive experiments, and
found the best threshold on average.
The second experiment is to compare existing structure learning methods
with PCB algorithm. We adopt DAG, SC (5), SC (10), L1MB, PC (0.05), TPDA
(0.05), PCB (0.1) etc. PCB (0.1) means running DAG-Search after PCB pruning,
and DAG means running DAG-Search without pruning, SC (5) and SC (10)
means running DAG-Search after SC pruning (where we set the fan in bound to
5 and 10), L1MB means running DAG-Search after L1MB pruning. For DAG, SC
(5), SC (10), L1MB, we use Murphy’s DAGsearch implementation of DAGLearn
software2. For PC (0.05), TPDA (0.05), we used ”causal explorer”3.
Fig.3 shows the structural errors and time performance on the above networks
under SEM (2) by the seven algorithms. The results of SEM (1) are omitted
because of space. We give analyses in details as follows.
(1) PCB (0.1) algorithm vs DAG algorithm. DAG has worse performance on all
the networks. PCB algorithm achieves higher accuracy on all the networks under all
2
http://people.cs.ubc.ca/~ murphyk/
3
http://www.dsl-lab.org/
A PCB Bayesian Network Structure Learning Algorithm under SEM 71

(a) structural errors with different sample sizes and networks of seven algorithms under SEM(2)

(b) run time with different sample sizes and networks of seven algorithms under SEM(2)

Fig. 3. structural errors and run times under SEM(2). Under SEM(2), with differ-
ent sample sizes(1000,5000,10000,20000,100000) and networks(1.alarm 2.barley 3.carpo
4.factors 5.hailfinder 6.insurance 7.mildew 8.water), the 7 algorthms(DAG, SC(5),
SC(10), L1MB, PCB(0.1), PC(0.05), TPDA(0.05) ) have been tested. (a) are results
of structural errors. (b) are results of run time.
72 J. Yang and L. Li

the SEMs. For time performance, PCB (0.1) wins 5, ties 2, and loses 1 under SEM
(1), wins 5, ties 3, and loses 0 under SEM (2). Under the two SEMs, the results are
similar. For DAG, potential neighbors of each variable are all the other variables. In
the Search step, because we set the maximum number of iteration to 2500, maybe
2500 is too small, the Search step may terminate before finding the best DAG, so
the structural errors are more. Without the pruning step, time performance of DAG
algorithm is worse than PCB (0.1), the reason is as follows: Search step examines
the change in the score for each possible move. For without pruning, the number
of potential neighbors for each variable is large, the number within consideration
is also large, so the cost of search step is higher.
(2) PCB (0.1) algorithm vs SC(5) and SC(10). PCB algorithm achieves both
better time performance and higher accuracy almost on all the networks un-
der all the SEMs. SC algorithm needs specify the maximum fan in advance,
however some nodes in the true structure may have much higher connectivity
than others, so a common parameter for all nodes is not reasonable. In addition,
based on correlation coefficient SC algorithm of DAGLearn software selects the
top k (maximum fan) candidate neighbors and doesn’t consider symmetry of
correlation coefficient, and this will lead to redundant information of potential
neighbors, and will sacrifice either efficiency or performance. PCB algorithm
has not the above problems. From the section 2.3 we know that computational
complexity of calculating the correlation coefficient matrix is O(3mn2 ); in order
to select the top k candidate neighbors, we must sort each row of correlation
coefficient matrix, computational complexity is n3 , so the total complexity is
O(3mn2 + n3 ), which is equal to PCB restrict step (O(3mn2 + n3 )). However,
total time performance of SC algorithm is worse than that of PCB (0.1). Due
to the unreasonable selection of potential neighbors and redundant information
of potential neighbors, the cost of the search step will be increased. So SC algo-
rithm has worse time performance and accuracy.
(3) PCB (0.1) algorithm vs L1MB. PCB algorithm achieves both better time
performance and higher accuracy on all the networks under all the SEMs.
L1MB algorithm adopts LARS algorithm to select potential neighbors. For a
variable, L1MB selects the set of variables that have the best predictive accuracy
as a whole, and L1MB evaluates the effects of a set of variables, not a single vari-
able. Using the method to select potential neighbors has some shortcomings. The
method can describe the correlation between a set of variables and a variable,
not the correlation between two variables. There maybe exist some variables,
which do not belong to the selected set of potential neighbors, but have strong
relevance with the target variable. However, Partial correlation method can re-
veal the true correlation between two variables by eliminating the influences of
other correlative variables. PCB algorithm is based on partial correlation to se-
lect potential neighbors and evaluates the effect of a single variable. So PCB
algorithm is more reasonable, and experimental results also indicate that PCB
algorithm has fewer structural errors.
PCB (0.1) has better time performance than L1MB. From section 3.4, we know
that time complexity of PCB is O(3mn2 + n3 ) (n is the number of variables and
A PCB Bayesian Network Structure Learning Algorithm under SEM 73

m is the number of cases ). For L1MB, time complexity of computing the L1-
regularization path is O(mn2 ) in the Gaussian case (SEM (1) and SEM (2))[7].
In addition, L1MB also includes computing the Maximum Likelihood parameters
for all non-zero sets of variables encountered along this path and selecting the
set of variables that achieved the highest MDL score. So L1MB has worse time
performance than PCB(0.1) under all the SEMs.
(4) PCB (0.1) algorithm vs PC(0.05) algorithm. PCB (0.1) algorithm achieves
both better time performance and higher accuracy on all the networks under
all the SEMs. PC (0.05) algorithm has been designed for discrete variables, or
imposes restrictions on which variables may be continuous. PC first identifies
the skeleton of a Bayesian network and then orients the edges. However, PC
algorithm may fail to orient some edges, and in our experiments, we take the
edges as wrong. So PC algorithm has more structural errors. PC algorithm needs
O(nk+2 ) CI tests, k is the maximum degree of any node in the true structure[13].
Time complexity of CI test at least is O(m). So time complexity of PC algorithm
is O(mnk+2 ). PC algorithm has an exponential time complexity in the worst case.
Time complexity of PCB (0.1) is O(3mn2 + n3 ). Obviously, PC algorithm has
worse time performance.
(5) PCB(0.1) algorithm vs TPDA(0.05) algorithm. PCB algorithm achieves
both better time performance and higher accuracy on all the networks. TPDA
has been designed for discrete variables, or imposes restrictions on which vari-
ables may be continuous. So TPDA algorithm has more structural errors. TPDA
requires at most O(n4 ) CI tests to discover the edges. In some special case, TPDA
requires only O(n2 ) CI tests[1]. Time complexity of CI test at least is O(m). So
time complexity of TPDA is O(mn4 ) or O(mn2 ). Compared with PC algorithm,
TPDA algorithm has better time performance, however compared with PCB
algorithm O(mn2 ), time complexity of TPDA algorithm is still high. So PCB
(0.1) has better time performance than TPDA (0.05).

4 Conclusions and Future Work


The contributions of this paper are two-fold.
(1)We prove that partial correlation can be used as CI test under SEM, which
includes multivariate Gaussian distribution as a special case. We redefine the
strong relevance and weak relevance. Based on a series of experiments, we find
the best relevance threshold.
(2) We propose PCB algorithm, and theoretical analysis and empirical results
show that PCB algorithm performs better than the other existing algorithms on
both accuracy and run time.
We are seeking a way of automatically determining the best threshold. We
will also extend our algorithm to higher dimension and larger datasets.

Acknowledgement
The research has been supported by 973 Program of China under award 2009CB
326203, the National Natural Science Foundation of China under award 61073193
74 J. Yang and L. Li

and 61070131. The authors are very grateful to the anonymous reviewers for their
constructive comments and suggestions that have led to an improved version of
this paper.

References
1. Cheng, J., Greiner, R., Kelly, J., Bell, D.A., Liu, W.: Learning Bayesian networks
from data: An information-theory based approach. Doctoral Dissertation. Depart-
ment of Computing Science, University of Alberta and Faculty of Informatics,
University of Ulster, November 1 (2001)
2. Chickering, D.: Learning Bayesian networks is NP-Complete. In: AI/Stats V (1996)
3. Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic
networks from data. Machine Learning 9(4), 309–347 (1992)
4. Friedman, N., Nachman, I., Peer, D.: Learning Bayesian network structure from
massive datasets: The ”sparse candidate” algorithm. In: UAI (1999)
5. Kleijnena, J.P.C., Heltonb, J.C.: Statistical analyses of scatterplots to identify im-
portant factors in largescale simulations, 1: Review and comparison of techniques.
Reliability Engineering and System Safety 65, 147–185 (1999)
6. Lam, W., Bacchus, F.: Learning Bayesian belief networks: An approach based on
the MDL principle. Comp. Int. 10, 269–293 (1994)
7. Schmidt, M., Niculescu-Mizil, A., Murphy, K.: Learning Graphical Model Struc-
ture Using L1-Regularization Paths. In: Proceedings of Association for the Ad-
vancement of Artificial Intelligence (AAAI), pp. 1278–1283 (2007)
8. Ogawa, T., Shimada, M., Ishida, H.: Relation of stiffness parameter b to carotid
arteriosclerosis and silent cerebral infarction in patients on chronic hemodialysis.
Int. Urol. Nephrol. 41, 739–745 (2009)
9. Pellet, J.P., Elisseeff, A.: Partial Correlation and Regression-Based Approaches to
Causal Structure Learning, IBM Research Technical Report (2007)
10. Pellet, J.P., Elisseeff, A.: Using Markov Blankets for Causal Structure Learning.
Journal of Machine Learning Research 9, 1295–1342 (2008)
11. Rissanen, J.: Stochastic complexity. Journal of the Royal Statistical Society, Series
B 49, 223–239 (1987)
12. Scheines, R., Spirtes, P., Glymour, C., Meek, C., Richardson, T.: The tetrad
project: Constraint based aids to causal model specification. Technical report,
Carnegie Mellon University, Dpt. of Philosophy (1995)
13. Spirtes, P., Glymour, C., Scheines, R.: Causation, prediction, and search, 2nd edn.
The MIT Press, Cambridge (2000)
14. Sun, Y., Negishi, M.: Measuring the relationships among university, industry and
other sectors in Japan’s national innovation system: a comparison of new ap-
proaches with mutual information indicators. Scientometrics 82, 677–685 (2010)
15. Tsamardinos, I., Brown, L., Aliferis, C.: The max-min hill-climbing bayesian net-
work structure learning algorithm. Machine Learning 65, 31–78 (2006)
16. Xu, G.R., Wan, W.X., Ning, B.Q.: Applying partial correlation method to analyz-
ing the correlation between ionospheric NmF2 and height of isobaric level in the
lower atmosphere. Chinese Science Bulletin 52(17), 2413–2419 (2007)
Predicting Friendship Links in Social Networks
Using a Topic Modeling Approach

Rohit Parimi and Doina Caragea

Computing and Information Sciences,


Kansas State University, Manhattan, KS, USA 66506
{rohitp,dcaragea}@ksu.edu

Abstract. In the recent years, the number of social network users has
increased dramatically. The resulting amount of data associated with
users of social networks has created great opportunities for data mining
problems. One data mining problem of interest for social networks is the
friendship link prediction problem. Intuitively, a friendship link between
two users can be predicted based on their common friends and interests.
However, using user interests directly can be challenging, given the large
number of possible interests. In the past, approaches that make use of
an explicit user interest ontology have been proposed to tackle this prob-
lem, but the construction of the ontology proved to be computationally
expensive and the resulting ontology was not very useful. As an alterna-
tive, we propose a topic modeling approach to the problem of predicting
new friendships based on interests and existing friendships. Specifically,
we use Latent Dirichlet Allocation (LDA) to model user interests and,
thus, we create an implicit interest ontology. We construct features for
the link prediction problem based on the resulting topic distributions.
Experimental results on several LiveJournal data sets of varying sizes
show the usefulness of the LDA features for predicting friendships.

Keywords: Link Mining, Topic Modeling, Social Networks, Learning.

1 Introduction

Social network such as MySpace, Facebook, Orkut, LiveJournal and Bebo have
attracted millions of users [1], some of these networks growing at a rate of more
than 50 percent during the past year [2]. Recent statistics have suggested that
social networks have overtaken search engines in terms of usage [3]. This shows
how Internet users have integrated social networks into their daily practices.
Many social networks, including LiveJournal online services [4] are focused
on user interactions. Users in LiveJournal can tag other users as their friends.
In addition to tagging friends, users can also specify their demographics and
interests in this social network. We can see LiveJournal as a graph structure with
users (along with their specific information, e.g. user interests) corresponding to
nodes in the graph and edges corresponding to friendship links between the users.
In general, the graph corresponding to a social network is undirected. However,

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 75–86, 2011.
c Springer-Verlag Berlin Heidelberg 2011
76 R. Parimi and D. Caragea

in LiveJournal, the edges are directed i.e., if a user ‘A’ specifies another user ‘B’
as its friend, then it is not necessary for user ‘A’ to be the friend of user ‘B’. One
desirable feature of an online social network is to be able to suggest potential
friends to its users [8]. This task is known as the link prediction problem, where
the goal is to predict the existence of a friendship link from user ‘A’ to user ‘B’.
The large amounts of social network data accumulated in the recent years have
made the link prediction problem possible, although very challenging.
In this work, we aim at using the ability of machine learning algorithms to take
advantage of the content (data from user profiles) and graph structure of social
network sites, e.g., LiveJournal, to predict friendship links. User profiles in such
social networks consist of data that can be processed into useful information.
For example, interests specified by users of LiveJournal act as good indicators
to whether two users can be friends or not. Thus, if two users ‘A’ and ‘B’ have
similar interests, then there is a good chance that they can be friends. However,
the number of interests specified by users can be very large and similar interests
need to be grouped semantically. To achieve this, we use a topic modeling ap-
proach. Topic models provide an easy and efficient way of capturing semantics of
user interests by grouping them into categories, also known as topics, and thus
reducing the dimensionality of the problem. In addition to using user interests,
we also take advantage of the graph structure of the LiveJournal network and
extract graph information (e.g., mutual friends of two users) that is helpful for
predicting friendship links [9]. The contributions of this paper are as follows: (i)
an approach for applying topic modeling techniques, specifically LDA, on user
profile data in a social network; and (ii) experimental results on LiveJournal
datasets showing that a) the best performance results are obtained when in-
formation from interest topic modeling is combined with information from the
network graph of the social network b) the performance of the proposed approach
improves as the number of users in the social network increases.
The rest of the paper is organized as follows: We discuss related work in
Section 2. In Section 3, we review topic modeling techniques and Latent Dirichlet
Allocation (LDA). We provide a detailed description of our system’s architecture
in Section 4 and present the experimental design and results in Section 5. We
conclude the paper with a summary and discussion in Section 6.

2 Related Work
Over the past decade, social network sites have attracted many researches as
sources of interesting data mining problems. Among such problems, the link
prediction problem has received a lot of attention in the social network domain
and also in other graph structured domains.
Hsu et al. [9] have considered the problems of predicting, classifying, and an-
notating friendship relations in a social network, based on the network struc-
ture and user profile data. Their experimental results suggest that features
constructed from the network graph and user profiles of LiveJournal can be
effectively used for predicting friendships. However, the interest features pro-
posed in [9] (specifically, counts of individual interests and the common interests
Friendship Link Prediction in Social Networks 77

of two users) do not capture the semantics of the interests. As opposed to that,
in this work, we create an implicit interest ontology to identify the similarity be-
tween interests specified by users and use this information to predict unknown
links.
A framework for modeling link distributions, taking into account object fea-
tures and link features is also proposed in [5]. Link distributions describe the
neighborhood of links around an object and can capture correlations among links.
In this context, the authors have proposed an Iterative Classification Algorithm
(ICA) for link-based classification. This algorithm uses logistic regression models
over both links and content to capture the joint distributions of the links. The
authors have applied this approach on web and citation collections and reported
that using link distribution improved accuracy in both cases.
Taskar et al. [8] have studied the use of a relational Markov network (RMN)
framework for the task of link prediction. The RMN framework is used to define a
joint probabilistic model over the entire link graph, which includes the attributes
of the entities in the network as well as the links. This method is applied to
two relational datasets, one involving university web pages, and the other a
social network. The authors have reported that the RMN approach significantly
improves the accuracy of the classification task as compared to a flat model.
Castillo et al. [7] have also shown the importance of combining features
computed using the content of web documents and features extracted from the
corresponding hyperlink graph, for web spam detection. In their approach, sev-
eral link-based features (such as degree related measures) and various ranking
schemes are used together with content-based features such as corpus precision
and recall, query precision, etc. Experimental results on large public datasets of
web pages have shown that the system was accurate in detecting spam pages.
Caragea et al. [10], [11] have studied the usefulness of a user interest ontology
for predicting friendships, under the assumption that ontologies can provide a
crisp semantic organization of the user information available in social networks.
The authors have proposed several approaches to construct interest ontologies
over interests of LiveJournal users. They have reported that organizing user in-
terests in a hierarchy is indeed helpful for predicting links, but computationally
expensive in terms of both time and memory. Furthermore, the resulting ontolo-
gies are large, making it difficult to use concepts directly to construct features.
With the growth of data on the web, as new articles, web documents, social
networking sites and users are added daily, there is an increased need to ac-
curately process this data for extracting hidden patterns. Topic modeling tech-
niques are generative probabilistic models that have been successfully used to
identify inherent topics in collections of data. They have shown good perfor-
mance when used to predict word associations, or the effects of semantic as-
sociations on a variety of language-processing tasks [12], [13]. Latent Dirichlet
Allocation (LDA) [15] is one such generative probabilistic model used over dis-
crete data such as text corpora. LDA has been applied to many tasks such as
word sense disambiguation [16], named entity recognition [17], tag recommen-
dation [18], community recommendation [19], etc. In this work, we apply LDA
78 R. Parimi and D. Caragea

on user profile data with the goal of producing a reduced set of features that
capture user interests and improve the accuracy of the link prediction task in
social networks. To the best of our knowledge, LDA had not been used for this
problem before.

3 Topic Modeling and Latent Dirichlet Allocation (LDA)

Topic models [12], [13] provide a simple way to analyze and organize large vol-
umes of unlabeled text. They express semantic properties of words and docu-
ments in terms of probabilistic topics, which can be seen as latent structures
that capture semantic associations among words/documents in a corpus. Topic
models treat each document in a corpus as a distribution over topics and each
topic as a distribution over words. A topic model, in general, is a generative
model, i.e. it specifies a probabilistic way in which documents can be generated.
One such generative model is Latent Dirichlet Allocation, introduced by Blei
et al. [15]. LDA models a collection of discrete data such as text corpora. Fig-
ure 1 (adapted from [15]) illustrates a simplified graphical model representing
LDA. We assume that the corpus consists of M documents denoted by D =
{d1 , d2 · · · dM }. Each document di in the corpus is defined as a sequence of Ni
words denoted by di = (wi1 , wi2 · · · wiNi ), where each word wij belongs to a vo-
cabulary V . A word in a document di is generated by first choosing a topic zij
according to a multinomial distribution and then choosing a word wij according
to another multinomial distribution, conditioned on the topic zij . Formally, the
generative process of the LDA model can be described as follows [15]:

1. Choose the topic distribution θi ∼ Dirichlet(α).


2. For each of the Ni words wij :
(a) Choose a topic zij ∼ M ultinomial(θi).
(b) Choose a word wij from p(wij |zij , β) (multinomial conditioned on zij ).

From Figure 1, we can see that the LDA model has a three level representation.
The parameters α and β are corpus level parameters, in the sense that they
are assumed to be sampled once in the process of generating a corpus. The
variables θi are document-level variables sampled once per document and the

Fig. 1. Graphical representation of the LDA model


Friendship Link Prediction in Social Networks 79

variables zij and wij are at the word level. These variables will be sampled once
for each word in each document. For the work in this paper, we have used the
LDA implementation available in MALLET, A Machine Learning for Language
Toolkit [20]. MALLET uses Gibbs sampling for parameter estimation.

4 System Architecture
As can be seen in Figure 2, the architecture of the system that we have designed
is divided into two modules. The first module of the system is focused on iden-
tifying and extracting features from the interests expressed by each user of the
LiveJournal. These features are referred to as interest based features. The second
module uses the graph network (formed as a result of users tagging other users
in the network as ‘friends’) to calculate certain features which have been shown
to be helpful at the task of predicting friendship links in LiveJournal [9]. We
call these features, graph based features. We use both types of features as input
to learning algorithms (as shown in Section 5). Sections 4.1 and 4.2 describe in
detail the construction of interest based and graph based features, respectively.

Fig. 2. Architecture of the system used for link prediction

4.1 Interest Based Features


Each user in a social network has a profile that contains information character-
istic to himself or herself. Users most often tend to describe themselves, their
likes, dislikes, interests/hobbies in their profiles. For example, users of LiveJour-
nal can specify their demographics and interests along with tagging other users
of the social network as friends. Data from the user profiles can be processed into
80 R. Parimi and D. Caragea

useful information for predicting/recommending potential friends to the users.


In this work, we use a topic modeling technique to capture semantic informa-
tion associated with the user profiles, in particular, with interests of LiveJournal
users. Interests of the users act as good indicators to whether they can be friends
or not. The intuition behind interest based features is that two users ‘A’ and ‘B’
might be friends if ‘A’ and ‘B’ have some similar interests. We try to capture
this intuition through the feature set that we construct using the user interests.
Our goal is to organize interests into “topics”. To do that, we model user in-
terests in LiveJournal using LDA by treating LiveJournal as a document corpus,
with each user in the social network representing a “document”. Thus, interests
specified by each user form the content of the “user document”. We then run
the MALLET implementation of LDA on the collection of such user documents.
LDA allows us to input the number of inherent topics to be identified in the
collection used. In this work, we vary the number of topics from 20 to 200. In
general, the smaller the number of topics, the more abstract will be the inherent
topics identified. Similarly, the larger the number of topics, the more specific the
topics identified will be. Thus, by varying the number of topics, we are implicitly
simulating a hierarchical ontology: a particular number of topics can be seen as a
cut through the ontology. The topic probabilities obtained as a result of modeling
user interests with LDA provide an explicit representation of each user and are
used to construct the interest based features for the friendship prediction task,
as described in what follows: suppose that A [1 · · · n] represents the topic distri-
bution for user ‘A’ and B [1 · · · n] represents the topic distribution for user ‘B’ at
a particular topic level n. The feature vector, F (A, B) for the user pair (A, B) is
constructed as: F (A, B) = (|A [1]−B [1] |, |A [2]−B [2] |, · · · , |A [n]−B [n] |). This
feature vector is meant to capture the intuition that the smaller the difference
between the topic distributions, the more semantically related the interests are.

4.2 Graph Based Features

Previous work by Hsu et al. [9] and Caragea et al. [10], [11], among others, have
shown that the graph structure of the LiveJournal social network acts as a good
source of information for predicting friendship links. In this work, we follow the
method described in [9] to construct graph-based features. For each user pair
(A, B) in the network graph, we calculate in-degree of ‘A’, in-degree of ‘B’, out-
degree of ‘A’, out-degree of ‘B’, mutual friends of ‘A’ and ‘B’, backward deleted
distance from ‘B’ to ‘A’ (see [9] for detailed descriptions of these features).

5 Experimental Design and Results

This section describes the dataset used in this work and the experiments de-
signed to evaluate our approach of using LDA for the link prediction task. We
have conducted various experiments with several classifiers to investigate their
performance at predicting friendship links between the users of LiveJournal.
Friendship Link Prediction in Social Networks 81

5.1 Dataset Description and Preprocessing


We used three subsets of the LiveJournal dataset with 1000, 5000 and 10,000
users, respectively, to test the performance and scalability of our approach. As
part of the preprocessing step, we clean the interest set to remove symbols, num-
bers, foreign language. Interests with frequency less than 5 in the dataset are
also removed. Strings of words in a multi-word interest are concatenated into
a single “word,” so that MALLET treats them as a single entity. For example,
the interest ‘artificial neural networks’ is transformed into ’ArtificialNeuralNet-
works’ after preprocessing. Users whose in-degree and out-degree is zero, as well
as users who do not have any interests declared are removed from the dataset.
We are left with 801, 4026 and 8107 users in the three datasets, respectively, and
approximately 14,000, 32,000 and 39,700 interests for each dataset after prepro-
cessing. Furthermore, there are around 4,400, 40,000, 49,700 declared friendship
links in the three datasets. We generate topic distributions for the users in the
dataset using LDA; hyper-parameters (α, β) are set to the default values.
We make the assumption that the graph is complete, i.e. all declared friendship
links are positive examples and all non declared friendships are negative examples
[10], although this assumption does not hold in the real world. The user network
graph is partitioned into two subsets with 2/3rd of the users in the first set and
1/3rd of the users in the second set (this process is repeated five times for cross-
validation purposes). We used the subset with 2/3rd of the users for training
and the subset with 1/3rd of the users for test. We ensure that the training and
the test datasets are independent by removing the links that go across the two
datasets. We also balance the data in the training set, as the original distribution
is highly skewed towards the negative class.

5.2 Experiments
The following experiments have been performed in this work.
1. Experiment 1: In the first experiment, we test the performance of several
predictive models trained on interest features constructed from topic distri-
butions. The number of topics to be modeled is varied from 20 to 200. The
1000 user dataset described above is used in this experiment.
2. Experiment 2: In the second experiment, we test several predictive models
that are trained on graph features, for the 1000 user dataset. To be able to
construct the graph features for test data, we assume that a certain per-
centage of links is known [8] (note that this is a realistic assumption, as it
is expected that some friends are already known for each user). Specifically,
we explore scenarios where 10%, 25% and 50% links are known, respectively.
Thus, we construct features for the unknown links using the known links.
3. Experiment 3: In the third experiment, graph based features are used
in combination with interest-based features to see if they can improve the
performance of the models trained with graph features only on the 1000 user
dataset. For the test set graph features constructed by assuming 10%, 25%
and 50% known links, respectively, are combined with interest features.
82 R. Parimi and D. Caragea

We repeat the above mentioned experiments for the 5000 user dataset. The
corresponding experiments are referred to as Experiment 4, Experiment 5
and Experiment 6, respectively. For the 10,000 user dataset, we build predictive
models using just interest based features (construction of graph features for the
10,000 user dataset was computationally infeasible, given our resources). This
experiment is referred to as Experiment 7. We use results from Experiments
1, 4 and 7 to study the performance and the scalability of the LDA approach
to link prediction based on interests, as the number of users increases. For all
the experiments, we used WEKA implementations of the Logistic Regression,
Random Forest and Support Vector Machine (SVM) algorithms.

5.3 Results
Importance of the Interest Features for Predicting Friendship Links.
As mentioned above, several experiments have been conducted to test the use-
fulness of the topic modeling approach on user interests for the link prediction
problem in LiveJournal. As expected, interest features (i.e., topic distributions
obtained by modeling user interests) combined with graph features produced the
most accurate models for the prediction task. This can be seen from Tables 1
and 2. In both tables, we can see that interest+graph features with 50% known
links outperform interest or graph features alone in terms of AUC values1 , for
all three classifiers used. Interesting results can be seen in Table 2, where inter-
est features alone are better than graph features alone when only 10% links are
known, and sometimes better also than interest+graph features with 10% links
known, thus, showing the importance of the user profile data, captured by LDA,
for link prediction in social networks. Furthermore, a comparison between our
results and the results presented in [21], which uses an ontology-based approach
to construct interest features, shows that the LDA features are better than the
ontology features on the 1,000 user dataset. As another drawback, the ontology
based approach is not scalable (no more than 4,000 users could be used) [21].
Figure 3 depicts the AUC values obtained using interest, graph and inter-
est+graph features with Logistic Regression and SVM classifiers across all num-
bers of topics modeled for the 1,000 and 5,000 user datasets, respectively. We can
see that the AUC value obtained using interest+graph features is better than
the corresponding value obtained using graph features alone across all numbers
of topics, for all scenarios of known links, in the case of the 5000 user dataset.
This shows that the contribution of interest features increases with the number
of users. Also based on Figure 3, it is worth noting that the graphs do not show
significant variation with the number of topics used.
Performance of the Proposed Approach with the Number of Users.
In addition to studying the importance of the LDA interest features for the link
prediction task, we also study the performance and scalability of the approaches
considered in this work (i.e., graph-based versus LDA interest based, and com-
binations) as the number of users increases. We are interested in both a) the
1
All AUC values reported are averaged over five different train and test datasets.
Friendship Link Prediction in Social Networks 83

Table 1. AUC values for Logistic Regression (LR), Random Forests (RF) and Sup-
port Vector Machines (SVM) classifiers with interest, graph and interest+graph based
features for the 1,000 user dataset. k% links are known in the test set, where k is 10,
25 and 50, respectively. The known links are used to construct graph features.

Exp# Features Logistic Regression Random Forest SVM


1 Interest 0.625 ± 0.03 0.5782 ± 0.04 0.6198 ± 0.04
2 (10%) Graph 10% 0.74 ± 0.08 0.578 ± 0.04 0.7738 ± 0.05
3 (10%) Interest+Graph 10% 0.6226 ± 0.05 0.6664 ± 0.04 0.6606 ± 0.02
2 (25%) Graph 25% 0.7684 ± 0.07 0.7106 ± 0.05 0.8104 ± 0.05
3 (25%) Interest+Graph 25% 0.7406 ± 0.04 0.8188 ± 0.03 0.7983 ± 0.03
2 (50%) Graph 50% 0.8526 ± 0.03 0.8008 ± 0.03 0.8692 ± 0.03
3 (50%) Interest+Graph 50% 0.8648 ± 0.03 0.877 ± 0.04 0.8918 ± 0.03

Table 2. AUC values similar to those in Table 1, for the 5,000 user dataset.

Exp# Features Logistic Regression Random Forest SVM


4 Interest 0.6954 ± 0.01 0.6276 ± 0.01 0.7008 ± 0.01
5 (10%) Graph 10% 0.649 ± 0.03 0.5936 ± 0.02 0.692 ± 0.02
6 (10%) Interest+Graph 10% 0.6718 ± 0.02 0.6566 ± 0.01 0.6998 ± 0.01
5 (25%) Graph 25% 0.7022 ± 0.05 0.6716 ± 0.02 0.7896 ± 0.03
6 (25%) Interest+Graph 25% 0.7384 ± 0.03 0.7846 ± 0.03 0.7986 ± 0.03
5 (50%) Graph 50% 0.8456 ± 0.02 0.7086 ± 0.02 0.883 ± 0.02
6 (50%) Interest+Graph 50% 0.8696 ± 0.02 0.8908 ± 0.02 0.9046 ± 0.01

quality of the predictions that we get for the LiveJournal data as the number of
users increases; and b) the time and memory requirements for each approach.
From Figure 4, we can see that the prediction performance (expressed in terms
of AUC values) is improved in the 5,000 user dataset as compared to the 1,000
user dataset, across all numbers of topics modeled. Similarly, the prediction
performance for the 10,000 user dataset is better than the performance for the
5,000 user dataset, for all topics from 20 to 200. One reason for better predictions
with more users in the dataset is that, when we add more users, we also add the
interests specified by the newly added users to the interest set on which topics
are modeled using LDA. Thus, we get better LDA probability estimates for the
topics associated with each user in the dataset, as compared to the estimates that
we had for a smaller set of data, and hence better prediction results. However,
as expected, both the amount of time it takes to compute features for the larger
dataset, as well as the memory required increase with the number of users in the
data set. The amount of time it took to construct features for the 10,000 user
dataset for all numbers of topics modeled in the experiments is around 14 hours
on a system with Intel core 2 duo processor running at 3.16GHz and 20GB of
RAM. This time requirement is due to our complete graph assumption (which
results in feature construction for 10,000*10,000 user pairs in the case of a 10,000
user dataset) and can be relaxed if we relax the completeness assumption. Still
the LDA feature construction is more efficient than the construction of graph
features, which was not possible for the 10,000 user dataset used in our study.
84 R. Parimi and D. Caragea

Fig. 3. Graph of reported AUC values versus number of topics used for modeling, using
Logistic Regression and SVM classifiers, for the 1,000 user dataset (top-left and top-
right, respectively) and 5,000 user dataset (bottom-left and bottom-right, respectively)

Fig. 4. AUC values versus number of topics for LR (left) and SVM (right) classifiers
for the 1,000, 5,000 and 10,000 user datasets using interest-based features

6 Summary and Discussion


We have proposed an architecture, which takes advantage of both user profile
data and network structure to predict friendship links in a social network. We
have shown how one can model topics from user profiles in social networks using
Friendship Link Prediction in Social Networks 85

LDA. Experimental results suggest that the usefulness of the interest features
constructed using the LDA approach increases with an increase in the number
of users. Furthermore, the results suggest that the LDA based interest features
can help improve the prediction performance when used in combination with
graph features, in the case of the LiveJournal dataset. Although in some cases
the improvement in performance due to interest features is not very significant
compared with the performance when graph features alone are used, the fact that
computation of graph features becomes intractable for 10,000 users or beyond
emphasizes the importance of the LDA based approach.
However, while the proposed approach is effective and shows improvement in
performance as the number of users increases, it also suffers from some limita-
tions. First, adding more users to the dataset increases the memory and time
requirements. Thus, as part of the future work, we plan to take advantage of
the MapReduce framework to support distributed computing for large datasets.
Secondly, our approach takes into account, the static image of the LiveJournal
social network. Obviously, this assumption does not hold in the real world. Based
on user interactions in the social network, the graph might change rapidly due
to the addition of more users as well as friendship links. Also, users may change
their demographics and interests regularly. Our approach does not take into ac-
count such changes. Hence, the architecture of the proposed approach has to be
changed to accommodate the dynamic nature of a social network. We also spec-
ulate that the approach of modeling user profile data using LDA will be effective
for tasks such as citation recommendation in scientific document networks, iden-
tifying groups in online scientific communities based on their research/tasks and
recommending partners in internet dating, ideas that are left as future work.

References
1. Boyd, M.D., Ellison, B.N.: Social Network Sites: Definition, History, and Scholar-
ship. Journal of Computer-Mediated Communication 13 (2007)
2. comScore Press Release, http://www.comscore.com/Press Events/Press
Releases/2007/07/Social Networking Goes Globa
3. TechCrunch Report, http://eu.techcrunch.com/2010/06/08/report-social-
networks-overtake-search-engines-in-uk-should-google-be-worried
4. Fitzpatrick, B.: LiveJournal: Online Service, http://www.livejournal.com
5. Geetor, L., Lu, Q.: Link-based Classification. In: Twelth International Conference
on Machine Learning (ICML 2003), Washington DC (2003)
6. Na, J.C., Thet, T.T.: Effectiveness of web search results for genre and sentiment
classification. Journal of Information Science 35(6), 709–726 (2009)
7. Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F.: Know your Neigh-
bors: Web Spam Detection using the web Topology. In: Proceedings of SIGIR 2007,
Amsterdam, Netherlands (2007)
8. Taskar, B., Wong, M., Abbeel, P., Koller, D.: Link Prediction in Relational Data.
In: Proc. of 17th Neural Information Processing Systems, NIPS (2003)
9. Hsu, H.W., Weninger, T., Paradesi, R.S.M., Lancaster, J.: Structural link analy-
sis from user profiles and friends networks: a feature construction approach. In:
Proceedings of International Conference on Weblogs and Social Media (ICWSM),
Boulder, CO, USA (2007)
86 R. Parimi and D. Caragea

10. Caragea, D., Bahirwani, V., Aljandal, W., Hsu, H.W.: Link Mining: Ontology-
Based Link Prediction in the LiveJournal Social Network. In: Proceedings of As-
sociation of the Advancement of Artificial Intelligence, pp. 192–196 (2009)
11. Haridas, M., Caragea, D.: Link Mining: Exploring Wikipedia and DMoz as Knowl-
edge Bases for Engineering a User Interests Hierarchy for Social Network Appli-
cations. In: Proceedings of the Confederated International Conferences on On the
Move to Meaningful Internet Systems: Part II, Portugal, pp. 1238–1245 (2009)
12. Steyvers, M., Griffiths, T.: Probabilistic Topic Models. In: Landauer, T., Mcna-
mara, D., Dennis, S., Kintsch, W. (eds.) Handbook of Latent Semantic Analysis.
Lawrence Erlbaum Associates, Mahwah (2007)
13. Steyvers, M., Griffiths, T., Tenenbaum, J.B.: Topics in Semantic Representation.
American Psychological Association 114(2), 211–244 (2007)
14. Steyvers, M., Griffiths, T.: Finding Scientific Topics. Proceedings of National
Academy of Sciences, U.S.A, 5228–5235 (2004)
15. Blei, D., Ng, Y.A., Jordan, I.M.: Latent Dirichlet Allocation. Journal of Machine
Learning Research 3, 993–1022 (2003)
16. Blei, D., Boyd-Graber, J., Zhu, X.: A Topic Model for Word Sense Disambigua-
tion. In: Proc. of the 2007 Joint Conf. on Empirical Methods in Natural Language
Processing and Comp. Natural Language Learning, pp. 1024–1033 (2007)
17. Guo, J., Xu, G., Cheng, X., Li, H.: Named Entity Recognition in Query. In: Pro-
ceedings of SIGIR 2009, Boston, USA (2009)
18. Krestel, R., Fankhauser, P., Nejdl, W.: Latent Dirichlet Allocation for Tag Recom-
mendation. In: Proceedings of RecSys 2009, New York, USA (2009)
19. Chen, W., Chu, J., Luan, J., Bai, H., Wang, Y., Chang, Y.E.: Collaborative Fil-
tering for Orkut Communities: Discovery of User Latent Behavior. In: Proceedings
of International World Wide Web Conference (2009)
20. McCallam, K.A.: Mallet: A Machine Learning for Language Toolkit (2002),
http://mallet.cs.umass.edu
21. Phanse, S.: Study on the Performance of Ontology Based Approaches to Link
Prediction in Social Networks as the Number of Users Increases. M.S. Thesis (2010)
Info-Cluster Based Regional Influence Analysis
in Social Networks

Chao Li1,2,3 , Zhongying Zhao1,2,3 , Jun Luo1 , and Jianping Fan1


1
Shenzhen Institutes of Advanced Technology,
Chinese Academy of Sciences, Shenzhen 518055, China
2
Institute of Computing Technology,
Chinese Academy of Sciences, Beijing 100080, China
3
Graduate School of Chinese Academy of Sciences, Beijing 100080, China
{chao.li1,zy.zhao,jun.luo,jp.fan}@siat.ac.cn

Abstract. Influence analysis and expert finding have received a great


deal of attention in social networks. Most of existing works, however, aim
to maximize influence based on communities structure in social networks.
They ignored the location information, which often imply abundant in-
formation about individuals or communities. In this paper, we propose
Info-Cluster, an innovative concept to describe how the information orig-
inated from a location cluster propagates in or between communities.
According to this concept, we propose a framework for identifying the
Info-Cluster in social networks, which uses both location information
and communities structure. Taking the location information into con-
sideration, we first adopt the K-Means algorithm to find location clus-
ters. Next, we identify the communities for the whole network data set.
Given the location clusters and communities, we present the informa-
tion propagation based Info-Cluster detection algorithm. Experiments
on Renren networks show that our method can reveal many meaningful
results about regional influence analysis.

1 Introduction
Web-based social networks have attracted more and more research efforts in re-
cent years. In particular, community detection is one of the major directions in
social network analysis where a community can be simply defined as a group
of objects sharing some common properties. Nowadays, with the rapid devel-
opment of positioning techniques (eg., GPS), one can easily collect and share
his/her positions. Furthermore, with a large amount of shared positions or tra-
jectories, individuals expect to form their social network based on positions.
On the other hand, a social network, the graph of relationships and interac-
tions within a group of individuals, plays a fundamental role as a medium for
disseminating information, ideas, and influence among its members. Most peo-
ple consider the problem of how to maximize influence propagation in social

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 87–98, 2011.
c Springer-Verlag Berlin Heidelberg 2011
88 C. Li et al.

networks, by targeting certain influential individuals that have the potential to


influence many others. This problem has attracted some recent attention due to
potential applications in viral marketing, which is based on the idea of leveraging
existing social structures for world-of-mouth advertising of products [4,9]. How-
ever, here we consider a related problem of maximizing influence propagation in
networks. We propose Info-Cluster, an innovative concept to describe how effec-
tive the information originated from a location cluster propagate in or between
communities. According to this concept, we propose a framework for identifying
the Info-Cluster in social networks, which uses both location information and
communities structure. Taking the location information into consideration, we
first adopt the K-Means algorithm [10] to find location clusters. Next, we iden-
tify the communities for the whole network data set. Given the location clusters
and communities[3], we present the information propagation based Info-Cluster
detection algorithm (IPBICD).
The paper is organized as follows. we give the related work in section 2. In
section 3, we first present the data model with locations into consideration. Then
we formulate the Info-Cluster detection problem and propose our framework.
Section 4 details the main algorithms. Experiments on Renren data set are shown
in section 5. Finally, we conclude the whole paper in section 6.

2 Related Work
The success of large-scale online social network sites, such as Facebook and Twit-
ter, has attracted a large number of researchers. many of them are focusing on
modeling the information diffusion patterns within social networks. Domingos
and Richardson [4] are the first ones to study the information influence in social
networks. They used probabilistic theory to maximize the influence in social net-
work. However, Kempe, Kleinberg and Tardos [8] are the first group to formulate
the problem as a discrete optimization problem. They proposed namely the in-
dependent cascade model, the weight cascade model, and the linear threshold
model. Chen et al. [2] collected the blog dataset to identify five features (namely
the number of friends, popularity of participants, number of participants, time
elapsed since the genesis of the cascade, and citing factor of the blog) that may
play important role in predicting blog cascade affinity, so as to identify most
easily influenced bloggers. However, since the influence cascade models are dif-
ferent, they do not directly address the efficiency issue of the greedy algorithms
for the cascade models studied in [1].
With the growth of the web and social network, communities mining (com-
munity detection) are of great importance recently. In social network graph, the
community is with high concentrations of edges within special groups of vertices,
and low concentrations between these groups [6]. Therefore, Galstyan and Mu-
soyan [5] show that simple strategies that work well for homogenous networks
can be overly sub-optimal, and suggest simple modification for improving the
performance, by taking into account the community structure.
Spatial clustering is the process of grouping a set of objects into classes or
clusters so that objects within a cluster are close to each other, but are far away
Info-Cluster Based Regional Influence Analysis in Social Networks 89

to objects in other clusters. As a branch of statistics, spatial clustering algo-


rithms has been studied extensively for many years. According to Han et al.
[7], those algorithms can be categorized into four categories: density-based algo-
rithms, algorithms for hierarchical clustering, grid-based algorithms, partitioning
algorithms. As we all know, individuals have spatial location in social networks.
Therefore, we present an innovative concept to describe the information propa-
gation in social networks, by taking into account the individual location and the
community structure in social network.

3 Frameworks
A large volume of work has been done on community discovery as discussed
above. Most of them, however, ignored the location information of individuals.
The location information often plays a very important role in the community
formulation and evolution. Therefore, it should be paid attention to. In this
paper, we take the location of individuals into consideration to guide us detecting
the Info-Clusters of communities. In this section, we first give our model for the
social networks with location information. And then we present the problem
formulation. The frameworks of our solutions are described in section 3.3.

3.1 Modeling the Social Network Data


Taking the location into consideration, we model the social network data as an
undirected graph (see Fig. 1), denoted by G = (V, Ev , L, El ), where:
V : is the set of individuals, and V = {v1 , v2 , ...vm }. We use circle to represent
each individual.
Ev : is the set of edges that represent the interactions between individuals. We
use solid lines to represent edges.
L: is the set of individuals’ locations or positions, and L = {l1 , l2 , ...ln }. We
use square to represent each location.
El : is the set of links which refer to the associations of the individuals and
his/her locations. One location often corresponds to many individuals, which
means these individuals belong to the same location. We use dotted lines to
represent them.

3.2 Problem Formulation


In this part, we first give some definitions to formulate the problem. And then,
we present four key steps for regional influence analysis.
Definition 1. Location-Cluster-Set (Slc ): Slc = {LC1 , LC2 , ..., LCk } where LCi
is a cluster of locations resulted from a spatial clustering algorithm, and LCi ∩
LCj = ∅ (i, j = 1, 2, ...k, i = j).
Definition 2. Communities-Set (Scom ): Scom = {Com1 , Com2 , ..., Comp },
where Comi is a community identified by community detection method, and
Comi ∩ Comj = ∅ (i, j = 1, 2, ...k, i = j).
90 C. Li et al.

B D
A

10
3 9
5
4 1
11
2
8 12
7
6

C E

Fig. 1. The model of social networks with location information. In this example, there
are 12 individuals that belong to 5 locations. The individuals are connected with each
other through 17 edges.

Definition 3. Capital of Info-Cluster (Scapital ): Scapital = {Scapital 1 2


, Scapital
, ..., Scapital }, where k denotes the number of location clusters, Scapital represents
k i
i
the set of individuals whose locations belong to LCi : Scapital = {vj |L(vj ) ∈ LCi },
where L(vj ) denotes the location of vj .

Definition 4. Influence of Info-Cluster (Sinf luence ): Sinf luence = {Sinf 1


luence ,
2
Sinf luence , ..., S k
inf luence }, where k denotes the number of location clusters, and
i
Sinf luence represents the set of individuals who learn the information from active
individuals and are activated. Meanwhile, the information are created or obtained
i
by the individuals of Scapital who are active individuals initially.

Definition 5. Know of Info-Cluster (Sknow ): Sknow = {Sknow


1 2
,Sknow k
, ..., Sknow },
i
where k denotes the number of location clusters, and Sknow represents the set
i
of individuals who learn the information from active individuals from Sinf luence
but are inactive.

Definition 6. Info-Cluster (SInf o−Cluster ): SInf o−Cluster = {SIC


1 2
,SIC k
, ...,SIC },
where k is represent the number of location clusters, and SIC = Scapital ∪
i i

luence ∪ Sknow
i i
Sinf

Definition 7. Covering Rate (CR) and Average Covering Rate(ACR):


 i  
i=K
CR(LCi ) = SIC  |V | and ACR = 1
K
CR(LCi )
i=1

Definition 8. Influence Power (IP ): IP (LCi ) = |SInf


i
luence |/|SCapital |.
i

According to the above definitions, we present the main works about the regional
influence analysis as follows:

– Location clustering: It aims to find the clusters (Slc ) of locations through


some spatial clustering algorithm.
– Community detection: Community detection in social network aims to find
groups (Scom ) of vertices within which connections are dense, but between
which connections are sparse.
Info-Cluster Based Regional Influence Analysis in Social Networks 91

– Info-Cluster identification: This process focuses on the identification of Info-


Cluster based on influence propagation in internal/external communities.
With the location clusters (Slc ) we can get the corresponding Scapital , but
how to use influence propagation to find Sinf luence and Sknow , and then
discover the Info-Cluster(SInf o−Cluster ) is the third major work.
– Regional influence analysis: Analyzing Covering Rate (CR), Average Cover-
ing Rate (ACR) and Influence Power (IP ) based on SInf o−Cluster for each
LCi .

3.3 Framework of Our Solutions


In this part we present the framework of our method. The whole process for
Info-Cluster detection includes four steps:

1. Data Preparation: We store the data to be processed in some kind of


database, such as spatial location database, social network database.
2. Preprocessing: With a proper pre-processing approach, it is possible to
improve the performance and speed. In our framework, we use two models
(Data Conversion and Data Fusion) to process the location data and social
network. The main function of the Data Conversion is to rewrite the spatial
data. While the Data Fusion is used to merge the location data and social
network data together. At last, a new data file resulted from preprocessing
is sent to step 3.
3. Algorithm Design: It is the main part of the framework. And it consists
of three key components: location clustering, community detection and Info-
Cluster detection. (Details in section 4)
4. Result Visualization: This step aims to view and analyze the results.
It contains the visualization platform and the results analysis, which are
detailed in section 5.

4 Algorithms
In this section, we describe two main algorithms which aim to solve clustering
and Info-Cluster detection based on influence propagation. Firstly, the K-Means
clustering algorithm is used to cluster locations, and community detection algo-
rithm based on modularity maximization is used to cluster individuals. Secondly,
influence propagation based Info-Cluster detection (IPBICD) algorithm is pre-
sented in section 4.2.

4.1 Clustering
K-Means[10] is one of the simplest unsupervised learning algorithms. It is often
employed to solve the clustering problem. In this paper, we adopt K-Means
method to cluster the locations of the social networks. What should be paid
attention to is that, other clustering methods can also be used for our location
clustering. We adopt K-Means method here, only to show the feasibility of our
92 C. Li et al.

algorithms. The K-Means algorithm requires us to specify the parameter K,


which means the number of clusters. As to the location clustering, K often
reflects the scale of the locations. The larger K means the finer scale. That is,
if we set K = 10, each cluster for the location may accurate into province/city.
With increasing K, each cluster may represent a street.
Modularity [3], is used to evaluate the quality of a particular partition for
dividing network into communities. It motivates various kinds of methods for
detecting communities, which aims to maximize the modularity function. The
modularity is defined as follow:
1  kv kw  
Q= (Avw − ) δ(cv , i)δ(cw , i) = (eii − a2i ) (1)
2m vw 2m i i

where
– v, w are vertices within V ;
– i represent the ith community;
– cv is the community to which vertex v is assigned;
– Avw is an element of the adjacency matrix corresponding to the G = (V, Ev );
– m = 12 Avw ;
vw
– kv = Avu , where u is a vertex;
u 
1
– eij = 2m Avw δ(cv , i)δ(cw , j);
1

vw
– ai = 2m kv δ(cv , i);
v
1 x=y
– δ(x, y) =
0 otherwise
We start off with each vertex being a community which contains only one mem-
ber. Then the process includes finding the changes of Q, choosing the largest of
them, and performing the merge of communities.

4.2 Info-Cluster Detection


Before our Info-Cluster detection, we show the data after being processed by the
K-Means and community detection. We use L(vi ) to represent the location of the
vertex vi , and LC(vi ) to represent the location cluster of the vertex vi . LC(vi ) is
assigned from the result of the K-Means algorithm. Similarly, we use Com(vi ) to
denote the community of the vertex vi . Com(vi ) is assigned from the result of the
community detection algorithm. Table 1 shows an example of location clusters
and communities computed by K-Means and community detection methods.
For the social network G = (V, Ev , L, El ), represented by an undirected graph,
an idea or innovation can be spreaded from some source individuals to others.
Here we consider each individual has two states: active and inactive. If the indi-
vidual accept or adopt the idea/innovation, we say this individual is activated.
Otherwise, it is inactive. According to Kempe [8], each individual’s tendency to
become active increases monotonically as more of its neighbors become active.
Info-Cluster Based Regional Influence Analysis in Social Networks 93

Table 1. An example of location clusters and communities computed by K-Means and


community detection methods

v1 v2 v3 v4 ... vn
L(vi ) l1 l2 l3 l4 ... ln
LC(vi ) 1 2 1 2 ... k
Com(vi ) 1 1 2 2 ... m

During the activation process, a passive individual (inactive) will be activated


depending on the comparison between the threshold and probability that de-
pends on the state of its neighbor individuals. If the probability is larger than
the threshold, the individual will be activated (in our paper,we put it into the
set SInf luence ). If the probability is smaller than the threshold, the individuals
will be not activated (In our paper, if the probability is larger than zero, we put
it into Sknow . Otherwise, we send it into Snothing ).
We firstly define the threshold θ, which is related to the α1 , α2 , β1 and β2 .
The equation is as follows:

e(α1 +α2 ) lg(2+α1 +α2 )+(β1 +β2 )


θ = λ( ) (2)
3 + e(α1 +α2 ) lg(2+α1 +α2 )+(β1 +β2 )
where
p
(1) α1 : is the influence probability from vi to vj , where vi ∈ Scapital and
p
Com(vi ) = Com(vj ); Scapital are the active individuals initially.
p
(2) α2 : is the influence probability from vi to vj , where vi ∈ / Scapital and
Com(vi ) = Com(vj );
p
(3) β1 : is the influence probability from vi to vj , where vi ∈ Scapital and
Com(vi ) = Com(vj );
p
(4) β2 : is the influence probability from vi to vj , where vi ∈ / Scapital and
Com(vi ) = Com(vj );
(5) λ: is a regulation parameter. Generally λ = 1. If the network is super-
interactive between the individuals, we can set λ < 1. Otherwise, we set λ > 1 .
In this paper, we fix the λ = 1.
Fig. 2 shows examples about the α1 , α2 , β1 and β2 . Table 2 shows the range
of the α1 , α2 , β1 and β2 and gives three examples about the θ.
In this paper, the activated individuals are known at first based on the active
location cluster, which is different to other papers.

Table 2. The range of the α1 , α2 , β1 and β2 and three examples

α1 α2 β1 β2 θ
(0 < α1 < 1) (0 < α2 < α1 ) (0 < β1 < α1 ) (0 < β2 < β1 ) (0 < θ < 1)
0.9 0.8 0.7 0.6 0.7670
0.8 0.6 0.6 0.4 0.6560
0.3 0.2 0.2 0.1 0.3544
94 C. Li et al.

Scapital A Scapital B Scapital C

Active Node
D1 D1 D1 D1 E1

Co
mm
D1

un
Inactive Node
D1 D1 D1

yit
D1
Boundary
between
Scapital D Scapital E Scapital F capital nodes
and non-
capital nodes
D1 D1 D1 D1

co
D1 D1

com
mm
un

mu
Dividing line

ity

nity
E2 between
D2
different
communities

Fig. 2. Illustration of α1 , α2 , β1 and β2

Secondly, we show the individuals activate probability that depends on the


state of its neighborhood individuals. In Watt’s original model [11] this prob-
ability is defined by the number of active neighbors, the total number of the
neighboring individuals and the activation threshold.
However, in our framework we have four types of probability (α1 , α2 , β1 and
β2 ) between the individuals. We define the function Y (z) to be z-th individ-
ual’s probability to be activated that depends on its neighbor individuals. The
function Y (z) is defined as follows:
 Activenum (N (z)) 
Y (z) = (α1 + β1 ) + ( α2 + β 2 ) (3)
N um(N (z))
N (z) N (z)

where:
– N (z): The z-th individual’s neighborhood individuals set.
– N um(N (z)): The number of the neighbors’ individuals of z-th individuals.
– Activenum (N (z)): The number of active individuals in neighbors’ individuals
of z-th.
Finally, according to Y (z) and θ, we can easily generate Info-Cluster in our social
network graph. Specifically, for each location cluster, we first set all individu-
als from one location cluster to be active, and add those individuals into group
i
Capital(Scapital ). And then, for each inactive individual calculate its Y (z) and
i
compare it with θ. If Y (z) > θ, then add it into Inf luence group Sinf luence . If
0 < Y (z) ≤ θ, then add z-th node into Know group Sknow . However, if Y (z) = 0,
i
i
then add it into N othing group . At last, merge Capital group (Scapital ) indi-
i i
viduals, Inf luence group (Sinf luence ) individuals and Know group(Sknow ) in-
dividuals into one Info-Cluster and repeat the next location cluster. The process
of the Info-Cluster detection is shown in Algorithm 1.

5 Experiments
In order to test the performance of our algorithm, we conduct an experiment on
real social networks. We first obtain the information of 5000 individuals from the
Info-Cluster Based Regional Influence Analysis in Social Networks 95

Algorithm 1. Influence Propagation Base Info-Cluster Detection (IPBICD)


Input : G = (V, Ev , L, El );
Output : Info-Cluster: SInf o−Cluster ;
1: Calculate θ according to equation 2 and Slc , Scom according to section 4.1;
2: for each C from Qlc do
3: Active vi if LC(vi ) == C, and make vi .label = Capital, vi .value = 1
4: Compute all the nodes values according to Equation 3 for graph G based
on Breadth-First Search(BFS);
5: for each node v from G do
6: if (v.values > θ) then
7: v.label = Inf luence;
8: end if
9: if (0 < v.values < θ) then
10: v.label = Know;
11: end if
12: if (v.values == 0) then
13: v.label = N othing;
14: end if
15: end for
o−Cluster ) if v.label == Capital  v.label ==
i
16: Build Info-Cluster(SInf
Inf luence  v.label == Know;
i
17: return Info-Cluster: SInf o−Cluster ;
18: end for

Renren friend network by crawling the Renren online web site (www.renren.com,
which is similar as the web site of Facebook). After preprocessing, there are 2314
circle vertices, 1400 square vertices and 56456 edges in final Renren data sets.
Each circle vertex denotes an individual registered in Renren web site, while the
square vertex represents the location of the corresponding individual. And each
edge between circle vertices denotes the friendship of two individuals. Then we
do experiments on Renren data sets. In the experiment, we set three kinds of
influence probabilities which are shown in Table 2.
Fig. 3 illustrates the changing of Average Covering Rate (ACR) with differ-
ent K. From this figure, we can see that the Average Covering Rate decreases
with the increasing of K macroscopically. One main reason may lie in that the
higher K value often leads to less people in each location cluster. That is, the
information propagates from less sources.
In order to study the relations between covering rate and the number of
sources, we randomly select some K values and then analyze the experiments
results microscopically. Suppose that all the locations are grouped into 50 clus-
i
ters (K = 50). Then we can get 50 capitals denoted by Scapital , i = 1, 2, . . . , K
as described in Definition 3. After the experiment, we get 50 Info-Clusters, each
of which is composed of capital individuals, influence individuals and know indi-
viduals. Fig.4 shows the change of Covering Rate with the number of individuals
96 C. Li et al.

0.9
D1=0.9 D2=0.8 E1=0.7 E2=0.6
D =0.8 D =0.6 E =0.6 E =0.4
1 2 1 2
0.8
D =0.3 D =0.2 E =0.2 E =0.1
1 2 1 2

0.7

Average Cover Rate(ACR)


0.6

0.5

0.4

0.3

0.2

0.1

0
0 200 400 600 800 1000 1200 1400
Number of cluster:K

Fig. 3. The changing of Average Covering Rate(ACR) with K increasing. Here, K ∈


[1, 1400] since only 1400 locations are involved in the Renren data set.

of 50 capitals. According to the Fig. 4, we find that more people as sources may
result in higher covering rate as a whole. But it is not absolutely true. With
the third parameter settings, the covering rate of 30 individuals is higher than
that of 100 individuals. For other two kinds of parameter settings, the covering
rate of 50 individuals is nearly equal to that of 150 individuals. That implies
the former individuals have stronger influential power than the latter ones. Even
to the same number of individuals, the covering rates are often different. One
such example is when the number of individuals is 20. From Fig. 4, we can see
that there are two capitals composed of 20 individuals. And each cluster reaches
different covering rates even with the same parameter settings.

0.9

0.8

0.7
Covering Rate(CR)

0.6

0.5

0.4

0.3 D =0.9 D =0.8 E =0.7 E =0.6


1 2 1 2
D1=0.8 D2=0.6 E1=0.6 E2=0.4
0.2 D1=0.3 D2=0.2 E1=0.2 E2=0.1

0.1

0
0 50 100 150 200 250
Number of individuals

Fig. 4. The change of Covering Rate with the number of individuals of 50 capitals

To better understand the scope of information propagation from different


clusters, we depict the covering rate of each Info-Cluster on Chinese map, which
is shown in Fig. 5. The sub-figures (Fig. 5(a),Fig. 5(b) and Fig. 5(c)) correspond
to three kinds of parameter settings respectively. For each of them, we adopt five
representative colors to differentiate different covering rates. The corresponding
values are labeled at the right bottom of each sub-figure. For the Info-Cluster
whose covering rate is between two of our selected values, we paint it with the
transitional color. And the darkness of the transitional color is determined by
Info-Cluster Based Regional Influence Analysis in Social Networks 97

(a) (0.9, 0.8, 0.7, 0.6) (b) (0.8, 0.6, 0.6, 0.4) (c) (0.3, 0.2, 0.2, 0.1)

Fig. 5. The distribution of Covering Rate in Chinese map

70

D =0.9 D =0.8 E =0.7 E =0.6


1 2 1 2
60
D1=0.8 D2=0.6 E1=0.6 E2=0.4
D1=0.3 D2=0.2 E1=0.2 E2=0.1

50
Influential Power(IP)

40

30

20

10

0
0 5 10 15 20 25 30 35 40 45 50
clusterid

Fig. 6. The Influential Power of each capital set

value of the covering rate. According to the Fig. 5, we find that the information
of eastern Info-Cluster spreads more widely than that of west. Particularly, the
region of Beijing has the highest covering rate. This may attribute to the higher
density of its population.
The influence power for each capital set is shown in Fig. 6. From this figure, we
find that the last cluster (clusterid=50) achieves the highest influential power at
parameter settings (0.9, 0.8, 0.7, 0.6) and (0.8, 0.6, 0.6, 0.4). Therefore, the region
which contains those individuals, is an influential region. On the contrary, the
15th cluster (clusterid=15) represents the lowest influential power. That means
the region which contains those individuals, is weak influential region.

6 Conclusion
In this paper, we propose an innovative concept Info-Cluster. And then, based
on the information propagation, we present the framework of identifying Info-
Cluster, which uses both community and location information. With the social
network data set, we first adopt the K-Means algorithm to find location clusters.
Next, we identify the communities for the whole network. Given the location
clusters and communities, we present the information propagation based Info-
Cluster detection algorithm (IPBICD). Experiments on Renren data sets show
that the Info-Clusters have many characteristics. The Info-Clusters identified
have many potential applications, such as analyzing and predicting the influential
range of the information or advertisement from a certain location.
98 C. Li et al.

References
1. Cha, M., Mislove, A., Gummadi, K.P.: A measurement-driven analysis of informa-
tion propagation in the flickr social network. In: WWW 2009: Proceedings of the
18th International Conference on World Wide Web, pp. 721–730. ACM, New York
(2009)
2. Chen, W., Wang, Y., Yang, S.: Efficient influence maximization in social networks.
In: KDD 2009: Proceedings of the 15th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pp. 199–208. ACM, New York (2009)
3. Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very
large networks. Physical Review E 70, 066111 (2004)
4. Domingos, P., Richardson, M.: Mining the network value of customers. In: KDD
2001: Proceedings of the Seventh ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 57–66. ACM, New York (2001)
5. Galstyan, A., Musoyan, V., Cohen, P.: Maximizing influence propagation in net-
works with community structure. Physical Review E 79(5), 56102 (2009)
6. Girvan, M., Newman, M.E.J.: Community structure in social and biological net-
works. Proceedings of the National Academy of Sciences of the United States of
America 99, 7821 (2002)
7. Han, J., Kamber, M., Tung, A.K.H.: Spatial clustering methods in data mining:
A survey. In: Geographic Data Mining and Knowledge Discovery, Research Mono-
graphs in GIS. Taylor & Francis, Abington (2001)
8. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence through
a social network. In: KDD 2003: Proceedings of the Ninth ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, pp. 137–146. ACM,
New York (2003)
9. Leskovec, J., Adamic, L.A., Huberman, B.A.: The dynamics of viral marketing.
ACM Trans. Web 1(1), 5 (2007)
10. MacQueen, J.: Some methods for classification and analysis of multivariate obser-
vations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics
and Probability, pp. 281–297. University of California Press, Berkeley (1967)
11. Watts, D.: A simple model of global cascades on random networks. Proceedings
of the National Academy of Sciences of the United States of America 99(9), 5766
(2002)
Utilizing Past Relations and User Similarities in
a Social Matching System

Richi Nayak

Faculty of Science and Technology


Queensland University of Technology
GPO Box 2434, Brisbane Qld 4001, Australia
r.nayak@qut.edu.au

Abstract. Due to the higher expectation more and more online match-
ing companies adopt recommender systems with content-based, collab-
orative filtering or hybrid techniques. However, these techniques focus
on users explicit contact behaviors but ignore the implicit relationship
among users in the network. This paper proposes a personalized social
matching system for generating potential partners’ recommendations
that not only exploits users’ explicit information but also utilizes im-
plicit relationships among users. The proposed system is evaluated on
the dataset collected from an online dating network. Empirical analysis
shows the recommendation success rate has increased to 31% as com-
pared to the baseline success rate of 19%.

1 Introduction

With the improved Web technology and increased Web popularity, users are
commonly using online social networks to allow them to contact new friends
or ’alike’ users. Similarly, people from various different demographics have also
increased the customer base of online dating networks [9]. It is reported [1] that
there are around 8 million singles in Australia and 54.32% of them are using
online dating services. Users of online dating services are overwhelmed by the
number of choices returned by these services. The process of selecting the right
partner among a vast amount of candidates becomes tedious and nearly inef-
fective if the automatic selection process is not available. Therefore, a matching
system, utilizing data mining to predict behaviors and attributes that could lead
to successful matches, becomes a necessity.
Recommendation systems have existed for a long time to suggest users a prod-
uct according to their web visit histories or based on the product selections of
other similar users [2],[7]. In most cases, the recommendation is an item recom-
mendation which is inanimate. On the contrary, the recommendation in dating
networks is made about people who are animate. Different from item recommen-
dation, people recommendation is a form of two way matching where a person
can refuse an invitation but products cannot refuse to be sold. In other words, a
product does not choose the buyer but dating service users can choose the dating

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 99–110, 2011.
c Springer-Verlag Berlin Heidelberg 2011
100 R. Nayak

candidates. The goal of an e-commerce recommendation system is to find prod-


ucts most likely to interest a user, whereas, the goal of a social matching system
is to find the user who is likely to respond favorable to them. Current recommen-
dation systems cannot handle this well [6]. There are few published examples of
recommendation systems applied explicitly to online dating. Authors in [4] use
the traditional user-user algorithms and item-item algorithms with the use of
user rating data for online dating recommendation, and failed to use many fac-
tors, such as age, job, ethnicity, education etc that play important roles in match
making. Authors in [6] proposed a theoretical generic recommendation algorithm
for social networks that can easily be applied to an online dating context. Their
system is based on a concept of social capital which combines direct similarity
from static attributes, complementary relationship(s), general activity and the
strength of relationship(s). However, this work is at a theoretical level and there
have been no experiments carried out to prove the effectiveness of this theory.
There are many weight factors in the proposed algorithm which may negatively
influence it being an effective algorithm. Efficiency is another problem for these
pairwise algorithms with a very high computation complexity.
This paper proposes a social matching system that combines the social net-
work knowledge with content-based and collaborative filtering techniques [7] by
utilizing users’ past relations and user similarities to improve recommendation
quality. This system includes a nearest neighbour algorithm which provides the
system an add-on layer to group similar users to deal with the cold-start prob-
lem (i.e., handling new users). It also includes a relationship-based user similarity
prediction algorithm which is applied to calculate similarity scores and generate
candiadtes. Finally, a support vector machine [3] based algorithm is employed
to find out the compatibility between the matching pairs. The similarity scores
and the compatibility scores are combined to propose a ranked list of potential
partners to the network users. The proposed system is evaluated on the dataset
collected from a popular dating network. Empirical analysis shows that the pro-
posed system is able to recommend the top-N users with high accuracy. The
recommendation success rate has increased to 31% as compared to the baseline
success rate of 19%. The baseline recall of the underlying dating network is also
increased from 0.3% to 9.2% respectively.

2 The Proposed Social Matching Method

Data required by a dating network for recommending potential partners can be


divided into the following features: (1) Personal profile for each user which in-
cludes self details on demographic; fixed-choice responses on Physical, Identity,
Lifestyle, Career and Education, Politics and Religion and other attributes; free-
text responses to various interests such as sport, music etc; and optionally, one
or more photographs; (2) Ideal partner profile for each user which includes infor-
mation about what user prefers in Ideal partner, usually the multiple choices on
the attributes discussed before; (3) User activities on the network such as view-
ing the profiles of other members, sending pre-typed messages to other users;
Utilizing Past Relations and User Similarities in a Social Matching System 101

sending emails or chat invitations; and (4) Measure of relationships with other
users such as willingness to initialize relationships and responding to invita-
tions, and frequency and intensity with which all relationships are maintained.
A relationship is called successful for the purpose of match making when a user
initiates a pre-typed message as a token of interest and the target user sends
back a positive reply.
Let U be the set of m users in the network, U = u1 , . . . , um . Let X be a user
personal profile that includes a list of personal profile attributes, X = x1 , . . . , xn
where each attribute xi is an item such as body type, dietary preferences, po-
litical persuasion and so on. Consider the list of user’s ideal partner profile
attributes as a set Y = y1 , . . . , yn where each attribute yi is an item such as
body type, dietary preferences, political persuasion and so on. For a user uj ,
value of xi is unary, however, the values of yi can be multiple. Let P = X + Y
denote a user profile containing both the personal profile attributes and partner
preference attributes. The profile vector of a user is shown as P (uj ). There can
be many types of user activities in a network that can be used in the matching
process. Some of the main activities are “viewing profiles”, “initiating and/or
responding pre-defined messages (or kisses1 , “sending and/or receiving emails”
and “buying stamps”. The profile viewing is a one-sided interaction from the
viewer perspective; therefore it is hard to define the viewers’ interests. The “kiss
interactions” are more promising to be considered as an effective way to show the
distinct interests between two potential matches. A user is able to show his/her
interest by sending a “kiss”. The receiver is able to ignore the ”kiss” received or
return a positive or negative reply. When a receiver replies a kiss with positive
predefined message it is considered as a “successful” kiss or a “positive” kiss.
Otherwise, it is judged as an “unsuccessful” kiss or a “negative” kiss.

Generation of Small Social Networks. A number of social networks which


describe the past relations between users and their previous contacted users,
are derived. Let user ub be the user who has successfully interacted with more
than a certain number of previous partners for a particular period. Let GrA be
the set of users, GrA ∈ U , who user ub has positively interacted. Let user ua
be the user who user ub has positively interacted last, ua ∈ GrA. Let GrB be
the set of users who are ex-partners of user ua , ub ∈ GrB. Note: gender(ub )
= gender(GrB) and gender(ua ) = gender(GrA). Users ub and ua are called
seed users as they provide us the network. Users in GrA and GrB are called as
relationship-based users. The relationship between user ub and a user in GrB,
and the relationship between user ua and a user in GrA reflect the personal
profile similarity between the two same gender users. This similarity value is
evaluated by using an instance-based learning algorithm.
Figure 1 summarizes the proposed method. The process starts with selecting
a number of seed pairs. A network of relationship- based users (GrA, GrB) is
generated by locating the ex-partners of the seed users. The size of GrA and
GrB is increased by applying clustering to include new users and to overcome
1
We call a pre-defined message as ”kiss” in this paper.
102 R. Nayak

the lack of relationship based users for a seed pair. The similarity between ub
and users in GrB, and the similarity between ua and users in GrA are calculated
to find “closer” members in terms of profile attributes. This step determines the
users whose profiles match since they are the same gender users. Each pair in
(GrA, GrB) is also checked for their compatibility using a two-way matching.
These three similarity scores are combined by using weighted linear strategy.
Finally, a ranked list of potential partner matches from GrB for each of GrA is
formed.

The Personalised Social Matching Approach


Input: Network Users: U = { u1…. um}; Users Profiles : {P(u1),…P(um)} ; User
Clusters (Female or Male): C = {(u1..ui), (uj,,uk),…(uk…um)} ;Users’
communication
Output: Matching pairs: { (ui , uj)..... (ul, um)}
Begin
a. Select good seed pair of users based on communication between users U;
b. For each unique good seed pair (ub, ua):
a. Form GrA by finding ex-partners of ub;
b. Form GrB by finding ex-partners of ua;
c. Extend GrA and GrB by similar users with the corresponding
gender from the clusters C using the k-means algorithm;
d. For each user in GrA:
i. Find ex-partner GrAi whose profiles are similar to ua
using the instance-based learning algorithm;
ii. Assign SimScore(GrAi, ua)
e. For each user in GrB:
i. Find ex-partner GrBi whose profiles are similar to ub
using the instance-based learning algorithm;
ii. Assign SimScore(GrBi, ub)
f. For each pair (GrAi, GrBj) in GrA, GrB;
i. Apply the user compatibility algorithm to score the
compatibility score;
g. Combine the three scores and rank the matching pairs;
End

Fig. 1. High level definition of the proposed method

Personal Profile Similarity. An instance based learning algorithm is devel-


oped to calculate similarity between the seed user and the relationship-based
users. Attributes in personal profile are of categorical domain. So an overlap
function is used that determines how close the two users are in terms of at-
tribute xi .

1, xi (u1 ) = xi (u2 )
Sxi (u1 , u2 ) = (1)
0, Otherwise
Utilizing Past Relations and User Similarities in a Social Matching System 103

where u1 is a seed user and u2 ∈ GrB or u2 ∈ GrA. This matching process is


conducted between a seed user ub and GrB users, as well as the corresponding
partner seed user ua and GrA users. All attributes are not equally important
when selecting a potential match [5],[8]. For example, analysis of dataset of a
popular dating network2 shows that attributes such as height, body type, and
have children are specified more frequently than attributes such as nationality,
industry and have pets in user personal profiles. Therefore, each attribute score
is assigned with a weight when combined them all together. The weight is set
according to the percentage that all members have indicated that attribute in
their personal profiles existing in the network. Inclusion of the weight values
according to the network statistic allows us to reflect the user interest in the
network.
n

SimScore(u1 , u2 ) = Sxi (u1 , u2 ) × weightxi (2)
i=1

Solving cold-start problem in the network. A recommendation system can


suffer with cold-start problem [2] when in a network the number of relationship-
based users is very low or new users are to be included in the matching process.
This research utilizes the k-means clustering algorithm that helps to increase
the size of GrA and GrB by finding similar users according to seed users ua
and ub respectively. Users in the network for a specified duration are grouped
according to their personal profiles. Let Cm = {C1 , . . . , Cc }m be the cluster of
male members of the network where ck is the centroid vector of cluster Clm . Let
Cf = C1 , . . . , Cc f f be the cluster of female members of the network where ck
is the centroid vector of cluster Clf . The user personal profile and preference
attributes P = X + Y where X = {x1 , . . . , xn } and Y = {y1 , . . . , yn } are used in
the clustering process. Appropriate clusters corresponding to gender of the seed
users, ub and ua, are used to test which cluster matches best with the seed user
such as (max∀(ck ∈C (m|f) ) (S k (P (ub ), ck ))) where S k shows the maximum simi-
larity between a centroid vector and a user profile vector. Cosine similarity is
employed in the process. Members of the matched cluster are used in extending
the size of GrA or GrB.

The User Compatibility Algorithm. A recommendation system for social


matching should consider two-way matching. For each personal attribute, the
user’s preference is compared to the potential match’s stated value for the at-
tribute. The result of the comparison is a single match score for that attribute
that incorporates (1) the user’s preference for the match, (2) the potential
match’s preference for the user, and (3) the importance of the attribute to both.
The attribute cross score match can be thought of as a distance measure be-
tween two users along a single dimension. The attribute cross match scores for
all attributes are combined into a vector that indicates how closely the match
is between a user and their potential match. The first step is to calculate the
attribute cross match score CSxi (ub , ua ) that quantifies how closely a potential
2
Due to privacy reason the detail of this network is not given.
104 R. Nayak

match fits the preferences of user ub based on profile attribute xi . That is, does
a potential match’s stated value for an attribute fit the user’s preference yi for
that attribute? If the user has explicitly stated a preference for the attribute
then the measure becomes trivial. If the user has not explicitly stated a prefer-
ence, then a preference can be inferred from the preferences of other members
in the same age and gender group. Though a user may not explicitly state their
preference, an assumption can be made that their preference is similar to others
in the same age and gender group. The score then becomes the likelihood of a
potential match up meets the preferences of a user ub for the attribute xi .


⎪1 , xi (ua ) ∈ yi (ub )

⎪ N (xi (ub )=x,xi (ua )∈yi (ub )|xiGender (ub ),xi A ge(ub ))
⎨ , y (u ) = φ
CSxi (ub , ua ) = (N (xi (ub ) = x|xiGender (ub ), xiA ge(ub ))) i b

⎪ 

⎪ −N (xi (ub ) = x, yi (ub ) = “N ot Specif ied” |xiGender (ub ), xiA ge(ub ))

0 , Otherwise
(3)

where xi (ub ) is the user ub ’s profile value for attribute xi and yi (ub ) is user
ub ’s preferred match value for attribute xi . By the definition in above equation,
scores range from 0 to 1. The attribute cross match score is moderated by a
comparative measure of how important the attribute xi is to users within the
same age and gender demographics of user ub . This measure, called the impor-
tance, is estimated from the frequency of the users within the same age band
and gender of the user that specify a preference for the attribute. This is done to
ensure that all attributes are not equally important when selecting a potential
match. Attributes such as height, body type, and have children are specified
more frequently than attributes such as nationality, industry and have pets. If
a user explicitly specifies a preference then it is assumed the attribute is highly
important to them (e.g. when a user makes an explicit religious preference).
When it is not specified a good proxy for the importance of the attribute is the
complement of the proportion of users in the same age and gender group who
did specify a preference for the attribute. Mathematically, this is defined by


⎪ 1 , xi (ua ) ∈ yi (ub )


⎨1 − N (xi (ub )=x,yi (ub )“N ot Specif ied”|xiGender (ub ),xi Age (ub ))
yi (ub ) = φ
Ixi (ub ) = (N (xi (ub )=x|xiGender (ub ),xiAge (ub ))) (4)





0 , Otherwise

By the definition in this equation, scores range from 0 to 1. The attribute score
for xi between potential partners is calculated as follows:
Axi (ub , ua ) = Ixi (ub ) × CSxi (ub , ua ) (5)

Including the importance information upfront may simplify the task of training
an optimisation model to map the attribute scores to a target variable. By
reducing complexity of the model, accuracy may be improved. An alternative
to the importance measure would be to leave the weightings for an optimisation
model to estimate. It is assumed including the importance measure as part of
the score calculations will assist training of the optimisation model.
Utilizing Past Relations and User Similarities in a Social Matching System 105

Both the attribute match score and importance are also calculated from the
perspective of the potential match’s up preference towards user ub . Finally, a
single attribute match score Mxi between two users for attribute xi is obtained
as follows:
Mxi (ub , ua ) = Mxi (ua , ub ) = 1/2((Axi (ub , ua ) + Axi (ua , ub ))) (6)

By combining the four measures per attribute into one cross-match score per
attribute the search space is reduced by three quarters.
Finding user compatibility requires a measurement that allows different po-
tential matches to be compared. The measure should allow a user’s list of po-
tential matches to be ranked in order of ”closest” matches. This is achieved by
combining all attribute cross-matches scores into a singular match score.
M(ub , ua ) = [Mx1 (ub , ua ), . . . , Mxn (ub , ua )] (7)

The goal then becomes to intelligently summarise the vector M (ub , ua ) in a way
that increases the score for matches that are likely to lead to a relationship.
Technically, this becomes a search for an optimal mapping from a user ub to a
potential match up based on their shared attribute cross-match vector M (ub , ua )
to a target variable that represents a good match. We will call this target variable
the compatibility score Comp(ub , ua) such that:
Comp(ub , ua ) = f (M(ub , ua )) (8)

An optimal mapping could be found by training a predictive data mining algo-


rithm to learn the mapping function f provided a suitable target variable can be
identified. Ideally the target relationship score would be a variable based on user
activities that (1) identifies successful relationships, and (2) increases the com-
pany revenue through more contacts. In this research we have used the “kiss”
communication between users as the target to learn the successful relationship.
Calculation of the mapping function f from a potential match’s attribute cross-
match vector to the compatibility score is performed using a support vector
machine (SVM) algorithm [3]. Each input to SVM is a real value ranging from
-1 to 1. The trained SVM has a single real valued output that becomes the com-
patibility score.

Putting it all together. Once the three similarity scores, SimScore(ub , GrBj )
identifying profile similarity between the seed user and a potential match,
SimScore(ua , GrAi ) identifying profile similarity between the seed partner and
a recommendation object and Comp(GrAi , GrBj ) compatibility score between
a potential match pair (GrAi , GrBj ) are obtained, these scores can be combined
using weighted linear strategy.

Match(GrAi , GrBj ) = w1 × SimScore(ub , GrBj ) + w2 × SimScore(ua , GrAi )+


(9)
w3 × Comp(GrAi , GrBj )
106 R. Nayak

To determine these weights setting, a decision tree model was built using 300
unique seed users, 20 profile attributes and about 300,711 recommendations
generated from the developed social matching system along with an indicator
of their success. The resulting decision tree showed that higher percentage of
positive kisses are produced when w1 ≥ 0.5 and w2 ≥ 0.3 and w3 ≥ 0.2. Therefore
w1 , w2 and w3 are set as 0.5, 0.3 and 0.2 respectively. It is interesting to note the
lower value of w3 . It means that when two members are interested in each other,
there exists high probability that both of them are similar to their ex-partners
respectively.
For each recommendation object GrAi , matching partners are ranked accord-
ing to their M atch(GrAi , GrBj ) score and top-n partners from GrB become the
potential match of GrAi .

3 Empirical Analysis

3.1 Dataset: The Online Dating Network

The proposed method is tested with the dataset collected from a real life online
dating network. There were about 2 million users in the network. We used the
three months of data to generate and test networks of relationship-based users
and recommendations. The activity and measure of relationship between two
users in this research is “kiss”. The number of positive kisses is used in testing
the proposed social matching system. Figure 2 lists the details of the users and
kisses in the network. A user who has logged on in the website during the chosen
three months period is called as “active” user. The seed users and relationship
based users come from this set of users. A kiss sender is called “successful” when
the target user sends back a positive kiss reply. There are about 50 predefined
messages (short-text up to 150 characters) used in the dating network. These
kiss messages are manually defined as positive or negative showing the user
interest towards another member. There are a large number of kisses exist in the
network that have never been replied by the target users and they are called as
“null kiss”.

3 Months Data Value


# of distinct active users 163,050
(female + male) (82,500 + 80, 550)
# unique kiss senders 122,396
# unique successful senders 91,487
# unique kiss recipients in the network 198,293
# unique kiss recipients who are active 83,865
during the chosen period
# unique kisses 886,396
# unique successful kisses 171,158
# unique negative kisses 346,193
# unique null kisses 369,045

Fig. 2. User and Kiss Statistics for the three months chosen period
Utilizing Past Relations and User Similarities in a Social Matching System 107

It can be noted that for each kiss sender, there is about 4 kiss replies (including
successful and negative both) on an average. It can also be seen that about
75% kiss senders have received at least one positive kiss reply. The amount of
successful kisses is less than one fourth of the sum of negative and null kisses. A
further kiss analysis shows a strong indication of Male members in the network
for initiating the first activities such as sending kisses (78.9% vs 21.1%). They
are defined as proactive behavior users in the paper. While female members who
are reactive behaviour users usually wait for receiving kisses.

3.2 Evaluation Criteria


Let U be the set of network’s active users. Let GrA be the group of users
who are going to receive potential partners’ recommendations andGrB be the
group of users who become the potential partners. Let U = GrA GrB where
GrA GrB = φ . The recommendation performance is tested by whether the
user has made initial contact to the users in the recommendation list.
N umber of unique successf ul kisses GrA to GrB
SuccessRate(SR) = (10)
N umber of unique kisses GrA to GrB

N umber of unique successf ul kisses GrA to U


BaselineSuccessRate(BSR) = (11)
N umber of unique kissesGrA to U

Success Rate(SR)
Success Rate Improvement (SRI) = (12)
Baseline Success Rate(BSR)


N umber of (Kissed P artners Recommended P artners)
Recall = (13)
(N umber of (Kissed P artners))

Kernel Kernel Standard Correctly Predicted (%) Correctly Predicted (%)


Function Size Deviation Training Dataset Test Dataset
(Mismatch) (Match) (Mismatch) (Match)
Linear 44.4 62.4 10 90
Gaussian 40 0.5 79.6 63.9 73.3 60.1
Gaussian 40 1 79.5 63.9 67.2 59.8
Gaussian 50 0.5 70.5 62.1 58.4 53.7
Gaussian 50 1 62.7 60.4 58.5 56.6
Gaussian 70 0.5 68.9 64.5 64.6 59.9
Gaussian 70 1 59.7 51.0 56.9 48.2

Fig. 3. The SVM Model Performance

3.3 Results and Discussion


The SVM models were trained by changing different parameters on the datasets
as shown in Figure 3. In the dataset, only a very small set of samples are positive
matches. To avoid the model being overwhelmed by negative samples, a stratified
training set was created. A set of at least 3500 unique users with about 20
108 R. Nayak

positive and 20 negative kiss responses per user were chosen. This created a
sample training set of about 144,430 records which were used to train SVM
models. The test dataset was created with 24498 records by randomly choosing
users. A 10-fold cross validation experiments were performed and the average
performance is shown in Figure 3. The best performing SVM model was used in
the proposed matching system.
Figure 4 shows that the Success Rate (SR) decreases as the number of po-
tential matches (GrB) is increased for a user in GrA. This result confirms that
higher the total score generated by the proposed matching system, M atch(GrAi
, GrBj ), the more relevant and accurate matches are made. For example, users
with higher total score in top-5 recommendation list received highest percentage
of positive kiss reply. There are a number of null kiss replies in the dataset. A
null kiss reply can be transformed to positive kiss reply and negative kiss reply.
If all the null kiss reply is able to transform to positive kiss reply, the success
rate (SR) can be obtained as 66% for top-20 users. The BSR of the underly-
ing online dating network is 19%. Figure 2 shows that the proposed system is
always better. This result describes that the potential matches offered by the
system interest the user (as shown by figure 5) and also the receivers show high
interests towards these users by sending the positive kiss message back as shown
by SR in figure 4 and with the increased recall (figure 6). However, it can be
seen that with the increased number of recommendations the value of SRI de-
creases as shown in figure 5. It concludes that more matching recommendations
will attract user attention and trigger more kisses to be sent. However, more
recommendation will also lead to low quality recommendations. When recom-
mending potential matches, the user is more interested in examining a small set
of recommendations rather than a long list of candidates. Based on all results,
high quality top-20 recommendation maximize SRI without letting recall drop
unsatisfactorily.
Experiments have also been performed to determine which kind of users are
more important for generating high quality matches for the dating network, the
similar users from clusters or relationship based users? Two sets of experiments
are performed.
– In the first setting, the size of GrA and GrB is fixed as 200. The usual
size of GrA is about 30 to 50, populated with ex-partners. More similar
users obtained from respective clusters are added into these two groups in
comparison to relationship-based users.
– In the second setting, the variance between the two groups Dif f (#GrA,
#GrB) is covered by adding members from the respective cluster. In addi-
tion, only the 10% size of GrA, GrB is increased by clustering to add new
members and to increase the user coverage.
Results show that when more similar users rather than relationship-based users
are added the success rate improvement (SRI) is lower than adding more relation
ship-based users against the current pairs. The SR and SRI obtained from the
first setting are 0.19 and 1.0 respectively, whereas in the second setting SR
and SRI are 0.29 and 1.4 respectively considering all suggested matching pairs.
Utilizing Past Relations and User Similarities in a Social Matching System 109

2
0.3
1.5 0.2

1 SR 0.1
SRI 0
0.5
top-5 top-10 top-20 top-50 top-100 All
0
top-5 top-10 top-20 top-50 top-100 All Reaction of S users

Fig. 4. Top-n success rate and suc- Fig. 5. Sender’s Interests Prediction
cess rate improvement Accuracy

10
8
6
4 Recall
2
0
top-5 top-10 top-20 top-50 top-100 All

Fig. 6. Top-n recall performance

Empirical analysis ascertains that by utilising clustering to increase the size of


GrA and GrB by small amount and equalising two groups, the recommendation
is improved as well as new users are considered in matching.
Due to the use of small networks of relationship-based users, the proposed
personalized social matching system is able to generate recommendations in ac-
ceptable time frame - it takes about 2 hours to generate recommendations for
100,000 users excluding the offline activities such as clustering of users, training
of SVM model and calculation of importance table for the members according
to common gender and age to be used in SVM. The proposed social matching
system is able to generate high quality recommendations to users. The quality
of the recommendations is enhanced by the following techniques: 1) All the rec-
ommendations are generated by good seed users who have above thirty previous
partners in a defined period. 2) The recommendations are among relationship-
based users who are generated by the utilization of social network’s background
knowledge. 3) To solve the cold-start issue but still ensuring the recommenda-
tion quality, the system add-on layer only groups users who are similar to the
seed pairs. This method avoids introducing any random users and keeps the
relationships among users. 4) Three similarity scores are utilised to determine
110 R. Nayak

the matching pair quality by measuring similarity level against seed pairs and
relationship-based users, and compatibility between the matching pair. The de-
cision tree model is used for producing the weights for these similarity scores.

4 Conclusion
The proposed system gathers relationship-based users, forms relationship-based
users networks, explores the similarity level between relationship-based users and
seed users, explores the compatibility between potential partners and then make
partner recommendations in order to increase the likelihood of successful reply.
This innovative system combines the following three algorithms to generate the
potential partners: (1) An instance-based similarity algorithm for predicting sim-
ilarity between the seed users and relationship-based user that forms potential
high quality recommendation and reduces the number of users that the matching
system needs to be considered; (2) A K-means similar user checking algorithm
that helps to overcome the problems that the standard recommender techniques
usually suffer, including the absence of knowledge, the cold-start problem and
the sparse user data; and (3) A user compatibility algorithm that conducts the
two-way matching between users by utilising the SVM predictive data mining
algorithm. Empirical analysis show that the success rate has been improved from
the baseline results of 19% to 31% by using the proposed system.

Acknowledgment: We would like to acknowledge the industry partners and


the Cooperative Research Centre for Smart Services.

References
1. 2006 census quickstats. Number March 2010 (2006)
2. Anand, S.S., Mobasher, B.: Intelligent techniques for web personalization. Online
Information Review (2005)
3. Bennett, K.P., Campbel, C.: Support vector machines: Hype or hallelujah? SIGKDD
Explorations 2, 1–13 (2000)
4. Brozovsky, L., Petricek, V.: Recommender system for online dating service (2005)
5. Fiore, A., Shaw Taylor, L., Zhong, X., Mendelsohn, G., Cheshire, C.: Who’s right
and who writes: People, profiles, contacts, and replies in online dating. In: Hawai’i
International Conference on System Sciences 43, Persistent Conversation Minitrack
(2010)
6. Kazienko, P., Musial, K.: Recommendation framework for online social networks. In:
4th Atlantic Web Intelligence Conference (AWIC 2006). IEEE Internet Computing
(2006)
7. Linden, G., Smith, B., York, J.: Amazon.com recommendations: Item-to-item col-
laborative filtering. IEEE Internet Computing 7 (2003)
8. Markey, P.M., Markey, C.N.: Romantic ideals, romantic obtainment, and relation-
ship experiences: The complementarity of interpersonal traits among romantic part-
ners. Journal of Social and Personal Relationships 24, 517–534 (2007)
9. Smith, A.: Exploring online dating and customer relationship management. Online
Information Review 29, 18–33 (2005)
On Sampling Type Distribution from Heterogeneous
Social Networks

Jhao-Yin Li and Mi-Yen Yeh

Institute of Information Science, Academia Sinica


128 Academia Road, Section 2, Nankang, Taipei 115, Taiwan
{louisjyli,miyen}@iis.sinica.edu.tw

Abstract. Social network analysis has drawn the attention of many researchers
recently. As the advance of communication technologies, the scale of social net-
works grows rapidly. To capture the characteristics of very large social networks,
graph sampling is an important approach that does not require visiting the en-
tire network. Prior studies on graph sampling focused on preserving the prop-
erties such as degree distribution and clustering coefficient of a homogeneous
graph, where each node and edge is treated equally. However, a node in a social
network usually has its own attribute indicating a specific group membership or
type. For example, people are of different races or nationalities. The link between
individuals from the same or different types can thus be classified to intra- and
inter-connections. Therefore, it is important whether a sampling method can pre-
serve the node and link type distribution of the heterogeneous social networks.
In this paper, we formally address this issue. Moreover, we apply five algorithms
to the real Twitter data sets to evaluate their performance. The results show that
respondent-driven sampling works well even if the sample sizes are small while
random node sampling works best only under large sample sizes.

1 Introduction
Social network analysis has drawn more and more attention of the data mining com-
munity in recent years. By modeling the social network as a graph structure, where a
node is an individual and an edge represents the relationship between individuals, many
studies addressed the graph mining techniques to discover the interesting knowledge on
the social networks. As the advance of communication technologies and the explosion
of social web applications such as Facebook and Twitter, the scale of the generated net-
work data is usually very large. Apparently, it is less possible to explore and store the
entire large network data before extracting the characteristics of these social networks.
Therefore, it is critical to develop an efficient and systematic approach to gathering data
in an appropriate size while keeping the properties of the original network.
To scale down the network data to be processed, there are two possible strategies:
graph summarization and graph sampling. Graph summarization [1,2,3,4,5,6] aims to
condense the original graph in a more compact form. There are lossless methods, where
the original graph can be recovered from the summary graph, and loss-tolerant methods,
where some information may be lost during the summarization. To obtain the summary
graph, these methods usually need to examine the entire network first. On the other

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 111–122, 2011.
c Springer-Verlag Berlin Heidelberg 2011
112 J.-Y. Li and M.-Y. Yeh

hand, sampling is a way of data collection by selecting a subset of the original data. By
following some rules of sampling nodes and edges, a subgraph can be constructed with
the characteristics of the original graph preserved. In contrast to graph summarization,
a big advantage of sampling is that only a controlled number of nodes, instead of the
entire network, are visited. In this work, as a result, we want to focus on sampling from
large social networks.
Prior studies on graph sampling [7,8], however, focused only on preserving statis-
tics such as degree distribution, hop-plot, and clustering coefficient on homogeneous
graphs, where each node and link is treated equally. In reality, the social network is
heterogeneous, where each individual has its own attribute indicating a specific group
membership or type. For example, people are of different races or nationalities. The
link between individuals of the same or different types can thus be classified to intra-
connection and inter-connection. The type distribution of nodes and the proportion of
intra/inter-connection links is also key information that should be preserved to under-
stand the heterogeneous social network, which, to the best of our knowledge, has not yet
been addressed in the previous graph sampling works in the data mining community.
To this end, we propose two goals on the heterogeneous social network. First is the
type distribution preserving goal. Given a desired number of nodes of the sample size,
a subgraph Gs is generated by some sampling method. The type distribution of Gs ,
Dist(Gs ), is expected to be the same as that of the original graph G. The second goal is
the intra-relationship preserving goal. We expect that the ratio of the intra-connection
numbers to the total edges of Gs should be preserved.
In search of a better solution, we adopt five possible methods: Random Node Sam-
pling (RNS), Random Edge Sampling (RES), EgoCenteric Exploration Sampling(ECE)
[9], Multiple-Ego-Centric Sampling, (MES) and Respondent-Driven Sampling (RDS)
[10], to see their effects on sampling type distribution in the heterogeneous social net-
works. RNS and RES are two methods of selecting nodes and edges randomly until some
criteria are met. ECE is a chain-referral-based sampling proposed in [9]. Chain-referral
sampling usually starts from a node called ego and selects neighbor nodes uniformly at
random wave by wave [9]. MES is a variation of ECE we designed that the sampling
starts with multiple initial egos. Finally, we adopt RDS, which is a sampling method
used in social science for studying the hidden populations [10]. Many works on the
social network analysis focus on the majority, i.e., the greatest or the second greatest
connected components, of the network. However, sometimes the small or hidden group
of a network hints more interesting messages. For example, the population of drug users
or patients with rare diseases is usually hidden and relatively small. Essentially, RDS
is a method combining snowball sampling, of which the recruiting of future samples is
from acquaintances of the current subject, with a Markov Chain model to generate un-
biased samples. In our implementation, we adopt RDS for the simulation of the human
recruiting process and indicate how the Markov Chain is computed from the collected
samples.
To evaluate the sampling quality of the above five methods, we conduct experi-
ments on the Twitter data sets provided in [11]. We measure the difference of the type
distribution between the sampling results and the original network by two indexes:
error ratio and D-statistic of Kolmogorov-Smirnov Test. In addition, we measure the
On Sampling Type Distribution from Heterogeneous Social Networks 113

difference of the intra-connection percentage between the samples and the original net-
work. The results show that RDS works best in terms of preserving the type distribu-
tion and the intra-connection percentage when the sample size is small. MES and ECE
perform next best, while MES has small improvement over ECE in the node type distri-
bution. Finally, the sampling quality of RNS and RES are less stable. RNS outperforms
other methods only when the sample size is large.
The remainder of the paper is organized as follows. The related work is discussed
in Section 2. The problem statement is formally defined in Section 3. The detailed
implementation of the five sampling algorithms is described in Section 4. In Section 5,
we show the experimental results. Finally, the paper is concluded in Section 6.

2 Related Work
As the scale of social network data is getting very large, graph sampling is a useful
technique to collect a smaller subgraph, without visiting the entire original network,
but with some properties of the original network preserved. Krishnamurthy et al. [12]
found that a simple random node selection to a 30% sampling size is already able to
preserve some properties of an undirected graph. Leskovec and Faloutsos [7] provided
a survey on three major kinds of sampling methods on graphs: sampling by random
node selection, sampling by random edge selection and sampling by exploration. The
sampling quality of preserving many graph properties such as degree distribution, hop-
plot, and clustering coefficient are examined. Moreover, they proposed a Forest Fire
sampling algorithm, which preserves the power law property of a network very well
during the sampling process. They concluded that there is no perfect sampling method
for preserving every property under any conditions. The sampling performance depends
on different criteria and graph structures. Hübler et al. [8] further proposed Metropolis
algorithms to obtain representative subgraphs. However, all these sampling works did
not concern the heterogeneous network where each node may have its own attribute
indicating a specific group membership or type. The link (edge) may be a connection
between two nodes of the same or different types. The type distribution of nodes and
the proportion of intra/inter-connection links is also key information to understand the
heterogeneous social network, which, to the best of our knowledge, has not yet been
addressed in the previous graph sampling works in the data mining community.
To sample the type distribution in a heterogeneous network, we further introduce
the Respondent-Driven Sampling (RDS) proposed by Heckathorn [10]. RDS is a well-
known sampling approach for studying the hidden population, which combines snow-
ball sampling and the Markov Chain process to produce unbiased sampling results for
the hidden population. Furthermore, a newer estimator for the sampling results is de-
signed [13] based on the reciprocity model assumption, i.e., the number of edges from
group A to group B is equal to that from group B to group A in a directed graph with two
groups (an undirected graph naturally complies with this assumption). In the real case
of directed heterogeneous network, however, this assumption does not usually hold.
For example, the Twitter data sets we use in the experiments do not have the reciprocity
property between the tweets among users. In our study, as a result, we apply and sim-
ulate the original RDS [10] to sample the type distribution in a large heterogeneous
social network.
114 J.-Y. Li and M.-Y. Yeh

3 Problem Statement

Given a graph G =< V, E >, V denotes a set of n vertexes (nodes, individuals) vi and
E is a set of m directed edges (link, relationships) ei . First, we define the heterogeneous
graph which models the heterogeneous social network we are interested in.

Definition 1. A heterogeneous graph G with k types is a graph where each node be-
longs to only one specific type out of k types. More specifically, given a finite set
L = {L1 , ...Lk } denoting k types, the type of each node vi is T (vi ) = Li , where Li ∈ L.
Suppose the number of vertex of G is n, and the number of nodes belonging to type Li is
Ni , then the condition ∑ki=1 Ni = n must be true. In other words, (nodes ∈ Li ) ∩ (nodes
∈ L j ) =0, where i = j.

The edges between nodes of different types are defined as follows.

Definition 2. An edge ei connecting two nodes vi and v j is an intra-connection edge if


T (vi ) = T (v j ). Otherwise, it is an inter-connection edge.

With the above two definitions, our problem statements are presented as follows.

Problem 1. Type distribution preserving goal Given a desired number of nodes, i.e.,
the sample size, a subgraph Gs is generated by some sampling method. The type distri-
bution of Gs , Dist(Gs ), is expected to be the same as that of the original graph G. That
is, d(Dist(Gs ), Dist(G)) = 0, where d() denotes the difference between two distribu-
tions. In other words, the percentage of each Ni in Gs is expected to be the same as that
of G.

Problem 2. Intra-relationship preserving goal Given a desired number of nodes, i.e.,


the sample size, a subgraph Gs is generated by some sampling method. The ratio of the
intra-connection numbers to the total edges should be preserved. That is,

d(IR(Gs ), IR(G)) = 0.

On the other hand, the inter-relationship is equal to 1 − IR(Gs) which is also preserved.

An example is given to illustrate these two problems. Given a social network which
including 180 nodes (n = 180) and 320 edges (m = 320). Suppose there are totally
3 groups (k = 3) containing 20, 100, and 60 people respectively. Thus, the type dis-
tribution of the network Dist(G) is 0.11, 0.56, 0.33. Also suppose there are 200 intra-
connection edges, the intra-connection ratio is thus 0.625. Our goal is to find out a
sampling method that preserves the type distribution and the intra-connection ratio best.
Suppose that a subgraph Gs is sampled under the given 10% sampling rate, which is 18
nodes. If the number of nodes of group 1, 2, and 3 is 5, 8, and 5, then the type distri-
bution is 0.28,0.44,0.28. In addition, suppose there are 30 intra-connection edges out of
50 sampled edges, then the intra-connection ratio is 0.6. In the experiment section, we
will provide several indexes to compute the difference between these distributions and
ratios.
On Sampling Type Distribution from Heterogeneous Social Networks 115

4 Sampling Algorithms

The five algorithms for sampling large heterogeneous networks to be described are di-
vided into three categories: random-based sampling, chain-referral sampling and indi-
rect inference sampling. For random-based sampling, we conduct Random Node
Sampling and Random Edge Sampling. The chain-referral sampling method includes
Ego-Centric Exploration Sampling and Multiple Ego-Centric-Exploration Sampling.
Finally, we adopt Respondent-Driven Sampling, an indirect sampling method, that orig-
inated from the social science.

4.1 Random-Based Sampling

Random Node Sampling (RNS) is an intuitive procedure for selecting desired number
of nodes uniformly at random from the given graph. First of all, RNS picks up a set of
nodes into a list. Then it constructs the vertex-induced subgraph by checking if there
are edges between the selected nodes in the original graph.
The logic of Random Edge Sampling (RES) is also intuitive and similar to RNS
by selecting edges uniformly at random. Once an edge is selected during the sampling
process, the two nodes, head and tail, that it connected are also be included. Note that
once the node number exceeds the desired one when a latest edge is selected, one of the
node, say, the head node, will be excluded.

4.2 Chain-Referral Sampling

The chain-referral sampling is also known as exploration-based sampling. First, we il-


lustrate the Ego-Centric Exploration Sampling (ECE) method proposed by Ma et al. [9],
then we proposed an variation method which improves ECE.
Essentially, ECE is based on the Random Walk (RW) methods [14]. Starting with
a random node, RW chooses exactly one neighbor of that start node as the next stop.
Following the same step, RW visits as many nodes as the desired sample size. The
visited nodes and the edges along the walking path are collected. Similar to RW, ECE
first randomly chooses a starting node called ego. Then all of its neighbor nodes, in
contrast to RW where only one neighbor is considered, will be examined to be chosen
or not by ECE according to a probability p respectively. The number of nodes chosen
is expected to be p times the degree of that ego. Next, each new selected node itself is a
new ego and the algorithm repeats the same step iteratively until the desired sample size
is reached. Whenever the sampling process cannot move to the next wave, we select a
new ego and restart this procedure to continue the sampling.
Consider the case that the start ego of ECE falls in a strongly connected component
where individuals tend to be connected to those of the same type, e.g., same race or
same nationality. This may trap ECE in sampling only the same or very few types of
nodes. To deal with this issue, we further propose the Multiple Ego-centric-exploration
Sampling (MES) method, which allows multiple egos as the sampling starts at initial.
In this way, we have more chance to reach nodes of different types and can avoid the
bias of over-sampling nodes for a particular type.
116 J.-Y. Li and M.-Y. Yeh

Although the chain-referral sampling algorithms can both produce a reasonable con-
nected subgraph and preserve community structure, the rich get richer flavor inherently
exists in this family of sampling techniques.

4.3 Respondent-Driven Sampling


To study the hidden population in social science, Respondent-Driven Sampling (RDS)
[10], a non-probability method, has been proposed. Generally, RDS contains two phases
that including the snowball sampling phase and the Markov Chain process. Snowball
sampling works similarly to ECE/MES that the recruiting of future samples is from
acquaintances of the current subject. To compensate for collecting the data in a non-
random way, the second phase of RDS, the Markov Chain process, helps to generate
unbiased samples. As opposed to the conventional sampling methods, the statistics are
not obtained directly from the samples, but indirectly inferred from the social network
information constructed through them.
We simulate the snowball sampling phase of RDS as follows. First, the initial seeds,
or individuals, must be chosen from a limited number of convenience samples. We just
randomly select these initial seeds. Originally in RDS, each chosen seed is rewarded
to encourage further recruiting. Here, we simply make all recruited nodes continue to
recruit their peers. In addition, we set a coupon limit, which is the number of peers an
individual can recruit, to prevent the sampling in favor of individuals who have many
acquaintances.
Then, we simulate the Markov Chain process. Suppose there are total k types of
people in the network we study. From the collected samples we can organize a k by k
recruitment matrix M, where the element Si, j of M represents the percentage of the type
j people among those recruited by the people of type i. An example is illustrated in the
following sample matrix.
⎛ ⎞
S11 . . . S1 j
⎜ ⎟
M = ⎝ ... . . . ... ⎠
Si1 · · · Si j
Suppose that the recruiting should reach to an equilibrium state if more samples are
recruited than currently we have. That is, the type distribution will stabilize at E =
(E1 , ..., Ei , ..., Ek ), where Ei is the proportion of the type i at equilibrium. The law of
large number of the regular Markov Chain process provides a way of computing that
equilibrium state of M. It is computed by solving the following linear equations.

E1 + E2 + ... + Ek = 1
S1,1 E1 + S2,1 E2 + ...Sk,1 Ek = E1
S1,2 E1 + S2,2 E2 + ...Sk,2 Ek = E2
..
.
S1,k−1 E1 + S2,k−1 E2 + ...Sk,k−1 1Ek = Ek−1 .

For instance, if there are only two groups in a social network, Male (m) and Female ( f ),
Sm f
the solution is Em = 1−Smm+S f m and E f = 1 − Em , thus provide the information about
On Sampling Type Distribution from Heterogeneous Social Networks 117

type distribution of the social network. According to [15], if M is a regular transition


matrix, there is an unique E. Also, M to the the power of N, M N , approaches a proba-
bility matrix, where each row is identical and equal to E. Therefore, we can estimate E
alternatively by finding a large enough N that makes M N converge and obtaining a row
of it.

5 Evaluation
In this section, we introduce the data sets we used and present our experimental results.
Then, we show the results of all five sampling methods for both of the type reserv-
ing goal and the intra-relationship preserving goal. We also discuss the effects on the
statistics with the number of types and the sampling size varied. The sampling proba-
bility p was set to 0.8 for ECE and MES based on the suggestion in [12]. The coupon
limit for RDS was set to 5. We implemented all algorithms using VC++ and ran on a PC
equipped with 2.66GHz dual CPUs and 2G memory. Moreover, we ran each experiment
200 times for each setting and computed the average to get a stable and valid result.

5.1 Twitter Data Sets


We conducted our experiments on the Twitter data sets provided in [11] . This data set
contains about 10.5 million tweets from 200,000 users along with their information such
as time zone, location and so on between 2006 and 2009. The total following/follower
information constitutes a directed heterogeneous graph we used. However, we pre-
served those users with the location label only. Starting from those 200,000 user IDs,
we included all their neighbors that had valid location labels on them. Therefore. we
constructed a social network that contained more than 200,000 nodes. The resulted so-
cial network data consists of n = 403874 accounts and e = 689541 relationships (tweets
behavior of followers and followees) among those accounts.
According to the location label that records the city/country of each user, we can di-
vide them into several types according to the geographical areas. The division included
three settings: 3, 5, and 7 types (groups). The group and type are interchangeably in
the following content. The characteristics of these settings were listed in Table 1. The
7 groups are: East Asia, West Asia, Europe, North America, South America, Australia
and Africa. The overall intra-connection ratio is 0.303, which means about 30% users
contact with those of the same region types. The second setting, 5 region groups were
classified as Asia, Europe, America, Australia and Africa. The overall intra-connection
ratio is 0.466. In the third setting, 3 region groups are Asia, Europe and America. The
overall intra-connection ratio is 0.486 meaning that almost half of the users contact with
those in the same area. Detailed information between different groups were presented
in Table 1.

5.2 Evaluation Index


For the type distribution preserving goal, we used two statistics to measure the type
distribution difference between the subgraph Gs extracted by the five algorithms and
118 J.-Y. Li and M.-Y. Yeh

Table 1. Summary of the Twitter data sets

group count characteristics group 1 group 2 group 3 group 4 group 5 group 6 group 7
group ratio 0.24 0.246 0.149 0.196 0.142 0.023 0.004
node count 97053 99177 60318 79290 57357 9206 1473
7 intra-connection ratio 0.324 0.332 0.335 0.265 0.258 0.209 0.02
intra-edge count 55943 53170 33132 38360 21558 2798 70
edge count 185242 160094 98819 144920 83531 13378 3530
group ratio 0.486 0.149 0.338 0.023 0.004 — —
node count 196230 60318 136647 9206 1473 — —
5 intra-connection ratio 0.574 0.335 0.381 0.209 0.02 — —
intra-edge count 198306 33132 86943 2798 70 — —
edge count 345336 98819 228451 13378 3530 — —
group ratio 0.509 0.153 0.381 — — — —
node count 205436 61791 136647 — — — —
3 intra-connection ratio 0.598 0.334 0.381 — — — —
intra-edge count 214351 34149 86943 — — — —
edge count 358714 102349 228451 — — — —

the original graph G. First, the Error Ratio (ER) summed up the proportion difference
∑k |O(i)−E(i)|
of all types. It is defined as i=1 2∗SN , where O(i) is the number of nodes in the ith
group on Gs , E(i) is the theoretical number of nodes it should be in the sampled graph
according to the type i’s real proportion in G, and SN is the sample size. Another evalu-
ation statistic is the D-statistic for the Kolmogorov-Smirnov Test. We simply used it as
an index rather than conducting a hypothesis test. The D-statistic, which can measure
the agreement between two distributions, is defined as D = supx |F  (x) − F(x)|, where
F  (x) is the type distribution of Gs and F(x) is that of G. ER provided a percentage-
like form of the total errors between type distributions of Gs and G whereas D-statistic
provided the information about the cumulative errors within the structure of Gs and G.
For the intra-relationship preserving goal, we used the Intra-Relation Error (IRE) to
measure the difference of the intra-relationship ratio among Gs and G. It was defined

| mI  − mI |, where I  and I denoted the number of intra-connection edges in Gs and G
respectively, and m and m were the total number of edges in Gs and G respectively.

5.3 Results of the Type Distribution Preserving Goal


This goal is to preserve the type distribution of the sampled graph Gs as similar as that of
the original graph G. The sample size varied from 50 to 200000 nodes, i.e., about 0.1%
to 50% sampling rate. The experiment results in Fig. 1(a) and (b) showed the error in the
type distribution for the 7-groups Twitter data set. In general, the error decreased as the
sampling size increased. Fig. 1(a) showed the ER of all the five sampling methods, we
found that RDS performed best when the sampling size was very small, but improved
slowly in large sample size. Because the fast-converge rate in the Markov transition,
RDS can provide more accurate results even when the information from the samples
was limited. However, since the Markov process converged very fast, the result was
determined until an enough number of nodes was reached. Thus, the following selected
entities failed to improve the accuracy. On the other hand, MES outperformed ECE
On Sampling Type Distribution from Heterogeneous Social Networks 119

Number of Groups = 7 Number of Groups = 7


0.35 0.35
RNS RNS
RES 0.3 RES
0.3 ECE
ECE
MES 0.25 MES
0.25 RDS
RDS

D Statistic
Error Ratio

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0
10 100 1000 10000 100000 1e+006 10 100 1000 10000 100000 1e+006
Sample Size Sample Size

(a) ER (b) D-statistic


Number of Groups = 5 Number of Groups = 5
0.35 0.35
RNS RNS
0.3 RES 0.3 RES
ECE ECE
0.25 MES MES
RDS 0.25
RDS
Error Ratio

D Statistic

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0
10 100 1000 10000 100000 1e+006 10 100 1000 10000 100000 1e+006
Sample Size Sample Size

(c) ER (d) D-statistic


Number of Groups = 3 Number of Groups = 3
0.35 0.35
RNS RNS
0.3 RES 0.3 RES
ECE ECE
0.25 MES MES
RDS 0.25
RDS
Error Ratio

D Statistic

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0
10 100 1000 10000 100000 1e+006 10 100 1000 10000 100000 1e+006
Sample Size Sample Size

(e) ER (f) D-statistic


Fig. 1. ER and D-statistic at different sample sizes when the group number is 3, 5 and 7

only at the small sample size. It was because MES could avoid getting stuck in some
particular group members and thus provided more accurate results. When the sample
size increased, ECE had higher chances to travel from groups to groups thus provided
the similar result with MES. Finally, RNS and RES behaved very unstable and sensi-
tive to the sample size. When the sample size was very small, both methods produced
terrible results. However, they improved significantly and obtained the best sampling
results when the sample size was large enough. In Fig. 1(b), we found similar behav-
ior patterns for all sampling methods except for RDS at the small sample size. This
indicated that RDS heavily relied on the information provided by the recruitment ma-
trix. When the sampling size was very small, the recruitment matrix could not provide
enough information for the Markov Chain process thus produced a worse result.
120 J.-Y. Li and M.-Y. Yeh

For the 5-groups Twitter data set, all patterns from five methods were similar to those
of 7-group Twitter data set as shown in Fig. 1(c) and (d). This is also true for the setting
of 3-groups as shown in Fig. 1(e) and (f). Only at small sample sizes, the results showed
that the error decreased as the number of groups was getting smaller. We will further
discuss the results in Section 5.5.
It is noted that, the results were similar for both ER and D-statistic. This was because
the property of the Twitter data sets. Since ER is an index to measure the total error,
it was sensitive to the performance on the largest or relatively great groups in terms of
size. On the other hand, D-statistic measured the cumulative error that encountered on
the greater groups as well in most cases. According to those reasons, we observed some
similar patters between ER and D-statistic.

5.4 Results of the Intra-Relationship Preserving Goal

Our second goal is to preserve the relationship among different groups in a network.
Fig. 2 presented experimental results for this goal. We found that RDS produced the
best result even at small sample sizes. It indicated that the sampling phase of RDS not
only provided the network information to the Markov Chain process, but also somewhat
preserved the relationship information (different tie types) as well. Still, its improve-
ment slowed down when the sample size became very large. On the other hand, MES
had a little higher errors compared to that of the ECE at a small sample size. Since the
original concept of MES is to avoid sampling bias from the chain-referral procedure in
the type distribution, it did not consider the issues about the relationship among indi-
viduals (edges on the graph). However, we can observe the advantage of MES when
the sample size increased. RES outperformed RNS since it is an edge-based random
selection. Thus it had more advantages than the node-based random selection did. Fi-
nally, RNS failed to describe the relationships among individuals with a small sample
size since RNS tended to produce a set of nonconnective nodes, which was especially
true when the network was sparse, that mislead the intra-connection ratio to 0. How-
ever, the situation changed while the sample size increased. Because RNS performed a
vertex-induced procedure after sampling enough nodes into the sample pool, this pro-
cess included both in-edge and out-edges between two nodes. Therefore, more selected
edges resulted in the better performance of the intra-relationship preserving goal. We
omitted the results of the 5-group data set due to the space limit. Its IRE values were
between those of the 7-group and 3-group settings.

5.5 Analysis on the Effects of the Number of Groups and the Sample Size

Here we provide some remarks on the performance at different numbers of groups. The
sampling size chosen here was 100. We only presented ER in Fig. 3(a) and omitted
the results of D-statistic since they had similar patterns. We found that both ER and
D-statistic affected by the number of groups (k) positively. This is reasonable since the
more k existed in a social network the more errors we observed lead to a lower accuracy.
On the other hand, in Fig. 3(b), the number of groups k were almost independent of the
intra-relationship error. It is noted that since RNS cannot sample any edge in the small
On Sampling Type Distribution from Heterogeneous Social Networks 121

Number of Groups = 7 Number of Groups = 3


0.5 0.5
RNS RNS
RES RES
0.4 ECE 0.4 ECE
Intra-Relation Error

MES

Intra-Relation Error
MES
RDS RDS
0.3 0.3

0.2 0.2

0.1 0.1

0 0
10 100 1000 10000 100000 1e+006 10 100 1000 10000 100000 1e+006
Sample Size Sample Size

(a) IRE (b) IRE


ቹ।ᑑᠲ
Fig. 2. IRE at different sample sizes when the group number is 7 and 3

0.400 0.600

0.350
0.500
IntraͲRelationError

0.300
0.400
ErrorRatio

0.250

0.200 0.300

0.150
0.200
0.100
0.100
0.050

0.000 0.000
3 5 7 3 5 7

(a) Group number versus ER (b) Group number versus IRE


ቹ।ᑑᠲ
RNS RES ECE MES RDS

Fig. 3. Error Ratio and IRE at different group number

sample setting, the IRE equals to the IR of the original graph. As a result, the IRE of
RNS in Fig. 3(b) had significant differences of different group settings.
One of the most important issues in the sampling problem is that how big the sample
size should be chosen to get good enough results in terms of the sampling accuracy on
preserving our two goals. According to all of our experiment results, Fig. 1 and Fig. 2,
we concluded that 15% is the best point for this concern. We found that when the sample
size grew to more than 15% (around 60000 nodes) of the population, all statistics were
below 0.05 no matter which sampling method was used. In other words, the sampling
quality improved limitedly when the sample size was even larger (up to 50% in this
study). Although the purposes and research target are different, our finding is similar to
that in [7], which also concluded with this argument.

6 Conclusion
In this study, we proposed a novel and meaningful sampling problem, the type distri-
bution preserving and the intra-relationship preserving problems, on the heterogeneous
social network and conducted five algorithms to deal with this issue. In preserving the
type distribution, we found that RDS was a good method especially at small sample
122 J.-Y. Li and M.-Y. Yeh

sizes. MES helped ECE a little at a small sample size. In addition, the random-based
methods were sample size sensitive and failed to provide reasonable results at small
sample sizes. In preserving the link relationship goal, we had a similar conclusion while
some differences were discussed. Furthermore, we discussed the results under different
group sizes. Finally, we provided a rule of thumb that a 15% sample size should be
large enough on the type distribution preserving and the intra-relationship preserving
sampling problems in our findings.

References
1. Navlakha, S., Rastogi, R., Shrivastava, N.: Graph summarization with bounded error. In:
Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 419–432 (2008)
2. Gibson, D., Kumar, R., Tomkins, A.: Discovering large dense subgraphs in massive graphs.
In: Proc. of Int. Conf. on Very Large Data Bases, p. 732 (2005)
3. Raghavan, S., Garcia-Molina, H.: Representing web graphs. In: Proc. of IEEE Int. Conf. on
Data Engineering, pp. 405–416 (2003)
4. Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Extracting large-scale knowledge
bases from the web. In: Proc. of Int. Conf. on Very Large Data Bases, pp. 639–650 (1999)
5. Li, C.T., Lin, S.D.: Egocentric Information Abstraction for Heterogeneous Social Networks.
In: Proc. of Int. Conf. on Advances in Social Network Analysis and Mining, pp. 255–260
(2009)
6. Tian, Y., Hankins, R., Patel, J.: Efficient aggregation for graph summarization. In: Proc. of
ACM SIGMOD Int. Conf. on Management of Data, pp. 567–580 (2008)
7. Leskovec, J., Faloutsos, C.: Sampling from large graphs. In: Proc. of ACM SIGKDD Int.
Conf. on Knowledge Discovery and Data Mining, p. 636 (2006)
8. Hübler, C., Kriegel, H., Borgwardt, K., Ghahramani, Z.: Metropolis algorithms for represen-
tative subgraph sampling. In: Proc. of IEEE Int. Conf. on Data Mining, pp. 283–292 (2008)
9. Ma, H., Gustafson, S., Moitra, A., Bracewell, D.: Ego-centric Network Sampling in Viral
Marketing Applications. In: Int. Conf. on Computational Science and Engineering, pp. 777–
781 (2009)
10. Heckathorn, D.: Respondent-driven sampling: a new approach to the study of hidden popu-
lations. Social problems 44, 174–199 (1997)
11. Choudhury, M.D.: Social datasets by munmun de choudhury (2010),
http://www.public.asu.edu/~mdechoud/datasets.html
12. Krishnamurthy, V., Faloutsos, M., Chrobak, M., Lao, L., Cui, J.-H., Percus, A.G.: Re-
ducing large internet topologies for faster simulations. In: Boutaba, R., Almeroth, K.C.,
Puigjaner, R., Shen, S., Black, J.P. (eds.) NETWORKING 2005. LNCS, vol. 3462, pp. 328–
341. Springer, Heidelberg (2005)
13. Heckathorn, D.: Respondent-driven sampling II: deriving valid population estimates from
chain-referral samples of hidden populations. Social Problems 49, 11–34 (2002)
14. Lovász, L.: Random walks on graphs: A survey. Combinatorics, Paul Erdos is Eighty 2, 1–46
(1993)
15. Kemeny, J.G., Snell, J.L.: Finite Markov Chains, pp. 69–72. Springer, Heidelberg (1960)
Ant Colony Optimization with Markov Random
Walk for Community Detection in Graphs

Di Jin1,2 , Dayou Liu1 , Bo Yang1 , Carlos Baquero2, and Dongxiao He1


1
College of Computer Science and Technology, Jilin University, Changchun
{jindi.jlu,hedongxiaojlu}@gmail.com,{liudy,ybo}@jlu.edu.cn
2
CCTD/DI, University of Minho, Braga, Portugal
cbm@di.uminho.pt

Abstract. Network clustering problem (NCP) is the problem associated


to the detection of network community structures. Building on Markov
random walks we address this problem with a new ant colony optimiza-
tion strategy, named as ACOMRW, which improves prior results on the
NCP problem and does not require knowledge of the number of commu-
nities present on a given network. The framework of ant colony optimiza-
tion is taken as the basic framework in the ACOMRW algorithm. At each
iteration, a Markov random walk model is taken as heuristic rule; all of
the ants’ local solutions are aggregated to a global one through cluster-
ing ensemble, which then will be used to update a pheromone matrix.
The strategy relies on the progressive strengthening of within-community
links and the weakening of between-community links. Gradually this con-
verges to a solution where the underlying community structure of the
complex network will become clearly visible. The performance of algo-
rithm ACOMRW was tested on a set of benchmark computer-generated
networks, and as well on real-world network data sets. Experimental re-
sults confirm the validity and improvements met by this approach.

Keywords: Network Clustering, Community Detection, Ant Colony


Optimization, Clustering Ensemble, Markov Random Walk.

1 Introduction
Many complex systems in the real world exist in the form of networks, such as
social networks, biological networks, Web networks, etc., which are also often
classified as complex networks. Complex network analysis has been one of the
most popular research areas in recent years due to its applicability to a wide
range of disciplines [1,2,3]. While a considerable body of work addressed basic
statistical properties of complex networks such as the existence of “small world
effects” and the presence of “power laws” in the link distribution, another prop-
erty that has attracted particular attention is that of “community structure”:
nodes in a network are often found to cluster into tightly-knit groups with a
high density of within-group edges and a lower density of between-group edges
[3]. Thus, the goal of network clustering algorithms is to uncover the underlying
community structure in given complex networks.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 123–134, 2011.
c Springer-Verlag Berlin Heidelberg 2011
124 D. Jin et al.

The research on complex network clustering problems is of fundamental im-


portance. It has both theoretical significance and practical applications on an-
alyzing network topology, comprehending network function, unfolding network
patterns and forecasting network activities. It has been used in many areas,
such as terrorist organization recognition, organization management, biological
network analyzing, Web community mining, etc [4].
So far, lots of network clustering algorithms have been developed. In terms of
the basic strategies adopted by them, they mainly fall into two main categories:
optimization and heuristic based methods. The former solves the NCP by trans-
forming it into an optimization problem and trying to find an optimal solution
for a predefined objective function such as the network modularity employed in
several algorithms [5,6,7,8]. In contrast, there are no explicit optimization ob-
jectives in the heuristic based methods, and they solve the NCP based on some
intuitive assumptions or heuristic rules, such as in the Girvan-Newman (GN)
algorithm [3], Clique Percolation Method (CPM) [9], Finding and Extracting
Communities (FEC) [10], Community Detection with Propinquity Dynamics
(CDPD) [11], Opinion Dynamics with Decaying Confidence (ODDC) [12], etc.
Though a lot of network clustering algorithms have been proposed, how to
further improve the clustering accuracy, especially how to discover reasonable
network community structure without prior knowledge (such as the number of
clusters in the network), is still an open problem. In order to address this prob-
lem, a random walk based ant colony optimization for NCP is proposed in this
paper enlightened by [10]. In this algorithm, each ant detects its community
by using the transition probability of a random walk as heuristic rule; in each
iteration, all the ants collectively produce the current solution via the concept
of clustering ensemble [13], and update their pheromone matrix by using this
solution; at last, after the algorithm has converged, the pheromone matrix is
analyzed in order to attain the clustering solution for the target network.

2 Algorithm

2.1 The Main Idea

Let N = (V, E) denote a network, where V is the set of nodes (or vertices)
and E is the set of edges (or links). Let a k-way partition ofthe network be
definedas π = N1 , N2 , . . . , Nk , where N1 , N2 , . . . , Nk satisfy 1≤i≤k Ni = N
and 1≤i≤k Ni = ∅. If partition π has the property that within-community
edges are dense and between-community edges are sparse, it’s called a well-
defined community structure of this network.
In a network, let pij be the probability that an agent freely walks from any
node i to its neighbor node j within one step, this is also called the transition
probability of a random walk. In terms of the adjacency matrix of N , A =
(aij )n×n , pij is defined by
aij
pij =  . (1)
r air
ACO with Markov Random Walk for Community Detection in Graphs 125


Let D = diag(d1 , . . . , dn ) where di = j aij denotes the degree of node i. Let
P be the transition probability matrix of random walk, we have

P = D−1 ∗ A. (2)

From the view of a Markov random walk, when a complex network has commu-
nity structure, a random walk agent should be found it difficult to move outside
its own community boundary, whereas it should be easy for it to reach other
nodes within its community, as link density within a community should be high,
by definition. In other words, the probability of remaining in the same commu-
nity, that is, an agent starts from any node and stays in its own community after
freely walking by a number of steps, should be greater than that of going out to
a different community.
It’s for the reason above, that in this ant colony optimization (ACO) strategy,
each ant (only different from agent in the sense that it can consult and update
a “pheromone” variable in each link) takes the transition probability of random
walk as heuristic rule and is directed by pheromone to find its solution. At each
iteration, the solution found by each ant only expresses its local view, but one
can derive a global solution when all of the ants’ solutions are aggregated to one
through a clustering ensemble, which will be used to update the pheromone ma-
trix. As the process evolves, the cluster characteristic of the pheromone matrix
will gradually become sharper and algorithm ACOMRW converges to a solu-
tion where the community structure can be accurately detected. In short, the
pheromone matrix can be regarded as the final clustering result which aggregates
the information of all the ants at all iterations in this algorithm.
In order to further clarify this above idea, an intuitive description is presented
as follows. Given a network N which has community structure. One lets some
ants freely crawl along the links in the network. The ants have a given life-cycle,
and the new ant colony will be generated immediately when all of the former
ants die. At the beginning of this algorithm, there is yet no impact from the
pheromone on network N . Only because of the restriction by community struc-
ture, the ant’s probability for remaining in its own community should be greater
than that for going out to other communities, but there is no difference between
these ants and the random walk agents at the moment since pheromone distribu-
tion is still homogeneous. As ants move, with the accumulation and volatilization
of pheromone left by the former ants, the pheromone on within-community links
will become thicker and thicker, and the pheromone on between-community links
will become thinner and thinner. In fact, pheromone is simply a mechanism that
can register in the network past walks and that leads to more informed decisions
for subsequent walks. This strengthens the trend that any ant will more often re-
main in its own community. At last, when the pheromone matrix converges, the
clustering result of network N will be got naturally. In one word, the idea behind
of ACOMRW is that, by strengthening within-community links and weakening
between-community links, an underlying community structure of the network
will gradually become visible.
126 D. Jin et al.

2.2 Algorithm Description

A Solution by One Ant. Each ant is leading to a solution in ant colony


optimization, so a solution produced by one ant should be a clustering solution
of the network for the NCP.
Given network N = (V, E), consider a stochastic process defined on network
N with pheromone, in which an ant freely crawls from one node to another
along the links between them. After the ant arrives at one node, directed by
pheromone left by the former ants on the links, it will rationally select one of its
neighbors and move there.
Let X = {Xt , t ≥ 0} denote the ant’s positions, and P {Xt = i, 1 ≥ i ≥ n} be
the probability that the ant arrives at node i after walking t steps. For ∀it ∈ V we
have P {Xt = it |X0 = i0 , X1 = i1 , . . . , Xt−1 = it−1 } = P {Xt = it |Xt−1 = it−1 }.
That is, the next state of the ant is only decided by its previous state, which
is called a Markov property. So this stochastic process is a discrete Markov
chain and its state space is V . Furthermore, Xt is homogeneous because of
P {Xt = j|Xt−1 = i} = mij , where mij is the ant’s transition probability from
node i to node j.
Let the transition probability matrix of random walk, which is regarded as
heuristic rule, be P = (pij )n×n , and the current pheromone matrix be B =
(bij )n×n , then the probability mij that any ant walks from any node i to its
neighbor node j within one step can be computed by Eq. (3). Therefore, the
transition probability matrix of the ants should be M = (mij )n×n .

bij pij
mij =  . (3)
r bir pir

Consider the Markov dynamics of each ant above. Let the start position of
one ant be node s, the step number limitation be l, and Vst denote the t-step
( t ≤ l ) transition probability distribution of the ant, in which Vst (j) denotes the
probability that this ant walks from node s to node j within t steps. There should
be Vs0 = (0, . . . 0, 1, 0, . . . 0), where Vs0 (s) = 1. If we also consider the influence
of power-law degree distribution in complex network, directed by matrix M , Vst
can be given by
Vst = Vst−1 ∗ M T . (4)
In this algorithm, all the ants take the transition probability of random walk as
heuristic rule, and are directed by pheromone at the same time. Thus, as the
link density within a community is, in general, much higher than that between
communities, an ant that starts from the source node s should have more paths
to choose from to reach the nodes in its own community within l steps, where
the value of l can’t be too large. On the contrary, the ant should have much
lower probability to arrive the nodes outside its community. In other words, it
will be hard for an ant that falls on a community to pass those “bottleneck” links
and leave the existing community. Furthermore, with the evolution of algorithm
ACOMRW, the pheromone on within-community links will become thicker and
thicker, and the pheromone on between-community links will become thinner
ACO with Markov Random Walk for Community Detection in Graphs 127

and thinner. This makes the trend that any ant remains in its own community
more and more obvious.
Here we define Eq. (5), where Cs denotes the community where node s is
situated. More formally speaking, Eq. (5) should be met better and better with
the evolution of pheromone matrix. When algorithm ACOMRW converges at
last, Eq. (5) will be completely satisfied, and underlying community structure
will become visible. Later we will give detailed analysis on parameter l.
 
∀ ∀ Vsl (i) > Vsl (j) . (5)
i∈Cs j ∈C
/ s

The algorithm that each ant adopts to compute its l-step transition probability
distribution Vsl is given bellow. It is described by using Matlab pseudocode.

Procedure Produce V
/ ∗ Consider that each ant has already visited t + 1 nodes upon any t steps,
thus the max t + 1 elements in Vst should be set at 1 af ter each step. ∗ /
input: s / ∗ start position of this ant ∗ /
B / ∗ current pheromone matrix ∗ /
P / ∗ transition probability matrix of random walk ∗ /
l / ∗ limitation of step number ∗ /
output: V / ∗ l − step transition probability distribution of this ant ∗ /
begin
1 V ← zeros(1, n);
2 V (s) ← 1;
3 M ← P. ∗ B;
4 D ← sum(M, 2);
5 D ← diag(D);
6 M ← inv(D) ∗ M ;
7 M ← M ;
8 f or i = 1 : l
9 V ← V ∗ M;
10 if i = l
11 [sorted V, ix] ← sort(V,  descend );
12 V (ix(1 : i + 1)) ← 1;
13 end
14 end
end

After attaining Vsl , the current problem is how to find the ant’s solution which
should be also a clustering solution of the network. However, each ant only has
the ability of denoting that it can visit the nodes in its own community with
high probability, but the nodes with low visited probability are not necessarily in
one community, which may respectively belong to several different communities.
Therefore one ant can only find its own community from its local view.
This algorithm sorts Vsl in descending order, and then calculates differences
between adjacent elements of the sorted Vsl , finding the point corresponding
128 D. Jin et al.

to the max difference. It’s obvious that the point corresponding to the biggest
“valley” of the sorted Vsl should be the most suitable one as the cutoff point to
identify the community of this ant. At last, we believe that the points whose
visited probability value is greater than that of the cutoff point should be in a
same community, but we don’t consider which communities the nodes left out
belong to. It’s obvious, a solution produced by one ant is its own community.
Given Vsl , the algorithm that divides Vsl and finds this ant’s solution is as
follows.

Procedure Divide V
/ ∗ As each ant has visited at least l + 1 nodes,
the index of cutoff point shouldn t be less than l + 1. ∗ /
input: V / ∗ l − step transition probability distribution of this ant ∗ /
output: solution / ∗ solution of this ant ∗ /
begin
1 [sorted V, ix] ← sort(V, descend );
2 diff V ← −diff (sorted V );
3 diff V ← dif f V (l + 1 : length(diff V ));
4 [max diff , cut pos] ← max(diff V );
5 cut pos ← cut pos + l;
6 cluster ← ix(1 : cut pos);
7 solution ← zeros(n, n);
8 solution(cluster, cluster) ← 1;
9 I ← eye(cut pos, cut pos);
10 solution(cluster, cluster) ← solution(cluster, cluster) − I;
end

In the network, let the number of total nodes be n, and the number of total
edges be m. If the network is denoted by its adjacency matrix, the time com-
plexity of Produce V is O(l ∗ n2 ). Divide V needs to sort all nodes according
to their probability values. Because there exist linear-time sort algorithms (such
as bin sort and counting sort) which can be adopted, the time complexity of
Divide V is O(n). Thus, the overall complexity that one ant induces to produce
its solution is O(l ∗ n2 ). However, if this network is denoted by linked list, its
time complexity can be decreased to O(l(m + n)). As most complex networks
are sparse graphs, this can be very efficient.

Algorithm ACOMRW. There are two main parts in algorithm ACOMRW:


exploration phase and partition phase. The goal of the first phase is to attain the
pheromone matrix when the algorithm converges. The goal of the second phase
is to analyze this pheromone matrix in order to get the clustering solution for
the network. The algorithm of the exploration phase is given by:

Procedure Exploration Phase


input: A, T, S, ρ / ∗ A is the adjacent matrix of the network,
T is the limitation of iteration number, S is the size of ant colony,
ACO with Markov Random Walk for Community Detection in Graphs 129

ρ is the updating rate to ants pheromone matrix ∗ /


output: B / ∗ denotes pheromone matrix ∗ /
begin
1 D ← sum(A, 2);
2 D ← diag(D);
3 P ← inv(D) ∗ A; / ∗ produce transition probability matrix of random walk ∗ /
4 B ← ones(n, n)/n; / ∗ initialize the pheromone matrix ∗ /
5 f or i = 1 : T
6 solution ← zeros(n, n);
7 f or j = 1 : S
8 solution ← solution + one ant(P, B); / ∗ one ant returns a solution ∗ /
9 end / ∗ aggregate local solutions of all ants to one ∗ /
10 D ← sum(solution, 2);
11 D ← diag(D);
12 solution ← inv(D) ∗ solution; / ∗ normalize the solution ∗ /
13 B ← (1 − ρ) ∗ B + ρ ∗ solution; / ∗ update the pheromone matrix ∗ /
14 end
end

As we can see, at each iteration, this algorithm aggregates the local solutions
of all ants to a global one, and then updates the pheromone matrix B by using
it. With the increase of iterations, the pheromone matrix is gradually evolving,
which makes the ants more and more directed, and the trend that any ant
stays in its own community more and more obvious. When the algorithm finally
converges, the pheromone matrix B can be regarded as the final clustering result
which aggregates the information of all the ants at all iterations.
The next step is how to analyze the produced pheromone matrix B in or-
der to attain the clustering solution of the network. Because of the convergence
property of ACO, the cluster characteristic of matrix B is very obvious. The
description of a simple partition phase algorithm is as follows.

Procedure Partition Phase


input: B / ∗ pheromone matrix af ter algorithm converging ∗ /
output: C / ∗ f inal clustering solution or community structure ∗ /
begin
1 Compute cutoff value ε; / ∗ ε is 1/n, where n is the number of nodes ∗ /
2 Get the first row of B, and take the nodes whose values are greater
than ε as a community;
3 From matrix B, delete all the rows and columns corresponding to the
nodes in above community;
4 If B is not empty go to step 2, otherwise return the clustering solution
C which includes all the communities;
end
130 D. Jin et al.

Because of the convergence properties of the exploration phase algorithm, we


can develop a simple algorithm for the partition phase. As known from the con-
vergence property of ACO, in pheromone matrix B, the rows which correspond
to the nodes in the same community should be equal. Therefore, by choosing any
row from B, we can identify its community by using a small positive number ε
as cutoff value to divide this row.

2.3 Parameter Setting


There are four parameters: T, S, ρ, l in this algorithm, which denote iteration
number limitation, ant colony size, the update rate to the ants’ pheromone ma-
trix and a step number limitation, respectively. The first three parameters are
easy to be determined, they can be set as: T = 20, S = n (where n is the num-
ber of nodes in the network) and ρ = 0.5 in general. However, the setting of
parameter l is very difficult and important.
As most social networks are small-world networks, the average distance be-
tween two nodes was shown to be around 6, according to the theory of “six
degrees of separation” [14]. For scale-free networks, the average distance is usu-
ally small too. The World Wide Web is one of the biggest scale-free networks
that we have found so far. However, the average distance of such a huge network
is actually about 19 [15]; that is, we can get to anywhere we want through 19
clicks on average. Thus, based on the above general observations, good options
for the value of l that we propose, in practice, should somehow fall in the range
6 ≤ l ≤ 19. Additionally, the l-value that we are considering is actually the aver-
age distance between nodes within any community instead of the whole network,
thus it can be even smaller.
In addition, by using this algorithm we can attain a reasonable hierarchi-
cal community structure of the network with the change of parameter l. The
experimental section of the paper will give a detailed analysis on parameter l.

3 Experiments
In order to quantitatively analyze the performance of algorithm ACOMRW,
we tested it by using both computer-generated and real-world networks. We
conclude by analyzing the parameter l defined in this algorithm.
In this experiment our algorithm is compared with GN algorithm [3], Fast
Newman (FN) algorithm [5] and FEC algorithm [10]. They are all known and
competitive network clustering algorithms. In order to more fairly compare the
clustering accuracy of different algorithms, we adopt two widely used accuracy
measures, which are Fraction of Vertices Classified Correctly (FVCC) [5] and
Normalized Mutual Information (NMI) [16] respectively.

3.1 Computer-Generated Networks


To test the performance of ACOMRW, we adopt random networks with known
community structure, which have been used as benchmark datasets for test-
ing complex network clustering algorithms [3]. This kind of random network is
ACO with Markov Random Walk for Community Detection in Graphs 131

defined as RN (C, s, d, zout ), where C is the number of communities, s is the


number of nodes in each community, d is the degree of nodes in the network,
and each node has zin edges connecting it to members of its own community
and zout edges to members of other communities.
Parameter l is set at 6 for algorithm ACOMRW and benchmark random net-
work RN (4, 32, 16, zout) is used in this experiment. It’s obvious that as zout is
increased from zero, community structures of networks become more diffused and
the resulting networks pose greater and greater challenges to network clustering
algorithms. Especially, a network doesn’t have community structure when zout
is greater than 8 [3]. Fig. 1 shows the results, in which y-axis denotes clustering
accuracy and x-axis denotes zout . For each zout , for each algorithm, we compute
the average accuracy through clustering 50 random networks. As we can see
from Fig. 1, our algorithm significantly outperforms the other three algorithms
in terms of both these accuracy measures. Furthermore, as zout becomes larger
and larger, the superiority of our algorithm becomes more and more significant.
Especially, when zout equals 8, which means the number of within-community
and between-community edges per vertex is the same, our algorithm can still
correctly classify 100% of vertices into their correct communities, while the ac-
curacy of the other algorithms is low at this moment.

1.05
1
tlyc 1
onti 0.9 rreo 0.95
a c
rm de 0.9
fon 0.8 ifis
il sa 0.85
uat cl
u 0.7 se 0.8
m ticr
edz ev 0.75 GN
lia 0.6 GN fo
m FN FN
orn FEC onit 0.7 FEC
0.5 ACOMRW acfr 0.65 ACOMRW
0.40 0.6
1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
number of inter-community edges per vertex zout number of inter-community edges per vertex zout
(a) (b)
Fig. 1. Compare ACOMRW with GN, FN and FEC against benchmark random net-
works. (a) take NMI as accuracy measure; (b) take FVCC as accuracy measure.

3.2 Real-World Networks


As a further test on algorithm ACOMRW, we applied it to three widely used
real-world networks with a known community structure. These actual networks
may have different topological properties than the artificial ones. They are the
well known karate network [17], dolphin network [18] and football network [3].
In our algorithm, parameter l is set to be 7, 3 and 11 for these three networks
respectively. The clustering results (over 50 runs) of algorithms ACOMRW, GN,
FN and FEC against the three real-world networks are shown as Table 1. It can
be seen that the clustering accuracy of ACOMRW is clearly higher than that of
the other algorithms in terms of both these two accuracy measures.
132 D. Jin et al.

Table 1. Compare ACOMRW with GN, FN and FEC on three real-world networks

karate network dolphin network football network


Algs NMI FVCC C Num NMI FVCC C Num NMI FVCC C Num
GN 57.98% 97.06% 5 44.17% 98.39% 13 87.89% 83.48% 10
FN 69.25% 97.06% 3 50.89% 96.77% 5 75.71% 63.48% 7
FEC 69.49% 97.06% 3 52.93% 96.77% 4 80.27% 77.39% 9
ACO 100% 100% 2 88.88% 98.39% 2 92.69% 93.04% 12

6 0.54
ACOMRW Q-value of ACOMRW
0.52 Q-value of real community structure
5 0.5
re 0.48
b4 eu0.46
m
un la
re v- 0.44
stu3 Q
lc
0.42
2 0.4
0.38
16 8 10 12 14 16 18 20 0.366 8 10 12 14 16 18 20
L-steps L-steps
(a) (b)

(c)

Fig. 2. Sensitivity analysis of parameter l. (a) Cluster number got by ACOMRW as a


function of parameter l. (b) Q-values got by ACOMRW as a function of parameter l.
(c) Clustering solution got by ACOMRW varying with parameter l.
ACO with Markov Random Walk for Community Detection in Graphs 133

3.3 Parameters Analysis


Parameter l is very important in algorithm ACOMRW. In Sec. 2.3, we have
given a reasonable indication on the range of parameter l by considering the
small-world and scale-free networks. Taking the dolphin network as an example,
this section gives a more detailed analysis. Here we adopt network modularity
function (Q), which was proposed by Newman and has been widely accepted by
the scientific community [19], as a measure of the compactness of communities.
Fig. 2 shows how the cluster number, Q value, and the clustering result got by
algorithm ACOMRW vary with its parameter l. From Fig. 2(a) and (c), we find
out that this algorithm divides the network into 5 little tight communities when
l is small. With the increase of parameter l, the communities between which
there are more edges are beginning to merge. Finally, this network is divided
into 2 big tight communities. Note that: the red nodes form a community and
the blue nodes form another community in the real community structure of
dolphin network. Furthermore, from Fig. 2(b), the Q values got by ACOMRW
for different l values are all greater than that of the real community structure
of this network, thus we can also say algorithm ACOMRW can give some well-
defined hierarchical community structures of networks with the change of its
parameter l.

4 Conclusions and Future Work


The main contribution of this paper is to propose a high accuracy network clus-
tering algorithm ACOMRW. Because the real community structures of most
current large scale networks are still unknown now, we have adopted as bench-
mark computer generated networks and some widely used real world networks,
whose community structures are known, to test its performance. In the future,
we also wish to apply it to the analysis of real-world large-scale networks, and
try to uncover and interpret significative community structure that is expected
to be found on them.

Acknowledgments. This work was supported by National Natural Science


Foundation of China under Grant Nos. 60773099, 60703022, 60873149, 60973088,
the National High-Tech Research and Development Plan of China under Grant
No. 2006AA10Z245, the Open Project Program of the National Laboratory of
Pattern Recognition, the Fundamental Research Funds for the Central Universi-
ties under Grant No. 200903177 and the Project BTG of European Commission.

References
1. Watts, D.J., Strogatz, S.H.: Collective Dynamics of Small-World Networks. Na-
ture 393(6638), 440–442 (1998)
2. Barabsi, A.L., Albert, R., Jeong, H., Bianconi, G.: Power-law distribution of the
World Wide Web. Science 287(5461), 2115a (2000)
134 D. Jin et al.

3. Girvan, M., Newman, M.E.J.: Community Structure in Social and Biological Net-
works. Proceedings of National Academy of Science 9(12), 7821–7826 (2002)
4. Santo, F.: Community Detection in Graphs. Physics Reports 486(3-5), 75–174
(2010)
5. Newman, M.E.J.: Fast Algorithm for Detecting Community Structure in Networks.
Physical Review E 69(6), 066133 (2004)
6. Guimera, R., Amaral, L.A.N.: Functional cartography of complex metabolic net-
works. Nature 433(7028), 895–900 (2005)
7. Barber, M.J., Clark, J.W.: Detecting Network Communities by Propagating Labels
under Constraints. Phys. Rev. E 80(2), 026129 (2009)
8. Jin, D., He, D., Liu, D., Baquero, C.: Genetic algorithm with local search for
community mining in complex networks. In: Proc. of the 22th IEEE International
Conference on Tools with Artificial Intelligence (ICTAI 2010), pp. 105–112. IEEE
Press, Arras (2010)
9. Palla, G., Derenyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community
structures of complex networks in nature and society. Nature 435(7043), 814–818
(2005)
10. Yang, B., Cheung, W.K., Liu, J.: Community Mining from Signed Social Networks.
IEEE Trans. on Knowledge and Data Engineering 19(10), 1333–1348 (2007)
11. Zhang, Y., Wang, J., Wang, Y., Zhou, L.: Parallel Community Detection on Large
Networks with Propinquity Dynamics. In: Proc. the 15th ACM SIGKDD Int. Conf.
on Knowledge Discovery and Data Mining, pp. 997–1005. ACM Press, Paris (2009)
12. Morarescu, C.I., Girard, A.: Opinion Dynamics with Decaying Confidence: Appli-
cation to Community Detection in Graphs. arXiv:0911.5239v1 (2010)
13. Strehl, A., Ghosh, J.: Cluster ensembles-a knowledge reuse framework for combin-
ing partitionings. Journal of Machine Learning Research 3, 583–617 (2002)
14. Milgram, S.: The Small World Problem. Psychology Today 1(1), 60–67 (1967)
15. Albert, R., Jeong, H., Barabasi, A.L.: Diameter of the World Wide Web. Na-
ture 401, 130–131 (1999)
16. Danon, L., Duch, J., Diaz-Guilera, A., Arenas, A.: Comparing community structure
identification. J. Stat. Mech., P09008 (2005)
17. Zachary, W.W.: An Information Flow Model for conflict and Fission in Small
Groups. J. Anthropological Research 33, 452–473 (1977)
18. Lusseau, D.: The Emergent Properties of a Dolphin Social Network. Proc. Biol.
Sci. 270, S186–S188 (2003)
19. Newman, M.E.J., Girvan, M.: Finding and Evaluating Community Structure in
Networks. Phys. Rev. E 69(2), 026113 (2004)
Faster and Parameter-Free Discord Search in
Quasi-Periodic Time Series

Wei Luo and Marcus Gallagher

The University of Queensland, Australia


{luo,marcusg}@itee.uq.edu.au

Abstract. Time series discord has proven to be a useful concept for time-
series anomaly identification. To search for discords, various algorithms
have been developed. Most of these algorithms rely on pre-building an
index (such as a trie) for subsequences. Users of these algorithms are typ-
ically required to choose optimal values for word-length and/or alphabet-
size parameters of the index, which are not intuitive. In this paper, we
propose an algorithm to directly search for the top-K discords, without the
requirement of building an index or tuning external parameters. The al-
gorithm exploits quasi-periodicity present in many time series. For quasi-
periodic time series, the algorithm gains significant speedup by reducing
the number of calls to the distance function.

Keywords: Time Series Discord, Minimax Search, Time Series Data


Mining, Anomaly Detection, Periodic Time Series.

1 Introduction
Periodic and quasi-periodic time series appear in many data mining applications,
often due to internal closed-loop regulation or external phase-locking forces on
the data sources. A time series’ temporary deviation from a periodic or quasi-
periodic pattern constitutes a major type of anomalies in many applications. For
example, an electrocardiography (ECG) recording is nearly periodic, as one’s
heartbeat. Figure 1 shows an ECG signal where a disruption of periodicity is
highlighted. This disruption of periodicity actually indicates a Premature Ven-
tricular Contraction (PVC) arrhythmia [3]. As another example, Figure 4 shows
the number of beds occupied in a tertiary hospital. The time series suggests a
weekly pattern—busy weekdays followed by quieter weekends. If the weekly pat-
tern is disrupted, then chaos often follows with elective surgeries being canceled
and the emergency department being over-crowded, greatly impacting patient
satisfaction and health care quality.
Time Series Discord captures the idea of anomalous subsequences in time
series and has proven to be useful in a diverse range of applications (see for
example [5,1,11]). Intuitively, a discord of a time series is a subsequence with
the largest distance from all other non-overlapping subsequences in the time se-
ries. Similarly, the 2nd discord is a subsequence with the second largest distance
from all other non-overlapping subsequences. And more generally one can search

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 135–148, 2011.
c Springer-Verlag Berlin Heidelberg 2011
136 W. Luo and M. Gallagher

1st discord
d
^
d

15
distance
10
5
0 1000 3000 5000 0 1000 3000 5000
index p

Fig. 1. An ECG time series that demon- Fig. 2. Illustration of Proposition 1. The
strates periodicity, baseline shift, and a blue solid line represents the true d for
discord. The time series is the second-lead the time series xmitdb_x108_0 (with sub-
signal from dataset xmitdb_x108_0 of [6]. sequence length 360). The red dashed line
According to [3], the ECG was taken un- represents an estimate d̂ for d. Although
der the frequency of 360 Hz. The unit for at many locations d̂ is very different from
measurement is unknown to the author. d, the maximum of d̂ coincides with the
maximum of d.

for the top-K discords [1]. Finding the discord for a time series in general requires
comparisons among O(m2 ) pair-wise distances, where m is the length of the time
series. Despite past efforts in building heuristics (e.g., [5,1]), searching for the
discord still requires expensive computation, making real-time interaction with
domain experts difficult. In addition, most of existing algorithms are based on
the idea of indexing subsequences with a data structure such as a trie. Such data
structures often have unintuitive parameters (e.g., word length and alphabet
size) to tune. This means time consuming trial-and-error that compromises the
efficiency of the algorithms.
Keogh, Lin, and Fu first defined time series discords and proposed a search
algorithm named HOT SAX in [5]. A memory efficient search algorithm was
also proposed later [11]. HOT SAX builds on the idea of discretizing and index-
ing time series subsequences. To select the lengths for index keys, wavelet decom-
position can be used ([2,1]). Most recently, adaptive discretization has been pro-
posed to improve the index for efficient discord search ([8]). In this paper, we pro-
pose a fast algorithm to find the top-K discords in a time series without prebuild-
ing an index or tuning parameters. For periodic or quasi-periodic time series, the
algorithm finds the discord with much less computation, compared to results pre-
viously reported in the literature (e.g., [5]). After finding the 1st discord, our algo-
rithm finds subsequent discords with even less computation—often 50% less. We
tested our algorithm with a collection of datasets from [6] and [4]. The diversity
of the collection shows the definition of “quasi-periodicity” can be very relaxed
for our algorithm to achieve search efficiency. Periodicity of a time series can be
easily assessed through visual inspection. The experiments with artificially gen-
erated non-periodic random walk time series showed increased running time, but
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series 137

the algorithm is still hundreds of times faster than the brute-force search, without
tuning any parameter.
The paper is organized as follows. Section 2 reviews the definition of time-
series discord and existing algorithms for discord search. Section 3 introduces our
direct search algorithm and explains ideas behind it. Section 4 presents empirical
evaluation for the new algorithm and a comparison with the results of HOT SAX
from [5]. Section 5 concludes the paper.

2 Time Series Discords


This section reviews the definition of time-series discord and major search algo-
rithms.
Notation. In this paper, T  (t1 , . . . , tm ) denotes a time series of length m. In
addition, T [p; n] denotes the length-n subsequence of T with beginning position
p. The distance between two length-n subsequences T [p; n] and T [q; n] is denoted
distT,n (p, q). Following [5], we consider by default the Euclidean distance between
two standardized subsequences—all subsequences are standardized to have a
mean of 0 and a standard deviation of 1. Nevertheless, the results in this paper
apply to other definitions of distance. Given a subsequence T [p; n], the minimum
distance between T [p; n] and any non-overlapping subsequence T [q; n] is denoted
dp,n (i.e., dp,n = minq:|p−q|≥n distT,n (p, q)). As n is a constant, we often write dp
for dp,n . Finally we use d to denote the vector (d1 , d2 , . . . , dm−n+1 ) and use d̂
and dˆp to denote estimates for d and dp respectively.
For a time series of length m, there are at most 12 (m − n − 1)(m − n −
2) + 1 distinct distT,n (p, q) values. In particular distT,n (p, q) = distT,n (q, p) and
distT,n (p, p) == 0. Figure 3 shows a heatmap of the distances distT,n (p, q) for
all p and q values of the time series xmitdb_x108_0 (see Figure 1).
Number of Occupied Beds

950
850
750

Sep Oct Nov

Date

Fig. 3. Distribution of distT,360 (p, q) where T is Fig. 4. Hourly bed occupancy in a


the time series xmitdb_x108_0 tertiary hospital for two months
138 W. Luo and M. Gallagher

The following definition is reformulated from [5].


Definition 1 (Discord). Let T be a sequence of length m. A subsequence
T [p(1) ; n] is the first discord (or simply the discord) of length n for T if

p(1) = argmax {dp : 1 ≤ p ≤ m − n + 1} . (1)


p

Intuitively, a discord is the most “isolated” length-n subsequence in the space


Rn . Subsequent discords—the second discord, the third discord, and so on—of
a time series are defined inductively as follows.
Definition 2. Let T [p(1) ; n], T [p(2) ; n], . . . , T [p(k−1) ; n] be the top k − 1 discords
of length n for a time series T . Subsequence T [p(k) ; n] is the k-th discord of
length n for T if
 
p(k) = argmax dp : |p − p(i) | ≥ n for all i < k
p

Note that the values for both n and k should be determined by the application;
they are independent of the search algorithm. If a user was looking for three
most unusual weeks in the bed occupancy example (Figure 4), k would be 3
and n would be 7 ∗ 24, assuming the time series is sampled hourly. Strictly
speaking, the discord is not well defined as there may be more than one location
p that maximizes dp (i.e., dp1 = dp2 = max p {dp : 1 ≤ p ≤ m − n + 1}). But
the ambiguity rarely matters in most applications, especially when the top-K
discords are searched in a batch. In this paper, we shall follow the existing
literature [5] and assume that all dp ’s have distinct values.
The discord has a formulation similar to the minimax problem in game theory.
Note that

maxp minq {distT,n (p, q) : |p − q| ≥ n} ≤ min max {distT,n (p, q) : |p − q| ≥ n} .


p q

According to Sion’s minimax theorem [9], the equality holds if distT,n (p, ·) is
quasi-concave on q for every p and distT,n (·, q) is quasi-convex on p for every q.
Figure 3 indicates, however, that in general neither distT,n (p, ·) is quasi-concave
nor distT,n (·, q) is quasi-convex, and no global saddle point exists. That suggests
searching for discords requires a strategy different from those used in game the-
ory. In the worst case, searching for the discord has the complexity O(m2 ), essen-
tially requiring brute-force computation of the pair-wise distances of all length-n
subsequences of the time series. When m = 104 , that means 100 million calls
to the distance function. Nevertheless, the following sufficient condition for the
discord suggests a search strategy better than the brute-force computation.
Observation 1. Let T be a time series. A subsequence T [p∗ ; n] is the discord
of length n if there exists d∗ such that

∀q : |p∗ − q| > n ⇒ distT,n (p∗ , q) ≥ d∗ , and (2)


∗ ∗
∀p = p , ∃q : (|p − q| > n) ∧ (distT,n (p, q) < d ). (3)
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series 139

In general, there are infinitely many d∗ that satisfies Clause (2) and Clause (3).
Suppose we have a good guess d∗ . Clause (3) implies that a false candidate
of the discord can be refuted, potentially in fewer than m steps. Clause (2)
implies that, given all false candidates have been refuted, the true candidate
for the discord can be verified in m − n + 1 steps. Hence in the best case,
(m − n + 1) + (m − 1) = 2m − n calls to the distance function are sufficient to
verify the discord. To estimate d∗ , we can start with the value of dp where p is
a promising candidate for the discord, and later increase the guess to a larger
value dp if p is not refuted (i.e., distT,n (p , q) > dp for every non-overlapping q)
and becomes the next candidate. This hill-climbing process goes on until all but
one of the subsequences are refuted with the updated value of d∗ .
This idea forms the basis of most existing discord search algorithms (e.g., HOT
SAX in [5] and WAT in [1]); the common structure of these algorithms is shown in
Figure 5. With this base algorithm, the efficiency of a search then depends on the

1: Select a p0 and let d∗ ← dp0 and p∗ ← p0 . {Initialization}


2: for all the remaining locations p ordered by certain heuristic Outer. do {Outer Loop}
3: for all locations q ordered by some heuristic Inner such that |p − q| ≥ n. do {Inner Loop}
4: if distT,n (p, q) < d∗ then
5: According to Clause (3) in Observation 1, T [p; n] cannot be the discord; break to next p.
6: end if
7: end for
8: if minq distT,n (p, q) > d∗ then
9: As Clause (2) in Observation 1 is not met, update d∗ ← minq distT,n (p, q) and p∗ ← p.
10: end if
11: end for
12: return p∗ and d∗

Fig. 5. Base algorithm for HOT SAX and WAT

order of subsequences in the Outer and Inner loops (see lines 2 and 3). Intuitively,
the Outer loop should rank p according to the singularity of subsequence T [p; n];
the Inner loop should rank q according the proximity between subsequences
T [p; n] and T [q; n]. Both HOT SAX and WAT adopt the following strategy.
Firstly all subsequences of length n are discretized and compressed into shorter
strings. Then the strings are indexed with a suffix trie—in the ideal situation,
subsequences close in distance also share an index key or occupy neighboring
index keys in the trie. This is not so different from the idea of hashing to achieve
O(1) search time. In the end, all subsequences will be indexed into a number of
buckets on the terminal nodes. The hope is that, with careful selection of string
length and alphabet size, the discord will fall into a bucket containing very few
subsequences while a non-discord subsequence will fall into a bucket shared with
similar subsequences. Then the uneven distribution of subsequences among the
buckets can be exploited to devise efficient ordering for the Outer and Inner
loops.
This ingenious approach however has two drawbacks. Firstly, one needs to
select optimal parameters that balance the index size and the bucket size, which
are critical to the search efficiency. For example, to use HOT SAX, one needs
140 W. Luo and M. Gallagher

to set the alphabet size and the word size for the discretized subsequences [5,
Section 4.2]; WAT automates the selection of word size, but still requires setting
the alphabet size [1, Section 3.2]. Such parameters are not always intuitive to a
user, as the difficulty of building a useable trie has been discussed in [11, Section
2]. Secondly, the above approach uses fixed/random order in the outer loop
to search for all top-K discords. A dynamic ordering for the outer loop could
potentially make better use of the information gained in the previous search
steps. Also it is not clear how knowledge gained in finding the k-th discord can
help finding the (k + 1)-th discord. In [1, Section 3.6], partial information about
d̂ is cached so that the inner loop may break quickly. But as caching works at
the “easy” part of the search space—where dp is small, it is not clear how much
computation is saved.
In the following section, we address the above issues by proposing a direct
way to search for multiple discords. In particular, our algorithm requires no
ancillary index (and hence no parameters to tune), and the algorithm reuses the
knowledge gained in searching for the first k discords to speed up the search for
the (k + 1)-th discord.

3 Direct Discord Search


In Definition 1, the formula p(1) = argmax {dp : 1 ≤ p ≤ m − n + 1} suggests a
p
direct way to search for the discord with the following two steps:
ˆ
Step 1: Compute an estimate dp of dp for each p. 
Step 2: Let p arg maxp dp : 1 ≤ p ≤ m − n + 1 , and verify that T [p∗ ; n] is
∗ ˆ
the discord.
Step 2 can be carried out by testing the condition dp∗ ≥ maxp dˆp , as justified by
the following proposition.
Proposition 1. Let d̂ be an estimate of d such that d̂
d. If dp∗ ≥ maxp dˆp ,
then dp∗ ≥ maxp dp .
Proof. With d̂
d, we have dp∗ ≥ maxp dˆp ≥ maxp dp .
Proposition 1 gives a sufficient condition for verifying the discord of a time series.
It shows that d̂ does not have to be close to d at every location p. To verify the
discord, it suffices to have d̂
d and max d ≥ max d̂. This point is illustrated
in Figure 2.
To estimate dp = minq dist(p, q) in Step 1, we can use dˆp  minq∈Qp dist(p, q).
Here Qp is a subset of {q : |p − q| > n}—Hence dˆp ≥ dp . As Qp includes more
locations, the error dˆp − dp becomes smaller. If Qp = {q : |p − q| > n}, then
dˆp − dp = 0. By controlling the size of Qp , we can control the accuracy of dˆp for
different p. Therefore Proposition 1 justifies the search strategy shown in Figure 6.
For top-K discords search, the while-loop (Line 2-10) is repeated K times (with
proper book keeping to exclude overlapping subsequences). As d̂s keeps decreasing
in the computation, every time we start with a better estimate d̂ in Line 3.
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series 141

1: For each p, estimate dˆp  minq∈Qp dist(p, q), where Qp is a subset of {q : |p − q| > n}.
2: while the discord has not been found, do
3: p∗ ← argmax{dˆp }.
p
4: Compute dp∗  minq dist(p∗ , q).
5: if dp∗ > dˆp for all p = p∗ then
6: return p∗ as the discord starting location.
7: else
8: Decrease d̂ by enlarging Qp s.
9: end if
10: end while

Fig. 6. Base algorithm for direct discord search

3.1 Efficient Way to Estimate d̂


Figure 2 suggests that to find the discord, it is not necessary to have a highly
accurate estimate of dˆp for every p. Instead, highly accurate dˆp is needed only
when dp is relatively large. To minimize the total computation cost, we should
distribute computational resources according to the importance of dˆp .
We propose three operations to estimate dp , with increasing level of compu-
tation cost.
1. Traversing: Suppose that dist(p, qp ) is known to be small for some qp . For
small integers k, let Qp+k contain one location qp + k. Intuitively, traversing
translates to searching along the 45 degree lines in Figure 3.
2. Sampling: Let Qp be a set of locations if dp is likely to be large or knowledge
of dp is unavailable. We shall see a way to construct such Qp using periodicity
of time series.
3. Exhausting: Let Qp be all possible locations if the exact value of dp is
required.
Note that the most expensive Exhausting operation is needed only in verifying
the discord (Line 4 in Figure 6).
The Traversing operation can be justified with the following argument. For a
relatively large n, distT,n (p, q) distT,n (p+1, q+1). If distT,n (p, qp ) is small, then
distT,n (p+1, qp +1) is likely to be small as well. The argument can be “telescoped”
to other k values as long as nk is small enough. This is demonstrated in Figure 7,
where local minima for sp ’s tend to cluster around some “sweet spots” (the red
circle). Therefore, in Traversing, a good estimate dˆp = distT,n (p, q) suggests a
“sweet spot” q around which good estimates dˆp+k for neighboring positions (p+k)
can be found.
The Sampling operation may be implemented with local search with a set of ran-
dom starting points. But when the time series is nearly periodic or quasi-periodic,
more efficient implementation exists. This will be discussed in the next section.

3.2 Quasi-Periodic Time Series


Suppose a time series T is nearly periodic with a period l (i.e., tp tp±k·l ).
Then distT,n (p, p ± k · l) 0 implies dp = minq:|p−q|≥n distT,n (p, q) 0 as long
142 W. Luo and M. Gallagher

30
25
10 15 20
dist(p,, q)
5
0

0 200 400 600 800 1000


q

Fig. 7. Distance profiles dist(p, ·) of time Fig. 8. Locations of qp ’s for time series
series xmitdb_x018_0. Each line plots xmitdb_x108_0. Each location (p, qp ) is
the sequence sp = (distT,n (p, 1), . . . , colored according the value dp . Dashed
distT,n (p, 1000)) for some p, where n = 360. lines are a period (360) apart. Hence if
The 10 lines in the plot correspond to p a location (p, qp ) falls on a dashed line,
being 10, 20, . . . , 100 respectively. then qp − p is a multiple of the period
360.

as k · l ≥ n for some k . Small distances associated with multiple times of the


time-series period can be seen in Figure 7—at locations around p + 360 and
p + 2 × 360 for each p in {10, 20, . . . , 100}.
With this observation, the following heuristic can be used to implement the
Sampling operation for nearly periodic time series: a location q multiple periods
away from p is likely to be near a local minimum for {distT,n (p, q) : q}. Figure 8
shows the location qp = argmin{distT,n (p, q)} for all locations p at the time
q
series in Figure 1. It shows that in most cases a minimum-location qp is roughly
multiple periods away from p.
There are a number of ways to estimate the period of a time series. For
example, autocorrelation function (see Figure 9) and phase coherence analysis
[7] are often used to estimate period.
As suggested in Figure 7, the gaps between local minima of a distance profile
{distT,n (p, q) : 1 ≤ q ≤ m − n + 1} approximate the period of a time series,
for distT,n (p, p + k · l) 0. We use this observation to estimate period in this
paper (see Figure 11). Figure 10 shows the collection of gaps {Δk } for local
minima of {distT,n (1000, q) : q}, where T is the time series in Figure 1 and
n = 360. Taking the median of {Δk } gives the estimate 354 for the period of the
time series. Note that the period need to estimated only once (with the distance
profile {distT,n (p, q) : 1 ≤ q ≤ m − n + 1} for only one location p). Hence it takes
only m − n calls to the distance function to estimate the period of a time series
of length m. As a by-product, the exact value of dp is also obtained.
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series 143

1.0
0.6
estimated period

0.010
ACF

Density
0.2
−0.2

0.000

●●
● ●●● ●● ● ●

0 1000 3000 5000 300 400 500 600 700

Lag Gaps between neighboring local minima

Fig. 9. Autocorrelation function of time Fig. 10. The density plot for gaps be-
series xmitdb_x108_0. The plot shows tween local minima and the estimated pe-
multiple peaks corresponding to multiples riod for time series xmitdb_x108_0
of the period.

1: Randomly pick a location p.


2: Compute dist(p, q) for every q.
3: cp ← the lower 5% quantile of all distances calculated in the previous step.
4: Q ← all local minima q such that dist(p, q) ≤ cp .
5: Sort Q in increasing order (Q1 , Q2 , . . . , Q|Q| ).
6: Δk = Qk+1 − Qk for all k < |Q|.
7: l ← the median of {Δk : k < |Q|}.
8: return l as the estimated period.

Fig. 11. Estimating period with the median gap between two neighboring local minima

3.3 Implementation of the Search Strategy

With heuristics for both Traversing and Sampling, Figure 12 implements Line 1
in Figure 6. The procedure uses a sequential covering strategy to estimate d̂p for
each p. In each iteration (the while loop from Line 2 to Line 10), a Sampling
operation is done to find a “sweet spot”. Then a Traversing operation exploits
that location to cover as many neighboring locations as possible.
The verification stage of our algorithm (Lines 2-10 in Figure 6) consists of
a while loop which resembles the outer loop in HOT SAX and WAT. But here
the order of location is dynamic, determined by the ever-improving estimate d̂.
Line 8 in Figure 6 further improves d̂ when the initial guess for the discord
turns out to be incorrect. The improvement can be achieved by traversing with
a better starting location qp∗ produced in Line 4 of Figure 6 (see Figure 13). As
suggested by Figure 8, the “best” locations tend to cluster along the 45 degree
lines. Moreover the large value of the initial estimate dˆp∗ suggests the neighbor-
hood of p∗ is a high-payoff region for further refinement of d̂. As the traversing
is done locally, the improvement step is relatively fast compared to the initial
estimation step for d.
To sum up, we have described a new algorithm for discord search that con-
sists of an estimation stage followed by a verification stage. The estimation stage
144 W. Luo and M. Gallagher

1: traversed[p] ← FALSE and mindist[p] ← ∞, for each p.


2: while traversed[p] = FALSE for some p do
3: Randomly pick a location p from {p : traversed[p] = FALSE}.
4: Q ← {q : |p − q| = k · l for some integer k}.
5: Do local search for the optimal qp with starting points in Q.{Sampling}
6: cp ← the lower 5% quantile of all distances calculated in the previous step.
7: Find the largest numbers L and R such that dist(p − i, qp − i) < cp for all
i ≤ L and dist(p + i, qp + i) < cp for all i ≤ R. {Traversing}
8: traversed[p − L : p + R] ← TRUE.
9: mindist[p − L : p + R] ← {dist(p + i, qp + i) : −L ≤ i ≤ R}.
10: end while

Fig. 12. Implementation of d estimation (Line 1 in Figure 6)

1: Let qp be the minimum location returned with Exhausting.


2: cp ← the lower 5% quantile of all distances calculated in Exhausting.
3: Find the largest numbers L and R such that mindist[p − L : p + R] ≥ dp ,
dist(p − i, qp − i) < cp and for all i ≤ L, and dist(p + i, qp + i) < cp for all i ≤ R.

4: mindist[p − L : p + R] ← {dist(p + i, qp + i) : −L ≤ i ≤ R}.

Fig. 13. Traversing with a better starting point to improve d̂

achieves efficiency by dynamically differentiating locations p according to their


potential influence to maxp dˆp . Further reduction in computation cost comes from
the periodicity of a time series. In general, the Traversing heuristic works best
when a time series is smooth (or equivalently densely sampled), while the Sam-
pling heuristic works best when the periodicity of the time series is pronounced.
The algorithm is guaranteed to halt and to return the discord by Observation 1.
The efficiency of the algorithm has been evaluated in the following section.

4 Empirical Evaluation
In this section, we first compare the performance of our direct-discord-search
algorithm with the results reported for HOT SAX in [5]. We then report the
performance of our algorithm on a collection of time series which are publicly
available. Following the tradition established in [5] and [1], the efficiency of our
algorithm was measured by the number of calls to the distance function, as
apposed to wall clock or CPU time. Since our algorithm entails no overhead of
constructing an index (in contrast to the algorithms in [5] and [1]), the number
of calls to the distance function is roughly proportional to the total computation
time involved. As shown in [2] and [1], the performance of HOT SAX depends
on the parameters selected. Here we assume that the metrics reported in [5] were
based on optimal parameter values.
To compare to HOT SAX, we use the dataset qtdbsel102 from [6]. Although
several datasets were used in [5] to evaluate the performance of HOT SAX, this is
the only one readily available to us. The dataset qtdbsel102 contains two time
series of length 45, 000; we use the first one as the two are highly correlated.
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series 145

HOT SAX 1st discord


Number of calls to the distance function

direct search: 1st discord


direct search: 2nd discord 2nd discord
1e+06

direct search: 3rd discord ●




●●
●● ●●



●●


●●
●●
●●
●● ●

●● ●●

●●

●●

●●●

● ●●● ● ●●

● ●
● ●● ● ●●● ● ●● ●●

●● ● ● ● ● ● ●● ● ●●

●●● ●●●●
●●●● ● ●

3rd discord
● ● ● ●●
● ●●
●● ● ● ●● ●

● ●
●●
●● ●
● ● ● ● ● ● ● ●● ● ●● ●● ●● ●●●● ●● ●●● ●
●● ● ● ●● ●


●●●●
● ● ● ●● ●●

● ●● ● ● ● ●● ● ●●●●

●● ● ● ● ●●
● ● ● ●●●●● ● ● ● ●
●● ●● ● ●●●
● ● ● ●

●●●●

●●●●●●

● ●● ●●● ● ●● ● ●● ● ● ● ●●
● ● ●●●● ● ● ● ● ● ●● ●
● ● ● ●●
●●● ●● ● ●● ●●
● ●● ● ●
● ● ● ●●● ●●● ● ● ●● ●● ●
●● ● ● ●

● ● ● ●●● ● ● ●●
●●●●● ●●
●● ●
● ●● ●
● ● ●● ● ●●● ● ●●●● ● ●
●●
● ●
● ●● ● ●●
● ● ●● ●
●● ● ● ●●● ●● ● ●● ●●● ●●● ● ●●●
● ● ●●
● ● ●● ● ●
● ●● ● ● ● ● ●● ●●● ● ●● ●●● ●
● ●
●●● ●●●●● ● ●●
● ●● ●
● ●●●●●


●● ●● ● ●●● ●
●● ● ● ● ● ●●● ●● ● ● ●
●● ●● ● ● ●
● ● ● ●●
●●●●● ● ● ●

● ●●●●● ● ● ● ● ● ●●
● ●● ● ●●
● ●● ●●●● ● ●● ●● ● ●● ● ● ●
● ● ●
● ● ● ●●● ● ● ● ● ●●
●● ●
●● ● ● ● ● ● ●● ●●●●● ●● ● ● ● ● ●●● ●●
● ● ●● ●●●● ● ● ● ●● ●
● ● ●● ●● ● ●● ● ● ●●●● ● ● ● ● ● ●● ●
●● ●● ● ● ●●● ● ● ●
● ● ●
● ● ●●● ●● ● ●●
●●● ● ●
●●●●
●● ● ●● ● ● ● ● ●●●● ● ● ●●● ●
● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ●

● ● ●●●● ● ● ●● ● ●● ● ● ●● ● ●●● ●

●●●
● ●● ●

●●● ●
10000

●●●● ● ● ●● ●● ● ●● ●● ● ●

0 5000 15000 25000


●●● ●● ● ● ● ● ●● ●● ●
●● ● ●●●●●
● ● ●●●● ● ●● ●● ● ●●●● ● ● ●

● ● ●●● ●● ●●●●● ● ●

●● ●●●

●●
● ●●
●●
●●● ● ● ●●●●
●●● ●


●●● ●● ● ●● ● ● ● ● ●● ●● ●●●●
●● ● ●● ●
●●
● ●●● ● ● ●● ●
● ●●● ●● ● ●●●
●● ● ● ● ● ●
●● ● ●
● ●●●
● ●●●● ●● ● ●●● ● ●
● ● ●●● ●●●●●●● ●● ● ● ●● ● ●● ● ●●

●● ●●
●● ●● ● ●

●● ● ● ●● ●●●●
● ●● ● ●● ● ●●●● ● ●● ●● ● ●●
● ● ●●●●●●●

●● ● ● ●
● ●● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●● ●
● ●
●● ●● ● ● ● ●● ● ● ●● ●●
●● ● ●●●●● ●●● ●● ●●●●● ●
●● ● ●

●● ●●● ●●●●●● ● ●● ●● ●
● ● ●● ● ●
● ● ● ●● ● ● ●● ●●●
● ●●● ● ● ●
● ●●● ● ● ●● ● ●●● ●● ● ● ●
●● ● ●●● ●
● ● ●● ●
●●● ●●
●●●
●●● ●●

● ●● ● ● ●● ● ● ●●●● ●● ●●
● ● ● ●● ●● ● ● ●● ●

●● ●
● ●● ● ● ● ●●
● ●●● ● ●● ● ● ●
● ●●●● ●●● ● ● ● ● ●
● ● ●● ●●
● ●
● ● ●
● ●
● ●●●●● ●●● ● ● ●● ●
● ● ● ●●●●●● ● ●
● ●● ● ● ● ● ● ●● ●● ●●● ●●
●●●

●●● ● ●● ●● ● ●
● ● ● ● ● ● ●● ● ● ●
●● ●● ● ● ●
●● ● ● ● ● ● ●
●● ● ● ● ● ●
● ● ● ● ● ● ●●●● ●●● ●● ●●
●●● ● ●●● ● ●●●
●● ●
●● ● ● ●●●● ● ● ● ●●● ● ●●●●●● ● ● ● ●● ●●● ●●● ● ● ● ●
●●
● ● ●● ●●●●● ●●●
● ●●
● ●● ● ●● ●●● ● ● ●
●●●●● ●
●● ● ● ●● ●● ●● ● ●●●●
● ●●●● ●●● ● ●●● ●
● ●
● ● ● ●

●●● ●●●●

●● ●● ●● ●
● ● ●● ●● ●● ● ● ● ●
● ● ●● ●
● ● ● ● ● ●● ● ● ●● ●● ●
● ●● ● ●● ● ● ● ● ● ● ● ●
●●●● ● ●● ● ● ●● ●● ● ●● ●●● ● ● ●●● ●●●● ● ● ●
●●
●●




●●●●
● ● ●● ●●● ●● ●●
● ●
●●●●
● ● ●
● ● ●

●●●●

● ● ●●●●● ●● ● ● ●

● ●● ●● ● ● ●
● ●● ● ● ●
●●● ●
● ● ● ●
● ● ● ●● ●
● ●● ● ●
● ●
● ●● ●● ●
●● ● ●
● ● ●● ● ● ●●●
●●●●●● ● ● ●●
● ●●
● ●● ●● ● ●●● ● ● ●●●●●● ●● ● ●

●●●●●
● ●●●
● ●●
●●●●●●● ● ●●● ● ●

●●●●
●●●●
● ●●
●● ●
●● ●●
● ●●●●●●●●
● ●●

●●
●●●
●●

●●● ● ●
● ● ●

● ●

● ●

● ●
10 100




● ●
● ● ●


● ●

10 14
● ●

d
^
1

1k 2k 4k 8k 16k 32k

6
0 5000 15000 25000
Length of time series
Index

Fig. 14. Search costs for the direct search Fig. 15. Time series nprs44 and its d̂
algorithm and HOT SAX. For HOT SAX, vector
the mean numbers of distance calls were
visually estimated from [5]; interval esti-
mates were used to account for potential
estimation error.

Following [5], we created random excerpts of length {1000 × 2k : 0 ≤ k ≤


5}from the original time series1 . For each length configuration, 100 random ex-
cerpts were created and the top 3 discords of length 128 were searched for.
Table 1 shows the mean and the standard error for numbers of calls to the
distance function. The rightmost column of the table contains the mean perfor-
mance metric visually estimated from Figure 13 of [5]. Similar information is
also visualized in Figure 14. The figure plots the numbers of calls to the distance
function for 6 × 3 × 100 runs of the direct-discord-search algorithm. Each point
corresponds to one run of discord search; horizontal jitter was applied to reduce
overlaps among points. The dashed intervals estimate the average number of calls
to the distance function by HOT SAX. Loess lines for the costs of searching for
top-3 discords are also plotted. We can see that for the 1st discord, the average
number of calls by the direct search algorithm (the red line) is roughly linear
to the size of the time series excerpts. Moreover, these numbers are significant
smaller than the numbers reported for HOT SAX (summarized with the dashed
intervals). For subsequent discords, the average numbers of calls to the distance
function (the blue line and the green line) decrease significantly, due to informa-
tion gained from prior computation. The metrics for the second and the third
discords also show larger variance: some points are significantly higher or lower
than the loess lines. A likely cause is that the complete time series contains only
a number of truly anomalous subsequences (discords): When a random excerpt
of the time series includes only one (or two) of these discords, searching for the
second (or the third, respectively) discord will be difficult. (Note the plot uses
the log scale for x and y axes.)
1
Experiments on the length 64, 000 were not carried out because qtdbsel102 has only
45, 000 points and we choose not to pad the time series with hypothetical values.
146 W. Luo and M. Gallagher

Table 1. Numbers of calls to the distance function with random excerpts from ptdb-
sel102, for the direct-discord-search algorithm and HOT SAX
g
Time Series Direct Search Cost (Standard Error) Aver. Cost for HOT SAX
Length 1st discord 2nd discord 3rd discord (visual estimates)
1,000 4,020 (1,441) 1,072 (705) 998 ( 690) 16, 000 to 40, 000
2,000 11,159 (4,641) 4,120 (2,532) 3,493 (2,780) 40, 000 to 100, 000
4,000 30,938 (12,473) 13,963 (10,633) 13,399 (12,473) 60, 000 to 160, 000
8,000 77,381 (33,064) 29,711 (32,651) 38,632 (40,974) 100, 000 to 160, 000
16,000 168,277 (70,071) 94,855 (107,128) 141,038 (143,553) 250, 000 to 400, 000
32,000 365,900 (184,540) 198,797 (95,960) 105,911 (107,992) 400, 000 to 1 × 106

In the second set of experiments, we search for the top 3 discords for a col-
lection of time series from [6]2 and [4], using the proposed algorithm. For time
series from [6], the discord lengths are chosen to be consistent with configura-
tions used in [5]. The results are shown in Table 2. Many of these datasets, in
particular 2h_radioactivity, demonstrate little periodicity. The results show
that our algorithm has reasonable performance even for such time series.

Table 2. Numbers of calls to the distance function for top-3 discord search

Time Length Discord Search Cost (Standard Error)


Series Length 1st discord 2nd discord 3rd discord
nprs44 24,085 320 249,283 (12,454) 231,350 (19,949) 208,539 (34,640)
nprs43 18,012 320 188,095 (11,820) 24,588 (2,785) 109,147 (29,516)
power data 35,000 750 158,235 (13,546) 34,680 (874) 37,992 (3,460)
chfdbchf15 15,000 256 79,683 (5,606) 21,400 (2,224) 134,734 (18,967)
2h radioactivity 4,370 128 157,495 (8,799) 20,286 (5,725) 16,463 (4,657)

In Table 2, the results for the time series nprs44 are particularly interesting.
For nprs44, no significant reduction in computation is observed for computing
the 2nd and the 3rd discords. To find out why, we plot the time series and the
estimated d vector in Figure 15. The figure shows that the 2nd and the 3rd
discords are not noticeably different from other subsequences.

Completely nonperiodic case. Completely nonperiodic time series rarely exist


in applications, and they can be easily identified through visual inspection of
the time series or their autocorrelation function. In an unlikely situation where
our algorithm is blindly applied to a completely nonperiodic time series, a bad
estimation of period will reduce the efficiency of the algorithm.
p To demonstrate
this, we generate two random walk time series T with tp = i=1 Zi , where Zi are
independent normally-distributed random variables with mean 0 and variance 1
2
For datasets containing more than one time series, we take the first one in each data
file.
Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series 147

0 5000 10000 15000


Index

(a) Random Walk 1 (b) Random Walk 2

Fig. 16. Random walk time series used in the experiments for completely nonperiodic
data

Table 3. Number of calls to the distance function for top-3 discord search (random
walk time series)

Time Length Discord Direct Search Cost (Standard Error)


Series Length 1st 2nd 3rd
random walk 1 15,000 256 136,395 (7,410) 54,994 (10,144) 34,355 (7,696)
random walk 2 30,000 128 441,685 (35,695) 329,380 (50,432) 636,930 (164,842)

(see Figure 16). Random walk time series is interesting in two aspects: firstly a
random walk time series is completely nonperiodic; secondly every subsequence
of a random walk can be regarded as equally anomalous.
We applied the algorithm to find the top-3 discords in the two random-walk
time series. The results are shown in Table 3. Without tuning any parameter,
the algorithm is still hundreds of times faster than the brute-force computation
of all pair-wise distances.
To sum up, our experiments show clear performance improvement on quasi-
periodic time series by the proposed direct discord-search algorithm. Our algo-
rithm also demonstrates consistent performance across a broad range of time
series, with varying degree of periodicity.

5 Conclusions and Future Work


The paper has introduced a parameter-free algorithm for top-K discord search.
When a time series is nearly periodic or quasi-periodic, the algorithm demon-
strated significant reduction in computation time. Many applications generate
quasi-periodic time series, and the assumption of quasi-periodicity can be assessed
by simple visual inspection. Therefore our algorithm has wide applicability.
Our results have shown that periodicity is a useful feature in time-series anomaly
detection. More theoretical study is needed to better understand the effect of peri-
odicity on the search space of time-series discords. We are also interested in know-
ing to what extent the results in this paper can be generalized to chaotic time series
[10].
148 W. Luo and M. Gallagher

One limitation of the proposed algorithm is that the time series need to be fit
into the main memory. Hence the algorithm requires O(m) memory. One future
direction is to explore disk-aware approximations to the direct-discord-search
algorithm. When the time series is too large to be fitted into the main memory,
one needs to minimize the number of disk scans as well the number of calls to
the distance function (see [11]).
Another direction is to explore alternative ways of estimating the d vector so
that the number of iterations for refining d̂ is minimized. We also are looking for
ways to extend the algorithm so that the periodicity assumption can be removed.

Acknowledgment
Support for this work was provided by an Australian Research Council Linkage
Grant (LP 0776417). We would like to thank anonymous reviewers for their
helpful comments.

References
1. Bu, Y., Leung, T.W., Fu, A.W.C., Keogh, E., Pei, J., Meshkin, S.: WAT: Finding
top-k discords in time series database. In: Proceedings of 7th SIAM International
Conference on Data Mining (2007)
2. Fu, A.W.-c., Leung, O.T.-W., Keogh, E.J., Lin, J.: Finding time series discords
based on haar transform. In: Li, X., Zaı̈ane, O.R., Li, Z.-h. (eds.) ADMA 2006.
LNCS (LNAI), vol. 4093, pp. 31–41. Springer, Heidelberg (2006)
3. Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark,
R.G., Mietus, J.E., Moody, G.B., Peng, C.-K., Stanley, H.E.: PhysioBank, Phys-
ioToolkit, and PhysioNet: Components of a new research resource for complex
physiologic signals. Circulation 101(23), e215–e220 (2000), Circulation Electronic
Pages: http://circ.ahajournals.org/cgi/content/full/101/23/e215
4. Hyndman, R.J.: Time Series Data Library, http://www.robjhyndman.com/TSDL
(accessed on April 15, 2010)
5. Keogh, E., Lin, J., Fu, A.: HOT SAX: Efficiently finding the most unusual time
series subsequence. In: Proc. of the 5th IEEE International Conference on Data
Mining, pp. 226–233 (2005)
6. Keogh, E., Lin, J., Fu, A.: The UCR Time Series Discords Homepage,
http://www.cs.ucr.edu/~eamonn/discords/
7. Lindström, J., Kokko, H., Ranta, E.: Detecting periodicity in short and noisy time
series data. Oikos 78(2), 406–410 (1997)
8. Pham, N.D., Le, Q.L., Dang, T.K.: HOT aSAX: A novel adaptive symbolic repre-
sentation for time series discords discovery. In: Nguyen, N.T., Le, M.T., Światek,

J. (eds.) ACIIDS 2010. LNCS, vol. 5990, pp. 113–121. Springer, Heidelberg (2010)
9. Sion, M.: On general minimax theorems. Pacific J. Math. 8(1), 171–176 (1958)
10. Sprott, J.C.: Chaos and time-series analysis. Oxford Univ. Pr., Oxford (2003)
11. Yankov, D., Keogh, E., Rebbapragada, U.: Disk aware discord discovery: Finding
unusual time series in terabyte sized datasets. Knowledge and Information Sys-
tems 17(2), 241–262 (2008)
INSIGHT: Efficient and Effective Instance Selection for
Time-Series Classification

Krisztian Buza, Alexandros Nanopoulos, and Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
{buza,nanopoulos,schmidt-thieme}@ismll.de

Abstract. Time-series classification is a widely examined data mining task with


various scientific and industrial applications. Recent research in this domain has
shown that the simple nearest-neighbor classifier using Dynamic Time Warping
(DTW) as distance measure performs exceptionally well, in most cases outper-
forming more advanced classification algorithms. Instance selection is a com-
monly applied approach for improving efficiency of nearest-neighbor classifier
with respect to classification time. This approach reduces the size of the train-
ing set by selecting the best representative instances and use only them during
classification of new instances. In this paper, we introduce a novel instance se-
lection method that exploits the hubness phenomenon in time-series data, which
states that some few instances tend to be much more frequently nearest neighbors
compared to the remaining instances. Based on hubness, we propose a frame-
work for score-based instance selection, which is combined with a principled
approach of selecting instances that optimize the coverage of training data. We
discuss the theoretical considerations of casting the instance selection problem as
a graph-coverage problem and analyze the resulting complexity. We experimen-
tally compare the proposed method, denoted as INSIGHT, against FastAWARD,
a state-of-the-art instance selection method for time series. Our results indicate
substantial improvements in terms of classification accuracy and drastic reduction
(orders of magnitude) in execution times.

1 Introduction
Time-series classification is a widely examined data mining task with applications in
various domains, including finance, networking, medicine, astronomy, robotics, bio-
metrics, chemistry and industry [11]. Recent research in this domain has shown that the
simple nearest-neighbor (1-NN) classifier using Dynamic Time Warping (DTW) [18]
as distance measure is “exceptionally hard to beat” [6]. Furthermore, 1-NN classifier is
easy to implement and delivers a simple model together with a human-understandable
explanation in form of an intuitive justification by the most similar train instances.
The efficiency of nearest-neighbor classification can be improved with several meth-
ods, such as indexing [6]. However, for very large time-series data sets, the execution
time for classifying new (unlabeled) time-series can still be affected by the significant
computational requirements posed by the need to calculate DTW distance between the
new time-series and several time-series in the training data set (O(n) in worst case,
where n is the size of the training set). Instance selection is a commonly applied ap-
proach for speeding-up nearest-neighbor classification. This approach reduces the size

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 149–160, 2011.
c Springer-Verlag Berlin Heidelberg 2011
150 K. Buza, A. Nanopoulos, and L. Schmidt-Thieme

of the training set by selecting the best representative instances and use only them dur-
ing classification of new instances. Due to its advantages, instance selection has been
explored for time-series classification [20].
In this paper, we propose a novel instance-selection method that exploits the re-
cently explored concept of hubness [16], which states that some few instances tend to
be much more frequently nearest neighbors than the remaining ones. Based on hub-
ness, we propose a framework for score-based instance selection, which is combined
with a principled approach of selecting instances that optimize the coverage of training
data, in the sense that a time series x covers an other time series y, if y can be classi-
fied correctly using x. The proposed framework not only allows better understanding
of the instance selection problem, but helps to analyze the properties of the proposed
approach from the point of view of coverage maximization. For the above reasons, the
proposed approach is denoted as Instance Selection based on Graph-coverage and Hub-
ness for Time-series (INSIGHT). INSIGHT is evaluated experimentally with a collec-
tion of 37 publicly available time series classification data sets and is compared against
FastAWARD [20], a state-of-the-art instance selection method for time series classifi-
cation. We show that INSIGHT substantially outperforms FastAWARD both in terms
of classification accuracy and execution time for performing the selection of instances.
The paper is organized as follows. We begin with reviewing related work in section
2. Section 3 introduces score-based instance selection and the implications of hubness
to score-based instance selection. In section 4, we discuss the complexity of the in-
stance selection problem, and the properties of our approach. Section 5 presents our
experiments followed by our concluding remarks in section 6.

2 Related Work
Attempts to speed up DTW-based nearest neighbor (NN) classification [3] fall into 4
major categories: i) speed-up the calculation of the distance of two time series, ii) reduce
the length of time series, iii) indexing, and iv) instance selection.
Regarding the calculation of the DTW-distance, the major issue is that implement-
ing it in the classic way [18], the comparison of two time series of length l requires the
calculation of the entries of an l × l matrix using dynamic programming, and therefore
each comparison has a complexity of O(l 2 ). A simple idea is to limit the warping win-
dow size, which eliminates the calculation of most of the entries of the DTW-matrix:
only a small fraction around the diagonal remains. Ratanamahatana and Keogh [17]
showed that such reduction does not negatively influence classification accuracy, in-
stead, it leads to more accurate classification. More advanced scaling techniques include
lower-bounding, like LB Keogh [10].
Another way to speed-up time series classification is to reduce the length of time
series by aggregating consecutive values into a single number [13], which reduces the
overall length of time series and thus makes their processing faster.
Indexing [4], [7] aims at fast finding the most similar training time series to a given
time series. Due to the “filtering” step that is performed by indexing, the execution
time for classifying new time series can be considerable for large time-series data sets,
since it can be affected by the significant computational requirements posed by the need
to calculate DTW distance between the new time-series and several time-series in the
INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification 151

training data set (O(n) in worst case, where n is the size of the training set). For this
reason, indexing can be considered complementary to instance selection, since both
these techniques can be applied to improve execution time.
Instance selection (also known as numerosity reduction or prototype selection) aims
at discarding most of the training time series while keeping only the most informative
ones, which are then used to classify unlabeled instances. While instance selection is
well explored for general nearest-neighbor classification, see e.g. [1], [2], [8], [9], [14],
there are just a few works for the case of time series. Xi et al. [20] present the Fast-
AWARD approach and show that it outperforms state-of-the-art, general-purpose in-
stance selection techniques applied for time series.
FastAWARD follows an iterative procedure for discarding time series: in each it-
eration, the rank of all the time series is calculated and the one with lowest rank is
discarded. Thus, each iteration corresponds to a particular number of kept time time
series. Xi et al. argue that the optimal warping window size depends on the number of
kept time series. Therefore, FastAWARD calculates the optimal warping window size
for each number of kept time series.
FastAWARD follows some decisions whose nature can be considered as ad-hoc (such
as the application of an iterative procedure or the use of tie-breaking criteria [20]). Con-
versely, INSIGHT follows a more principled approach. In particular, INSIGHT gener-
alizes FastAWARD by being able to use several formulae for scoring instances. We
will explain that the suitability of such formulae is based on the hubness property that
holds in most time-series data sets. Moreover, we provide insights into the fact that the
iterative procedure of FastAWARD is not a well-formed decision, since its large com-
putation time can be saved by ranking instances only once. Furthermore, we observed
the warping window size to be less crucial, and therefore we simply use a fixed window
size for INSIGHT (that outperforms FastAWARD using adaptive window size).

3 Score Functions in INSIGHT

INSIGHT performs instance selection by assigning a score to each instance and select-
ing instances with the highest scores (see Alg. 1). In this section, we examine how to
develop appropriate score functions by exploiting the property of hubness.

3.1 The Hubness Property

In order to develop a score function that selects representative instance for nearest-
neighbor time-series classification, we have to take into account the recently explored
property of hubness [15]. This property states that for data with high (intrinsic) dimen-
sionality, as most of the time-series data1 , some objects tend to become nearest neigh-
bors much more frequently than others. In order to express hubness in a more precise
way, for a data set D we define the k-occurrence of an instance x ∈ D, denoted fNk (x),
that is the number of instances of D having x among their k nearest neighbors. With
the term hubness we refer to the phenomenon that the distribution of fNk (x) becomes
1 In case of time series, consecutive values are strongly interdependent, thus instead of the length
of time series, we have to consider the intrinsic dimensionality [16].
152 K. Buza, A. Nanopoulos, and L. Schmidt-Thieme

Fig. 1. Distribution of fG1 (x) for some time series datasets. The horizontal axis correspond to the
values of f G1 (x), while on the vertical axis we see how many instance have that value.

significantly skewed to the right. We can measure this skewness, denoted by S f k (x) ,
N
with the standardized third moment of fNk (x):

E[( fNk (x) − μ f k (x) )3 ]


S f k (x) = N
(1)
N σ 3f k (x)
N

where μ f k (x) and σ f k (x) are the mean and standard deviation of fNk (x). When S f k (x)
N N N
is higher than zero, the corresponding distribution is skewed to the right and starts
presenting a long tail.
In the presence of labeled data, we distinguish between good hubness and bad hub-
ness: we say that the instance y is a good (bad) k-nearest neighbor of the instance x
if (i) y is one of the k-nearest neighbors of x, and (ii) both have the same (different)
class labels. This allows us to define good (bad) k-occurrence of a time series x, fGk (x)
(and fBk (x) respectively), which is the number of other time series that have x as one
of their good (bad) k-nearest neighbors. For time series, both distributions fGk (x) and
fBk (x) are usually skewed, as is exemplified in Figure 1, which depicts the distribution
of fG1 (x) for some time series data sets (from the collection used in Table 1). As shown,
the distributions have long tails, in which the good hubs occur.
We say that a time series x is a good (bad) hub, if fGk (x) (and fBk (x) respectively)
is exceptionally large for x. For the nearest neighbor classification of time series, the
skewness of good occurrence is of major importance, because a few time series (i.e.,
the good hubs) are able to correctly classify most of the other time series. Therefore, it
is evident that instance selection should pay special attention to good hubs.

3.2 Score Functions Based on Hubness

Good 1-occurrence score — In the light of the previous discussion, INSIGHT can use
scores that take the good 1-occurrence of an instance x into account. Thus, a simple
score function that follows directly is the good 1-occurrence score fG (x):

fG (x) = fG1 (x) (2)

Henceforth, when there is no ambiguity, we omit the upper index 1.


INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification 153

While x is being a good hub, at the same time it may appear as bad neighbor of several
other instances. Thus, INSIGHT can also consider scores that take bad occurrences into
account. This leads to scores that relate the good occurrence of an instance x to either
its total occurrence or to its bad occurrence. For simplicity, we focus on the following
relative score, however other variations can be used too:

Relative score fR (x) of a time series x is the fraction of good 1-occurrences and total
occurrences plus one (plus one in the denominator avoids division by zero):

fG1 (x)
fR (x) = (3)
fN1 (x) + 1
Xi’s score — Interestingly, fGk (x) and fBk (x) allows us to interpret the ranking criterion
of Xi et al. [20], by expressing it as another form of score for relative hubness:

fXi (x) = fG1 (x) − 2 fB1 (x) (4)

4 Coverage and Instance Selection


Based on scoring functions, such as those described in the previous section, INSIGHT
selects top-ranked instances (see Alg. 1). However, while ranking the instances, it is also
important to examine the interactions between them. For example, suppose that the 1st
top-ranked instance allows correct 1-NN classification of almost the same instances as
the 2nd top-ranked instance. The contribution of the 2nd top-ranked instance is, there-
fore, not important with respect to the overall classification. In this section we describe
the concept of coverage graphs, which helps to examine the aforementioned aspect
of interactions between the selected instances. In Section 4.1 we examine the general
relation between coverage graphs and instance-based learning methods, whereas in Sec-
tion 4.2 we focus on the case of 1-NN time-series classification.

4.1 Coverage Graphs for Instance-Based Learning Methods


We first define coverage graphs, which in the sequel allow us to cast the instance-
selection problem as a graph-coverage problem:
Definition 1 (Coverage graph). A coverage graph Gc = (V, E) is a directed graph,
where each vertex v ∈ VGc , corresponds to a time series of the (labeled) training set. A
directed edge from vertex vx to vertex vy , denoted as (vx , vy ) ∈ EGc states that instance
x contributes to the correct classification of instance y.

Algorithm 1. INSIGHT
Require: Time-series dataset D, Score Function f , Number of selected instances N
Ensure: Set of selected instances (time series) D
1: Calculate score function f (x) for all x ∈ D
2: Sort all the time series in D according to their scores f (x)
3: Select the top-ranked N time series and return the set containing them
154 K. Buza, A. Nanopoulos, and L. Schmidt-Thieme

We first examine coverage graphs for the general case of instance-based learning
methods, which include the k-NN (k ≥ 1) classifier and its generalizations, such as
adaptive k-NN classification where the number of nearest neighbors k is chosen adap-
tively for each object to be classified [12], [19].2 In this context, the contribution of
an instance x to the correct classification of an instance y refers to the case when x is
among the nearest neighbors of y and they have the same label.
Based on the definition of the coverage graph, we can next define the coverage of a
specific vertex and of set of vertices:

Definition 2 (Coverage of a vertex and of vertex-set). A vertex v covers an other


vertex v if there is an edge from v to v; C(v) is the set of all vertices covered by v:
C(v) = {v |v = v ∧ (v , v) ∈ EGc }. Moreover, a set of vertices S0 covers all the vertices

that are covered by at least one vertex v ∈ S0 : C(S0 ) = ∀v∈S0 C(v).

Following the common assumption that the distribution of the test (unlabeled) data is
similar to the distribution of the training (labeled) data, the more vertices are covered,
the better prediction for new (unlabeled) data is expected. Therefore, the objective of an
instance-selection algorithm is to have the selected vertex-set S (i.e., selected instances)
covering the entire set of vertices (i.e., the entire training set), i.e., C(S) = VGc . This,
however, may not be always possible, such as when there exist vertices that are not
covered by any other vertex. If a vertex v is not covered by any other vertex, this means
that the out-degree of v is zero (there are no edges going from v to other vertices).
Denote the set of such vertices with by VG0c . Then, an ideal instance selection algorithm
should cover all coverable vertices, i.e., for the selected vertices S an ideal instance
selection algorithm should fulfill:

C(v) = VGc \ VG0c (5)
∀v∈S

In order to achieve the aforementioned objective, the trivial solution is to select all
the instances of the training set, i.e., chose S = VGc . This, however is not an effective
instance selection algorithm, as the major aim of discarding less important instances
is not achieved at all. Therefore, the natural requirement regarding the ideal instance
selection algorithm is that it selects the minimal amount of those instances that together
cover all coverable vertices. This way we can cast the instance selection task as a cov-
erage problem:

Instance selection problem (ISP) — We are given a coverage graph Gc = (V, E). We
aim at finding a set of vertices S ⊆ VGc so that: i) all the coverable vertices are covered
(see Eq. 5), and ii) the size of S is minimal among all those sets that cover all coverable
vertices.
Next we will show that this problem is NP-complete, because it is equivalent to the
set-covering problem (SCP), which is NP-complete [5]. We proceed with recalling the
set-covering problem.
2 Please notice that in the general case the resulting coverage graph has no regularity regarding
both the in- and out-degrees of the vertices (e.g., in the case of k-NN classifier with adaptive k).
INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification 155

Set-covering problem (SCP) — ”An instance (X, F ) of the set-covering problem con-
sists of a finite set X and a familiy F of subsets of X, such that every element of X
belongs to at least one subset in F . (...) We say that a subset F ∈ F covers its ele-
ments. The problem is to find a minimum-size subset C ⊆ F whose members cover all

of X”[5]. Formally: the task is to find C ⊆ F , so that |C | is minimal and X = F.
∀F∈C
Theorem 1. ISP and SCP are equivalent. (See Appendix for the proof.)

4.2 1-NN Coverage Graphs


In this section, we introduce 1-nearest neighbor (1-NN) coverage graphs which is moti-
vated by the good performance of the 1-NN classifier for time series classification. We
show the optimality of INSIGHT for the case of 1-NN coverage graphs and how the
NP-completeness of the general case (Section 4.1) is alleviated for this special case.
We first define the specialization of the coverage graph based on the 1-NN relation:
Definition 3 (1-NN coverage graph). A 1-NN coverage graph, denoted by G1NN is a
coverage graph where (vx , vy ) ∈ EG1NN if and only if time series y is the first nearest
neighbor of time series x and the class labels of x and y are equal.
This definition states that an edge points from each vertex v to the nearest neighbor of
v, only if this is a good nearest neighbor (i.e., their labels match). Thus, vertexes are not
connected with their bad nearest neighbors.
From the practical point of view, to account for the fact that the size of selected
instances is defined apriori (e.g., a user-defined parameter), a slightly different version
of the Instance Selection Problem (ISP) is the following:

m-limited Instance Selection Problem (m-ISP) — If we wish to select exactly m la-


beled time series from the training set, then, instead of selecting the minimal amount
of time series that ensure total coverage, we select those m time series that maximize
the coverage. We call this variant m-limited Instance Selection Problem (m-ISP). The
following proposition shows the relation between 1-NN coverage graphs and m-ISP:
Proposition 1. In 1-NN coverage graphs, selecting m vertices v1 , ..., vm that have the
largest covered sets C(v1 ), ..., C(vm ) leads to the optimal solution of m-ISP.
The validity of this proposition stems from the fact that, in 1-NN coverage graphs, the
out-degree of all vertices is 1. This implies that each vertex is covered by at most one
other vertex, i.e., the covered sets C(v) are mutually disjoint for each v ∈ VG1NN .
Proposition 1 describes the optimality of INSIGHT, when the good 1-occurrence
score (Equation 2) is used, since the size of the set C(vi ) is the number of vertices having
vi as first good nearest neighbor. It has to be noted that described framework of coverage
graphs can be extended to other scores too, such as relatives scores (Equations 3 or 4).
In such cases, we can additionally model bad neighbors and introduce weights on the
edges of the graph. For example, for the score of Equation 4, the weight of an edge
e is +1, if e denotes a good neighbor, whereas it is −2, if e denotes a bad neighbor.
We can define the coverage score of a vertex v as the sum of weights of the incoming
edges to v and aim to maximize this coverage score. The detailed examination of this
generalization is addressed as future work.
156 K. Buza, A. Nanopoulos, and L. Schmidt-Thieme

Fig. 2. Accuracy as function of the number of selected instances (in % of the entire training data)
for some datasets for FastAWARD and INSIGHT

5 Experiments
We experimentally examine the performance of INSIGHT with respect to effectiveness,
i.e., classification accuracy, and efficiency, i.e., execution time required by instance se-
lection. As baseline we use FastAWARD [20].
We used 37 publicly available time series datasets3 [6]. We performed 10-fold-cross
validation. INSIGHT uses fG (x) (Eq. 2) as the default score function, however fR (x)
(Eq. 3) and fXi (x) (Eq. 4) are also being examined. The resulting combinations are
denoted as INS- fG (x), INS- fR (x) and INS- fXi (x), respectively.
The distance function for the 1-NN classifier is DTW that uses warping windows [17].
In contrast to FastAWARD, which determines the optimal warping window size ropt ,
INSIGHT sets the warping-window size to a constant of 5%. (This selection is justified
by the results presented in [17], which show that relatively small window sizes lead to
higher accuracy.) In order to speed-up the calculations, we used the LB Keogh lower
bounding technique [10] for both INSIGHT and FastAWARD.

Results on Effectiveness — We first compare INSIGHT and FastAWARD in terms


of classification accuracy that results when using the instances selected by these two
methods. Table 1 presents the average accuracy and corresponding standard deviation
for each data set, for the case when the number of selected instances is equal to 10%
of the size of the training set (for INSIGHT, the INS- fG (x) variation is used). In the
vast majority of cases, INSIGHT substantially outperforms FastAWARD. In the few
remaining cases, their difference are remarkably small (in the order of the second or
third decimal digit, which are not significant in terms of standard deviations).
We also compared INSIGHT and FastAWARD in terms of the resulting classification
accuracy for varying number of selected instances. Figure 2 illustrates that INSIGHT
compares favorably to FastAWARD. Due to space constraints, we cannot present such
results for all data sets, but analogous conclusion is drawn for all cases of Table 1 for
which INSIGHT outperforms FastAWARD.
Besides the comparison between INSIGHT and FastAward, what is also interesting
is to examine their relative performance compared to using the entire training data (i.e.,
no instance selection is applied). Indicatively, for 17 data sets from Table 1 the accuracy
3 For StarLightCurves the calculations have not been completed for FastAWARD till the sub-
mission, therefore we omit this dataset.
INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification 157

Table 1. Accuracy ± standard deviation for INSIGHT and FastAWARD (bold font: winner)

Dataset FastAWARD INS- fG (x) Dataset FastAWARD INS- fG (x)


50words 0.526±0.041 0.642±0.046 Lighting7 0.447±0.126 0.510±0.082
Adiac 0.348±0.058 0.469±0.049 MALLAT 0.551±0.098 0.969±0.013
Beef 0.350±0.174 0.333±0.105 MedicalImages 0.642±0.033 0.693±0.049
Car 0.450±0.119 0.608±0.145 Motes 0.867±0.042 0.908±0.027
CBF 0.972±0.034 0.998±0.006 OliveOil 0.633±0.100 0.717±0.130
Chlorinea 0.537±0.023 0.734±0.030 OSULeaf 0.419±0.053 0.538±0.057
CinC 0.406±0.089 0.966±0.014 Plane 0.876±0.155 0.981±0.032
Coffee 0.560±0.309 0.603±0.213 Sonyd 0.924±0.032 0.976±0.017
Diatomb 0.972±0.026 0.966±0.058 SonyIIe 0.919±0.015 0.912±0.033
ECG200 0.755±0.113 0.835±0.090 SwedishLeaf 0.683±0.046 0.756±0.048
ECGFiveDays0.937±0.027 0.945±0.020 Symbols 0.957±0.018 0.966±0.016
FaceFour 0.714±0.141 0.894±0.128 SyntheticControl0.923±0.068 0.978±0.026
FacesUCR 0.892±0.019 0.934±0.021 Trace 0.780±0.117 0.895±0.072
FISH 0.591±0.082 0.666±0.085 TwoPatterns 0.407±0.027 0.987±0.007
GunPoint 0.800±0.124 0.935±0.059 TwoLeadECG 0.978±0.013 0.989±0.012
Haptics 0.303±0.068 0.435±0.060 Wafer 0.921±0.012 0.991±0.002
InlineSkate 0.197±0.056 0.434±0.077 WordsSynonyms0.544±0.058 0.637±0.066
Italyc 0.960±0.020 0.957±0.028 Yoga 0.550±0.017 0.877±0.021
Lighting2 0.694±0.134 0.670±0.096
a ChlorineConcentration, b DiatomSizeReduction, c ItalyPowerDemand,
d SonyAIBORobotSurface, e SonyAIBORobotSurfaceII

resulting from INSIGHT (INS- fG (x)) is worse by less than 0.05 compared to using the
entire training data. For FastAward this number is 4, which clearly shows that INSIGHT
select more representative instances of the training set than FastAward.
Next, we investigate the reasons for the presented difference between INSIGHT and
FastAward. In Section 3.1, we identified the skewness of good k-occurrence, fGk (x), as
a crucial property for instance selection to work properly, since skewness renders good
hubs to become representative instances. In our examination, we found that using the
iterative procedure applied by FastAWARD, this skewness has a decreasing trend from
iteration to iteration. Figure 3 exemplifies this by illustrating the skewness of fG1 (x) for
two data sets as a function of iterations performed in FastAWARD. (In order to quanti-
tatively measure skewness we use the standardized third moment, see Equation 1.) The
reduction in the skewness of fG1 (x) means that FastAWARD is not able to identify in the
end representative instances, since there are no pronounced good hubs remaining.
To further understand that the reduced effectiveness of FastAWARD stems from its
iterative procedure and not from its score function, fXi (x) (Eq. 4), we compare the
accuracy of all variations of INSIGHT including INS- fXi (x), see Tab. 2. Remarkably,
INS- fXi (x) clearly outperforms FastAWARD for the majority of cases, which verifies
our previous statement. Moreover, the differences between the three variations are not
large, indicating the robustness of INSIGHT with respect to the scoring function.
158 K. Buza, A. Nanopoulos, and L. Schmidt-Thieme

Fig. 3. Skewness of the distribution of fG1 (x) as function of the number of iterations performed in
FastAWARD. On the trend, the skewness decreases from iteration to iteration.

Table 2. Number of datasets where different versions of INSIGHT win/lose against FastAWARD

INS- f G (x) INS- f R (x) INS- fX i (x)


Wins 32 33 33
Loses 5 4 4

Table 3. Execution times (in seconds, averaged over 10 folds) of instance selection using IN-
SIGHT and FastAWARD for some datasets

Dataset FastAWARD INS- fG (x) Dataset FastAWARD INS- f G (x)


50words 94 464 203 Lighting7 5 511 8
Adiac 32 935 75 Mallat 4 562 881 19 041
Beef 1 273 3 MedicalImages 13 495 55
Car 11 420 18 Motes 17 937 55
CBF 37 370 67 OliveOil 3 233 5
ChlorineConcentration 16 920 1 974 OSULeaf 80 316 118
CinC 3 604 930 16 196 Plane 1 527 4
Coffee 499 1 SonyAIBORobotS. 4 608 11
DiatomSizeReduction 18 236 44 SonyAIBORobotS.II 10 349 23
ECG200 634 2 SwedishLeaf 37 323 89
ECGFiveDays 20 455 60 Symbols 165 875 514
FaceFour 4 029 6 SyntheticControl 3 017 8
FacesUCR 150 764 403 Trace 3 606 11
FISH 59 305 93 TwoPatterns 360 719 1 693
GunPoint 1 107 4 TwoLeadECG 12 946 45
Haptics 152 617 869 Wafer 923 915 4 485
InlineSkate 906 472 4 574 WordsSynonyms 101 643 203
ItalyPowerDemand 1 855 6 Yoga 1 774 772 6 114
Lighting2 15 593 23

Results on Efficiency — The computational complexity of INSIGHT depends on the


calculation of the scores of the instances of the training set and on the selection of the
top-ranked instances. Thus, for the examined score functions, the computational com-
plexity is O(n2 ), n being the number of training instances, since it is determined by the
calculation of the distance between each pair of training instances. For FastAWARD, its
INSIGHT: Efficient and Effective Instance Selection for Time-Series Classification 159

first step (leave-one-out nearest neighbor classification of the train instances) already
requires O(n2 ) execution time. However, FastAWARD performs additional computa-
tionally expensive steps, such as determining the best warping-window size and the
iterative procedure for excluding instances. For this reason, INSIGHT is expected to
require reduced execution time compared to FastAWARD. This is verified by the re-
sults presented in Table 3, which show the execution time needed to perform instance
selection with INSIGHT and FastAWARD. As expected, INSIGHT outperforms Fast-
AWARD drastically. (Regarding the time for classifying new instances, please notice
that both methods perform 1-NN using the same number of selected instances, there-
fore the classification times are equal.)

6 Conclusion and Outlook

We examined the problem of instance selection for speeding-up time-series classifica-


tion. We introduced a principled framework for instance selection based on coverage
graphs and hubness. We proposed INSIGHT, a novel instance selection method for
time series. In our experiments we showed that INSIGHT outperforms FastAWARD, a
state-of-the-art instance selection algorithm for time series.
In our future work, we aim at examining the generalization of coverage graphs for
considering weights on edges. We also plan to extend our approach for other instance-
based learning methods besides 1-NN classifier.

Acknowledgements. Research partially supported by the Hungarian National Research


Fund (Grant Number OTKA 100238).

References
1. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learn-
ing 6(1), 37–66 (1991)
2. Brighton, H., Mellish, C.: Advances in Instance Selection for Instance-Based Learning Al-
gorithms. Data Mining and Knowledge Discovery 6, 153–172 (2002)
3. Buza, K., Nanopoulos, A., Schmidt-Thieme, L.: Time-Series Classification based on Indi-
vidualised Error Prediction. In: IEEE CSE 2010 (2010)
4. Chakrabarti, K., Keogh, E., Sharad, M., Pazzani, M.: Locally adaptive dimensionality reduc-
tion for indexing large time series databases. ACM Transactions on Database Systems 27,
188–228 (2002)
5. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT
Press, Cambridge (2001)
6. Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.: Querying and Mining of
Time Series Data: Experimental Comparison of Representations and Distance Measures. In:
VLDB 2008 (2008)
7. Gunopulos, D., Das, G.: Time series similarity measures and time series indexing. ACM
SIGMOD Record 30, 624 (2001)
8. Jankowski, N., Grochowski, M.: Comparison of instances seletion algorithms I. Algorithms
survey. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC
2004. LNCS (LNAI), vol. 3070, pp. 598–603. Springer, Heidelberg (2004)
160 K. Buza, A. Nanopoulos, and L. Schmidt-Thieme

9. Jankowski, N., Grochowski, M.: Comparison of instance selection algorithms II. Results and
Comments. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC
2004. LNCS (LNAI), vol. 3070, pp. 580–585. Springer, Heidelberg (2004)
10. Keogh, E.: Exact indexing of dynamic time warping. In: VLDB 2002 (2002)
11. Keogh, E., Kasetty, S.: On the Need for Time Series Data Mining Benchmarks: A Survey
and Empirical Demonstration. In: SIGKDD (2002)
12. Ougiaroglou, S., Nanopoulos, A., Papadopoulos, A.N., Manolopoulos, Y., Welzer-Druzovec,
T.: Adaptive k-Nearest-Neighbor Classification Using a Dynamic Number of Nearest Neigh-
bors. In: Ioannidis, Y., Novikov, B., Rachev, B. (eds.) ADBIS 2007. LNCS, vol. 4690, pp.
66–82. Springer, Heidelberg (2007)
13. Lin, J., Keogh, E., Lonardi, S., Chiu, B.: A Symbolic Representation of Time Series, with
Implications for Streaming Algorithms. In: Proceedings of the 8th ACM SIGMOD Workshop
on Research Issues in Data Mining and Knowledge Discovery (2003)
14. Liu, H., Motoda, H.: On Issues of Instance Selection. Data Mining and Knowledge Discov-
ery 6, 115–130 (2002)
15. Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Nearest Neighbors in High-Dimensional
Data: The Emergence and Influence of Hubs. In: ICML 2009 (2009)
16. Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Time-Series Classification in Many Intrin-
sic Dimensions. In: 10th SIAM International Conference on Data Mining (2010)
17. Ratanamahatana, C.A., Keogh, E.: Three myths about Dynamic Time Warping. In: SDM
(2005)
18. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recog-
nition. IEEE Trans. Acoustics, Speech and Signal Proc. 26, 43–49 (1978)
19. Wettschereck, D., Dietterich, T.: Locally Adaptive Nearest Neighbor Algorithms. Advances
in Neural Information Processing Systems 6 (1994)
20. Xi, X., Keogh, E., Shelton, C., Wei, L., Ratanamahatana, C.A.: Fast Time Series Classifica-
tion Using Numerosity Reduction. In: Airoldi, E.M., Blei, D.M., Fienberg, S.E., Goldenberg,
A., Xing, E.P., Zheng, A.X. (eds.) ICML 2006. LNCS, vol. 4503. Springer, Heidelberg (2007)

Appendix: Proof of Theorem 1


We show the equivalence in two steps. First we show that ISP is a subproblem of SCP, i.e. for
each instance of ISP a corresponding instance of SCP can be constructed (and the solution of the
SCP-instances directly gives the solution of the ISP-instance). In the second step we show that
SCP is a subproblem of ISP. The both together imply equivalence.
For each ISP-instance we construct a corresponding SCP-instance: X := VGc \VG0c and F :=
{C(v)|v ∈ VGc } We say that vertex v is the seed of the set C(v). The solution of SCP is a set
F ⊆ F . The set of seeds of the subsets in F constitute the solution of ISP: S = {v|C(v) ∈ F}
While constructing an ISP-instance for an SCP-instance, we have to be careful, because
the number of subsets in SCP is not limited. Therefore in the coverage graph Gc there are
two types of vertices. Each first-type-vertex vx corresponds to one element x ∈ X, and each
second-type-vertex vF correspond to a subset F ∈ F . Edges go only from first-type-vertices
to second-type-vertices, thus only first-type-vertices are coverable. There is an edge (vx , vF ) from
a first-type-vertex vx to a second-type-vertex vF if and only if the corresponding element of X
is included in the corresponding subset F, i.e. x ∈ F. When the ISP is solved, all the coverable
vertices (first-type-vertices) are covered by a minimal set of vertices S. In this case, S obviously
consits only of second-type-vertices. The solution of the SCP are the subsets corresponding to
the vertices included in S: C = {F|F ∈ F ∧ vF ∈ S}
Multiple Time-Series Prediction
through Multiple Time-Series Relationships
Profiling and Clustered Recurring Trends

Harya Widiputra , Russel Pears, and Nikola Kasabov

The Knowledge Engineering and Discovery Research Institute,


Auckland University of Technology, New Zealand
{harya.widiputra,rpears,nkasabov}@aut.ac.nz
http://kedri.info

Abstract. Time-series prediction has been very well researched by both


the Statistical and Data Mining communities. However the multiple time-
series problem of predicting simultaneous movement of a collection of
time sensitive variables which are related to each other has received much
less attention. Strong relationships between variables suggests that tra-
jectories of given variables that are involved in the relationships can be
improved by including the nature and strength of these relationships into
a prediction model. The key challenge is to capture the dynamics of the
relationships to reflect changes that take place continuously over time.
In this research we propose a novel algorithm for extracting profiles of
relationships through an evolving clustering method. We use a form of
non-parametric regression analysis to generate predictions based on the
profiles extracted and historical information from the past. Experimental
results on a real-world climatic data reveal that the proposed algorithm
outperforms well established methods of time-series prediction.

Keywords: time-series inter-relationships, multiple time-series predic-


tion, evolving clustering method, recurring trends.

1 Introduction

Previous studies have found that in multiple time-series data relating to real
world phenomena in the Biological and Economic domains, dynamic relation-
ships between series exist, and being governed by them these series move together
through time. For instance, it is well known that movement of a stock market
index in a specific country is affected by the movements of other stock market
indexes across the world or in that particular region [1],[2],[3]. Likewise, in a
Gene Regulatory Network (GRN) the expression level of a Gene is determined
by its time varying interactions with other Genes [4],[5].
However, even though time-series prediction has been extensively researched,
and some prominent methods from the machine learning and data mining arenas

Corresponding author.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 161–172, 2011.
c Springer-Verlag Berlin Heidelberg 2011
162 H. Widiputra, R. Pears, and N. Kasabov

such as the Multi-Layer Perceptron and Support Vector Machines have been
developed, there has been no research so far into developing a method that
can predict multiple time-series simultaneously based on interactions between
the series. The closest researches that take into account multiple time-series
variables are that of ([6],[7],[8],[9]) which generally used historical values of some
independent variables as inputs to a model that estimates future values of a
dependent variable. Consequently, these methods do not have the capability to
capture and model the dynamics of relationships in multiple time-series dataset
and to predict their future values simultaneously.
This research proposes a new method for modeling the dynamics of relation-
ships in multiple time-series and to simultaneously predict their future values
without the need for generating multiple models. The work thus focuses on the
discovery of profiles of relationships in multiple time-series dataset and the recur-
ring trends of movement that occur in a specific relationship’s form to construct
a knowledge repository of the system under evaluation. The identification and
exploitation of these profiles and recurring trends is expected to provide knowl-
edge to perform simultaneous multiple time-series prediction, and that it would
also significantly improve the accuracy of time-series prediction.
The rest of the paper is organized as follows; in the next section we briefly re-
view the issues involved in time-series modeling and cover the use of both Global
and Localized models. Section 3 describes and explains the method proposed in
this paper to discover profiles of relationships in multiple time-series dataset
and their recurring trends of movement. In section 4 we present our experimen-
tal findings, and to end with, in section 5 we conclude the paper summarizing
the main achievements of the research and briefly outline some directions for
future research.

2 Global and Local Model of Time-Series Prediction


In the last few decades, the use of a single global model to forecast future events
based on known past events has been a very popular approach in the data mining
and knowledge discovery area [10]. Global models are built using all available
historical data and thus can be used to predict future trends. However, the
trajectories that global models produce often fail to track localized changes that
take place at discrete points in time. This is due to the fact that trajectories tend
to smooth localized deviations by averaging the effects of such deviations over a
long period of time. In reality localized disturbances may be of great significance
as they capture the conditions under which a time-series deviates from the norm.
Furthermore, it is of interest to capture similar deviations from a global tra-
jectory that take place repeatedly over time, in other words to capture recurring
deviations from the norm that are similar in shape and magnitude. Such local-
ized phenomena can only be captured accurately by localized models that are
built only on data that define the phenomenon under consideration and are not
contaminated by data outside the underlying localized phenomenon.
Local models can be built by grouping together data that has similar behavior.
Different types of phenomena will define their own clusters. Models can then be
Multiple Time-Series Prediction through Time-Series Relationships Profiling 163

developed for each cluster (i.e. local regressions) that will yield better accuracy
over the local problem space covered by the model in contrast to a global model.
Having a set of local models also offers greater flexibility as predictions can be
made either on the basis of a single model or, if needed, on a global level by
combining the predictions made by the individual local models [11].

3 Local Modeling of Multiple Time-Series Data

In this research a method to construct local models of multiple time-series


containing profiles of relationships between multiple time-series by utilizing a
non-parametric regression model in combination with the ECM, Evolving Clus-
tering Method [12] is proposed. The construction of local models in the proposed
methodology consists of two main steps, that are the extraction of profiles of re-
lationships between time-series and the detection and clustering of recurring
trends of movement in time-series when a particular profile emerges. The prin-
cipal objective of the methodology is to construct a repository of profiles and
recurring trends as the knowledge-base and key data resource to perform multiple
time-series prediction.
To attain such objective a 2-level local modeling process is utilized in the pro-
posed methodology. The first level of local modeling which deals with the extrac-
tion of profiles of relationships between series in a sub-space of the given multi-
ple time-series data is outlined in Section 3.1. The second level of local modeling,
which is the procedure to detect and cluster recurring trends that take place in
time-series when a particular profile is emerging, is described in Section 3.2.
Constructed local models are expected to be able to capture the underly-
ing behavior of the multiple time-series under examination, in terms of their
tendency to move collectively in a similar fashion over time. Knowledge about
such underlying behavior is expected to be helpful in predicting their upcoming
movements as new data becomes available. This premise has been experimentally
verified in the work presented in this paper.

3.1 Extracting Profiles of Relationship of Multiple Time-Series

Most of the work in clustering time-series data has concentrated on sample clus-
tering rather than variable clustering [13]. However, one of the key tasks in
our work is to group together series that are highly correlated and have similar
shapes of movement (variable clustering), as we believe that these local models
representing clusters of similar profiles will provide a better basis than a single
global model for predicting future movement of the multiple time-series.
The first step in extracting profiles of relationships between multiple time-
series is the computation of cross-correlation coefficients between the observed
time-series using the Pearson’s correlation analysis. Additionally, only those sta-
tistically significant correlation coefficients identified by the t-test with confi-
dence level of 95% are taken into account. The following step of the algorithm
is to measure dissimilarity between time-series from the Pearson’s correlation
164 H. Widiputra, R. Pears, and N. Kasabov

Fig. 1. The Pearson’s correlation coefficient matrix is calculated from a given multiple
time-series data (TS-1,TS-2,TS-3,TS-4), and then converted to normalized correlation
[Equation 1] before the profiles are finally extracted

coefficient (line 1, Algorithm 1) by calculating the Rooted Normalized One-


Minus Correlation [13] (known hence as normalized correlation in this paper)
given by,

1 − corr(a, b)
RNOMC(a, b) = (1)
2
where a and b are the time-series being analyzed. The normalized correlation
coefficient ranges from 0 to 1, in which 0 denotes high similarity and 1 signifies
the opposite condition.
Thereafter, the last stage of the algorithm is to extract profiles of relationships
from the normalized correlation matrix. The methodology used in this step is
outlined in line 3 to 24 of Algorithm 1. The whole process of extracting profiles
of relationships is illustrated in Figure 1. In any case, the fundamental concept
of this algorithm is to group multiple time-series with comparable fashion of
movement whilst validating that every time-series belong to the same cluster
are correlated and hold significant level of similarity.
The underlying concept of Algorithm 1 is closely comparable to the CAST,
Clustering Affinity Search Technique, clustering algorithm [14]. However, Al-
gorithm 1 works by dynamically creating new clusters, deleting and merging
existing clusters as it evaluates the coefficient of similarity between time-series
or observed variables. Therefore, Algorithm 1 is considerably different compared
to CAST which creates a single cluster at one time and perform updates by
adding new elements to the cluster from a pool of elements, or by removing
elements from the cluster and return it to the pool as it evaluates the affinity
factor of the cluster in which the elements belong to.
After the profiles have been extracted, then the next step of the methodol-
ogy is to mine and cluster series’ trends of movement from each profile. This
process is outlined and explained in the next section. Additionally, as the time-
complexity of Algorithm 1 is O( 12 (n2 − n)), to avoid expensive re-computation
Multiple Time-Series Prediction through Time-Series Relationships Profiling 165

Algorithm 1. Extracting profiles of relationship of multiple time-series


Require: X, where X1 , X2 , ..., Xn are observed time-series
Ensure: profiles of relationships between multiple time-series
1: calculate the normalized correlation coefficient [Equation 1] of X
2: for each time-series X1 , X2 , ..., Xn do
3: //pre-condition: Xi ,Xj do not belong to any cluster
4: if (Xi ,Xj are correlated) AND (Xi , Xj do not belong to any cluster) then
5: allocate Xi ,Xj together in a new cluster
6: end if
7: //pre-condition: Xi belongs to a cluster; Xj does not belong to any cluster
8: if (Xi ,Xj are correlated) AND (Xi belongs to a cluster) then
9: if (Xj is correlated with all Xi cluster member) then
10: allocate Xj to cluster of Xi
11: else if (Xi ,Xj correlation > max(correlation) of Xi with its cluster member)
AND (Xj is not correlated with any of Xi cluster member) then
12: remove Xi from its cluster; allocate Xi ,Xj together in a new cluster
13: end if
14: end if
15: //pre-condition: Xi and Xj belong to different cluster
16: if (Xi ,Xj are correlated) AND (Xi , Xj belong to different cluster) then
17: if (Xi is correlated with all Xj cluster member) AND (Xj is correlated with
all Xi cluster member) then
18: merge cluster of Xi ,Xj together
19: else if (Xi ,Xj correlation > max(correlation) of Xj with its cluster member)
AND (Xj is correlated with all Xi cluster member) then
20: remove Xj from its cluster; allocate Xj to cluster of Xi
21: else if (Xi ,Xj correlation > max(correlation) of both Xi ,Xj with their cluster
member) AND (Xi is not correlated with one of Xj cluster member) AND
(Xj is not correlated to any of Xi cluster member) then
22: remove Xi ,Xj from their cluster; allocate Xi ,Xj together in a new cluster
23: end if
24: end if
25: end for
26: return clusters of multiple time-series

and extraction of profiles, extracted profiles of relationships are stored and being
updated dynamically instead of being computed on the fly.

3.2 Clustering Recurring Trends of a Time-Series

To detect and cluster recurring trends of movement from localized sets of time-
series data, an algorithm that extracts patterns of movement in the form of
polynomial regression function and groups them on the basis of similarity in
the regression coefficients has been proposed in a previous study [15]. However,
in this proposed algorithm to eliminate the assumption that the data is drawn
from a Gaussian data distribution when estimating the regression function, a
non-parametric regression analysis is used in place of the polynomial regression.
166 H. Widiputra, R. Pears, and N. Kasabov

The process to cluster recurring trends of a time-series by using kernel regression


as the non-parametric regression method is outlined in Algorithm 2 as follows,
– Step 1, perform the autocorrelation analysis to the time-series dataset from
which trends of movement will be extracted and clustered. Number of lag,
as outcome of the autocorrelation analysis where lag > 0, with highest cor-
relation coefficient is then taken as the size of snapshot-window n.
– Step 2, create the first cluster C1 by simply taking the trend of movement
(1) (1) (1)
of the first snapshot X(1) = (X1 , X2 , ..., Xn ), from the input stream as
the first cluster centre Cc1 and set the cluster radius Ru1 to 0.
In this methodology, the i-th trend of movement represented by the kernel
weight vector wi = (wi1 , wi2 , ..., win ) as outcome of the non-parametric re-
gression analysis, is calculated using the Nadaraya-Watson kernel weighted
average formula defined as follows,

n
wik xjk
(i) (i) k=1
X̂j = f (xj , wi ) = . (2)

n
xjk
k=1

(i)
Here xj = (xij1 , ..., xijk ) is the extended smaller value of the original data
n
X(i) at domain j and certain small step dx where j = 1, 2, ..., ( dx + 1).
xj = (xj1 , ..., xjk ) is calculated using the Gaussian MF equation as follows,
 
(xj − k)2
xjk = K(xj , k) = exp − , (3)
2α2
where xj = dx×(j−1), k = 1, 2, ..., n and α is a pre-defined kernel bandwidth.
The kernel weight, wi is estimated using common OLS, ordinary least squares
such that the following objective functions is minimized,

n
(i) (i) (i)
SSR = (Xk − X̂j ), ∀ X̂j where xj = Xk . (4)
k=1

To gain knowledge about upcoming trend of movement when a particular


trend emerge in a locality of time, the algorithm also model next trajectories
of a data snapshot defined by,

n+1
(u) (u)
wik K(xj , k)
(i)(u) (u) (u) k=1
X̂j = f (xj , wi ) = , (5)

n+1
(u)
K(xj , k)
k=1

(u)
where xj = dx × (j (u) − 1); j (u) = 1, 2, ..., ( n+1
dx + 1); k = 1, 2, . . . , n + 1 and
(u) (u) (u) (u)
the kernel weights wi = (wi1 , wi2 , ..., wi(n+1) ).
Multiple Time-Series Prediction through Time-Series Relationships Profiling 167

– Step 3, if there is no more data snapshot, then the process stops (go to
Step 7); else next snapshot, X(i) , is taken. Extract trend of movement from
X(i) as in Step 2, and calculate distances between current trend and all m
already created cluster centres defined as,

Di,l = 1 − CorrelationCoefficient(wi , Ccl ), (6)

where l = 1, 2, ..., m. If found cluster centre Ccl where Di,l ≤ Rul , then
current trend joins cluster Cl and the step is repeated; else continue to next
step.
– Step 4, find a cluster Ca (with centre Cca and cluster radius Rua ) from all
m existing cluster centres by calculating the values of Si,a given by,

Si,a = Di,a + Rua = min(Si,l ), (7)

where Si,l = Di,l + Rul and l = 1, 2, ..., m.


– Step 5, if Si,a > 2 × Dthr, where Dthr is a clustering parameter to limit the
maximum size of a cluster radius, then current trend of X(i) , wi , does not
belong to any existing clusters. A new cluster is then created in the same
way as described in Step 2, and the algorithm returns to Step 3.
– Step 6, if Si,a ≤ 2 × Dthr, current trend of X(i) , wi , joins cluster Ca .
Cluster Ca is updated by moving its centre, Cca , and increasing the value of
its radius, Rua . The updated radius Runewa is set to Si,a /2 and the new centre
Ccnew
a is now the mean value of all trends of movement belong to cluster Ca .
Distance from the new centre Ccnew a to current trend wi , is equal to Runew
a .
The algorithm then returns to Step 3.
– Step 7, end of the procedure.

In the procedure of clustering trends the following indexes are used,

– number of data snapshots: i = 1, 2, ...;


– number of clusters: l = 1, 2, ..., m;
– number of input and output variables: k = 1, 2, ..., n.

Clusters of trends of movement are then stored in each extracted profile of rela-
tionship. This information about profiles of and the trends of movements inside
them will then be exploited as the knowledge repository to perform simultaneous
multiple time-series prediction.

3.3 Knowledge Repository and Multiple Time-Series Prediction


To visualize how Algorithms 1 and 2 extract profiles of relationships and cluster
recurring trends from multiple time-series to construct a knowledge repository,
a pre-created simple synthetic dataset is conferred to the algorithms, as it is
shown in Figure 2, in this paper. The most suitable window size of snapshot
is defined by implementing the auto-correlation analysis as it is explained in
previous section and indicated in the first step of Algorithm 2.
168 H. Widiputra, R. Pears, and N. Kasabov

Fig. 2. Creation of knowledge repository (profiles of relationships and recurring trends)

Fig. 3. Multiple time-series prediction using profiles of relationships and recurring


trends

Figure 2 illustrates how a repository of profiles of relationships and recurring


trends is built from the first snapshot of the observed multiple time-series and
how it is being updated dynamically after the algorithm has processed the third
snapshot. This knowledge repository of profiles and recurring trends acts as the
knowledge-base and is the key data resource to perform multiple time-series
prediction.
Multiple Time-Series Prediction through Time-Series Relationships Profiling 169

After the repository has been built, there are two further steps that need
to be performed before prediction can take place. The first step is to extract
current profiles of relationships between the multiple series. Thereafter, matches
are found between the current trajectory and previously stored profiles from the
past. Predictions are then made by implementing a weighting scheme that gives
more importance to pairs of series that belong to the same profile and retain
comparable trends of movement. The weight wij for given pair (i, j) of series, is
given by the distance of similarity between them.

4 Experiments and Evaluation of Results

4.1 New Zealand Climate Data

Air pressure data collected from various locations in New Zealand by the Na-
tional Institute of Weather and Atmosphere (NIWA, http://www.niwa.co.nz)
constitutes the multiple time-series in this research. The data covers a period of
three years, ranging from 1st January 2007 to 31st December 2009.
Findings from previous study about global weather system [16] which argue
that a small change to one part of the system can lead to a complete change
in the weather system as a whole is the key reason that drives us to use such
dataset. Consequently, being able to reveal profiles of relationship between air
pressure in different locations at various time-points would help us to understand
more about the behavior of our weather system and would also facilitate in
constructing a reliable prediction model.

4.2 Prediction Accuracy Evaluation

To examine the performance of the proposed algorithm in predicting changes


in weather patterns across the multiple stations, a prediction of future air pres-
sure level in four locations from 1st October 2009 to 31st December 2009 was
performed. Additionally, to confirm that forecasting movement of multiple time-
series simultaneously by exploiting relationships between multiple series would
provide better accuracy, prediction outcomes are also compared to results from
the commonly-used single time-series prediction methods, such as multiple linear
regression and the Multi-Layer Perceptron.
Throughout the conducted experiments, the window size used to perform
bootstrap sampling in extracting profiles of relationships and recurring trends
from the training dataset is set to 5, as suggested by the results of autocorrelation
analysis performed on the New Zealand air pressure dataset. Furthermore, the
parameters in clustering recurring trends process (Algorithm 2) are set to dx =
0.1, α = 0.1, and Dthr = 0.5.
Generally, error rates of prediction and estimated trajectories RMSE, Rooted
Mean Square Root Error, as outlined in Table 1 and Figure 4, reveals that the
proposed algorithm demonstrates its superiority in predicting movement of dy-
namic and oscillated time-series data with a consistently high-degree of accuracy
in relation to multiple linear regression and the Multi-Layer Perceptron. These
170 H. Widiputra, R. Pears, and N. Kasabov

Table 1. RMSE of the proposed algorithm compared to the commonly-used single


time-series prediction method, multiple linear regression (MLR) and the Multi-Layer
Perceptron (MLP)

Location Proposed Algorithm MLR MLP


Paeroa 1.3890 3.4257 3.1592
Auckland 1.3219 3.5236 3.0371
Hamilton 1.4513 3.7263 3.4958
Reefton 1.8351 4.1725 3.9125

Fig. 4. Results of 100 days (1st October 2009 to 31st December 2009) air pressure level
prediction at four observation locations (Paeroa, Auckland, Hamilton and Reefton) in
New Zealand

results confirm proposals in previous studies which states that by being able to
reveal and understand characteristics of relationships between variables in mul-
tiple time-series data, one can predict their future states or behavior accurately
[2],[3],[4].
In addition, as it is expected, the proposed algorithm is not only able to
provide excellent accuracy in predicting future values, but it is also capable of
extracting knowledge about profiles of relationship between different locations in
New Zealand (in terms of movement of air pressure level) and clustering recurring
trends which exist in the series as illustrated in Figure 5 and 6. Consequently,
our study is also able to reveal that the air pressure level in the four locations
are highly correlated and tend to move in a similar fashion through time. This
is showed by the circle in the lower left corner where number of occurrences of
such profile is 601 in Figure 5.
Multiple Time-Series Prediction through Time-Series Relationships Profiling 171

Fig. 5. Extracted profiles of relationship from air pressure data. The radius represents
average normalized correlation coefficient, while N indicates number of occurrences of
a distinct profile.

Fig. 6. Created clusters of recurring trends when Paeroa, Auckland, Hamilton and
Reefton are detected to be progressing in a highly correlated similar manner

5 Conclusion and Future Work

Outcomes of conducted experiments undoubtedly prove that predicting move-


ment of multiple time-series on the same real world phenomenon by using profiles
of relationship between the series improves prediction accuracy.
Additionally, the algorithm proposed in this study demonstrates the ability
to: (1) extract profiles of relationships and recurring trends from a multiple time-
series data i.e., the air pressure dataset from four observation stations in New
Zealand; (2) perform prediction of multiple time-series all together with excel-
lent precision; and (3) evolve, by continuing to extract profiles of relationships
and recurring trends over time. As future work, we plan to explore the use of
172 H. Widiputra, R. Pears, and N. Kasabov

correlation analysis methods which are capable of detecting non-linear correla-


tions between observed variables (i.e. correlation ratio, Copula, etc.) in place of
the Pearson’s correlation coefficient to extract profiles of relationships of multiple
time-series.

References
1. Collins, D., Biekpe, N.: Contagion and Interdependence in African Stock Markets.
The South African Journal of Economics 71(1), 181–194 (2003)
2. Masih, A., Masih, R.: Dynamic Modeling of Stock Market Interdependencies: An
Empirical Investigation of Australia and the Asian NICs. Working Papers 98-18,
pp. 1323–9244. University of Western Australia (1998)
3. Antoniou, A., Pescetto, G., Violaris, A.: Modelling International Price Relation-
ships and Interdependencies between the Stock Index and Stock Index Future Mar-
kets of Three EU Countries: A Multivariate Analysis. Journal of Business, Finance
and Accounting 30, 645–667 (2003)
4. Kasabov, N., Chan, Z., Jain, V., Sidorov, I., Dimitrov, D.: Gene Regulatory Net-
work Discovery from Time-series Gene Expression Data: A Computational Intelli-
gence Approach. In: Pal, N.R., Kasabov, N., Mudi, R.K., Pal, S., Parui, S.K. (eds.)
ICONIP 2004. LNCS, vol. 3316, pp. 1344–1353. Springer, Heidelberg (2004)
5. Friedman, L., Nachman, P.: Using Bayesian Networks to Analyze Expression Data.
Journal of Computational Biology 7, 601–620 (2000)
6. Liu, B., Liu, J.: Multivariate Time Series Prediction via Temporal Classification.
In: Proc. IEEE ICDE 2002, pp. 268–275. IEEE, Los Alamitos (2002)
7. Kim, T., Adali, T.: Approximation by Fully Complex Multilayer Perceptrons. Neu-
ral Computation 15, 1641–1666 (2003)
8. Yang, H., Chan, L., King, I.: Support Vector Machine Regression for Volatile Stock
Market Prediction. In: Yellin, D.M. (ed.) Attribute Grammar Inversion and Source-
to-source Translation. LNCS, vol. 302, pp. 143–152. Springer, Heidelberg (1988)
9. Zanghui, Z., Yau, H., Fu, A.M.N.: A new stock price prediction method based on
pattern classification. In: Proc. IJCNN 1999, pp. 3866–3870. IEEE, Los Alamitos
(1999)
10. Holland, J.H., Holyoak, K.J., Nisbett, R.E., Thagard, P.R.: Induction: Processes
of Inference, Learning and Discovery, Cambridge, MA, USA (1989)
11. Kasabov, N.: Global, Local and Personalised Modelling and Pattern Discovery in
Bioinformatics: An Integrated Approach. Pattern Recognition Letters 28, 673–685
(2007)
12. Song, Q., Kasabov, N.: ECM - A Novel On-line Evolving Clustering Method and Its
Applications. In: Posner, M.I. (ed.) Foundations of Cognitive Science, pp. 631–682.
MIT Press, Cambridge (2001)
13. Rodrigues, P., Gama, J., Pedroso, P.: Hierarchical Clustering of Time-Series Data
Streams. IEEE TKDE 20(5), 615–627 (2008)
14. Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering Gene Expression Patterns. Journal
of Computational Biology 6(3/4), 281–297 (1999)
15. Widiputra, H., Kho, H., Lukas, Pears, R., Kasabov, N.: A Novel Evolving Clus-
tering Algorithm with Polynomial Regression for Chaotic Time-Series Prediction.
In: Leung, C.S., Lee, M., Chan, J.H. (eds.) ICONIP 2009. LNCS, vol. 5864, pp.
114–121. Springer, Heidelberg (2009)
16. Vitousek, P.M.: Beyond Global Warming: Ecology and Global Change. Ecol-
ogy 75(7), 1861–1876 (1994)
Probabilistic Feature Extraction from
Multivariate Time Series Using Spatio-Temporal
Constraints

Michal Lewandowski, Dimitrios Makris, and Jean-Christophe Nebel

Digital Imaging Research Centre, Kingston University, London, United Kingdom


{m.lewandowski,d.makris,j.nebel}@kingston.ac.uk

Abstract. A novel nonlinear probabilistic feature extraction method,


called Spatio-Temporal Gaussian Process Latent Variable Model, is in-
troduced to discover generalised and continuous low dimensional rep-
resentation of multivariate time series data in the presence of stylistic
variations. This is achieved by incorporating a new spatio-temporal con-
straining prior over latent spaces within the likelihood optimisation of
Gaussian Process Latent Variable Models (GPLVM). As a result, the
core pattern of multivariate time series is extracted, whereas a style
variability is marginalised. We validate the method by qualitative com-
parison of different GPLVM variants with their proposed spatio-temporal
versions. In addition we provide quantitative results on a classification
application, i.e. view-invariant action recognition, where imposing spatio-
temporal constraints is essential. Performance analysis reveals that our
spatio-temporal framework outperforms the state of the art.

1 Introduction
A multivariate time series (MTS) is a sequential collection of high dimensional
observations generated by a dynamical system. The high dimensionality of MTS
creates challenges for machine learning and data mining algorithms. To tackle
this, feature extraction techniques are required to obtain computationally effi-
cient and compact representations.
Gaussian Process Latent Variable Model [5] (GPLVM) is one of the most
powerful nonlinear feature extraction algorithms. GPLVM emerged in 2004 and
instantly made a breakthrough in dimensionality reduction research. The novelty
of this approach is that, in addition to the optimisation of low dimensional
coordinates during the dimensionality reduction process as other methods did,
it marginalises parameters of a smooth and nonlinear mapping function from low
to high dimensional space. As a consequence, GPLVM defines a continuous low
dimensional representation of high dimensional data, which is called latent space.
Since GPLVM is a very flexible approach, it has been successfully applied in a
range of application domains including pose recovery [2], human tracking [20],
computer animation [21], data visualization [5] and classification [19].
However, extensive study of the GPLVM framework has revealed some essen-
tial limitations of the basic algorithm [5, 6, 8, 12, 19, 21, 22]. First, since GPLVM

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 173–184, 2011.
c Springer-Verlag Berlin Heidelberg 2011
174 M. Lewandowski, D. Makris, and J.-C. Nebel

aims at retaining data global structure in the latent space, there is no guaran-
tee that local features are preserved. As a result, the natural topology of the
data manifold may not be maintained. This is particularly problematic when
data, such as MTS, have a strong and meaningful intrinsic structure. In addi-
tion, when data are captured from different sources, even after normalisation,
GPLVM tends to produce latent spaces which fail to represent common local fea-
tures [8, 21]. This prevents successful utilisation of the GPLVM framework for
feature extraction of MTS. In particular, GPLVM cannot be applied in many
classification applications such as speech and action recognition, where latent
spaces should be inferred from time series generated by different subjects and
used to classify data produced by unknown individuals. Another drawback of
GPLVM is its computationally expensive learning process [5, 6, 21] which may
converge towards a local minima if the initialization of the model is poor [21].
Although recent extensions of GPLVM, i.e. back constrained GPLVM [12]
(BC-GPLVM) and Gaussian Process Dynamical Model [22] (GPDM), allow sat-
isfactorily representation of time series, the creation of generalised latent spaces
from data issued from several sources is still a unsolved problem which has never
been addressed by the research community. In this paper, we define ’style’ as
the data variations between two or more datasets representing a similar phe-
nomenon. They can be produced by different sources and/or repetitions from
a single source. Here, we propose an extension of the GPLVM framework, i.e.
Spatio-Temporal GPLVM (ST-GPLVM), which produces generalised and prob-
abilistic representation of MTS in the presence of stylistic variations. Our main
contribution is the integration of a spatio-temporal ’constraining’ prior distribu-
tion over the latent space within the optimisation process.
After a brief review of the state of art, we introduce the proposed methodology.
Then, we validate qualitatively our method on a real dataset of human behavioral
time series. Afterwards, we apply our method to a challenging view independent
action recognition task. Finally, conclusions are presented.

2 Related Work
Feature extraction methods can be divided into two general categories, i.e., deter-
ministic and probabilistic frameworks. The deterministic methods can be further
classified into two main classes: linear and non linear methods. Linear methods
like PCA cannot model the curvature and nonlinear structures embedded in ob-
served spaces. As a consequence, nonlinear methods, such as Isomap [17], locally
linear embedding [13] (LLE), Laplacian Eigenmaps [1] (LE) and kernel PCA [14],
were proposed to address this issue. Isomap, LLE and LE aim at preserving a
specific geometrical property of the underlying manifold by constructing graphs
which encapsulate nonlinear relationships between points. However, they do not
provide any mapping function between spaces. In contrast, kernel PCA obtains
embedded space through nonlinear kernel based mapping from a high to low
dimensional space. In order to deal with MTS, extensions of Isomap [3], LE [8]
and the kernel based approach [15] were proposed.
Probabilistic Feature Extraction from Multivariate Time Series 175

Since previously described methods do not model uncertainty, another class of


feature extraction methods evolved, the so-called latent variable models (LVM).
They define a probability distribution over the observed space that is conditioned
on a latent variable and mapping parameters. Consequently, it produces a prob-
abilistic generative mapping from the latent space to the data space. In order to
address intrinsic limitations of probabilistic linear methods, such as probabilis-
tic principal component analysis [18] (PPCA), Lawrence [5] reformulated the
PPCA model to the nonlinear GPLVM by establishing nonlinear mappings from
the latent variable space to the observed space. From a Bayesian perspective,
the Gaussian process prior is placed over these mappings rather than the latent
variables with a nonlinear covariance function. As a result, GPLVM produces a
complete joint probability distribution over latent and observed variables.
Recently, many researchers have exploited GPLVM in a variety of appli-
cations [2, 5, 19, 20, 21], thus designing a number of GPLVM-based extensions
which address some of the limitations of standard GPLVM. Lawrence [5, 6] pro-
posed to use sparse approximation of the full Gaussian process which allows de-
creasing the complexity of the learning process. Preservation of observed space
topology was supported by imposing high dimensional constraints on the latent
space [12, 21]. BC-GPLVM [12] enforces local distance preservation through the
form of a kernel-based regression mapping from the observed space to the latent
space. Locally linear GPLVM [21] (LL-GPLVM) extends this concept by defining
explicitly a cylindrical topology to maintain. This is achieved by constructing
advanced similarity measures for the back constrained mapping function and
incorporation of the LLE objective function [13] into the GPLVM framework to
reflect a domain specific prior knowledge about observed data. Another line of
work, i.e. GPDM [22], augments GPLVM with a nonlinear dynamical mapping
on the latent space based on the auto-regressive model to take advantage of
temporal information provided with time series data.
Current GPLVM based approaches have proven very effective when modelling
of MTS variability is desired in the latent space. However, these methods are
inappropriate in a context of recognition based applications where the discovery
of a common content pattern is more valuable than modelling stylistic variations.
In this work, we tackle this fundamental problem by introducing the idea of a
spatio-temporal interpretation of GPLVM. This concept is formulated by incor-
porating a constraining prior distribution over the latent space in the GPLVM
framework. In contrast to LL-GPLVM and BC-GPLVM, we aim at implicitly
preserving a spatio-temporal structure of the observed time series data in order
to discard style variability and discover a unique low dimensional representation
for a given set of MTS. The proposed extension is easily adaptable to any variant
of GPLVM, for instance BC-GPLVM or GPDM.

3 Methodology
Let a set of multivariate time series Y consists of multiple repetitions (or cycles)
of the same phenomenon from the same or different sources and all data points
{yi }(i=1..N ) in this set are distributed on a manifold in a high dimensional space
176 M. Lewandowski, D. Makris, and J.-C. Nebel

(yi ∈ RD ). ST-GPLVM is able to discover low dimensional representation X =


{xi }(i=1..N ) (xi ∈ Rd with d  D) by giving a Gaussian process prior to mapping
functions from the latent variable space X to the observed space Y under a
constraint L to preserve the spatio-temporal patterns of the underlying manifold.
The entire process is summarized in figure 1.
Initially the spatio-temporal constraints L are constructed. These constraints
are exploited twofold. First they are used to better initialise the model by
discovering a low dimensional embedded space which is close to the expected rep-
resentation. Secondly, they constrain the GPLVM optimisation process so that
it converges faster and maintains the spatio-temporal topology of the data. The
learning process is performed using the standard two stage maximum a posteriori
(MAP) estimation used in GPLVM. Latent positions X, and the hyperparameters
Φ are optimised iteratively until the optimal solution is reached under the intro-
duced constraining prior p(X|L). The key novelty of the proposed methodology is
its style generalisation potential. ST-GPLVM discovers a coherent and continuous
low dimensional representation by identifying common spatio-temporal patterns
which result in discarding style variability among all conceptually similar time
series.

3.1 Gaussian Process Latent Variable Model


GPLVM [5] was derived from the observation that a particular probabilistic in-
terpretation of PCA is a product of Gaussian Process (GP) models, where each
of them is associated with a linear covariance function (i.e. kernel function). Con-
sequently, the design of a non-linear probabilistic model could be achieved by re-
placing the linear kernel function with a non-linear covariance function. From a
Bayesian perspective, by marginalising over the mapping function [5], the com-
plete joint likelihood of all observed data dimensions given the latent positions is:
1
p(Y |X, Φ) = exp(−0.5tr(K −1 Y Y T )) (1)
(2π)DN/2 |K|D/2
where Φ denotes the kernel hyperparameters and K is the kernel matrix of
the GP model which is assembled with a combination of covariance functions:
K = {k(xi , xj )}(i,j=1..N ) . The covariance function is usually expressed by the
nonlinear radial basis function (RBF):
γ
k(xi , xj ) = α exp(− xi − xj 2 ) + β −1 δxi xj (2)
2
where the kernel hyperparameters Φ = {α, β, γ} respectively determine the out-
put variance, the variance of the additive noise and the RBF width. δxi xj is the
Kronecker delta function.
Learning is performed using two stage MAP estimation. First, latent variables
are initialized, usually using PPCA. Secondly, latent positions and the hyperpa-
rameters are optimised iteratively until the optimal solution is reached. This can
be achieved by maximising the likelihood (1) with respect to the latent positions,
X, and the hyperparameters, Φ using the following posterior:
p(X, Φ|Y ) ∝ p(Y |X, Φ)p(X)p(Φ) (3)
Probabilistic Feature Extraction from Multivariate Time Series 177


where priors of the unknowns are: p(X) = N (0, I), p(Φi ) ∝ i Φ−1
i . The max-
imisation of the above posterior is equivalent to minimising the negative log
posterior of the model:
 
− ln p(X, Φ|Y ) = 0.5((DN +1) ln 2π+D ln |K|+tr(K −1 Y Y T )+ xi 2 )+ Φi
i i
(4)
This optimization process can be achieved numerically using the scaled conju-
gate gradient [11] (SCG) method with respect to Φ and X. However, the learning
process is computationally very expensive, since O(N 3 ) operations are required
in each gradient step to inverse the kernel matrix K [5]. Therefore, in practice,
a sparse approximation to the full Gaussian process, such as ’fully independent
training conditional’ (FITC) approximation [6] or active set selection [5], is ex-
ploited to reduce the computational complexity to a more manageable O(k 2 N )
where k is the number of points involved in the lower rank sparse approximation
of the covariance [6].

3.2 Spatio-Temporal Gaussian Process Latent Variable Model


The proposed ST-GPLVM relies on the novel concept of a spatio-temporal con-
straining prior which is introduced into the standard GPLVM framework in
order to maintain temporal coherence and marginalise style variability. This is
achieved by designing an objective function, where the prior p(X) in Eq. 3 is
replaced by the proposed conditioned prior p(X|L):

p(X, Φ|Y, L) ∝ p(Y |X, Φ)p(X|L)p(Φ) (5)

where L denotes the spatio-temporal constraints imposed on the latent space.


Although p(X|L) is not a proper prior, conceptually it can be seen as an equiv-
alent of a prior for a given set of weights L [21]. These constraints are derived
from graph theory, since neighbourhood graphs have been powerful in design-
ing nonlinear geometrical constraints for dimensionality reduction using spectral
based approaches [1,13,17]. In particular, the Laplacian graph allows preserving
approximated distances between all data points in the low dimensional space [1].
This formulation is extensively exploited in our approach by constructing cost
matrix, L, which emphasizes spatio-temporal dependencies between similar time
series. This is achieved by designing two types of neighbourhood for each high
dimensional data point Pi (figure 2):
– Temporal neighbours (T): the 2m closest points in the sequential order of
input (figure 2a): Ti ∈ {Pi−m , ..., Pi−1 , Pi , Pi+1 , ..., Pi+m }
– Spatial neighbours (S): let’s associate to each point, Pi , 2s temporal neigh-
bours which define a time series fragment Fi . The spatial neighbours, Si , of
Pi are the centres of the qi time series fragments, Fik , which are similar to
Fi (figure 2b): Si ∈ {Fi,1 (C), ..., Fi,qi (C)}. Here Fi,k (C) returns the centre
point of Fi,k . The neighbourhood Si is determined automatically using either
dynamic time warping [8] or motion pattern detection [7].
178 M. Lewandowski, D. Makris, and J.-C. Nebel

Fig. 1. Spatio-Temporal Gaussian Process Latent Variable Model pipeline

Fig. 2. Temporal (a) and spatial (b) neighbours (green dots) of a given data point, Pi ,
(red dots)

Neighbourhood connections defined in the Laplacian graphs implicitly impose


points closeness in the latent space. Consequently, the temporal neighbours allow
to model a temporal continuity of MTS, whereas spatial neighbours remove style
variability by aligning MTS in the latent space.
The constraint matrix, L, is obtained, first, by assigning weights, W, to the
edges of each graph, G ∈ {T, S}, using the standard LE heat kernel [1]:

exp(yi − yj 2 ) i,j connected
WijG = (6)
0 otherwise
Probabilistic Feature Extraction from Multivariate Time Series 179

Then, information from both graphs are combined L = LT + LS where LG =


DG − W G is the Laplacian matrix. DG = diag{D11 G G
, D22 G
, , DN N } denotes a
G
N G
diagonal matrix with entries: Dii = j=1 W ij The prior probability of the
latent variables, which forces each latent point to preserve the spatio-temporal
topology of the observed data, is expressed by:

1 tr(XLX T )
p(X|L) = √ exp(− ) (7)
2πσ 2 2σ 2
where σ represents a global scaling of the prior and controls the ’strength’ of the
constraining prior. Note that although distance between neighbours (especially
spatial ones) may be large in L, it is infinite between unconnected points.
The maximisation of the new objective function (5) is equivalent to minimising
the negative log posterior of the model:

− ln p(X, Φ|Y, L) = 0.5(D ln |K| + tr(K −1 Y Y T ) + σ −2 tr(XLX T ) + C) + Φi
i
(8)
where C is a constant: (DN + 1) ln 2π + ln σ 2 . Following the standard GPLVM
approach, the learning process involves minimising Eq. 8 with respect to Φ and
X iteratively using SCG method [11] until convergence.
ST-GPLVM is initialised using a nonlinear feature extraction method, i.e.
temporal LE [8] which is able to preserve the constraints L in the produced em-
bedded space. Consequently, compared to the standard usage of linear PPCA,
initialisation is more likely to be closer to the global optimum. In addition, the
enhancement of the objective function (3) with the prior (7) constrains the op-
timisation process and therefore further mitigates the problem of local minima.
The topological structure in terms of spatio-temporal dependencies is implic-
itly preserved in the latent space without enforcing any domain specific prior
knowledge.
The proposed methodology can be applied to other GPLVM based approaches,
such as BC-GPLVM [12] and GPDM [22] by integrating the prior (7) in their cost
function. The extension of BC-GPLVM results in a spatio-temporal model which
provides bidirectional mapping between latent and high dimensional spaces. Al-
ternatively, ST-GPDM produces a spatio-temporal model with an associated
nonlinear dynamical process in the latent space. Finally, the proposed extension
is compatible with a sparse approximation of the full Gaussian process [5, 6]
which allows reducing further processing complexity.

4 Validation of ST-GPLVM Approach


Our new approach is evaluated qualitatively through a comparative analysis
of latent spaces discovered by standard non-linear probabilistic latent variable
models, i.e. GPLVM, BC-GPLVM and GPDM and their extensions, i.e., ST-
GPLVM, ST-BC-GPLVM and ST-GPDM, where the proposed spatio-temporal
constraints have been included.
180 M. Lewandowski, D. Makris, and J.-C. Nebel

Fig. 3. 3D models learned from walking sequences of 3 different subjects with corre-
sponding first 2 dimensions and processing times: a) GPLVM, b)ST-GPLVM, c) BC-
GPLVM, d) ST-BC-GPLVM, e) GPDM and f) ST-GPDM. Warm-coloured regions
correspond to high reconstruction certainty.

Our evaluation is conducted using time series of MoCap data, i.e. repeated ac-
tions provided by the HumanEva dataset [16]. The MoCap time series are firstly
converted into normalized sequences of poses, i.e. invariant to subject’s rotation
and translation. Then each pose is represented as a set of quaternions, i.e. a 52-
dimension feature vector. In this experiment, we consider three different subjects
performing a walking action comprising of 500 frames each. The dimensionality
of walking action space is reduced to 3 dimensions [20, 22]. During the learn-
ing process, the computational complexity is reduced using FITC [6] where the
number of inducing variables is set to 10% of the data. The global scaling of the
constraining prior, σ, and the width of the back constrained kernel [12] were set
empirically to 104 and 0.1 respectively. Values of all the other parameters of the
models were estimated automatically using maximum likelihood optimisation.
The back constrained models used a RBF kernel [12].
The learned latent spaces for the walking sequences with the corresponding
first two dimensions and processing times are presented in figure 3. Qualitative
analysis confirms the generalisation property of the proposed extension. Stan-
dard GPLVM based approaches discriminate between subjects in the spatially
distinct latent space regions. Moreover, action repetitions by a given subject
are represented separately. In contrast, the introduction of our spatio-temporal
Probabilistic Feature Extraction from Multivariate Time Series 181

constraint in objective functions allows producing consistent and smooth repre-


sentation by discarding style variability in all considered models. In addition, the
extended algorithms converge significantly faster than standard versions. Here,
we achieve a speed-up of a factor 4 to 6.

5 Application of ST-GPLVM to Activity Recognition

We demonstrate the effectiveness of the novel methodology in a realistic com-


puter vision application by integrating ST-GPLVM within the view independent
human action recognition framework proposed in [7]. Here, the training data
comprises of time series of action images obtained from evenly spaced views
located around the vertical axis of a subject. In order to deal with this compli-
cated scenario, the introduced methodology is extended by a new initialisation
procedure and a new advanced constraining prior. The learning process of action
recognition framework is summarised in figure 4.
First, for each view (z=1..Z ), silhouettes are extracted from videos, normal-
ized and represented as 3364-dimensional vectors of local space-time saliency
features [7]. Then, spatio-temporal constraints Lz are calculated. During initial-
isation, style invariant one-dimensional action manifolds Xz are obtained using
temporal LE [8] and subsequently all these view-dependent models are combined
to generate a coherent view invariant representation of the action [7]. The out-
come of this procedure reveals a torus-like structure which is used to initialise
GPLVM and encapsulates both style and view. Finally, the latent space and
parameters of model are optimised jointly under a new combined prior p(X|L).
This prior is derived by taking into account constraints associated with each
view:
Z
1 tr(Xz Lz XzT )
p(X|L) = √ exp(− ) (9)
z=1 2πσ 2 2σ 2
where L is a block diagonal matrix formed by all Lz . Action recognition is
performed by maximum likelihood estimation. Performance of the system is
evaluated using the multi-view IXMAS dataset [23] which is considered as the
benchmark for view independent action recognition. This dataset is comprised
of 12 actions which are performed 3 times by 12 different actors. In this dataset,
actors’ positions and orientations in videos are arbitrary since no specific instruc-
tion was given during acquisition. As a consequence, the action viewpoints are
arbitrary and unknown. Here, we use one action repetition of each subject for
training, whereas testing is performed with all action repetitions. Experiments
are conducted using the popular leave-one-out schema [4,23,24]. Two recognition
tasks were evaluated using either a single view or multiple views. In line with
other experiments made on this dataset [9,10,24], the top view was discarded for
testing. The global scaling of the constraining prior and the number of inducing
variables in FITC [6] were set to 104 and 25% of the data respectively. Values
of all the other parameters of the models were estimated automatically using
maximum likelihood optimisation.
182 M. Lewandowski, D. Makris, and J.-C. Nebel

Fig. 4. Pipeline for generation of probabilistic view and style invariant action
descriptor

Table 1. Left, average recognition accuracy over all cameras using either single or
multiple views for testing. Right, class-confusion matrix using multiple views.

Average accuracy
Subjects
% Single All
/Actions
view views
Weinland [23] 10 / 11 63.9 81.3
Yan [24] 12 / 11 64.0 78.0
Junejo [4] 10 / 11 74.1 -
Liu [9] 12 / 13 71.7 78.5
Liu [10] 12 / 13 73.7 82.8
Lewandowski [7] 12 / 12 73.2 83.1
Our 12 / 12 76.1 85.4

Action recognition results are compared with the state of the art in table 1
(top view excluded). Examples of learned view and style invariant action descrip-
tors using ST-GPLVM are shown in figure 5. Although different approaches may
use slightly different experimental settings, table 1 shows that our framework
produces the best performances. In particular, it improves the accuracy of the
standard framework [7]. The confusion matrix of recognition for the ’all-view’
experiment reveals that our framework performed better when dealing with mo-
tions involving the whole body, i.e. ”walk”, ”sit down”, ”get up”, ”turn around”
Probabilistic Feature Extraction from Multivariate Time Series 183

Fig. 5. Probabilistic view and style invariant action descriptors obtained using ST-
GPLVM for a) sit down, b) cross arms, c) turn around and d) kick

and ”pick up”. As expected, the best recognition rates 78.7%, 80.7% are ob-
tained for camera 2 and 4 respectively, since those views are similar to those
used for training, i.e. side views. Moreover, when dealing with either different,
i.e. camera 1, or even significantly different views, i.e. camera 3, our framework
still achieves good recognition rate, i.e. 75.2% and 69.9% respectively.

6 Conclusion

This paper introduces a novel probabilistic approach for nonlinear feature extrac-
tion called Spatio-Temporal GPLVM. Its main contribution is the inclusion of
spatio-temporal constraints in the form of a conditioned prior into the standard
GPLVM framework in order to discover generalised latent spaces of MTS. All
conducted experiments confirm the generalisation power of the proposed concept
in the context of classification applications where marginalising style variabil-
ity is crucial. We applied the proposed extension on different GPLVM variants
and demonstrated that their Spatio-Temporal versions produce smoother, co-
herent and visually more convincing descriptors at a lower computational cost.
In addition, the methodology has been validated in a view independent action
recognition framework and produced state of the art accuracy. Consequently,
the concept of consistent representation of time series should benefit to many
other applications beyond action recognition such as gesture, sign-language and
speech recognition.

References
1. Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding
and clustering. In: Proc. NIPS, vol. 14, pp. 585–591 (2001)
2. Ek, C., Torr, P., Lawrence, N.D.: Gaussian process latent variable models for
human pose estimation. Machine Learning for Multimodal Interaction, 132–143
(2007)
3. Jenkins, O., Matarić, M.: A spatio-temporal extension to isomap nonlinear dimen-
sion reduction. In: Proc. ICML, pp. 441–448 (2004)
4. Junejo, I., Dexter, E., Laptev, I., Pérez, P.: Cross-view action recognition from
temporal self-similarities. In: Proc. ECCV, vol. 12 (2008)
184 M. Lewandowski, D. Makris, and J.-C. Nebel

5. Lawrence, N.: Gaussian process latent variable models for visualisation of high
dimensional data. In: Proc. NIPS, vol. 16 (2004)
6. Lawrence, N.: Learning for larger datasets with the Gaussian process latent variable
model. In: Proc. AISTATS (2007)
7. Lewandowski, J., Makris, D., Nebel, J.C.: View and style-independent action man-
ifolds for human activity recognition. In: Daniilidis, K., Maragos, P., Paragios, N.
(eds.) ECCV 2010. LNCS, vol. 6316, pp. 547–560. Springer, Heidelberg (2010)
8. Lewandowski, M., Martinez-del-Rincon, J., Makris, D., Nebel, J.-C.: Temporal
extension of laplacian eigenmaps for unsupervised dimensionality reduction of time
series. In: Proc. ICPR (2010)
9. Liu, J., Ali, S., Shah, M.: Recognizing human actions using multiple features. In:
Proc. CVPR (2008)
10. Liu, J., Shah, M.: Learning human actions via information maximization. In: Proc.
CVPR (2008)
11. Möller, M.: A scaled conjugate gradient algorithm for fast supervised learning.
Neural Networks 6(4), 525–533 (1993)
12. Lawrence, N.D., Quinonero-Candela, J.: Local Distance Preservation in the
GP-LVM Through Back Constraints. In: Proc. ICML, pp. 513–520 (2006)
13. Roweis, S., Saul, L.: Nonlinear dimensionality reduction by locally linear embed-
ding. Science 290(5500), 2323–2326 (2000)
14. Schölkopf, B., Smola, A., Müller, K.: Kernel principal component analysis. In:
ICANN, pp. 583–588 (1997)
15. Shyr, A., Urtasun, R., Jordan, M.: Sufficient dimension reduction for visual se-
quence classification. In: Proc. CVPR (2010)
16. Sigal, L., Black, M.: HumanEva: Synchronized Video and Motion Capture Dataset
for Evaluation of Articulated Human Motion. Brown Univertsity (2006)
17. Tenenbaum, J., Silva, V., Langford, J.: A global geometric framework for nonlinear
dimensionality reduction. Science 290(5500), 2319–2323 (2000)
18. Tipping, M., Bishop, C.: Probabilistic principal component analysis. Journal of the
Royal Statistical Society, Series B 61, 611–622 (1999)
19. Urtasun, R., Darrell, T.: Discriminative Gaussian process latent variable model for
classification. In: Proc. ICML, pp. 927–934 (2007)
20. Urtasun, R., Fleet, D.J., Fua, P.: 3D people tracking with gaussian process dynam-
ical models. In: Proc. CVPR, vol. 1, pp. 238–245 (2006)
21. Urtasun, R., Fleet, D., Geiger, A., Popović, J., Darrell, T., Lawrence, N.:
Topologically-constrained latent variable models. In: Proc. ICML (2008)
22. Wang, J., Fleet, D., Hertzmann, A.: Gaussian process dynamical models. In: Proc.
NIPS, vol. 18, pp. 1441–1448 (2006)
23. Weinland, D., Boyer, E., Ronfard, R.: Action recognition from arbitrary views
using 3D exemplars. In: Proc. ICCV, vol. 5(7), p. 8 (2007)
24. Yan, P., Khan, S., Shah, M.: Learning 4D action feature models for arbitrary view
action recognition. In: Proc. CVPR, vol. 12 (2008)
Real-Time Change-Point Detection Using
Sequentially Discounting Normalized Maximum
Likelihood Coding

Yasuhiro Urabe1 , Kenji Yamanishi1 , Ryota Tomioka1 , and Hiroki Iwai2


1
The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan
sfs8681@student.miyazaki-u.ac.jp
The first author’s current affiliation is Faculty of Medicine, University of Miyazaki
yamanishi@mist.i.u-tokyo.ac.jp,
tomioka@mist.i.u-tokyo.ac.jp
2
Little eArth Corporation Co., Ltd, 2-16-1 Hirakawa-cho, Chiyoda-ku, Tokyo, Japan
iwai@lac.co.jp

Abstract. We are concerned with the issue of real-time change-point de-


tection in time series. This technology has recently received vast attentions
in the area of data mining since it can be applied to a wide variety of im-
portant risk management issues such as the detection of failures of com-
puter devices from computer performance data, the detection of masquer-
aders/malicious executables from computer access logs, etc. In this paper
we propose a new method of real-time change point detection employing
the sequentially discounting normalized maximum likelihood coding (SD-
NML). Here the SDNML is a method for sequential data compression of a
sequence, which we newly develop in this paper. It attains the least code
length for the sequence and the effect of past data is gradually discounted
as time goes on, hence the data compression can be done adaptively to non-
stationary data sources. In our method, the SDNML is used to learn the
mechanism of a time series, then a change-point score at each time is mea-
sured in terms of the SDNML code-length. We empirically demonstrate
the significant superiority of our method over existing methods, such as
the predictive-coding method and the hypothesis testing method, in terms
of detection accuracy and computational efficiency for artificial data sets.
We further apply our method into real security issues called malware de-
tection. We empirically demonstrate that our method is able to detect un-
seen security incidents at significantly early stages.

1 Introduction
1.1 Motivation
We are concerned with the issue of detecting change points in time series. Here
a change-point is the time point at which the statistical nature of time series
suddenly changes. Hence the detection of that point may lead to the discovery
of a novel event. The issue of change-point detection has recently received vast
attentions in the area of data mining ([1],[9],[2],etc.).This is because it can be

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 185–197, 2011.
c Springer-Verlag Berlin Heidelberg 2011
186 Y. Urabe et al.

applied to a wide variety of important data mining problems such as the de-
tection of failures of computer devices from computer performance data such as
CPU loads, the detection of malicious executables from computer access logs.
We require that the change-point detection be conducted in real-time. This
requirement is crucial in real environments as in security monitoring, system
monitoring, etc. Hence we wish to design a real-time change-point detection al-
gorithm s.t. every time a datum is input, it gives a score measuring to what
extent it is likely to be a change-point. Further it is desired that such an algo-
rithm detects change-points as early as possible with least false alarms.
We attempt to design a change-point detection algorithm on the basis of data
compression. The basic idea is that a change point may be considered as a time
point when the data is no longer compressed using the same nature as the one
which have ever been observed. An important notion of sequentially normal-
ized maximum likelihood (SNML) coding has been developed in the scenario of
sequential source coding [4],[6],[5]. It has turned out to attain the shortest code-
length among possible coding methods. Hence, from the information-theoretic
view point, it is intuitively reasonable that the time point when the SNML code-
length suddenly changes can be thought of as a change point. However, SNML
coding has never been applied to the issue of change-point detection. Further in
the case where data sources are non-stationary, SNML should be extended so
the data compression is adaptive to the time-varying nature of the sources.

1.2 Purpose and Significances of This Paper


The purpose of this paper is twofolds. One is to propose a new method of real-
time change point detection using the sequentially discounting normalized maxi-
mum likelihood coding(SDNML). SDNML is a variant of SNML, which we newly
develop in this paper. It is obtained by extending SNML so that out-of-date
statistics is gradually discounted as time goes on, and eventually the coding
can be adaptive to the time-varying nature of data sources. In our method, we
basically employ the two-stage learning framework proposed in [9] for real-time
change-point detection. In it there are two-stages for learning; one is the stage for
learning a probabilistic model from an original data sequence and giving a score
for each time on the basis of the learned model, and the other is the stage for
learning another probabilistic model from a score sequence obtained by smooth-
ing scores calculated at the first stage and giving a change-point score at each
time. In this framework we use SDNML code-length as a change-point score.
Note that in [9], the predictive code-length was used as a change-point score
instead of SDNML code-length. Since the SDNML coding is optimal as shown
in [6],[5], we expect that our method will lead to a better strategy than the one
proposed in [9]. The theoretical background behind this intuition is Rissanen’s
minimum description length(MDL) principle [4], which asserts that the shorter
code-length leads to the better estimation of an underlying statistical model.
The other purpose is to empirically demonstrate the significant superiority of
our method over existing methods in terms of detection accuracy and computa-
tional efficiency. We demonstrate that using both artificial data and real data. As
Real-Time Change-Point Detection 187

for artificial data demonstration, we evaluate the performance of our method for
two types of change-points; continuous change points and discontinuous ones. As
for real data demonstration, we apply our method into real security issues called
malware detection. We empirically demonstrate that our method is able to detect
unseen security incidents or their symptoms at significantly early stages. Through
this demonstration we develop a new method for discovering unseen malware by
means of change-point detection from web server logs.

1.3 Related Works

There exist several earlier works on change-point detection. A standard approach


to this issue has been to employ the hypothesis testing method [3], [8], i.e., testing
whether the probabilistic models before and after the change-point are identical
or not. Guralnik and Srivastava proposed a hypothesis-testing based event detec-
tion [2]. In it a piecewise segmented function was used to fit the time-dependent
data and a change-point was detected by finding the point such that the total
errors of local model fittings of segments to the data before and after that point
is minimized. However, it is basically computationally expensive to find such a
point since the local model fitting task is required as many times as the number
of points between the successive points every time a datum is input.
As for real-time change-point detection, Takeuchi and Yamanishi [9] (see also
[11]) have proposed ChangeFinder, in which the two-stage learning framework
has been employed. It has been reported in [9] that ChangeFinder outperforms
the hypothesis testing-based method both in detection accuracy and compu-
tational efficiency. In the two-stage learning framework the choice of scoring
function is crucial. In ChangeFinder the score is calculated as the predictive
code-length, which will be replaced with the SDNML code-length in this paper.
The technology of change-point detection has been applied to a variety of
application domains (failure detection, marketing, security, etc.). The security
has been recognized as one of most important application areas among them
since it has critical issues of how to detect cyber-threat caused by malicious
hackers. Although various classification-based pattern matching methods have
been applied to security issues [10],[12], to the best of our knowledge, there is
few works that gives a clear relation of security issues to change-point detection.
The rest of this paper is organized as follows: Section 2 introduces the notion of
the SDNML. Section 3 describes our proposed method. Section 4 gives empirical
evaluation of the proposed method for artificial data sets. Section 5 shows an
application of our method to security. Section 6 yields concluding remarks.

2 Sequentially Discounting Normalized Maximum


Likelihood Coding

This section introduces the SDNML coding. Suppose that we observe a discrete-
time series, which we denote as, {xt : t = 1, 2, · · · }. We denote xt = x1 · · · xt .
Consider the parametric class F = {p(xt |xt−1 : θ) : t = 1, 2, · · · } of conditional
188 Y. Urabe et al.

probability density functions where θ denotes the k-dimensional parameter vec-


tor. For this class, letting θ̂(x·xt−1 ) be the maximum likelihood estimate of θ from
t−1
x · xt−1 = x1 · · · xt−1 · x (i.e., θ̂ = arg maxθ {p(x|xt−1 : θ) j=1 p(xj |xj−1 : θ)},
we consider the following minimax problem:

p(x · xt−1 |θ̂(x · xt−1 ))


min max log . (1)
q(x|x t−1 ) x q(x|xt−1 )

This is known as the conditional minimax criterion [5], which is a conditional


variant of Shatarkov’s minimax risk [7]. The solution to this yields the distribu-
tion having the shortest code-length relative to the model class F. It is known
from [4] that the solution of the minimum in (1) is achieved by the sequentially
normalized maximum likelihood (SNML) density function defined as:

def p(xt |θ̂(xt )) def
pSNML (xt |xt−1 ) = , Kt (xt−1 ) = p(x · xt−1 |θ̂(x · xt−1 ))dx.
Kt (xt−1 )

We call the quantity − log pSNML (xt |xt−1 ) the SNML code-length. It is known
from [5],[6] that the cumulative SNML code-length, which is the sum of SNML
code-length over the sequence, is optimal in the sense that it asymptotically
achieves the shortest code-length. According to Rissanen’s MDL principle [4],
the SNML leads to the best statistical model for explaining data.
We employ here the AR model as a probabilistic model and introduce SD-
NML(sequentially discounting normalized maximum likelihood) coding for this
model by extending SNML so that the effect of past data can gradually be dis-
counted as time goes on. The function of ”discounting” is important in real
situations where the data source is non-stationary and the coding should be
adaptive to it.
Let X ⊂ R be 1-dimensional and let xt ∈ X for each t. We define the kth
order auto-regression (AR) model as follows:
 
1 1
p(xt |xt−k : θ) =
t−1
exp − 2 (xt − w) ,
2
(2)
(2πσ 2 )1/2 2σ
k
where w = i=1 A(i) xt−i and θ = (A(1) , · · · , A(k) , σ 2 ).
Let r (0 < r < 1) be the discounting coefficient. Let m be the the least sample
(1) (k)
size such that Eq.(3) is uniquely solved. Let Ât = (Ât , · · · Ât )T be the dis-
(1) (k)
counting maximum likelihood estimate of the parameter At = (At , · · · , At )T
t
from x i.e.,


t
Ât = arg min r(1 − r)t−j (xj − AT x̄j )2 , (3)
A
j=m+1

where x̄j = (xj−1 , xj−2 , . . . , xj−k )T . Here the discounting maximum likelihood
estimate can be thought of as a modified variant of maximum likelihood estimate
so that the weighted likelihood is maximum where the weight of the jth past
Real-Time Change-Point Detection 189

data is given r(1 − r)t−j . Hence the larger the discounting coefficient r is, the
exponentially smaller the effect of past data becomes.
def
We further let êt = xt − ÂTt x̄t . Then let us define the discounting maximum
likelihood estimate of the variance from xt by

def

t
τ̂t = argmax p(xt |xt−1 2
t−k : Ât , σ )
σ2 j=m+1

1 t
= ê2 .
t − m j=m+1 j

Below we give a method of sequential computation of Ât and τ̂t so that they
def
can be computed every time a datum xt is input. Let Xt = (x̄k+1 , x̄k+2, . . . , x̄t ).
Let us recursively define Ṽt and Mt as follows:

Ṽt−1 = (1 − r)Ṽt−1
−1
def def
+ rx̄t x̄Tt , Mt = (1 − r)Mt−1 + rx̄t xt .
Then we obtain the following iterative relation for the parameter estimation:
1 r Ṽt−1 x̄t x̄Tt Ṽt−1
Ṽt = Ṽt−1 − , Ât = Ṽt Mt ,
1−r 1 − r 1 − r + c̃t
êt = xt − ÂTt x̄t , c̃t = rx̄Tt Ṽt−1 x̄t ,
c̃t
d˜t = . (4)
1 − r + c̃t
Setting r = 1/(t − m) yields the iteration developed by Rissanen et.al. [5] and
Roos et.al. [6]. We employ (4) for parameter estimation. Define st by

def

t
st = ê2j = (t − m)τ̂t . (5)
j=m+1

We define a SDNML density function by normalizing the discounting maximum


likelihood, which is given by
−(t−m)/2 −(t−m−1)/2
pSDNML (xt |xt−1 ) = Kt−1 (xt−1 )(st /st−1 ), (6)
where the normalizing factor Kt (xt−1 ) is calculated as follows:

t−1 π Γ ((t − m − 1)/2)
Kt (x ) = . (7)
1 − dt Γ ((t − m)/2)
The SDNML code-length for xt is calculated as follows:

π Γ ((t − m − 1)/2)
− log pSDNML (xt |xt−1 ) = log
1 − dt Γ ((t − m)/2)
1 t−m (t − m)τ̂t
+ log((t − m − 1)τ̂t−1 ) + log .(8)
2 2 (t − m − 1)τ̂t−1
We may employ the SDNML code-length (8) as the scoring function in the
context of change-point detection.
190 Y. Urabe et al.

3 Proposed Method
The main features of our proposed method are summarized as follows:
1)Two-stage learning framework with SDNML code-length: We basically employ
the two-stage learning framework proposed in [9] to realize real-time change-
point detection. The key idea is that probabilistic models are learned at the
two stages; in the first stage a probabilistic model is learned from the original
time series and a score is given for each time on the basis of the model, and in
the second stage another probabilistic model is learned from a score sequence
obtained by smoothing scores calculated at the first stage and a change-point
score is calculated on the basis of the learned model. We use the SDNML
code-length for the scoring in each stage.
2)Efficiently computing the estimates of parameters: Although the Yule-Walker
equation must be solved for the parameter estimation in ChangeFinder [9],
we can use an iterative relation to more efficiently estimate parameters than
ChangeFinder.
Below we give details of our version of the two-stage learning framework.

Two-stage Learning Based Change-point Detection:


We observe a discrete-time series, which we denote as, {xt : t = 1, 2, · · · }. The
following steps are executed every time xt is input.
Step 1 (First Learning Stage). We employ the AR model as in (2) to learn
from the time series {xt } a series of the SDNML density functions, which
we denote as {pSDNML (xt+1 |xt ) : t = 1, 2, · · · }. This is computed according
to the iterative learning procedure (4),(5),(6),(7).
Step 2 (First Scoring Stage). A score for xt is calculated in terms of the
SDNML code-length of xt relative to pSDNML (·|xt−1 ), according to (8):

Score(xt ) = − log pSDNML (xt |xt−1 ). (9)

This score measures how largely xt is deviated relative to pSDNML (·|xt−1 ).


Step 3 (Smoothing). We construct another time series {yt } on the basis of
the score sequence as follows: For a fixed size T (> 0),

1 
t
yt = Score(xi ). (10)
T
i=t−T +1

This implies that {yt : t = 1, 2, · · · } is obtained by smoothing: i.e., taking an


average of scores over a window of fixed size T and then sliding the window
over time. The role of smoothing is to reduce influence from isolated outliers.
Our goal is to detect bursts of outliers rather than isolated ones.
Step 4 (Second Learning Stage). We sequentially learn again SDNML den-
sity functions associated AR model from the score sequence {yt}, which we
denote as {qSDNML (yt+1 |y t ) : t = 1, 2, · · · }. This is also computed according
to the iterative relation (4),(5),(6),(7).
Real-Time Change-Point Detection 191

Step 5 (Second Scoring Stage). We calculate the SDNML code-length for


yt according to (8) and make a smoothing of the new score values over a
window of fixed size T to get the following change-point score:

1 
t

Score(t) = − log qSDNML (yi |y i−1 ) . (11)


T
i=t−T +1

This indicates how drastically the nature of the original sequence xt has
changed at time point t.

In [9],[11], the score is calculated as the negative logarithm of the plug-in density
defined as: − log p(xt |θ̂(t−1) ), where θ̂(t−1) is the estimate of θ obtained by using
the discounting learning algorithm from xt−1 . In our method, it is replaced by
the SDNML code-length.
In updating each parameter estimate, there is only one iteration every time a
datum is input. The computation time for our method is O(k 2 n) while that for
the original two-stage learning-based method: ChangeFinder is O(k 3 n).

4 Empirical Evaluation for Artificial Datasets

4.1 Methods to Be Compared

We evaluated the proposed method in comparison with existing algorithms:


ChangeFinder in [9], and the hypothesis testing method.
Guralnik and Srivastava [2] proposed an algorithm for hypothesis testing-
based change-point detection, which we denote as HT. In it the square loss
function was used as an error measure. It was extended in [9] to the one using
the predictive code-length as an error measure. Below we briefly sketch the basic
idea of HT using the predictive code-length. Let xtu = xu · · · xt . Let us define
the cumulative predictive code-length, which we denote as

def

t
I(xtu ) = − log p(xi |xi−1 i−1
i−k : θ(xu )),
i=u

where θ(xi−1 i−1


u ) is the maximum likelihood estimator of θ from xu . In HT, if
t
there exists a change point v (u < v < t) in xu , the total amount of predictive
code-length will be reduced by fitting different statistical models before and after
the change point, that is, I(xvu ) + I(xtv+1 ) is significantly smaller than I(xtu ).
On the basis of the principle as above, we can detect change points in an
incremental manner. Let ta be the last data point detected as a change point
and tb be the latest data point. Every time a datum is input, we examine whether
the following inequality holds or not:
 
1
I(xta ) − min (I(xta ) + I(xi+1 )) > δ,
tb i tb
tb − ta + 1 i:ta <i<tb
192 Y. Urabe et al.

where δ is a predetermined threshold. If it holds, we recognize the time point


giving the minimum in the left hand side as a change point. Once a change point
is detected, which we denote as tcp , then the detection process restarts by letting
tcp be the final detected change point. This procedure continues recursively. The
computation time is O(n2 ) where n is the data size.

4.2 Discontinuous Change-Point Detection

In our experimental setting, we formally define two-types of change-points. One


is a discontinuous type of change points and the other is a continuous type.
Letting pB and pA be the probability density functions for data generation
before and after the change point t, respectively. We define the dissimilarity
between pB and pA in terms of the Kullback-Leibler divergence defined as follows:

def 1 pA (X n )
Δ(t) = D(pA ||pB ) = lim EpA log .
n→∞ n pB (X n )

First we consider the case where change points are discontinuous in the sense
that the value of Δ(t) discontinuously changes at a change point t.
As for the data-generation model we employed the following AR model:

xt = A1 xt−1 + A2 xt−2 + ε,

where A1 = 0.6, A2 = −0.5, and ε ∼ N (μ, 1). We generated 1,000 records and
set change-points so that the jump of the mean value μ occurred at x × 100 (x =
1, 2, · · · ). Let the amount of jump of mean at the kth change-point be Dk . We
set Dk = 10 − k. The dissimilarity at the ith change point is given by

D(i)2 (1 − A21 − A22 )2


Δi = = 0.405Δ(i)2 .
2σ 2
The amount of jump in this model is discrete. Hence we call the change points in
this case discontinuous change points. Figure 1(a) shows the data set including
discontinuous change points. Note that the change points tend to be more diffi-
cult to be identified as i increases since they are more affected by noise. Hence
it is non-trivial to detect all of them regardless they are discontinuous.
In the applications of our method and ChangeFinder, we set k = 4 (the
degree of AR model), r = 0.01 (discounting parameter), and T = 3 (smoothing
parameter). Through the paper all of the parameters are systematically chosen
so that they are best fit for a fixed percentage (say, 5%) of training data.
Below we give a measure of performance for change-point detection. By setting
a threshold of scores, an alarm was made if the score value exceeds the threshold.
Letting t∗ be the true change-point, we define the benefit of the time point t when
an alarm is made as follows:

def 1 − (t − t∗ )/20 : 0 ≤ t − t∗ ≤ 20
benef it(t) =
0 : otherwise
Real-Time Change-Point Detection 193

60

50

40

30

20

10

1 101 201 301 401 501 601 701 801 901 SDNML
-10

(a) Data: Discontinuous Change Points (b) Benefit-FDR Curves

Fig. 1. Detecting Discontinuous Change Points

Log10

SDNML

Fig. 2. Comparison of computation times of SDNML and CF

The benefit measures how early the true change-point is detected. It takes the
maximum value 1 when the true change-point is detected at that point, and is
zero when |t − t∗ | exceeds 20. The false discovery rate (FDR) is the ratio of the
number of false positive alarms over the number of total alarms. Considering the
trade-off between benefit and FDR, we used the benefit-FDR curve as proposed
in [1] for the performance comparison. It is a concept similar to ROC curve.
Figure 1(b) shows the results of the benefit-FDR curves for our method and
existing methods. The horizontal axis shows FDR while the vertical axis shows
the average benefit where the average was taken over all of the change points.
SDNML is our method, CF is ChangeFinder, the conventional two-stage learning
based method. HT is the hypothesis-testing based method in which the fourth-
degree AR model is used for model fitting and the score is measured in terms of
the logarithmic loss. We observe from Figure 1(b) that SDNML performs better
than HT and CF. The AUC (Area Under Curve) for SDNML was about 12 %
larger than that for CF.
Figure 2 shows the computation time (sec) of SDNML in comparison with CF
for this data set. We see that SDNML is significantly more efficient than CF.
194 Y. Urabe et al.

600

500

400

300

200

100

1 26 51 76 101 126 151 176


SDNML
-100

(a) Data: Continuous Change Points (b) Benefit-FDR Curves

Fig. 3. Detecting Continuous Change Points

4.3 Continuous Change-Point Detection

Next we consider the case where change points are continuous in the sense that
the value of dissimilarity Δ continuously changes at each of the points. We
consider the following data generation model xt = v(t) + ε where ε ∼ N (0, 1)
and v(t) = 0 for 0 ≤ t ≤ 100, and v(t) = c(t − 100)(t − 99)/2 for t > 100.
Letting the dissimilarity at time t be Δ(t), then it is calculated as: For a given
c > 0, Δ(t) = 0 for 0 ≤ t ≤ 100, and Δ(t) = c2 (t − 100)2 /2 for t > 100.
This shows that the dissimilarity of change points is continuous with respect
to t. We call such change points continuous change points. They are more difficult
to be identified than discontinuous ones.
We generated 6 times 200 records according to the model as above.
Figure 3(a) shows an example of such data sets. We evaluated the detection
accuracies for CF,HT, and SDNML for this data set. Parameter values for all
of the methods are systematically chosen as with the discontinuous case. Figure
3(b) shows the results of the benefit-FDR curves for CF and SDNML where
the average-benefit was computed as the average of the benefits taken over the
6-times randomized data generation. Note that HT was much worse than CF,
and was omitted from the Figure3(b). We observe from Figure 3(b) that SD-
NML performs significantly better than CF. The AUC for SDNML is about 46
% larger than that for CF.
Through the empirical evaluation using artificial data sets including contin-
uous and discontinuous change-points, we observe that our method performs
significantly better than the existing methods both in detection accuracy and
computational efficiency.
The superiority of SDNML over CF may be justified from the view of the
minimum description length (MDL) principle. Indeed, SDNML is designed as the
optimal strategy that sequentially attains the least code-length while CF using
the predictive code produces longer code-lengths than SDNML. It is theoretically
guaranteed from the theory of the MDL principle that the shorter the code-length
for data sequence is, the better model is learned from data. Hence the better
strategy in the light of the MDL principle yields a better strategy for statistical
Real-Time Change-Point Detection 195

modeling, eventually leads to a better strategy for change-point detection. This


insight was demonstrated experimentally.
The reason why SDNML and CF are significantly better than HT is that
SDNML and CF are more adaptive to non-stationary data sources than HT.
Indeed, SDNML and CF have the function of sequential discounting learning
while HT has no such a function. It was also demonstrated experimentally.
It is interesting to see that the difference between SDNML and CF becomes
much larger in detecting continuous change points rather than discontinuous
ones. This is due to the fact that the statistical modeling is more critical for the
cases where the change-points are more difficult to be detected.

5 Applications to Malware Detection


We show an application of our method to malware detection. Malware is a generic
term indicating unwanted software (e.g., viruses, backdoors, spywares, torojans,
worms etc.). Most of conventional methods against malware are signature-based
ones such as anti-virus software. Here signature is a short string characterizing
malware’s features. In the signature-based methods, pattern matching of any
given input data with signatures is conducted to detect malware. Hence unseen
malware or those whose signatures are difficult to describe may not be detected
by the signature-based methods. Furthermore it is desired that the symptom of
malware is detected earlier than its malicious action actually occurs. We expect
that change-point detection from access log data is one of promising technologies
for detecting such malware at early stages.
We are concerned with the issue of detecting backdoor, which is one of typ-
ical malware. In our experiment we used access logs, each of which consists of
a number of attributes, including time stamp, IP address, URL, server name,
kinds of action, etc. All of the data were collected at a server. URL means the
URL accessed by a user. We used only three attributes from among them; time
stamp, IP address, URL. We constructed two kinds of time series. One is a
time series of IP address counts, where a datum was generated every 1 minutes
and its value is the maximum number of identical IPs which occurred within
past 15 minutes. The other is a time series of URL counts, where a datum was
generated every 1 minutes and its value was the maximum number of identical
URLs which occurred within past 15 minutes.
In the data set there are a number of bursts of logs including the message
500ServerError, which are considered as actions related to backdoors. From
the view of security analysts, they can be thought of as a symptom of them.
Hence we are interested in how early our method is able to detect such bursts
without any knowledge of the message 500ServerError.
We applied our method to the two time series as above. The original data set
consisted of 5538 records, and the length of the time series obtained after the pre-
processing as above was 536. In the original data set, there were included three
bursts of logs including the message 500ServerEroor. Figure 4(a) shows graphs of
a time series of IP address counts, SDNML score curve, and CF score curve. Figure
4(b) shows graphs of URL counts data, SDNML score curve, and CF score curve.
196 Y. Urabe et al.

SDNML SDNML

(a) IP access counting data (b) URL counting data

Fig. 4. Malware detection results

Table 1. Malware Detection

Alert Time Alarms


ServerError Begin 8:49:15 9:17:15 9:59:45 –
Time End 8:51:15 9:27:15 9:59:45 –
SDNML Alert IP 8:48:45 9:10:15 10:15:45 7
Time URL 8:48:45 9:10:15 10:00:45 12
CF Alert IP – 9:10:15 10:04:15 4
Time URL – 9:10:15 10:00:45 5

Table 1 summarizes the performance of SDNML and CF for the two time
series (IP counting data and URL counting data) in terms of alert time and
the total number of alarms. In the row of ServerError Time, for each burst of
messages 500ServerError, the starting time point and ending time point of the
burst are shown. In the table ”-” indicates the fact that the burst associated
with the message: 500ServerError was not detected.
We observe from Table 1 that our method was able to detect all of the bursts
associated with the message: 500ServerError, while CF overlooked some of
them. It was confirmed by security analysts that all of the detected bursts were
related to backdoor, and were considered as symptoms of backdoor. Further
there were no logs related to backdoor other than the bursts of the message:
500ServerError. It implies that our method was able to detect backdoor at
early stages when its symptoms appeared. This demonstrates the validity of our
method in the scenario of malware detection.

6 Conclusion
We have proposed a new method of real-time change point detection, in which we
employ the sequentially discounting normalized maximum likelihood (SDNML)
coding as a scoring function within the two-stage learning framework. The intu-
ition behind the design of this method is that SDNML coding, which sequentially
attains the shortest code-length, would improve the accuracy of change-point
Real-Time Change-Point Detection 197

detection. This is because according to the theory of the minimum description


length principle, the shorter code-length leads to the better statistical model-
ing. This paper has empirically demonstrated the validity of our method using
artificial data sets and real data sets. It has turned out that our method is able
to detect change-points with significantly higher accuracy and efficiency than
the existing real-time change-point detection method and the hypothesis-testing
based method. Specifically, through the application of our method to malware
detection, we have shown that real-time change-point detection is a promising
approach to the detection of symptoms of malware at early stages.

Acknowledgments
This research was supported by Microsoft Corporation (Microsoft Research
CORE Project) and NTT Corporation.

References
1. Fawcett, T., Provost, F.: Activity monitoring: noticing interesting changes in be-
havior. In: Proc. of ACM-SIGKDD Int’l Conf. Knowledge Discovery and Data
Mining, pp. 53–62 (1999)
2. Guralnik, V., Srivastava, J.: Event detection from time series data. In: Proc. ACM-
SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 33–42 (1999)
3. Hawkins, D.M.: Point estimation of parameters of piecewise regression models. J.
Royal Statistical Soc. Series C 25(1), 51–57 (1976)
4. Rissanen, J.: Information and Complexity in Statistical Modeling. Springer, Hei-
delberg (2007)
5. Rissanen, J., Roos, T., Myllymäki, P.: Model selection by sequentially normalized
least squares. Jr. Multivariate Analysis 101(4), 839–849 (2010)
6. Roos, T., Rissanen, J.: On sequentially normalized maximum likelihood models.
In: Proc. of 1st Workshop on Information Theoretic Methods in Science and En-
gineering, WITSME 2008 (2009)
7. Shtarkov, Y.M.: Universal sequential coding of single messages. Problems of Infor-
mation Transmission 23(3), 175–186 (1987)
8. Song, X., Wu, M., Jermaine, C., Ranka, S.: Statistical change detection for multi-
dimensional data. In: Proc. Fifteenth ACM-SIGKDD Int’l Conf. Knowledge Dis-
covery and Data Mining, pp. 667–675 (2009)
9. Takeuchi, J., Yamanishi, K.: A unifying framework for detecting outliers and
change-points from time series. IEEE Transactions on Knowledge and Data Engi-
neering 18(44), 482–492 (2006)
10. Wang, J., Deng, P., Fan, Y., Jaw, L., Liu, Y.: Virus detection using data mining
techniques. In: Proc. of ICDM 2003 (2003)
11. Yamanishi, K., Takeuchi, J.: A unifying approach to detecting outliers and change-
points from nonstationary data. In: Proc. of the Eighth ACM SIGKDD Int’l Conf.
Knowledge Discovery and Data Mining (2002)
12. Ye, Y., Li, T., Jiang, Q., Han, Z., Wan, L.: Intelligent file scoring system for malware
detection from the gray list. In: Proc. of the Fifteenth ACM SIGKDD Int’l Conf.
Knowledge Discovery and Data Mining (2009)
Compression for Anti-Adversarial Learning

Yan Zhou1 , Meador Inge2 , and Murat Kantarcioglu1


1
Erik Jonnson School of Engineering and Computer Science
University of Texas at Dallas
Richardson, TX 75080
{yan.zhou2,muratk}@utdallas.edu
2
Mentor Graphics Corporation
739 N University Blvd.
Mobile, AL 36608
meadori@gmail.com

Abstract. We investigate the susceptibility of compression-based learn-


ing algorithms to adversarial attacks. We demonstrate that compression-
based algorithms are surprisingly resilient to carefully plotted attacks
that can easily devastate standard learning algorithms. In the worst case
where we assume the adversary has a full knowledge of training data,
compression-based algorithms failed as expected. We tackle the worst
case with a proposal of a new technique that analyzes subsequences
strategically extracted from given data. We achieved near-zero perfor-
mance loss in the worst case in the domain of spam filtering.

Keywords: adversarial learning, data compression, subsequence


differentiation.

1 Introduction

There is an increasing use of machine learning algorithms for more efficient


and reliable performance in areas of computer security. Like any other defense
mechanisms, machine learning based security systems have to face increasingly
aggressive attacks from the adversary. The attacks are often carefully crafted
to target specific vulnerabilities in a learning system, for example, the data
set used to train the system, or the internal logic of the algorithm, or often
both. In this paper we focus on both causative attacks and exploratory attacks.
Causative attacks influence a learning system by altering its training set while
their exploratory counterparts do not alter training data but rather exploit mis-
classifications for new data. When it comes to consider security violations, we
focus on integrity violations where the adversary’s goal is injecting hostile input
into the system to cause more false negatives. In contrast, availability violations
refer to cases where the adversary aims at increasing false positives by preventing
“good” input into the system1 .
1
For a complete taxonomy defining attacks against machine learning systems, the
readers are referred to the Berkeley technical report [1].

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 198–209, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Anti-Adversarial Learning 199

Many security problems, such as intrusion detection, malware detection, and


spam filtering, involve learning on strings. Recent studies demonstrate that su-
perior classification performance can be achieved with modern statistical data
compression models instrumented for such learning tasks [2,3,4]. Unlike tra-
ditional machine learning methods, compression-based learning models can be
used directly on raw strings without error-prone preprocessing, such as tokeniza-
tion, stemming, and feature selection. The methods treat every string input as
a sequence of characters instead of a vector of terms (words). This effectively
eliminates the need for defining word boundaries. In addition, compression based
methods naturally take into consideration both alphabetic and non-alphabetic
characters, which prevents information loss as a result of preprocessing.
The robustness of compression-based learning algorithms was briefly discussed
recently [5]. To the best of our knowledge, there is not a full scale investiga-
tion on the susceptibility of this type of learning algorithm to adversarial at-
tacks, and thus no counter-attack techniques have been developed to address
potential attacks against compressors trained in learning systems. In this paper,
we demonstrate that compression-based learning algorithms are surprisingly re-
silient to carefully plotted attacks that can easily devastate standard learning
algorithms. We further demonstrate that as we launch more aggressive attacks
as in the worst case where the adversary has a full knowledge of training data,
compression-based algorithms failed as expected. We tackle the worst case with
a proposal of a new technique that analyzes subsequences strategically extracted
from given data. We achieved near-zero performance loss in the worst case in
the domain of spam filtering.
The remainder of this paper is organized as follows. In Section 2, we briefly
review the current state-of-the-art data compression model that has been fre-
quently used in machine learning and data mining techniques. Section 3 demon-
strates that learning systems with modern compressors are resilient to attacks
that have a significant impact on standard learning algorithms. We show that
modern compressors are susceptible to attacks when the adversary alters data
with negative instances in training data. We propose a counter-attack learning
method that enhances the compression-based algorithm in Section 4. Section 5
presents the experimental results. Section 6 concludes the work and discusses
future directions.

2 Context-Based Data Compression Model—Prediction


by Partial Matching

Statistical data modeling plays an important role in arithmetic encoding [6]


which turns a string of symbols into a rational number between [0, 1). The
number of bits needed to encode a symbol depends on the probability of its
appearance in the current context. Finite-context modeling estimates the prob-
ability of an incoming symbol based on a finite number of symbols previously
encountered. The current state-of-the-art adaptive model is prediction by partial
matching (PPM) [7,8,9]. PPM is one of the best dynamic finite-context models
200 Y. Zhou, M. Inge, and M. Kantarcioglu

that provide good estimate of the true entropy of data by using symbol-level dy-
namic Markov modeling. PPM predicts the symbol probability conditioned on
its k immediately prior symbols, forming a k th order Markov model. For exam-
ple, the context cki of the ith symbol xi in a given sequence is {xi−k , . . . , xi−1 }.
The total number of contexts of an order-k model is O(|Σ|k+1 ), where Σ is
the alphabet of input symbols. As the order of the model increases the number
of contexts increases exponentially. High-order models are more likely to cap-
ture longer-range correlations among adjacent symbols, if they exist; however,
an unnecessarily high order can result in context dilution leading to inaccurate
probability estimate. PPM solves the dilemma by using dynamic context match
between the current sequence and the ones that occurred previously. It uses
high-order predictions if they exist, otherwise “drops gracefully back to lower
order predictions” [10]. More specifically, the algorithm first looks for a match
of an order-k context. If such a match does not exist, it looks for a match of
an order k − 1 context, until it reaches order-0. Whenever a match is not found
in the current context, the model falls back to a lower-order context and the
total probability is adjusted by what is called an escape probability. The escape
probability models the probability that a symbol will be found in a lower-order

context. When an input symbol xi is found in context cki where k  ≤ k, the
conditional probability of xi given its k th order context cki is:
⎛ ⎞
k
 
p(xi |cki ) = ⎝ p(Esc|cji )⎠ · p(xi |cki )
j=k +1

where p(Esc|cji ) is the escape probability conditioned on context cji . If the sym-
bol is not predicted by the order-0 model, a probability defined by a uniform
distribution is predicted.
PPMC [11] and PPMD [12] are two well known variants of the PPM algorithm.
Their difference lies in the estimate of the escape probabilities. In both PPMC
and PPMD, an escape event is counted every time a symbol occurs for the first
time in the current context. In PPMC, the escape count and the new symbol
count are each incremented by 1 while in PPMD both counts are incremented
by 1/2. Therefore, in PPMC, the total symbol count increases by 2 every time
a new symbol is encountered, while in PPMD the total count only increases by
1. When implemented on a binary computer, PPMD sets the escape probability
|d|
to 2|t| , where |d| is the number of distinct symbols in the current context and
|t| is the total number of symbols in the current context.
Now, given an input X = x1 x2 . . . xd of length d, where x1 , x2 , . . . , xi is a
sequence of symbols, its probability given a compression model M can be esti-
mated as
 d
p(X|M ) = p(xi |xi−1
i−k , M )
i=1

where xji = xi xi+1 xi+2 . . . xj for i < j.


Anti-Adversarial Learning 201

3 Compression-Based Classification and Adversarial


Attacks

Consider binary classification problems: X → Y where Y ∈ {+, −}. Given a


set of training data, compression-based classification works as follows. First, the
algorithm builds two compression models, one from each class. Each compression
model maintains a context tree, together with context statistics, for training data
in one of the two classes. Then, to classify an unlabeled instance, it requires the
instance to run through both compression models. The model that provides a
better compression of the instance makes the prediction. A common measure for
classification based on compression is minimum cross-entropy [13,3]:
|X|
1 
c = argmin − log p(xi |xi−1
i−k , Mc )
c∈{+,−} |X| i=1

where |X| is the length of the instance, xi−k , . . . , xi is a subsequence in the in-
stance, k is the length of the context, and Mc is the compression model associated
with class c.
When classifying an unlabeled instance, a common practice is to compress it
with both compression models and check to see which one compresses it more
efficiently. However, PPM is an incremental algorithm, which means once an
unlabeled instance is compressed, the model that compresses it will be updated
as well. This requires the changes made to both models be reverted every time
after an instance is classified. Although, the PPM algorithm has a linear time
complexity, the constants in its complexity are by no means trivial. It is desirable
to eliminate the redundancy of updating and then reverting changes made to
the models. We propose an approximation algorithm (See Algorithm 1) that
we found works quite well in practice. Given context C = xi−k . . . xi−1 in an
unlabeled instance, if any suffix c of C has not occurred in the context trees
1
of the compression models, we set p(Esc|c) = |A| , thus the probability of xi is
discounted by |A| , where |A|  |Σ|, the size of the alphabet. More aggressive
1

discount factors set the prediction further away from the decision boundary.
Empirical results will be discussed in Section 5.
Compression-based algorithms have demonstrated superior classification per-
formance in learning tasks where the input consists of strings [3,4]. However, it
is not clear whether this type of learning algorithm is susceptible to adversarial
attacks. We investigate several ways to attack the compression-based classifier
on real data in the domain of e-mail spam filtering. We choose this domain in
our study for the following reasons: 1.) previous work has demonstrated great
success of compression-based algorithms in this domain [3]; 2.) it is conceptu-
ally simple to design various adversarial attacks and establish a ground truth;
3.) there have been studies of several variations of adversarial attacks against
standard learning algorithms in this domain [14,5].
Good word attacks are designed specifically to target the integrity of statisti-
cal spam filters. Good words are those that appear frequently in normal e-mail
202 Y. Zhou, M. Inge, and M. Kantarcioglu

Algorithm 1. Symbol Probability Estimate


Input: xi−k . . . xi−1 , xi , Mc
Output: p(xi |xi−1i−k )
p = 1.0;
foreach s = suffix(xi−k . . . xi−1 ) do
if s ∈
/ Mc then
1
p = p · |A| ;
else
p = p · p(xi |s, Mc );
end
end
return p;

but rarely in spam e-mail. Existing studies show that good word attacks are
very effective against the standard learning algorithms that are considered the
state of the art in text classification [14]. The attacks against multinomial naı̈ve
Bayes and support vector machines with 500 good words caused more than 45%
decrease in recall while the precision was fixed at 90% in a previous study [5].
We repeated the test on the 2006 TREC Public Spam Corpus [15] using our
implementation of PPMD compressors. It turns out that the PPMD classifier is
surprisingly robust against good word attacks. With 500 good words added to
50% of the spam in e-mail, we observed no significant change in both precision
and recall (See Figure 1)2 . This remains true even when 100% of spam is ap-
pended with 500 highly ranked good words. Its surprising resilience to good word
attacks led us to more aggressive attacks against the PPMD classifier. We ran-
domly chose a legitimate e-mail message from the training corpus and appended
it to a spam e-mail during testing. 50% of the spam was altered this way. This
time we were able to bring the average recall value down to 57.9%. However, the
precision value remains above 96%. Figure 1 shows the accuracy, precision and
recall values when there are no attacks, 500-goodword attacks, and in the worst
case, attacks with legitimate training e-mail. Details on experimental setup will
be given in Section 5.

4 Robust Classification via Subsequence Differentiation


In this section, we present a robust anti-adversarial learning algorithm that is
resilient to attacks with no assumptions on the adversary’s knowledge of the
system. We do, however, assume that the adversary cannot alter the negative
instances, such as legitimate e-mail messages, normal user sessions, and benign
software, in training data. This is a reasonable assumption in practice since in
many applications such as intrusion detection, malware detection, and spam
2
It has been reported previously that PPM compressors appeared to be vulnerable
to good word attacks [5]. The results were produced by using the PPMD implemen-
tation in the PSMSLib C++ library [16]. Since we do not have access to the source
code, we cannot investigate the cause of the difference and make further conclusions.
Anti-Adversarial Learning 203

100%

Accuracy
Precision

Accuracy/Precision/Recall Values
80% Recall

60%

40%

20%

0%
PPMD−No−Attack PPMD−500GW−Attack PPMD−WorstCase−Attack

Fig. 1. The Accuracy/precision/recall values of the PPMD classifier with no attacks,


500-goodword attacks, and attacks where original training data is used

filtering, the adversary’s attempts are mostly focused on altering positive in-
stances to make them less distinguishable among ordinary data in that domain.
Now that we know the adversary can alter the “bad” data to make it appear
to be good, our goal is to find a way to separate the target from its innocent
looking. Suppose we have two compression models M+ and M− . Intuitively,
a positive instance would compress better with the positive model M+ than it
would with M− , the negative model. When a positive instance is altered with fea-
tures that would ordinarily appear in negative data, we expect the subsequences
in the altered data that are truly positive to retain relatively higher compression
rates when compressed against the positive model. We apply a sliding window
approach to scan through each instance and extract subsequences in the sliding
window that require a smaller number of bits when compressed against M+ than
M− . Ideally, more subsequences would be identified in a positive instance than
in a negative instance. In practice, there are surely exceptional cases where the
number of subsequences in a negative instance would exceed its normal average.
For this reason, we decide to compute the difference between the total number of
bits required to compress the extracted subsequences S using M− and M+ , re-
spectively. If an instance is truly positive, we expect BitsM− (S)  BitsM+ (S),
where BitsM− (S) is the number of bits needed to compress S using the neg-
ative compression model, and BitsM+ (S) is the bits needed using the positive
model. Now for a positive instance, not only we expect a longer list of subse-
quences extracted, but also a greater discrepancy between the bits after they are
compressed using the two different models.
For the adversary, any attempt to attack this counter-attack strategy will
always boil down to finding a set of “misleading” features and seamlessly blend
them into the target (positive instance). To break down the first step of our
counter-attack strategy, that is, extracting subsequences that compress better
against the positive compression model, the adversary would need to select a
set of good words {wi |BitsM+ (wi ) < BitsM− (wi )} so that the good words can
pass, together with the “bad” ones, our first round screening. To attack the
second step, the adversary must find good words that compress better against
the negative compression model, that is, {wi |BitsM+ (wi ) > BitsM− (wi )}, to
offset the impact of the “bad” words in the extracted subsequences. These two
204 Y. Zhou, M. Inge, and M. Kantarcioglu

goals inevitably contradict each other, thus making strategically attacking the
system much more difficult.
We now formally present our algorithm. Given a set of training data T , where
T = T+ ∪ T− , we first build two compression models M+ and M− from T+ and
T− , respectively. For each training instance t in T , we scan t using a sliding
window W of size n, and extract the subsequence si in the current sliding window
W if BitsM+ (si ) < BitsM− (Si ). This completes the first step of our algorithm—
subsequence
 extraction. Next, for each instance t in the training set, we compute
dt = si (BitsM− (si ) − BitsM+ (si )), where si is a subsequence in t that has
been extracted in the first step. We then compute the classification threshold by
maximizing the information gain:
r= argmax InfoGain(T ).
r∈{d1 ,...,d|T | }

For a more accurate estimate, r should be computed using k-fold cross validation.
To classify an unlabeled instance u, we first extract the set of subsequences S
from u in the same manner, then compute du = BitsM− (S) − BitsM+ (S). If
du ≤ r, u is classified as a negative instance, otherwise, u is positive. Detailed
description of the algorithm is given in Algorithm 2.

5 Experimental Results
We evaluate our counter-attack algorithm on e-mail data from the 2006 TREC
Public Spam Corpus [15]. The data consists of 36,674 spam and legitimate email
messages, sorted chronologically by receiving date and evenly divided into 11
subsets D1 , · · · , D11 . Experiments were run in an on-line fashion by training on
the ith subset and testing on the subset i + 1. The percentage of spam messages
in each subset varies from approximately 60% to a little bit over 70%. The
good word list consists of the top 1,000 unique words from the entire corpus
ranked according to the frequency ratio. In order to allow a true apples-to-
apples comparison among compression-based algorithms and standard learning
algorithms, we preprocessed the entire corpus by removing HTML and non-
textual parts. We also applied stemming and stop-list to all terms. The to,
from, cc, subject, and received headers were retained, while the rest of the
headers were removed. Messages that had an empty body after preprocessing
were discarded. In all of our experiments, we used 6th -order context and a sliding
window of 5 where applicable.
We first test our counter-attack algorithm under the circumstances where
there are no attacks and there are only good word attacks. In the case of good
words attacks, the adversary has a general knowledge about word distributions in
the entire corpus, but lacks a full knowledge about the training data. As discussed
in Section 3, the PPMD-based classifier demonstrated superior performance in
these two cases. We need to ensure that our anti-adversarial classifier would
perform the same. Figure 2 shows the comparison between the PPMD-based
classifier and our anti-adversarial classifier when there are no attacks and 500-
goodword attacks on 50% of spam in each test set.
Anti-Adversarial Learning 205

Algorithm 2. Anti-AD Learn


Input: T = T+ ∪ T− —a set of training data
Output: c(u)—the class label of the unlabeled instance u
// Training
Build M+ from T+ and M− from T− ;
// K-fold Cross Validation for finding threshold r
Partition T into {T1 , . . . , TK };
i = 0;
while i < K do
i i
Build M+ and M− from {T1 , · · · , Ti−1 , Ti+1 , · · · , Tk };
S = ∅; /* sequence vector */
D = ∅; /* bit difference vector */
foreach t ∈ Ti do
initialize W to cover the first n-symbols sj in t;
while W = ∅ do
if BITSM i (sj ) < BITSM i (sj ) then
+ −
S = S ∪ {sj };
Shift W n symbols down the input;
else
Shift W 1 symbol down the input;
end
end
// bit difference
 of instance t
D = D ∪ { sj ∈S (BIT SM i (sj ) − BIT SM i (sj ))};
− +
end
i++;
end
r = argmaxInfoGain(T ); /* threshold r */
r∈D
// Classifying
Extract S from u such that BITSM+ (S) < BITSM− (S);
if BITSM− (S) − BITSM+ (S) ≤ r then
return Negative;
else
return Positive;
end

As can be observed, the performance are comparable between both algo-


rithms, with and without good word attacks. For each algorithm, both precision
and recall values remain nearly unchanged before and after attacks. Note that
for PPMD-based classifiers, we tried both incremental compression and bit ap-
proximation. The results shown were obtained using incremental compression.
With bit approximation, we achieved significant run-time speedup, however, with
5% performance drop, from 94.2% to 89.2%, in terms of recall values. Due to
its run-time efficiency, we used bit approximation in all experiments involving
our anti-adversarial classifiers. We used 10-fold cross validation to determine
the threshold value r by maximizing the information gain. In practice, overfit-
ting may occur and several k (number of folds) values should be tried in cross
206 Y. Zhou, M. Inge, and M. Kantarcioglu

Accuracy
Precision
Recall
100%

Accuracy/Precision/Recall Values
80%

60%

40%

20%

0%
PPMD−No−Attack AADL−NO−Attack PPMD−500GW−Attack AADL−500GW−Attack

Fig. 2. The accuracy, precision and recall values of the PPMD classifier and the anti-
adversarial classifier (AADL) with no attacks and 500 good words attacks

Accuracy
Precision
Recall
100%
Accuracy/Precision/Recall Values

80%

60%

40%

20%

0%
AADL−No−Attack AADL−500GW−Attack AADL−WorstCase(1)−Attack AADL−WorstCase(5)−Attack

Fig. 3. The accuracy, precision and recall values of the anti-adversarial classifier
(AADL) with no attacks, 500 good word attacks, 1 legit mail attacks, and 5 legit
mail attacks

validation. We also experimented when the number of good words varied from
50 to 500, and the percentage of spam altered from 50% to 100% in each test
set, the results remain similar.
We conducted more aggressive attacks with exact copies of legitimate mes-
sages randomly chosen from the training set. We tried two attacks of increasing
strength by attaching one legitimate message and five legitimate messages, re-
spectively, to spam in the test set. In total 50% of the spam in the test set
was altered in each attack. Figure 3 illustrates the results. As can be observed,
our anti-adversarial classifier is very robust to any of the attacks, while the
PPMD-based classifier, for which the results are shown in Figure 4, is obviously
vulnerable to the more aggressive attacks. Furthermore, similar results were ob-
tained when: 1.) the percentage of altered spam increased to 100%; 2.) legitimate
messages used to alter spam were randomly selected from the test set, and 3.) in-
jected legtimate messages were randomly selected from data sets that are neither
training nor test sets.
To make the matter more complicated, we also tested our algorithm when
legitimate messages were randomly scattered into spam. This was done in two
different fashions. In the first case, we first randomly picked a position in spam;
then we took a random length (no greater than 10% of the total length) of a
legitimate message and inserted it to the selected position in spam. The two steps
Anti-Adversarial Learning 207

100%

Accuracy
Precision

Accuracy/Precision/Recall Values
80% Recall

60%

40%

20%

0%
PPMD−No−Attack PPMD−500GW−Attack PPMD−WorstCase(1)−Attack PPMD−WorstCase(5)−Attack

Fig. 4. The accuracy, precision and recall values of the PPMD-based classifier with no
attacks, 500 good word attacks, 1 legit mail attacks, and 5 legit mail attacks
100%

Accuracy
Precision
Accuracy/Precision/Recall Values

80% Recall

60%

40%

20%

0%
PPMD−500GW−AttackBoth PPMD−WorstCase(1)−AttackBoth AADL−500GW−AttackBoth AADL−WorstCase(1)−AttackBoth

Fig. 5. The accuracy, precision and recall values of the PPMD-based classifier and the
anti-adversarial classifier (AADL) with 500-goodword attacks and 1 legit mail attacks
to 50% spam in both training and test data

were repeated until the entire legitimate message has been inserted. Empirical
results show that random insertion does not appear to affect the classification
performance. In the second case, we inserted terms from a legitimate message
in a random order after every  terms in spam. The process is as follows: 1.)
tokenize the legitimate message and randomly shuffle the tokens; 2.) insert a
random number of tokens (less than 10% of the total number of tokens) after
 terms in spam. Repeat 1) and 2) until all tokens are inserted to spam. We
observed little performance change when  ≥ 3. When  ≤ 2, nearly all local
context is completely lost in the altered spam, the average recall values dropped
to below 70%. Note that in the latter case ( ≤ 2), the attack is most likely
useless to the adversary in practice since the scrambled instance would also fail
to deliver the malicious attacks the adversary has set out to accomplish.
Previous studies [14,17,5] show that retraining on altered data as a result
of adversarial attacks may improve the performance of classifiers against the
attacks. This observation is further verified in our experiments, in which we ran-
domly select 50% of spam in the training set and appended good words and legit-
imate e-mail to it, separately. Figure 5 shows the results of the PPMD classifier
and the anti-adversarial classifier with 500-goodword attacks and 1-legitimate-
mail attacks in both training and test data. As can be observed, retraining
improved, to an extent, the classification results in all cases.

6 Concluding Remarks
We demonstrate that compression-based classifiers are much more resilient to
good word attacks compared to standard learning algorithms. On the other
208 Y. Zhou, M. Inge, and M. Kantarcioglu

hand, this type of classifier is vulnerable to attacks when the adversary has a
full knowledge of the training set. We propose a counter-attack technique that
extracts and analyzes subsequences that are more informative in a given instance.
We demonstrate that the proposed technique is robust against any attacks, even
in the worst case where the adversary can alter positive instances with exact
copies of negative instances taken directly from the training set.
A fundamental theory needs to be developed to explain the strength of the
compression-based algorithm and the anti-adversarial learning algorithm. It re-
mains less clear, in theory, why the compression-based algorithms are remarkably
resilient to strategically designed attacks that would normally defeat classifiers
trained using standard learning algorithms. It is certainly to our great interest
to find out how well the proposed counter-attack strategy performs in other
domains, and under what circumstances this seemingly bullet-proof algorithm
would break down.

Acknowledgement
The authors would like to thank Zach Jorgensen for his valuable input. This work
was partially supported by Air Force Office of Scientific Research MURI Grant
FA9550-08-1-0265, National Institutes of Health Grant 1R01LM009989, National
Science Foundation (NSF) Grant Career-0845803, and NSF Grant CNS-0964350,
CNS-1016343.

References
1. Barreno, M., Nelson, B.A., Joseph, A.D., Tygar, D.: The security of machine learn-
ing. Technical Report UCB/EECS-2008-43, EECS Department, University of Cal-
ifornia, Berkeley (April 2008)
2. Sculley, D., Brodley, C.E.: Compression and machine learning: A new perspective
on feature space vectors. In: DCC 2006: Proceedings of the Data Compression
Conference, pp. 332–332. IEEE Computer Society, Washington, DC (2006)
3. Bratko, A., Filipič, B., Cormack, G.V., Lynam, T.R., Zupan, B.: Spam filtering
using statistical data compression models. J. Mach. Learn. Res. 7, 2673–2698 (2006)
4. Zhou, Y., Inge, W.: Malware detection using adaptive data compression. In: AISec
2008: Proceedings of the 1st ACM Workshop on Artificial Intelligence and Security,
Alexandria, Virginia, USA, pp. 53–60 (2008)
5. Jorgensen, Z., Zhou, Y., Inge, M.: A multiple instance learning strategy for com-
bating good word attacks on spam filters. Journal of Machine Learning Research 9,
1115–1146 (2008)
6. Witten, I., Neal, R., Cleary, J.: Arithmetic coding for data compression. Commu-
nications of the ACM, 520–540 (June 1987)
7. Cleary, J., Witten, I.: Data compression using adaptive coding and partial string
matching. IEEE Transactions on Communications COM-32(4), 396–402 (1984)
8. Cormack, G., Horspool, R.: Data compression using dynamic markov modeling.
The Computer Journal 30(6), 541–550 (1987)
9. Cleary, J., Witten, I.: Unbounded length contexts of ppm. The computer Jour-
nal 40(2/3), 67–75 (1997)
Anti-Adversarial Learning 209

10. Moffat, A., Turpin, A.: Compression and Coding Algorithms. Kluwer Academic
Publishers, Boston (2002)
11. Moffat, A.: Implementing the ppm data compression scheme. IEEE Trans. Comm.
38, 1917–1921 (1990)
12. Howard, P.: The design and analysis of efficient lossless data compression systems.
Technical report, Brown University (1993)
13. Teahan, W.J.: Text classification and segmentation using minimum cross-entropy.
In: RIAO 2000, 6th International Conference Recherche d’Informaiton Assistee par
ordinateur (2000)
14. Lowd, D., Meek, C.: Good word attacks on statistical spam filters. In: Proceedings
of the 2nd Conference on Email and Anti-Spam (2005)
15. Cormack, G.V., Lynam, T.R.: Spam track guidelines – TREC 2005-2007 (2006),
http://plg.uwaterloo.ca/~ gvcormac/treccorpus06/
16. Bratko, A.: Probabilistic sequence modeling shared library (2008),
http://ai.ijs.si/andrej/psmslib.html
17. Webb, S., Chitti, S., Pu, C.: An experimental evaluation of spam filter performance
and robustness against attack. In: The 1st International Conference on Collabora-
tive Computing: Networking, Applications and Worksharing, pp. 19–21 (2005)
Mining Sequential Patterns from Probabilistic
Databases

Muhammad Muzammal and Rajeev Raman

Department of Computer Science, University of Leicester, UK


{mm386,r.raman}@mcs.le.ac.uk

Abstract. We consider sequential pattern mining in situations where


there is uncertainty about which source an event is associated with. We
model this in the probabilistic database framework and consider the prob-
lem of enumerating all sequences whose expected support is sufficiently
large. Unlike frequent itemset mining in probabilistic databases [C. Aggar-
wal et al. KDD’09; Chui et al., PAKDD’07; Chui and Kao, PAKDD’08],
we use dynamic programming (DP) to compute the probability that a
source supports a sequence, and show that this suffices to compute the ex-
pected support of a sequential pattern. Next, we embed this DP algorithm
into candidate generate-and-test approaches, and explore the pattern lat-
tice both in a breadth-first (similar to GSP) and a depth-first (similar to
SPAM) manner. We propose optimizations for efficiently computing the
frequent 1-sequences, for re-using previously-computed results through in-
cremental support computation, and for elmiminating candidate sequences
without computing their support via probabilistic pruning. Preliminary
experiments show that our optimizations are effective in improving the
CPU cost.

Keywords: Mining Uncertain Data, Mining complex sequential data,


Probabilistic Databases, Novel models and algorithms.

1 Introduction

The problem of sequential pattern mining (SPM), or finding frequent sequences of


events in data with a temporal component, has been studied extensively [23,17,4]
since its introduction in [18,3]. In classical SPM, the data to be mined is deter-
ministic, but it it recognized that data obtained from a wide range of data
sources is inherently uncertain [1]. This paper is concerned with SPM in proba-
bilistic databases [19], a popular framework for modelling uncertainty. Recently
several data mining and ranking problems have been studied in this framework,
including top-k [24,8] and frequent itemset mining (FIM) [2,5,6,7]. In classical
SPM, the event database consists of tuples eid, e, σ, where e is an event, σ is
a source and eid is an event-id which incorporates a time-stamp. A tuple may
record a retail transaction (event) by a customer (source), or an observation of an
object/person (event) by a sensor/camera (source). Since event-ids have a time-
stamp, the event database can be viewed as a collection of source sequences, one

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 210–221, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Mining Sequential Patterns from Probabilistic Databases 211

per source, containing a sequence of events (ordered by time-stamp) associated


with that source, and classical SPM problem is to find patterns of events that
have a temporal order that occur in a significant number of source sequences.
Uncertainty in SPM can occur in three different places: the source, the event
and the time-stamp may all be uncertain (in contrast, in FIM, only the event
can be uncertain). In a companion paper [16] the first two kinds of uncertainty
in SPM were formalized as source-level uncertainty (SLU) and event-level un-
certainty (ELU), which we now summarize.
In SLU, the “source” attribute of each tuple is uncertain: each tuple contains
a probability distribution over possible sources (attribute-level uncertainty [19]).
As noted in [16], this formulation applies to scenarios such as the ambiguity
arising when a customer makes a retail transaction, but the customer is either
not identified exactly, or the customer database itself is probabilistic as a result
of “deduplication” or cleaning [11]. In ELU, the source of the tuple is certain,
but the events are uncertain. For example, the PEEX system [13] aggregates
unreliable observations of employees using RFID antennae at fixed locations
into uncertain higher-level events such as “with probability 0.4, at time 103,
Alice and Bob had a meeting in room 435”. Here, the source (Room 435) is
deterministic, but the event ({Alice, Bob}) only occurred with probability 0.4.
Furthermore, in [16] two measures of “frequentness”, namely expected support
and probabilistic frequentness, used for FIM in probabilistic databases [5,7], were
adapted to SPM, and the four possible combinations of models and measures
were studied from a computational complexity viewpoint. This paper is focussed
on efficient algorithms for the SPM problem in SLU probabilistic databases,
under the expected support measure, and the contributions are as follows:
1. We give a dynamic-programming (DP) algorithm to determine efficiently the
probability that a given source supports a sequence (source support proba-
bility), and show that this is enough to compute the expected support of a
sequence in an SLU event database.
2. We give depth-first and breadth-first methods to find all frequent sequences
in an SLU event database according to the expected support criterion.
3. To speed up the computation, we give subroutines for:
(a) highly efficient computation of frequent 1-sequences,
(b) incremental computation of the DP matrix, which allows us to minimize
the amount of time spent on the expensive DP computation, and
(c) probabilistic pruning, where we show how to rapidly compute an upper
bound on the probability that a source supports a candidate sequence.
4. We empirically evaluate our algorithms, demonstrating their efficiency and
scalability, as well as the effectiveness of the above optimizations.

Significance of Results. The source support probability algorithm ((1) above)


shows that in probabilistic databases, FIM and SPM are very different – there
is no need to use DP for FIM under the expected support measure [2,6,7].
Although the proof that source support probability allows the computation
of the expected support of a sequence in an SLU database is simple, it is
212 M. Muzammal and R. Raman

unexpected, since in SLU databases, there are dependencies between different


sources – in any possible world, a given event can only belong to one source. In
contrast, determining if a given sequence is probabilistically frequent in an SLU
event database is #P-complete because of the dependencies between sources [16].
Also, as noted in [16], (1) can be used to determine if a sequence is frequent
in an ELU database using both expected support and probabilistic frequentness.
This implies efficient algorithms for enumerating frequent sequences under both
frequentness criteria for ELU databases, and by using the framework of [10], we
can also find maximal frequent sequential patterns in ELU databases.
The breadth-first and depth-first algorithms (2) have a high-level similarity to
GSP [18] and SPADE/SPAM [23,4], but checking the extent to which a sequence
is supported by a source requires an expensive DP computation, and major
modifications are needed to achieve good performance. It is unclear how to use
either the projected database idea of PrefixSpan [17], or bitmaps as in SPAM); we
instead use the ideas ((3) above) of incremental computation, and probabilistic
pruning. Although there is a high-level similiarity between this pruning and a
technique of [6] for FIM in probabilistic databases, the SPM problem is more
complex, and our pruning rule is harder to obtain.

Related Work. Classical SPM has been studied extensively [18,23,17,4].


Modelling uncertain data as probabilistic databases [19,1] has led to several rank-
ing/mining problems being studied in this context. The top-k problem (a rank-
ing problem) has been studied intensively (see [12,24,8] and references therein).
FIM in probabilistic databases was studied under the expected support measure
in [2,7,6] and under the probabilistic frequentness measure in [5]. To the best
of our knowledge, apart from [16], the SPM problem in probabilistic databases
has not been studied. Uncertainty in the time-stamp attribute was considered in
[20] – we do not consider time to be uncertain. Also [22] studies SPM in “noisy”
sequences, but the model proposed there is very different to ours and does not
fit in the probabilistic database framework.

2 Problem Statement
Classical SPM [18,3]. Let I = {i1 , i2 , . . . , iq } be a set of items and S =
{1, . . . , m} be a set of sources. An event e ⊆ I is a collection of items. A database
D = r1 , r2 , . . . , rn  is an ordered list of records such that each ri ∈ D is of the
form (eid i , ei , σi ), where eid i is a unique event-id, including a time-stamp (events
are ordered by this time-stamp), ei is an event and σi is a source.
A sequence s = s1 , s2 , . . . , sa  is an ordered list of events. The events si in the
sequence are called  its elements. The length of a sequence s is the total number of
items in it, i.e. aj=1 |sj |; for any integer k, a k-sequence is a sequence of length k.
Let s = s1 , s2 , . . . , sq  and t = t1 , t2 , . . . , tr  be two sequences. We say that s is a
subsequence of t, denoted s  t, if there exist integers 1 ≤ i1 < i2 < · · · < iq ≤ r
such that sk ⊆ tij , for k = 1, . . . , q. The source sequence corresponding to a
source i is just the multiset {e|(eid, e, i) ∈ D}, ordered by eid. For a sequence
s and source i, let Xi (s, D) be an indicator variable, whose value is 1 if s is
Mining Sequential Patterns from Probabilistic Databases 213

a subsequence of the source sequence for source i, and 0 otherwise. For any
m
sequence s, define its support in D, denoted Sup(s, D) = i=1 Xi (s, D). The
objective is to find all sequences s such that Sup(s, D) ≥ θm for some user-
defined threshold 0 ≤ θ ≤ 1.

Probabilistic Databases. We define an SLU probabilistic database Dp to be an


ordered list r1 , . . . , rn  of records of the form (eid , e, W) where eid is an event-
id, e is an event and W is a probability distribution over S; the list is ordered
by eid. The distribution W contains pairs of the form (σ, c), where σ ∈ S and
0< c ≤ 1 is the confidence that the event e is associated with source σ and
(σ,c)∈W c = 1. An example can be found in Table 1(L).

Table 1. A source-level uncertain database (L) transformed to p-sequences (R). Note


that events like e1 (marked with † on (R)) can only be associated with one of the
sources X and Y in any possible world.

eid event W
p-sequence
e1 (a, d) (X : 0.6)(Y : 0.4) p
DX (a, d : 0.6)† (a, b : 0.3)(b, c : 0.7)
e2 (a) (Z : 1.0)
DYp (a, d : 0.4)† (a, b : 0.2)
e3 (a, b) (X : 0.3)(Y : 0.2)(Z : 0.5) p
DZ (a : 1.0)(a, b : 0.5)(b, c : 0.3)
e4 (b, c) (X : 0.7)(Z : 0.3)

The possible worlds semantics of Dp is as follows. A possible world D∗ of


p
D is generated by taking each event ei in turn, and assigning it to one of the
possible sources σi ∈ Wi . Thus every record ri = (eidi , ei , Wi ) ∈ Dp takes the
form ri = (eidi , ei , σi ), for some σi ∈ S in D∗ . By enumerating all such possible
combinations, we get the complete set of possible worlds. We assume that the
distributions associated with each record ri in Dp are stochastically
n independent;
the probability of a possible world D∗ is therefore Pr[D∗ ] = i=1 PrWi [σi ]. For
example, a possible world D∗ for the database of Table 1 can be generated by
assigning events e1 , e3 and e4 to X with probabilities 0.6, 0.3 and 0.7 respectively,
and e2 to Z with probability 1.0, and Pr[D ∗ ] = 0.6 × 1.0 × 0.3 × 0.7 = 0.126.
As every possible world is a (deterministic) database, concepts like the support
of a sequence in a possible world are well-defined. The definition of the expected
support of a sequence s in D p follows naturally:

ES(s, Dp ) = Pr[D∗ ] ∗ Sup(s, D∗ ), (1)
D∗ ∈P W (Dp )

The problem we consider is:


Given an SLU probabilistic database D p , determine all sequences s such
that ES(s, D p ) ≥ θm, for some user-specifed threshold θ, 0 ≤ θ ≤ 1.
Since there are potentially an exponential number of possible worlds, it is infea-
sible to compute ES(s, Dp ) directly using Eq. 1; next we show how to do this
computation more efficiently using linearity of expectation and DP.
214 M. Muzammal and R. Raman

3 Computing Expected Support


p-sequences. A p-sequence is analogous to a source sequence in classical SPM,
and is a sequence of the form (e1 , c1 ) . . . (ek , ck ), where ej is an event and cj is
a confidence value. In examples, we write a p-sequence ({a, d}, 0.4), ({a, b}, 0.2)
as (a, d : 0.4)(a, b : 0.2). An SLU database D p can be viewed as a collection of p-
sequences D1p , . . . , Dm
p
, where Dip is the p-sequence of source i, and contains a list
of those events in Dp that have non-zero confidence of being assigned to source i,
ordered by eid, together with the associated confidence (see Table 1(R)). How-
ever, the p-sequences corresponding to different sources are not independent,
as illustrated in Table 1(R). Thus, one may view an SLU event database as a
collection of p-sequences with dependencies in the form of x-tuples [8]. Never-
theless, we show that we can still process the p-sequences independently for the
purposes of expected suppport computation:
  m
ES(s, Dp ) = D∗ ∈P W (Dp ) Pr[D ∗ ] ∗ Sup(s, D ∗ ) = D∗ Pr[D ∗ ] ∗ i=1 Xi (s, D∗ )
  ∗ ∗
m
= m i=1 D∗ Pr[D ] ∗ Xi (s, D ) =
p
i=1 E[Xi (s, D )], (2)

where E denotes the expected value of a random variable. Since Xi is a 0-1


variable, E[Xi (s, D p )] = Pr[s  Dip ], and we calculate the right-hand quantity,
which we refer to as the source support probability. This cannot be done naively:
e.g., if Dip = (a, b : c1 )(a, b : c2 ) . . . (a, b : cq ), then there are O(q 2k ) ways in
which (a)(a, b) . . . (a)(a, b) could be supported by source i, and so we use DP.
  
k times

Computing the Source Support Probability. Given a p-sequence Dip =


(e1 , c1 ), . . . , (er , cr ) and a sequence s = s1 , . . . , sq , we create a (q + 1)×(r + 1)
matrix Ai,s [0..q][0..r] (we omit the subscripts on A when the source and sequence
are clear from the context). For 1 ≤ k ≤ q and 1 ≤  ≤ r, A[k, ] will contain
Pr[s1 , . . . , sk   (e1 , c1 ), . . . , (e , c )], so A[q, r] the desired value of Pr[s 
Dip ]. We set A[0, ] = 1 for all , 0 ≤  ≤ r and A[k, 0] = 0 for all 1 ≤ k ≤ q, and
compute the other values row-by-row. For 1 ≤ k ≤ q and 1 ≤  ≤ r, define:

∗ c if sk ⊆ e
ck = (3)
0 otherwise

The interpretation of Eq. 3 is that c∗k is the probability that e allows the element
sk to be matched in source i; this is 0 if sk
⊆ e , and is otherwise equal to the
probability that e is associated with source i. Now we use the equation:

A[k, ] = (1 − c∗k ) ∗ A[k,  − 1] + c∗k ∗ A[k − 1,  − 1]. (4)

Table 2 shows the computation of the source support probability of an exam-


ple sequence s = (a)(b) for source X in the probabilistic database of Table 1.
Similarly, we can compute Pr[s  DYp ] = 0.08 and Pr[s  DZ p
] = 0.35, so the ex-
pected support of (a)(b) in the database of Table 1 is 0.558 + 0.08 + 0.35 = 1.288,
the same value obtained by direct application of Eq 1.
Mining Sequential Patterns from Probabilistic Databases 215

p
Table 2. Computing Pr[s  DX ] for s = (a)(b) using DP in the database of Table 1

(a, d : 0.6) (a, b : 0.3) (b, c : 0.7)


(a) 0.4 × 0 + 0.6 × 1 = 0.6 0.7 × 0.6 + 0.3 × 1 = 0.72 0.72
(a)(b) 0 0.7 × 0 + 0.3 × 0.6 = 0.18 0.3 × 0.18 + 0.7 × 0.72 = 0.558

The reason Eq. 4 is correct is that if sk


⊆ e then the probability that
s1 , . . . , sk   e1 , . . . , e  is the same as the probability that s1 , . . . , sk  
e1 , . . . , e−1  (note that if sk
⊆ e then c∗k = 0 and A[k, ] = A[k,  − 1]). Oth-
erwise, c∗k = c , and we have to consider two disjoint sets of possible worlds:
those where e is not associated with source i (the first term in Eq. 4) and those
where it is (the second term in Eq. 4). In summary:

Lemma 1. Given a p-sequence Dip and a sequence s, by applying Eq. 4 repeat-


edly, we correctly compute Pr[s  Dip ].

4 Optimizations

We now describe three optimized sub-routines for computing all frequent 1-


sequences, for incremental support computation, and for probabilistic pruning.

Fast L1 Computation. Givena 1-sequence s = {x}, a simple closed-form


expression for Pr[s  Dip ] is 1 − =1 (1 − c∗1 ). It is easy to verify by induction
r
t−1
that Eq. 4 gives the same answer, since (1 − =1 (1 − c∗1 ))(1 − c∗1t ) + c∗1t =
t
(1 − =1 (1 − c∗1 )) – recall that A[0,  − 1] = 1 for all  ≥ 1. This allows us to
compute ES(s, Dp ) for all 1-sequences s in just one (linear-time) pass through
Dp . Initialize two arrays F and G, each of size q = |I|, to zero and consider
each source i in turn. If Dip = (e1 , c1 ), . . . , (er , cr ), for k = 1, . . . , r take the
pair (ek , ck ) and iterate through each x ∈ ek , setting F [x] := 1 − ((1 − F [x]) ∗
(1 − ck )). Once we are finished with source i, if F [x] is non-zero, we update
G[x] := G[x] + F [x] and reset F [x] to zero (we use a linked list to keep track
of which entries of F are non-zero for a given source). At the end, for any item
x ∈ I, G[x] = ES(x, D p ).

Incremental Support Computation. Let s and t be two sequences of length


j and j + 1 respectively. Say that t is an S-extension of s if t = s · {x} for some
item x, where · denotes concatenation (i.e. we obtain t by appending a single
item as a new element to s). We say that t is an I-extension of s if s = s1 , . . . , sq 
and t = s1 , . . . , sq ∪{x} for some x
∈ sq , and x is lexicographically not less than
any item in sq (i.e. we obtain t by adding a new item to the last element of s).
For example, if s = (a)(b, c) and x = d, S- and I-extensions of s are (a)(b, c)(d)
and (a)(b, c, d) respectively. Similar to classical SPM, we generate candidate
sequences t that are either S- or I-extensions of existing frequent sequences s, and
compute ES(t, D p ) by computing Pr[t  Dip ] for all sources i. While computing
216 M. Muzammal and R. Raman

Pr[t  Dip ] for source i, we would like to exploit the similarity between s and t
to compute Pr[t  Dip ] more rapidly.
Let i be a source, Dip = (e1 , c1 ), . . . , (er , cr ), and s = s1 , . . . , sq  be any
sequence. Now let Ai,s be the (q + 1) × (r + 1) DP matrix used to compute
Pr[s  Dip ], and let Bi,s denote the last row of Ai,s , that is, Bi,s [] = Ai,s [q, ]
for  = 0, . . . , r. We now show that if t is an extension of s, then we can quickly
compute Bi,t from Bi,s , and thereby obtain Pr[t  Dip ] = Bi,t [r]:
Lemma 2. Let s and t be sequences such that t is an extension of s, and let i
be a source whose p-sequence has r elements in it. Then, given Bi,s and Dip , we
can compute Bi,t in O(r) time.
Proof. We only discuss the case where t is an I-extension of s, i.e. t = s1 , . . . , sq ∪
{x} for some x
∈ sq . Firstly, observe that since the first q − 1 elements of s
and t are pairwise equal, the first q − 1 rows of Ai,s and Ai,t are also equal. The
(q − 1)-st row of Ai,s is enough to compute the q-th row of Ai,t , but we only have
Bi,s , the q-th row of Ai,s . If tq = sq ∪ {x}
⊆ e , then Ai,t [q, ] = Ai,t [q,  − 1],
and we can move on to the next value of . If tq ⊆ e , then sq ⊆ e and so:
Ai,s [q, ] = (1 − c ) ∗ Ai,s [q,  − 1] + c ∗ Ai,s [q − 1,  − 1]
Since we know Bi,s [] = Ai,s [q, ], Bi,s [ − 1] = Ai,s [q,  − 1] and c , we can
compute Ai,s [q − 1,  − 1]. But this value is equal to Ai,t [q − 1,  − 1], which
is the value from the (q − 1)-st row of Ai,t that we need to compute Ai,t [q, ].
Specifically, we compute:
Bi,t [] = (1 − c ) ∗ Bi,t [ − 1] + (Bi,s [] − Bi,s [ − 1] ∗ (1 − c ))
if tq ⊆ e (otherwise Bi,t [] = Bi,t [ − 1]). The (easier) case of S-extensions and
an example illustrating incremental computation can be found in [15].

Probabilistic Pruning. We now describe a technique that allows us to prune


non-frequent sequences s without fully computing ES(s, Dp ). For each source i,
we obtain an upper bound on Pr[s  Dip ] and add up all the upper bounds; if
the sum is below the threshold, s can be pruned. We first show (proof in [15]):
Lemma 3. Let s = s1 , . . . , sq  be a sequence, and let Dip be a p-sequence. Then:
Pr[s  Dip ] ≤ Pr[s1 , . . . , sq−1   Dip ] ∗ Pr[sq   Dip ].
We now indicate how Lemma 3 is used. Suppose, for example, that we have a
candidate sequence s = (a)(b, c)(a), and a source X. By Lemma 3:
p p p
Pr[(a)(b, c)(a)  DX ] ≤ Pr[(a)(b, c)  DX ] ∗ Pr[(a)  DX ]
p p p
≤ Pr[(a)  DX ] ∗ Pr[(b, c)  DX ] ∗ Pr[(a)  DX ]
p 2 p p
≤ (Pr[(a)  DX ]) ∗ min{Pr[(b)  DX ], Pr[(c)  DX ]}
Note that the quantities on the RHS are computed for each source by the fast
L1 computation, and can be stored in a small data structure. However, the last
p
line is the least accurate upper bound bound: if Pr[(a)(b, c)  DX ] is available
p p
when pruning, an tighter bound is Pr[(a)(b, c)  DX ] ∗ Pr[(a)  DX ].
Mining Sequential Patterns from Probabilistic Databases 217

5 Candidate Generation
We now describe two candidate generation methods for enumerating all frequent
sequences, one each based on breadth-first and depth-first exploration of the
sequence lattice, which are similar to GSP [18,3] and SPAM [4] respectively. We
first note that an “Apriori” property holds in our setting:
Lemma 4. Given two sequences s and t, and a probabilistic database Dp , if s
is a subsequence of t, then ERS(s, D p ) ≥ ERS(t, Dp ).
Proof. In Eq. 1 note that for all D ∗ ∈ P W (Dp ), Sup(s, D∗ ) ≥ Sup(t, D∗ ).

Breadth-First Exploration. An overview of our BFS approach is in Fig. 1(L).


We now describe some details. Each execution of lines (6)-(10) is called a phase.
Line 2 is done using the fast L1 computation (see Section 4). Line 4 is done as
in [18,3]: two sequences s and s in Lj are joined iff deleting the first item in s
and the last item in s results in the same sequence, and the result t comprises
s extended with the last item in s . This item is added the way it was in s i.e.
either a separate element (t is an S-extension of s) or to the last element of s (t
is an I-extension of s). We apply apriori pruning to the set of candidates in the
(j + 1)-st phase, Cj+1 , and probabilistic pruning can additionally be applied to
Cj+1 (note that for C2 , probabilistic pruning is the only possibility).
In Lines 6-7, the loop iterates over all sources, and for the i-th source, first
consider only those sequences from Cj+1 that could potentially be supported
by source i, Ni,j+1 , (narrowing). For the purpose of narrowing, we put all the
sequences in Cj+1 in a hashtree, similar to [18]. A candidate sequence t ∈ Cj+1
is stored in the hashtree by hashing on each item in t upto the j-th item, and the
leaf node contains the (j + 1)-st item. In the (j + 1)-st phase, when considering
source i, we recursively traverse the hashtree by hashing on every item in Li,1
until we have traversed all the leaf nodes, thus obtaining Ni,j+1 for source i.
Given Ni,j+1 we compute the support of t in source i as follows. Consider
s = s1 , . . . , sq  and t = t1 , . . . , tr  be two sequences, then if s and t have a
common prefix, i.e. for k = 1, 2, . . . , z, sk = tk , then we start the computation
of Pr[t  Dip ] from tz+1 . Observe that our narrowing method naturally tends to
place sequences with common prefixes in consecutive positions of Ni,j+1 .

Depth-First Exploration. An overview of our depth-first approach is in Fig. 1


(R) [23,4]. We first compute the set of frequent 1-sequences, L1 (Line 1) (assume
L1 is in ascending order). We then explore the pattern sub-lattice as follows.
Consider a call of TraverseDFS(s), where s is some k-sequence. We first check
that all lexicographically smaller k-subsequences of t are frequent, and reject t
as infrequent if this test fails (Line 7). We can then apply probabilistic pruning
to t, and if t is still not pruned we compute its support (Line 8). If at any stage
t is found to be infrequent, we do not consider x, the item used to extend s to t,
as a possible alternative in the recursive tree under s (as in [4]). Observe that for
sequences s and t, where t is an S- or I- extension of s, if Pr[s  Dip ] = 0, then
Pr[t  Dip ] = 0. When computing ES(s, Dp ), we keep track of all the sources
218 M. Muzammal and R. Raman

1: L1 ← ComputeFrequent-1(Dp )
2: for all sequences x ∈ L1 do
1: j ← 1 3: Call TraverseDFS(x)
2: L1 ← ComputeFrequent-1(Dp ) 4: Output all frequent sequences
3: while Lj = ∅ do 5: function TraverseDFS(s)
4: Cj+1 ← Join Lj with itself 6: for all x ∈ L1 do
5: Prune Cj+1 7: t ← s · {x}
{S-extension}
6: for all s ∈ Cj+1 do 8: Compute ES(t, Dp )
p
7: Compute ES(s, D ) 9: if ES(t, D p ) ≥ θm then
8: Lj+1 ← all sequences s ∈ Cj+1 {s.t. 10: TraverseDFS(t)
ES(s, Dp ) ≥ θm}. 11: t ← s1 , . . . , sq ∪ {x}
{I-extension}
9: j ←j+1 12: Compute ES(t, Dp )
10: Stop and output L1 ∪ . . . ∪ Lj 13: if ES(t, Dp ) ≥ θm then
14: TraverseDFS(t)
15: end function
Fig. 1. BFS (L) and DFS (R) Algorithms. Dp is the input database and θ the threshold.

where Pr[s  Dip ] > 0, denoted by S s . If s is frequent then when computing


ES(t, D p ), we need only to visit the sources in S s .
Furthermore, with every source i ∈ S s , we assume that the array Bi,s (see
Section 4) has been saved prior to calling TraverseDFS(s), allowing us to use
incremental computation. By implication, the arrays Bi,r for all prefixes r of s
are also stored for all sources i ∈ S r ), so in the worst case, each source may
store up to k arrays, if s is a k-sequence. The space usage of the DFS traversal
is quite modest in practice, however.

6 Experimental Evaluation
We report on an experimental evaluation of our algorithms. Our implementations
are in C# (Visual Studio .Net), executed on a machine with a 3.2GHz Intel CPU
and 3GB RAM running XP (SP3). We begin by describing the datasets used for
experiments. Then, we demonstrate the scalability of our algorithms (reported
running times are averages from multiple runs), and also evaluate probabilis-
tic pruning. In our experiments, we use both real (Gazelle from Blue Martini
[14]) and synthetic (IBM Quest [3]) datasets. We transform these deterministic
datasets to probabilistic form in a way similar to [2,5,24,7]; we assign probabili-
ties to each event in a source sequence using a uniform distribution over (0, 1],
thus obtaining a collection of p-sequences. Note that we in fact generate ELU
data rather than SLU data: a key benefit of this approach is that it tends to
preserve the distribution of frequent sequences in the deterministic data.
We follow the naming convention of [23]: a dataset named CiDjK means that
the average number of events per source is i and the number of sources is j (in
thousands). Alphabet size is 2K and all other parameters are set to default.
We study three parameters in our experiments: the number of sources D,
the average number of events per source C, and the threshold θ. We test our
Mining Sequential Patterns from Probabilistic Databases 219

algorithms for one of the three parameters by keeping the other two fixed. Ev-
idently, all other parameters being fixed, increasing D and C, or decreasing θ,
all make an instance harder. We choose our algorithm variants according to two
“axes”:

– Lattice traversal could be done using BFS or DFS.


– Probabilistic Pruning (P) could be ON or OFF.

We thus report on four variants in all, for example “BFS+P” represents the
variant with BFS lattice traversal and with probabilistic pruning ON.

Probabilistic Pruning. To show the effectiveness of probabilistic pruning, we


kept statistics on the number of candidates both for BFS and for DFS. Due to
space limitations, we report statistics only for the dataset C10D20K here. For
more details, see [15]. Table 3 shows that probabilistic pruning is highly effective
at eliminating infrequent candidates in phase 2 — for example, in both BFS
and DFS, over 95% of infrequent candidates were eliminated without support
computation. However, probabilistic pruning was less effective in BFS compared
to DFS in the later phases. This is because we compute a coarser upper bound
in BFS than in DFS, as we only store Li,1 probabilities in BFS, whereas we
store both Li,1 and Li,j probabilities in DFS. We therefore, turn probabilistic
pruning OFF after Phase 2 in BFS in our experiments. If we could also store
Li,j probabilities in BFS, a more refined upper bound could be attained (as
mentioned after Lemma 3 and shown in (Section 6) [15]).

Table 3. Effectiveness of probabilistic pruning at θ = 2%, for dataset C10D20K in BFS


(L) and in DFS (R). The columns from L to R indicate the numbers of candidates cre-
ated by joining, remaining after apriori pruning, remaining after probabilistic pruning,
and deemed as frequent, respectively.

Phase Joining Apriori Prob. prun. Frequent Phase Joining Apriori Prob. prun. Frequent
2 15555 15555 246 39 2 15555 15555 246 39
3 237 223 208 91 3 334 234 175 91

C10D10K Gazelle
400
250
BFS BFS
BFS+P 350 BFS+P
DFS DFS
200 DFS+P 300 DFS+P
Running time (in sec)

Running time (in sec)

250
150
200

100 150

100
50
50

0 0
0 2 4 6 8 0 0.01 0.02 0.03 0.04 0.05
θ values (in %age) θ values (in %age)

Fig. 2. Effectiveness of probabilistic pruning for decreasing values of θ, for synthetic


dataset (C10D10K) (L) and for real dataset Gazelle (R)
220 M. Muzammal and R. Raman

C = 10, θ = 1% D = 10K, θ = 25%


1600

1400 1000

1200
Running time (in sec)

Running time (in sec)


100
1000

800
10
600

400 BFS BFS


1
BFS+P BFS+P
200 DFS DFS
DFS+P DFS+P
0 0
0 40 80 120 160 0 20 40 60 80
No. of sources (in thousands) No. of events per source

Fig. 3. Scalability of our algorithms for increasing number of sources D (L), and for
increasing number of events per sources C (R)

In Fig. 2, we show the effect of probabilistic pruning on overall running time


as θ decreases, for both synthetic (C10D10K) and real (Gazelle) datasets. It can
be seen that pruning is effective particularly for low θ, for both datasets.

Scalability Testing. We test the scalability of our algorithms by fixing C = 10


and θ = 1%, for increasing values of D (Fig. 3(L)), and by fixing D = 10K
and θ = 25%, for increasing values of C (Fig. 3(R)). We observe that all our
algorithms scale essentially linearly in both sets of experiments.

7 Conclusions and Future Work


We have considered the problem of finding all frequent sequences in SLU databases.
This is a first study on efficient algorithms for this problem, and naturally a
number of open directions remain e.g. exploring further the notion of ”interest-
ingness”. In this paper, we have used the expected support measure which has
the advantage that it can be computed efficiently for SLU databases – probabilis-
tic frequentness [5] is provably intractable for SLU databases [16]. Our approach
yields (in principle) efficient algorithms for both measures in ELU databases, and
comparing both measures in terms of computational cost versus solution quality
is an interesting future direction. A number of longer-term challenges remain,
including creating a data generator that gives an “interesting” SLU database
and considering more general models of uncertainty (e.g. it is not clear that the
assumption of independence between successive uncertain events is justified).

References
1. Aggarwal, C.C. (ed.): Managing and Mining Uncertain Data. Springer, Heidelberg
(2009)
2. Aggarwal, C.C., Li, Y., Wang, J., Wang, J.: Frequent pattern mining with uncertain
data. In: Elder et al. [9], pp. 29–38
3. Agrawal, R., Srikant, R.: Mining sequential patterns. In: Yu, P.S., Chen, A.L.P.
(eds.) ICDE, pp. 3–14. IEEE Computer Society, Los Alamitos (1995)
4. Ayres, J., Flannick, J., Gehrke, J., Yiu, T.: Sequential pattern mining using a
bitmap representation. In: KDD, pp. 429–435 (2002)
Mining Sequential Patterns from Probabilistic Databases 221

5. Bernecker, T., Kriegel, H.P., Renz, M., Verhein, F., Züfle, A.: Probabilistic frequent
itemset mining in uncertain databases. In: Elder et al. [9], pp. 119–128
6. Chui, C.K., Kao, B.: A decremental approach for mining frequent itemsets from
uncertain data. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD
2008. LNCS (LNAI), vol. 5012, pp. 64–75. Springer, Heidelberg (2008)
7. Chui, C.K., Kao, B., Hung, E.: Mining frequent itemsets from uncertain data. In:
Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp.
47–58. Springer, Heidelberg (2007)
8. Cormode, G., Li, F., Yi, K.: Semantics of ranking queries for probabilistic data
and expected ranks. In: ICDE, pp. 305–316. IEEE, Los Alamitos (2009)
9. Elder, J.F., Fogelman-Soulié, F., Flach, P.A., Zaki, M.J. (eds.): Proceedings of the
15th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, Paris, France, June 28-July 1. ACM, New York (2009)
10. Gunopulos, D., Khardon, R., Mannila, H., Saluja, S., Toivonen, H., Sharm, R.S.:
Discovering all most specific sentences. ACM Trans. DB Syst. 28(2), 140–174 (2003)
11. Hassanzadeh, O., Miller, R.J.: Creating probabilistic databases from duplicated
data. The VLDB Journal 18(5), 1141–1166 (2009)
12. Hua, M., Pei, J., Zhang, W., Lin, X.: Ranking queries on uncertain data: a proba-
bilistic threshold approach. In: Wang [21], pp. 673–686
13. Khoussainova, N., Balazinska, M., Suciu, D.: Probabilistic event extraction from
RFID data. In: ICDE, pp. 1480–1482. IEEE, Los Alamitos (2008)
14. Kohavi, R., Brodley, C., Frasca, B., Mason, L., Zheng, Z.: KDD-Cup 2000 orga-
nizers’ report: Peeling the onion. SIGKDD Explorations 2(2), 86–98 (2000)
15. Muzammal, M., Raman, R.: Mining sequential patterns from probabilistic
databases. Tech. Rep. CS-10-002, Dept. of Comp. Sci. Univ. of Leicester, UK
(2010), http://www.cs.le.ac.uk/people/mm386/pSPM.pdf
16. Muzammal, M., Raman, R.: On probabilistic models for uncertain sequential pat-
tern mining. In: Cao, L., Feng, Y., Zhong, J. (eds.) ADMA 2010, Part I. LNCS,
vol. 6440, pp. 60–72. Springer, Heidelberg (2010)
17. Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu,
M.: Mining sequential patterns by pattern-growth: The PrefixSpan approach. IEEE
Trans. Knowl. Data Eng. 16(11), 1424–1440 (2004)
18. Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and perfor-
mance improvements. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.)
EDBT 1996. LNCS, vol. 1057, pp. 3–17. Springer, Heidelberg (1996)
19. Suciu, D., Dalvi, N.N.: Foundations of probabilistic answers to queries. In: Özcan,
F. (ed.) SIGMOD Conference, p. 963. ACM, New York (2005)
20. Sun, X., Orlowska, M.E., Li, X.: Introducing uncertainty into pattern discovery in
temporal event sequences. In: ICDM, pp. 299–306. IEEE Computer Society, Los
Alamitos (2003)
21. Wang, J.T.L. (ed.): Proceedings of the ACM SIGMOD International Conference
on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12.
ACM, New York (2008)
22. Yang, J., Wang, W., Yu, P.S., Han, J.: Mining long sequential patterns in a noisy
environment. In: Franklin, M.J., Moon, B., Ailamaki, A. (eds.) SIGMOD Confer-
ence, pp. 406–417. ACM, New York (2002)
23. Zaki, M.J.: SPADE: An efficient algorithm for mining frequent sequences. Machine
Learning 42(1/2), 31–60 (2001)
24. Zhang, Q., Li, F., Yi, K.: Finding frequent items in probabilistic data. In: Wang
[21], pp. 819–832
Large Scale Real-Life Action Recognition Using
Conditional Random Fields with Stochastic
Training

Xu Sun1 , Hisashi Kashima1 , Ryota Tomioka1 , and Naonori Ueda2


1
Department of Mathematical Informatics, The University of Tokyo
{xusun,kashima,tomioka}@mist.i.u-tokyo.ac.jp
2
NTT Communication Science Laboratories, Kyoto, Japan
ueda@cslab.kecl.ntt.co.jp

Abstract. Action recognition is usually studied with limited lab set-


tings and a small data set. Traditional lab settings assume that the start
and the end of each action are known. However, this is not true for the
real-life activity recognition, where different actions are present in a con-
tinuous temporal sequence, with their boundaries unknown to the recog-
nizer. Also, unlike previous attempts, our study is based on a large-scale
data set collected from real world activities. The novelty of this paper is
twofold: (1) Large-scale non-boundary action recognition; (2) The first
application of the averaged stochastic gradient training with feedback
(ASF) to conditional random fields. We find the ASF training method
outperforms a variety of traditional training methods in this task.

Keywords: Continuous Action Recognition, Conditional Random


Fields, Online Training.

1 Introduction

Acceleration sensor based action recognition is useful in practical applications


[1,2,3,4]. For example, in some medical programmes, researchers hope to prevent
lifestyle diseases from being exacerbated. However, the traditional way of coun-
seling is ineffective both in time and accuracy, because it requires many manual
operations. In sensor-based action recognition, an accelerometer is employed
(e.g., attached on the wrist of people) to automatically capture the acceleration
statistics (e.g., a temporal sequence of three-dimension acceleration data) in the
daily life of counselees, and the corresponding categories of behaviors (actions)
can be automatically identified with a certain level of accuracy.
Although there is a considerable literature on action recognition, most of
the prior work discusses action recognition in a pre-defined limited environment
[1,2,3]. It is unclear whether or not the previous methods perform well in a more
natural real-life environment. For example, most of the prior work assumes that
the beginning and ending time of each action are known to the target recogniz-
ing system, and the produced system only performs simple classifications to the

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 222–233, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Large Scale Real-Life Action Recognition Using Conditional Random Fields 223

2.5
X-acceler.
2 Y-acceler.
Z-acceler.
act-0 act-4
1.5 act-3 act-5
act-5
Signal strength (g)

0.5

-0.5

-1

-1.5

-2
1000 1500 2000 2500
Time (seconds)

Fig. 1. An example of real-life continuous actions in our data, in which the correspond-
ing 3D acceleration signals are collected from the attached sensors. See Section 5 for
the meaning of the ‘g’ and action types, act-0 to act-5.

action signals [1,2,3]. However, this is not the case for real-life action sequences
of human beings, in which different types of actions are performed one by one
without an explicit segmentation on the boundaries. For example, people may
first walk, and then take a taxi, and then take an elevator, in which the bound-
aries of the actions are unknown to the target action recognition system. An
example of real-life actions with continuous sensor signals is shown in Figure 1.
For this concern, it is necessary and important to develop a more powerful sys-
tem not only to predict the types of the actions, but also to disambiguate the
boundaries of those actions.
With this motivation, we collected a large-scale real-life action data (continu-
ous sensor-based three-dimension acceleration signals) from about one hundred
people for continuous real-life action recognition. We adopt a popular structured
classification model, conditional random fields (CRFs), for recognizing the ac-
tion types and at the same time disambiguate the action boundaries. Moreover,
good online training methods are necessary for training CRFs on a large-scale
data in our task. We will compare different online training methods for training
CRFs on this action recognition data.

2 Related Work and Motivations


Most of the prior work on action recognition treated the task as a single-label
classification problem [1,2,3]. Given a sequence of sensor signals, the action recog-
nition system predicts a single label (representing a type of action) for the whole
224 X. Sun et al.

sequence. Ravi et al. [3] used decision trees, support vector machines (SVMs)
and K-nearest neighbors (KNN) models for classification. Bao and Intille [1] and
Pärkkä et al. [2] used decision trees for classification. A few other works treated
the task as a structured classification problem. Huynh et al. [4] tried to discover
latent activity patterns by using a Bayesian latent topic model.
Most of the prior work of action recognition used a relatively small data set.
For example, in Ravi et al. [3], the data was collected from two persons. In Huynh
et al. [4], the data was collected from only one person. In Pärkkä et al. [2], the
data was collected from 16 persons.
There are two major approaches for training conditional random fields: batch
training and online training. Standard gradient descent methods are normally
batch training methods, in which the gradient computed by using all training in-
stances is used to update the parameters of the model. The batch training meth-
ods include, for example, steepest gradient descent, conjugate gradient descent
(CG), and quasi-Newton methods like Limited-memory BFGS (LBFGS) [5]. The
true gradient is usually the sum of the gradients from each individual training
instance. Therefore, batch gradient descent requires the training method to go
through the entire training set before updating parameters. Hence, the batch
training methods are slow on training CRFs.
A promising fast online training method is the stochastic gradient method,
for example, the stochastic gradient descent (SGD) [6,7]. The parameters of the
model are updated much more frequently, and much fewer iterations are needed
before the convergence. For large-scale data sets, the SGD can be much faster
than batch gradient based training methods. However, there are problems on
the current SGD literature: (1) The SGD is sensitive to noise. The accuracy
of the SGD training is limited when the data is noisy (for example, the data
inconsistency problem that we will discuss in the experiment section). (2) The
SGD is not robust. It contains many hyper-parameters (not only regularization,
but also learning rate) and it is quite sensitive to them. Tuning the hyper-
parameters for SGD is not a easy task.
To deal with the problems of the traditional training methods, we use a new
online gradient-based learning method, the averaged SGD with feedback (ASF)
[8], for training conditional random fields. According to the experiments, the
ASF training method is quite robust for training CRFs for the action recognition
task.

3 Conditional Random Fields

Many traditional structured classification models may suffer from a problem,


which is usually called “the label bias problem” [9,10]. Conditional random fields
(CRFs) are proposed as an alternative solution for structured classification by
solving “the label bias problem” [10]. Assuming a feature function that maps a
pair of observation sequence x and label sequence y to a global feature vector f ,
the probability of a label sequence y conditioned on the observation sequence x
is modeled as follows [10,11,12]:
Large Scale Real-Life Action Recognition Using Conditional Random Fields 225

exp[Θ·f (y, x)]


P (y|x, Θ) =  , (1)
∀y exp[Θ·f (y, x)]
where Θ is a parameter  vector.
Typically, computing ∀y exp Θ·f (y, x) could be computationally intractable:
it is too large to explicitly sum over all possible label sequences. However, if
the dependencies between labels have a linear-chain structure, this summation
can be computed using dynamic programming techniques [10]. To make the
dynamic programming techniques applicable, the dependencies of labels must be
chosen to obey the Markov property. More precisely, we use Forward-Backward
algorithm for computing the summation in a dynamic programming style. This
has a computational complexity of O(N K M ). N is the length of the sequence;
K is the dimension of the label set; M is the length of the Markov order used
by local features.
Given a training set consisting of n labeled sequences, (xi , yi ), for i = 1 . . . n,
parameter estimation is performed by maximizing the objective function,

n
L(Θ) = log P (yi |xi , Θ) − R(Θ). (2)
i=1

The first term of this equation represents a conditional log-likelihood of a training


data. The second term is a regularizer for reducing overfitting. In what follows, we
denote the conditional log-likelihood of each sample log P (yi |xi , Θ) as Ls (i, Θ),
and therefore:
n
L(Θ) = Ls (i, Θ) − R(Θ). (3)
i=1

3.1 Stochastic Gradient Descent


The SGD uses a small randomly-selected subset of the training samples to ap-
proximate the gradient of the objective function given by Equation 3. The num-
ber of training samples used for this approximation is called the batch size. By
using a smaller batch size, one can update the parameters more frequently and
speed up the convergence. The extreme case is a batch size of 1, and it gives the
maximum frequency of updates, which we adopt in this work. Then, the model
parameters are updated in such a way:

Θk+1 = Θk + γk [Ls (i, Θ) − R(Θ)],
∂Θ
where k is the update counter and γk is the learning rate. A proper learning rate
can guarantee the convergence of the SGD method [6,7]. A typical convergent
choice of learning rate can be found in Collins et al. [13]:
γ0
γk = ,
1 + k/n
where γ0 is a constant. This scheduling guarantees ultimate convergence [6,7].
In this paper we adopt this learning rate schedule for the SGD.
226 X. Sun et al.

Notes m is the number of periods when the ASF reaches the convergence;
b is the current number of period;
c is the current number of iteration;
n is the number of training samples;
The learning rate, γ ←− 1+b/Zγ0
, is only for theoretical analysis. In practice we
can simply set γ ← 1, i.e., remove the learning rate.

Procedure ASF-train
Initialize Θ with random values
c ←− 0
for b ←− 1 to m
. γ ←− 1+b/Zγ0
with Z  n, or simply γ ← 1
. for 1 to b
. Θ ←− SGD-update(Θ)
. c ←− c + b
iter(c)
. Θ ←− Θ in Eq. 4
Return Θ

Procedure SGD-update(Θ)
for 1 to n
. select a sample j randomly
. Θ ←− Θ + γ ∂Θ ∂
Ls (j, Θ)
Return Θ

Fig. 2. The major steps of the ASF training

4 Averaged SGD with Feedback


Averaged SGD with feedback (ASF) is a modification and extension of the tra-
ditional SGD training method [8]. The naive version of averaged SGD is inspired
by the averaged perceptron technique [14]. Let Θiter(c),sample(d) be the parame-
ters after the d’th training example has been processed in the c’th iteration over
the training data. We define the averaged parameters at the end of the iteration
c as:  iter(c),sample(d)
iter(c ) c=1...c , d=1...n Θ
Θ  . (4)
nc
However, a straightforward application of parameter averaging is not ade-
quate. A potential problem of traditional parameter averaging is that the model
parameters Θ receive no information from the averaged parameters: the model
parameters Θ are trained exactly the same like before (SGD without averaging).
Θ could be misleading as the training goes on. To solve this problem, a natural
idea is to reset Θ by using the averaged parameters, which are more reliable.
The ASF refines the averaged SGD by applying a “periodic feedback”.
The ASF periodically resets the parameters Θ by using the averaged param-
eters Θ. The interval between a feedback operation and its previous operation is
called a training period or simply a period. It is important to decide when to do
Large Scale Real-Life Action Recognition Using Conditional Random Fields 227

the feedback, i.e., the length of each period should be adjusted reasonably as the
training goes on. For example, at the early stage of the training, the Θ is highly
noisy, so that the feedback operation to Θ should be performed more frequently.
As the training goes on, less frequent feedback operation would be better in order
to adequately optimize the parameters. In practice, the ASF adopts a schedule
of linearly slowing-down feedback : the number of iterations increases linearly in
each period, as the training goes on.
Figure 2 shows the steps of the ASF. We denote Θb,c,d as the model parameters
after the d’th sample is processed in the c’th iteration of the b’th period. Without
making any difference, we denote Θb,c,d more simply as Θb,cn+d where n is
the number of samples in a training data. Similarly, we use g b,cn+d to denote

∂Θ s
L (d, Θ) in the c’th iteration of the b’th period. Let γ (b) be the learning rate
(b)
in the b’th period. Let Θ be the averaged parameters produced by the b’th
(1)
period. We can induce the explicit form of Θ :

(1)  n−d+1
Θ = Θ1,0 + γ (1) g 1,d . (5)
n
d=1...n

When the 2nd period ends, the parameters are again averaged over all previous
model parameters, Θ1,0 , . . . , Θ1,n , Θ2,0 , . . . , Θ2,2n , and it can be expressed as:

(2)  n−d+1
Θ = Θ1,0 + γ (1) g 1,d
n
d=1...n
 (6)
(2) 2n − d + 1 2,d
+γ g .
3n
d=1...2n

Similarly, the averaged parameters produced by the b’th period can be expressed
as follows:
(b)   in − d + 1 i,d
Θ = Θ1,0 + (γ (i) g ). (7)
ni(i + 1)/2
i=1...b d=1...in

The best possible convergence result for stochastic learning is the “almost
sure convergence”: to prove that the stochastic algorithm converges towards
the solution with probability 1 [6]. The ASF guarantees to achieve almost sure
convergence [8]. The averaged parameters produced at the end of each period of
the optimization procedure of the ASF training are “almost surely convergent”
towards the optimum Θ∗ [8]. On the implementation side, there is no need to
keep all the gradients in the past for computing the averaged gradient Θ: we
can compute Θ on the fly, just like the averaged perceptron case.

5 Experiments and Discussion


We use one month data of the ALKAN dataset [15] for experiments. This is a
new data, and the data contains 2,061 sessions, with totally 3,899,155 samples
228 X. Sun et al.

Table 1. Features used in the action recognition task. For simplicity, we only describe
the features on x-axis, because the features on y-axis and z-axis are in the same setting
like the x-axis. A × B means a Cartesian product between the set A and the set B.
The time interval feature do not record the absolute time from the beginning to the
current window. This feature only records the time difference between two neighboring
windows: sometimes there is a jump of time between two neighboring windows.

Signal strength features:


{si−2 , si−1 , si , si+1 , si+2 , si−1 si , si si+1 } ×{yi , yi−1 yi }

Time interval features:


{ti+1 − ti , ti − ti−1 } ×{yi , yi−1 yi }

Mean, standard deviation, energy, covariance features:


mi ×{yi , yi−1 yi }
di ×{yi , yi−1 yi }
ei ×{yi , yi−1 yi }
{cx,y,i , cy,z,i , cx,z,i } ×{yi , yi−1 yi }

(in a temporal sequence). The data was collected by iPod accelerometers with the
sampling frequency of 20HZ. A sample contains 4 values: {time (the seconds past
from the beginning of a session), x-axis-acceleration, y-axis-acceleration, z-axis-
acceleration}, for example, {539.266(s), 0.091(g), -0.145(g), -1.051(g)}1. There
are six kinds of action labels: act-0 means “walking or running”, act-1 means
“on an elevator or escalator”, act-2 means “taking car or bus”, act-3 means
“taking train”, act-4 means “up or down stairs”, and act-5 means “standing
or sitting”.

5.1 How to Design and Implement Good Features

We split the data into a training data (85%), a development data for hyper-
parameters (5%), and the final evaluation data (10%). The evaluation metric
are sample-accuracy (%) (equals to recall in this task: the number of correctly
predicted samples divided by the number of all the samples). Following previous
work on action recognition [1,2,3,4], we use acceleration features, mean features,
standard deviation, energy, and correlation (covariance between different axis)
features. Features are extracted from the iPod accelerometer data by using a
window size of 256. Each window is about 13 seconds long. For two consecutive
windows (each one contains 256 samples), they have 128 samples overlapping to
each other. Feature extraction on windows with 50% of the window overlapping
was shown to be effective in previous work [1]. The features are listed in Table 1.
All features are used without pruning. We use exactly the same feature set for
all systems.
1
In the example, ‘g’ is the acceleration of gravity.
Large Scale Real-Life Action Recognition Using Conditional Random Fields 229

The mean feature is simply the averaged signal strength in a window:


|w|
sk
mi = k=1 ,
|w|
where s1 , s2 , . . . are the signal magnitudes in a window. The energy feature is
defined as follows: |w| 2
s
ei = k=1 k .
|w|
The deviation feature is defined as follows:

|w|
k=1 (sk − mi )
2
di = ,
|w|
where the mi is the mean value defined before. The correlation feature is defined
as follows:
covariancei (x, y)
ci (x, y) = ,
di (x)di (y)
where the di (x) and di (y) are the deviation values on the i’th window of the
x-axis and the y-axis, respectively. The covariancei (x, y) is the covariance value
between the i’th windows of the x-axis and the y-axis. In the same way, we can
define ci (y, z) and ci (x, z).
A naive implementation of the proposed features is to design several real-
value feature templates representing the mean value, standard deviation value,
energy value, and correlation value, and so on. However, in preliminary experi-
ments, we found that the model accuracy is low based on such straightforward
implementation of real features. A possible reason is that the different values of
a real value (e.g., the standard deviation) may contain different indications on
the action, and the difference of the indications can not be directly reflected by
evaluating the difference of their real values. The most easy way to deal with
this problem is to split an original real value feature into multiple features (can
still be real value features). In our case, the feature template function automat-
ically splits the original real value features into multiple real value features by
using a heuristic splitting interval of 0.1. For example, the standard deviations
of 0.21 and 0.31 correspond to two different feature IDs, and they correspond
to two different model parameters. The standard deviations of 0.21 and 0.29
correspond to an identical feature ID, with only difference on the feature values.
In our experiment, we found splitting the real features improves the accuracy
(more than 1%).
It is important to describe the implementation of edge features, which are
based on the label transitions, yi−1 yi . For traditional implementation of CRF
systems (e.g., the HCRF package), usually the edges features contain only the
information of yi−1 and yi , without the information of the observation sequence
(i.e., x). The major reason for this simple implementation of edge features is
for reducing the dimension of features. Otherwise, there can be an explosion of
edge features in some tasks. For our action recognition task, since the feature
230 X. Sun et al.

Table 2. Comparisons among methods on the sensor-based action recognition task.


The number of iterations are decided when a training method goes to its empirical
convergence state. The deviation means the standard deviation of the accuracy over
four repeated experiments.

Methods Accuracy Iteration Deviation Training time


ASF 58.97% 60 0.56% 0.6 hour
Averaged SGD 57.95% 50 0.28% 0.5 hour
SGD 55.20% 130 0.69% 1.3 hour
LBFGS (batch) 57.85% 800 0.74% 8 hours

dimension is quite small, we can combine observation information of x with label


transitions yi−1 yi , and therefore make “rich edge features”. We simply used the
same observation templates of node features for building rich edge features (see
Table 1). We found the rich edge features significantly improves the prediction
accuracy of CRFs.

5.2 Experimental Setting


Three baselines are adopted to make a comparison with the ASF method, in-
cluding the traditional SGD training (SGD), the SGD training with parameter-
averaging but without feedback (averaged SGD), and the popular batch training
method, limited memory BFGS (LBFGS).
For the LBFGS batch training method, which is considered to be one of the
best optimizers for log-linear models like CRFs, we use the OWLQN open source
package [16]2 . The hyper-parameters for learning were left unchanged from the
default settings of the software: the convergence tolerance was 1e-4; and the
LBFGS memory parameter was 10.
2
To reduce overfitting, we employed an L2 prior R(Θ) = ||Θ||2σ2 for both SGD
and LBFGS, by setting σ = 5. For the ASF and the averaged SGD, we did not
employ regularization priors, assuming that the ASF and the averaged SGD con-
tain implicit regularization by performing parameter averaging. For the stochas-
tic training methods, we set the γ0 as 1.0. We will also test the speed of the
various methods. The experiments are run on a Intel Xeon 3.0GHz CPU, and
the time for feature generation and data input/output is excluded from the
training cost.

5.3 Results and Discussion


The experimental results are listed in Table 2, and the more detailed results of the
respective action categories are listed in Table 3. Since recognizing actions from
real-life continuous signals require the action-identification and the boundary-
disambiguation at the same time, it is expected to be much more difficult than
2
Available online at: https://www.cs.washington.edu/homes/galen/ or
http://research.microsoft.com/en-us/um/people/jfgao/
Large Scale Real-Life Action Recognition Using Conditional Random Fields 231

Table 3. Comparisons among methods on different action labels

action types walk/run on elevat. car/bus train stairs stand/sit overall


# samples 2,246 37 3,848 2,264 221 5,275 13,891
Acc.(%) ASF 62.09 0 77.07 25.43 0 62.31 58.97
Acc.(%) Averaged SGD 60.84 0 76.35 25.88 0 61.36 57.95
Acc.(%) SGD 62.40 0 73.75 15.31 1.50 58.08 55.20
Acc.(%) LBFGS 51.66 0 76.41 22.98 1.02 66.57 57.85

the previous work on simply action-identification. An additional difficulty is that


the data is quite noisy. The number of iterations are decided when a training
method goes to its empirical convergence state3 .
Note that, the ASF training achieved better sample-accuracy than other on-
line training methods. The ASF method is relatively stable among different iter-
ations when the training goes on, while the SGD training faces severe fluctuation
when the training goes on. The averaged SGD training reached its empirical con-
vergence state faster than the ASF training. The ASF training converged much
faster than the SGD training. All of the online training methods converged faster
than the batch training method, LBFGS.
In Figure 3, we show the curves of sample-accuracies on varying the number of
training iterations of the ASF, the averaged SGD, and the traditional SGD. As
can be seen, the ASF training is much more stable/robust than the SGD training.
The fluctuation of the SGD is quite severe, probably due to the noisy data of
the action recognition task. The robustness of the ASF method relates to the
stable nature of the averaging technique with feedback. The ASF outperformed
the averaged SGD, which indicates that the feedback technique is helpful to
the naive parameter averaging. The ASF also outperformed the LBFGS batch
training with much fewer iteration numbers (therefore, with much faster training
speed), which is surprising.

5.4 A Challenge in Real-Life Action Recognition: Axis Rotation

One of the tough problems in this action recognition task is the rotation of the x-
axis, y-axis, and z-axis in the collected data. Since different people attached the
iPod accelerometer with a different rotation of iPod accelerometer, the x-axis,
y-axis, and z-axis faced the risk of inconsistency in the collected data. Take an
extreme case for example, while the x-axis may represent a horizontal direction
for an instance, the same x-axis may represent a vertical direction for another
instance. As a result, the acceleration signals of the same axis may face the prob-
lem of inconsistency. We suppose this is an important reason that prevented the
experimental results reaching a higher level of accuracy. A candidate solution
to keep the consistency is to tell the people to adopt a standard rotation when
3
Here, the empirical convergence state means an empirical evaluation of the
convergence.
232 X. Sun et al.

65
ASF
Averaged SGD
SGD

60
Accuracy (%)

55

50

45
0 10 20 30 40 50 60
Number of Iterations

Fig. 3. Curves of accuracies of the different stochastic training methods by varying the
number of iterations

collecting the data. However, this method will make the collected data not “nat-
ural” or “representative”, because usually people put the accelerometer sensor
(e.g., in iPod or iPhone) randomly in their pocket in daily life.

6 Conclusions and Future Work

In this paper, we studied automatic non-boundary action recognition with a


large-scale data set collected in real-life activities. Different from traditional
simple classification approaches to action recognition, we tried to investigate real-
life continuous action recognition, and adopted a sequential labeling approach
by using conditional random fields. To achieve good performance in continuous
action recognition, we presented how to design and implement useful features in
this task.
We also compared different online optimization methods for training condi-
tional random fields in this task. The ASF training method demonstrated to
be a very robust training method in this task with noisy data, and with good
performance. As future work, we plan to deal with the axis rotation problem
through a principled statistical approach.
Large Scale Real-Life Action Recognition Using Conditional Random Fields 233

Acknowledgments
X.S., H.K., and N.U. were supported by the FIRST Program of JSPS. We thank
Hirotaka Hachiya for helpful discussion.

References
1. Bao, L., Intille, S.S.: Activity recognition from user-annotated acceleration data.
In: Ferscha, A., Mattern, F. (eds.) PERVASIVE 2004. LNCS, vol. 3001, pp. 1–17.
Springer, Heidelberg (2004)
2. Präkkä, J., Ermes, M., Korpipää, P., Mäntyjärvi, J., Peltola, J., Korhonen, I.: Ac-
tivity classification using realistic data from wearable sensors. IEEE Transactions
on Information Technology in Biomedicine 10(1), 119–128 (2006)
3. Ravi, N., Dandekar, N., Mysore, P., Littman, M.L.: Activity recognition from ac-
celerometer data. In: AAAI, pp. 1541–1546 (2005)
4. Huynh, T., Fritz, M., Schiele, B.: Discovery of activity patterns using topic models.
In: Proceedings of the 10th International Conference on Ubiquitous Computing,
pp. 10–19. ACM, New York (2008)
5. Nocedal, J., Wright, S.J.: Numerical optimization. Springer, Heidelberg (1999)
6. Bottou, L.: Online algorithms and stochastic approximations. In: Saad, D. (ed.)
Online Learning and Neural Networks. Cambridge University Press, Cambridge
(1998)
7. Spall, J.C.: Introduction to stochastic search and optimization. Wiley-IEEE (2005)
8. Sun, X., Kashima, H., Matsuzaki, T., Ueda, N.: Averaged stochastic gradient de-
scent with feedback: An accurate, robust and fast training method. In: Proceedings
of the 10th International Conference on Data Mining (ICDM 2010), pp. 1067–1072
(2010)
9. Bottou, L.: Une Approche théorique de l’Apprentissage Connexionniste: Applica-
tions à la Reconnaissance de la Parole. PhD thesis, Université de Paris XI, Orsay,
France (1991)
10. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic
models for segmenting and labeling sequence data. In: Proceedings of the 18th
International Conference on Machine Learning (ICML 2001), pp. 282–289 (2001)
11. Daumé III, H.: Practical Structured Learning Techniques for Natural Language
Processing. PhD thesis, University of Southern California (2006)
12. Sun, X.: Efficient Inference and Training for Conditional Latent Variable Models.
PhD thesis, The University of Tokyo (2010)
13. Collins, M., Globerson, A., Koo, T., Carreras, X., Bartlett, P.L.: Exponentiated
gradient algorithms for conditional random fields and max-margin markov net-
works. J. Mach. Learn. Res. (JMLR) 9, 1775–1822 (2008)
14. Collins, M.: Discriminative training methods for hidden markov models: Theory
and experiments with perceptron algorithms. In: Proceedings of EMNLP 2002, pp.
1–8 (2002)
15. Hattori, Y., Takemori, M., Inoue, S., Hirakawa, G., Sudo, O.: Operation and base-
line assessment of large scale activity gathering system by mobile device. In: Pro-
ceedings of DICOMO 2010 (2010)
16. Andrew, G., Gao, J.: Scalable training of L1 -regularized log-linear models. In:
Proceedings of ICML 2007, pp. 33–40 (2007)
Packing Alignment: Alignment for Sequences
of Various Length Events

Atsuyoshi Nakamura and Mineichi Kudo

Hokkaido University, Kita 14 Nishi 9 Kita-ku Sapporo 060-0814, Japan


{atsu,mine}@main.ist.hokudai.ac.jp

Abstract. We study an alignment called a packing alignment that is an


alignment for sequences of various length events like musical notes. One
event in a packing alignment can have a number of consecutive opposing
events unless the total length of them exceeds the length of that one
event. Instead of using a score function that depends on event length,
which was studied by Mongeau and Sankoff [5], packing alignment deals
with event lengths explicitly using a simple score function. This makes
the problem clearer as an optimization problem. Packing alignment can
be calculated efficiently using dynamic programming. As an application
of packing alignment, we conducted experiments on frequent approx-
imate pattern extraction from MIDI files of famous musical variations.
The patterns and occurrences extracted from the variations using packing
alignment have more appropriate boundaries than those using conven-
tional string alignments from the viewpoints of the repetition structure
of the variations.

1 Introduction
Sequence alignment is now one of the most popular tools to compare sequences.
In molecular biology, various types of alignments are used in various kinds of
problems: global alignments of pairs of proteins related by common ancestry
throughout their length, local alignments involving related segments of proteins,
multiple alignments of members of protein families, and alignments made during
database searches to detect homologies [4]. Dynamic time warping (DTW), a
kind of alignments between two time series, are often used in speech recognition
[7] and aligning of audio recordings [1].
Most of previous works on alignments have dealt with strings in which each
component (letter) is assumed to have the same length. In comparison of musical
sequences, there is a research on an alignment that consider the length of each
note [5]. In their research, general alignment framework is adapted to deal with
note sequences by using a score (distance) function between notes that depends
on note length. Their method is very flexible but it heuristically defines its score
function so as to reflect note length.
In this paper, we study packing alignment that explicitly treats the length
of each component (event) together with a constraint on length. One event in
a packing alignment can have a number of consecutive opposing events unless

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 234–245, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Packing Alignment: Alignment for Sequences of Various Length Events 235

the total length of them exceeds the length of that one event. Compared to the
method using a length-dependent score function, our setting reduces flexibility
but makes the problem clearer as an optimization problem. We can show that
an optimal solution of this extended alignment problem for two event sequences
s and t can be obtained in O(p(s, t)n(s)n(t)) time and O(p(s, t)(n(s) + n(t)))
space using dynamic programming, where n(s) and n(t) is the number of events
in sequences s and t, respectively, and p(s, t) is the maximum packable number
that is defined as the maximum number of events in s or t which can be opposed
to any one event in the other sequence in packing alignment.
Alignment distance can be shown to be equivalent to edit distance even in
packing alignment if two additional 0-cost edit operations, partition and con-
catenation, are introduced. Alignment of various length events is also possible
indirectly by general string alignment or DTW if all events are partitioned uni-
formly in preprocessing. There are two significant differences between packing
alignment and these conventional alignments. First, one event must not be di-
vided in packing alignment while gaps can be inserted in the middle of one event
divided by uniform partitioning in preprocessed conventional alignment. Second,
an optimal solution in packing alignment can be calculated faster than that in
preprocessed conventional alignment when the number of events increases sig-
nificantly by uniform partitioning. DTW also allows one event to be opposed to
more than one event, but packing alignment is more strict on length of oppos-
ing events and more flexible at the point that it allows gap insertions. Though
alignment becomes flexible by virtue of gap insertion, alignment with long con-
secutive gaps are not desirable for many applications. So, we also developed
gap-constraint version algorithm of packing alignment.
In our experiments, we applied packing alignment to frequent approximate
pattern extraction from a note sequence of a musical piece. We used mining
algorithm EnumSubstrFLOO [6], which heavily uses an alignment algorithm as
a subprocedure. For two MIDI files of Bach’s musical pieces, EnumSubstrFLOO
using packing alignment, which is directly applied to the original sequence, was
more than four times faster than that using DTW and general alignment, which
are applied to the sequence made by uniform partitioning. We also applied Enum-
SubstrFLOO to the melody tracks in MIDI files of three musical variations in
order to check whether themes and variations can be extracted as patterns and
occurrences. The 80% of extracted patterns and occurrences by EnumSubstr-
FLOO with packing alignment were nearly whole themes, nearly whole varia-
tions or whole two consecutive variations while the algorithm using DTW and
general alignment, which were directly applied without uniform partitioning by
ignoring note length, could extract no such appropriate ranges except a few.

2 Packing Aliment of Event Sequences

Let Σ denote a finite set of event types. The gap ‘-’ is a special event type
that does not belong to Σ. Assume existence of a real-valued score function w
on (Σ ∪ {-}) × (Σ ∪ {-}), which measures similarity between two event types.
236 A. Nakamura and M. Kudo

Measures 2-5:
(Theme)
Measures 50-53:
(Variation 1)
Measures 245-248:
(Variation 5)

Fig. 1. Parts of the score of 12 Variations on “Ah Vous Dirai-je Maman (Twinkle
Twinkle Little Star)” K.265

An event (a, l) ∈ Σ × R+ is a pair of type a and length l, where R+ denotes the


set of positive real numbers. For event b = (a, l), b and |b| are used instead of
a and l, respectively. An event sequence s is a sequence s[1]s[2] · · · s[n(s)] whose
component s[i] is an event for all i = 1, 2, ..., n(s), where n(s) is the number of
events in s. When gap events are also allowed to be components of a sequence, we
call such an event sequence a gaped event sequence. Length of an event sequence
n(s)
s is defined as i=1 |s[i]|. Range r(s, j) of the jth event of an event sequence s
j−1 j
is [ i=1 |s[i]|, i=1 |s[i]|).
Example 1. The melody in measures 2-5 of Twelve Variations on “Twinkle Twin-
kle Little Star” shown in Figure 1 is represented as

(C5 , 14 )(C5 , 14 )(G5 , 14 )(G5 , 14 )(A5 , 14 )(A5 , 14 )(G5 , 14 )(G5 , 14 )

using event sequence representation, where scientific pitch notation is used as


event type notation. Let s denote this event sequence and let s[i] denote the ith
event in s. Then, the length of s is 84 , n(s) = 8, s[3] = (G5 , 14 ), s[3] = G5 ,
|s[3]| = 14 and r(s, 3) = [ 24 , 34 ).
The melody in measures 245-248 is represented as

(C5 , 14 )(R, 18 )(C5 , 18 )(G5 , 14 )(R, 18 )(G5 , 18 )(A5 , 14 )(R, 18 )(A5 , 18 )(G5 , 14 )(R, 18 )(G5 , 18 )

using event sequence representation, where event type ‘R’ denotes a rest.
A gap insertion into an event sequence s is an operation that inserts (-, l) right
before or after s[i] for some i ∈ {1, 2, ..., n(s)} and l ∈ R+ . We define a packing
alignment of two event sequences as follows.
Definition 1. A packing alignment of two event sequences s and t is a pair
(s , t ) that satisfies the following conditions.
1. s and t are gaped event sequences with the same length that are made from
s and t, respectively, by repeated gap insertions.
2. For all (j, k) ∈ {1, 2, ..., n(s )} × {1, 2, ..., n(t )},
r(s , j) ⊆ r(t , k) or r(s , j) ⊇ r(t , k) holds if r(s , j) ∩ r(t , k) = ∅.
3. For all (j, k) ∈ {1, 2, ..., n(s )} × {1, 2, ..., n(t )},
r(s , j) ∩ r(t , k) = ∅ if s [j] = t [k] =-.
Packing Alignment: Alignment for Sequences of Various Length Events 237

s C5 C5 G5 G5 A5 A5 G5 G5

t D5 C5 B4 C5 B4 C5 B4 C5 A5 G5 F#5 G5 F5# G5 F5# G5 G5# A5 C6 B5 D6 C6 B5 A5 A5 G5 E6 D6 C6 B5 A5 G5

s’ - C5 - C5 G5 G5 A5 A5 G5 G5

t’ D5 C5 B4 C5 B4 C5 B4 C5 A5 G5 F#5 G5 F5# G5 F5# G5 G5# A5 C6 B5 D6 C6 B5 A5 A5 G5 E6 D6 C6 B5 A5 G5 -


s’’ C5 - C5 G5 G5 A5 A5 G5 G5

t’’ D5 C5 B4 C5 B4 C5 B4 C5 A5 G5 F#5 G5 F5# G5 F5# G5 G5# A5 C6 B5 D6 C6 B5 A5 A5 G5 E6 D6 C6 B5 A5 G5 -

Fig. 2. Examples of packing alignments: (s, t) and (s , t ) are packing alignments of s
and t but (s , t ) is NOT

Example 2 (Continued from Example 1). Let s and t denote event sequence rep-
resentations for the melodies in measures 2-5 and measures 50-53, respectively.
By representing the length of each event as the length of a bar, s and t can be
illustrated using the diagram shown in Figure 2. In the figure, pair (s , t ) is NOT
a packing alignment of s and t because r(s , 2) ∩ r(t , 1) = ∅ but either one of
them is not contained in the other one, which violates condition 2 of Definition 1.
Pairs (s, t) and (s , t ) are packing alignments of s and t.

For event sequences s and t, let A(s, t) denote the set of packing alignments of
s and t.
Score S(s , t ) between s and t for a packing alignment (s , t ) is defined as
 
S(s , t ) = |s [j]|w(s [j], t [k]) + |t [k]|w(s [j], t [k]).
(j,k):r(s ,j)⊆r(t ,k) (j,k):r(s ,j)⊃r(t ,k)

Definition 2. Packing alignment score S∗ (s, t) between event sequences s and t


is the maximum score S(s , t ) among those for all packing alignments (s , t ) ∈
A(s, t), namely,

S∗ (s, t) = max S(s , t ).


(s ,t )∈A(s,t)

Problem 1. For given event sequences s and t, calculate the packing alignment
score between s and t (and the alignment (s , t ) that achieves the score).

Example 3 (Continued from Example 2). Define w(a, b) as 1 if a = b and −1


otherwise. Then, S(s, t) = − 21 and S(s , t ) = − 16
7
. In this case, alignment
 
(s , t ) is one of the optimal packing alignments for s and t. Let u denote an
event sequence representation for the melody in measures 245-248. Then, the
unique optimal packing alignment for s and u is alignment (s, u) and S(s, u) = 1.

Let s[i..j] denote s[i]s[i + 1] · · · s[j]. The following proposition holds. The proof
is omitted due to space limitations.
238 A. Nakamura and M. Kudo

Proposition 1. S∗ (s[1..m], t[1..n]) =


⎧ m
⎪ 

⎪ |s[i]|w(s[i], -) if n = 0,



⎪ i=1

⎪ n

⎪ 

⎪ |t[i]|w(-, t[i]) if m = 0, and otherwise






i=1 ⎧

⎪ ⎪ S∗ (s[1..m − 1], t[1..n0 ]) 

⎪ ⎪


⎪ ⎪
⎪ n n

⎪ ⎪ +
⎪ |t[i]|w(s[m], t[i]) + |s[m]| − |t[i]| w(s[m], -)

⎪ ⎪

⎨ ⎪


⎪ i=n0 +1 i=n0 +1

⎪ n

⎪ ⎪
⎪ ≤ |t[i]| ≤ |s[m]|,

⎪ ⎪
⎪ for all n 0 n with

⎪ ⎨


i=n 0 +1

⎪ max S∗ (s[1..m0 ], t[1..n − 1])

⎪ ⎪
⎪ ⎛ ⎞

⎪ ⎪
⎪ m m

⎪ ⎪
⎪  

⎪ ⎪
⎪ |s[j]|w(s[j], t[n]) + ⎝|t[n]| − |s[j]|⎠ w(-, t[n])

⎪ ⎪ +


⎪ ⎪
⎪ j=m0 +1 j=m0 +1

⎪ ⎪


⎪ ⎪
⎪ m

⎪ ⎪
⎪ ≤ |s[j]| ≤ |t[n]|.

⎩ ⎪
⎩ for all m 0 m with
j=m0 +1

Remark 1. Mongeau and Sankoff [5] have already proposed a method using the
recurrence equation with the same search space constrained by the lengths of
s[m] and t[n]. They heuristically introduced the constraint for efficiency while
the constraint is necessary to solve the packing alignment problem.
Let s and t be event sequences with ls = max |s[i]| and lt = max |t[i]|.
1≤i≤n(s) 1≤i≤n(t)
Maximum packable number p(s, t) is the maximum of the following two numbers:
j
(1) the maximum number of events s[i], s[i + 1], ..., s[j] with k=i |s[k]| ≤ lt and

(2) the maximum number of events t[i], t[i + 1], ..., t[j] with jk=i |t[k]| ≤ ls .
Proposition 2. The optimal packing alignment problem for event sequences s
and t can be solved in O(p(s, t)n(s)n(t)) time and O(p(s, t)(n(s) + n(t))) space.
Proof. Dynamic programming using an n(s)×n(t) table can achieve the bounds.
Entry (i, j) of the table is filled by S∗ (s[1..i], t[1..j]) in the dynamic programming.
By Proposition 1, this is done using at most p(s, t)+2 entry values that have been
already calculated so far. Thus, totally, O(p(s, t)n(s)n(t)) time and O(n(s)n(t))
space are enough. The space complexity can be reduced to O(p(s, t)(n(s)+n(t)))
using a technique of a linear space algorithm proposed in [3].

3 Properties of Packing Alignment


3.1 Relation to Edit Distance
By using non-negative score function w satisfying that w(a, b) = 0 ⇔ a = b and
defining d(s, t) =  min S(s , t ), we can obtain packing alignment distance
(s ,t )∈A(s,t)
Packing Alignment: Alignment for Sequences of Various Length Events 239

d. (Weighted) alignment distance is known to be equivalent to (weighted) edit


distance in general. In fact, packing alignment distance can be seen as a spe-
cial one of more general edit distance proposed in [5]. The specialty of packing
alignment allows us to simplify its corresponding edit operations.
(Weighted) edit distance between two event sequences s and t is the minimum
cost needed for s to be transformed into t using five edit operations: insertion,
deletion, substitution and partition of an event and concatenation of more than
one event of the same type. The last two operations are newly introduced 0-cost
operations for dealing with event sequences. A partition of event b is to replace
it with event sequence b1 b2 · · · bl composed of l(> 1) events satisfying bi  = b
for all i = 1, 2, ..., l and |b1 | + |b2 | + · · · + |bl | = |b|. A concatenation of more than
one consecutive event of the same type is the operation in the opposite direction.
Define the cost of insertion, deletion and substitution as follows: |b|w(-, b)
when b is inserted, |b|w(b, -) when b is deleted and |b|w(b, b ) when b is
substituted to b with |b | = |b|. Note that substitution is only allowed for event
type, and event length cannot be changed. Then, the alignment distance between
s and t is equal to the edit distance between them because an alignment (s , t ) of
s and t has one-to-one correspondence to a set of edit operations transforming s
to t with the total cost of d(s , t ). Note that any event created by a partition must
not be involved in a concatenation and any event created by a concatenation vice
versa. Partition operation that divides s[i] into l events b1 , b2 , ..., bl corresponds
to s[i] having an opposing event subsequence composed of l consecutive events
whose jth event has length |bj | in alignment (s , t ). Substitution operation that
substitutes b to b corresponds to (a part of) some s[i] of type b whose opposing
event is b in alignment (s , t ). Deletion operation that deletes b corresponds
to (a part of) some s[i] of type b whose opposing event is a gap event in
alignment (s , t ). Insertion operation that inserts b corresponds to a gap event
whose opposing event sequence contains b in alignment (s , t ). Concatenation
operation that concatenates l consecutive events b1 , b2 , ..., bl of the same type
b1  corresponds to some event subsequence s [i + 1..i + l] with |s [i + j]| = |bj |
l
(j = 1, 2, ..., l) whose opposing event is (b1 , j=1 |bj |) in alignment (s , t ).

Remark 2. Packing alignment is stricter on length than the edit distance defined
by Mongeau and Sankoff [5]. Operations called fragmentation and consolidation
introduced by them correspond to the partition and concatenation, respectively.
One event can be replaced with any consecutive events in fragmentation and vice
versa in consolidation regardless of event type and length while the total length
and event types are kept in the partition and concatenation. Besides, their sub-
stitution can be allowed with any event of any length. Partition, concatenation
and our substitution can be seen as special fragmentation, consolidation and
substitution they defined, and each of these their operations can be realized by
a series of our operations. Thus, our operations are more basic ones, and their
cost can be more easily and naturally determined using score function w on
Σ ∪ {-} compared to their operations.
240 A. Nakamura and M. Kudo

3.2 Comparison with General String Alignment


Event sequences can be seen mere strings if all events are partitioned uniformly,
namely, partitioned into events with the same length (after quantization if nec-
essary). Then, general string alignment can be applied to them. What is the
difference between this method and packing alignment?
First, alignment score (multiplied by unit length) obtained by this string align-
ment is no smaller, and possibly larger, than that obtained by packing alignment.
This is because, for all packing alignment (s , t ), pair (s , t ) composed of uni-
formly partitioned s and t can be regarded as a string alignment with the same
score. Furthermore, one event before preprocessing of uniform partitioning can
be divided into noncontiguous events in some string alignment. For example, let
s = (C, 1)(D, 1)(C, 1) and t = (C, 2). Then, (C, 1)(D, 1)(C, 1) and (C, 1)(-, 1)(C, 1)
is an alignment of (C, 1)(D, 1)(C, 1) and (C, 1)(C, 1) but not a packing alignment
of s and t. As a result, packing alignment score of s and t is −1 while string
alignment score of them is 1 for w(a, b) defined as 1 if a = b and −1 otherwise.
Thus, packing alignment is favorable when gaps should not be inserted into the
middle of some events in alignment.
Another point to mention is time and space efficiency of the algorithms for the
problems. The number of events in an event sequence s can become very large
when uniform partitioning is applied to s. Let s and t be uniformly partitioned
s and t, respectively. Then, O(n(s )n(t )) time and O(n(s )+n(t )) space used by
a string alignment algorithm can be significantly lager than O(p(s, t)n(s)n(t))
time and O(p(s, t)(n(s) + n(t))) space used by a packing alignment algorithm.

3.3 Comparison with DTW


The most popular alignment in music is dynamic time warping (DTW), which is
first developed for speech recognition. DTW is a kind of string alignment, so it
cannot dealt with event sequences directly. However, in exchange for prohibiting
gap insertions, DTW allows more than one contiguous symbol in a string to
be opposed to one symbol in the other string just like packing alignment does.
Unlike packing alignment, there are no limitation on the number of contiguous
symbols opposing to one symbol in DTW. Packing alignment’s strictness on
length seems not so bad property because some constraint on length is effective
in practice. In fact, as constraints on a path in an alignment graph, adjustment
window condition (the condition that the path must lie within a fixed distance
of the diagonal) and slope constraint condition (the condition that the path
must go pm steps in the diagonal direction after m consecutive steps in one
axis direction for fixed p) [7] were proposed in DTW. Packing alignment is more
flexible than DTW at the point that it allows gap insertions.

4 Gap Constraint
When we use alignment score as similarity measure, one problem is that score
can be high for alignments with long contiguous gaps. However, in many real
Packing Alignment: Alignment for Sequences of Various Length Events 241

applications, two sequences with the best alignment with long contiguous gaps
should not be considered to be similar. So, we consider a gap-constraint version
of packing alignment score defined as follows. For non-negative real number
g ≥ 0, let Ag (s, t) denote the set of packing alignments of s and t in which the
length of every contiguous gap subsequence defined below is at most g. Then, the
gap-constraint version of packing alignment score S∗g is defined as

max(s ,t )∈Ag (s,t) S(s , t ) if Ag (s, t) = ∅,
S∗g (s, t) =
−∞ otherwise.

We call the parameter g the maximum contiguous gap length. For packing align-
ment (s , t ) of s and t, a contiguous subsequence s [i..j] is called a contiguous gap
subsequence of s if s [i] = · · · = s [j] = - and no non-gap event in s is opposed
to t [h] and t [k] that are opposed to s [i] and s [j], respectively. A contiguous gap
subsequence of t can be defined similarly. For example, when s = (C, 1)(E, 1)
and t = (C, 2)(D, 1)(E, 2), the pair of s = (C, 1)(-, 1)(-, 1)(-, 1)(E, 1) and t is a
packing alignment, but none of s [2..4], s [2..3] and s [3..4] is a contiguous gap
subsequence of s because t[1] and t[3] which are opposed to s [2] and s [4] are
also opposed to s [1] and s [5], respectively. Let
i+l−1 i+l−1
pg (s, t) = max{l : k=i |s[k]| ≤ g or k=i |t[k]| ≤ g}.

Then, the following proposition holds. The proof is omitted.

Proposition 3. The optimal packing alignment problem for event sequences


s and t with maximum contiguous gap length g can be solved in O((p(s, t) +
pg (s, t))n(s)n(t)) time and O((p(s, t) + pg (s, t))(n(s) + n(t))) space.

5 Experiments
5.1 Frequent Approximate Pattern Extraction
By local alignment using packing alignment, we can define similar parts in event
sequences, so we can extract frequent approximate patterns in event sequences.
Here, we consider the task of approximate pattern extraction frequently appeared
in one event sequence. In a note sequence of a musical sheet, such a pattern can
be regarded as a most typical and impressive part. We conducted an experiment
on this task using MIDI files of famous classical music pieces.
As a frequent mining algorithm based on local alignment, we used EnumSub-
strFLOO in [6]. For a given event sequence and a minimum support σ, Enum-
SubstrFLOO extracts contiguous event sequences as approximate patterns that
have minimal locally optimal occurrences with frequency of at least σ. Local
optimality is first introduced to local alignment by Erickson and Sellers [2], and
locally optimal occurrences of approximate patterns are expected to have appro-
priate boundaries. Unfortunately, EnumSubstrFLOO with packing alignment is
not so fast; it is a O(kn3 )-time and O(n3 )-space algorithm, where n is the num-
ber of events in a given sequence s and k is the maximum packable number
242 A. Nakamura and M. Kudo

p(s, s). Since EnumSubstrFLOO keeps all occurrence candidates in memory for
efficiency and there are a lot of frequent patterns with short length, we prevented
memory shortage by setting a parameter called the minimum length θ; only the
occurrences for patterns with length of θ quarter notes were extracted.
In our experiment, we used the following score function:
a = b a is close to b a =- or b =- otherwise
w(a, b) 3 0 −1 −2
Here, we say that a is close to b if one of the following conditions are satisfied:
(1) just one of them is rest ’R’, (2) the pitch difference between them is at most
two semitones or an integral multiple of octaves.
We scored each frequent pattern by summing the alignment scores between
the pattern and its selected high-scored occurrences which are greedily selected
so that the ranges of any selected occurrences do not overlap1.
The maximum contiguous gap length was set to the length of one quarter note
throughout our experiments. The continuity of one rest does not seem important,
so we cut each rest on each beat.

5.2 Running Time Comparison


First, we measured running time of EnumSubstrFLOO with packing alignment,
DTW and general alignment for melody tracks of two relatively small-sized MIDI
files: the 2nd track2 of Bach-Menuet.mid3 (“Menuet G dur”, BWV Anh. 114) and
the 1st track of k140-4.mid4(“Wachet auf, ruft uns die Stimme”, BWV 140).
DTW and general alignment were applied after uniform partitioning whose unit
length was set to the greatest common divisor of the lengths of all notes in
a MIDI file. Note that EnumSubstrFLOO runs in O(n3 )-time and O(n3 )-space
when DTW or general alignment is used, where n is the length of a target string,
namely, the number of unit notes. The minimum support was set to 5 and the
minimum length was set to 10 except for the case of k140-4.mid with general
alignment; for the case, the minimum support was set to 5 but the minimum
length was set to 40 because of memory shortage. The computer used in our
experiments is DELL Precision T7500 (CPU:Intel(R) Xeon(R) E5520 2.27GH,
memory:2GB). The result is shown below.

MIDI file Bach-Menuet.mid k140-4.mid


Method PA DTW GA PA DTW GA
PA: Packing Alignment,
#note 387 4632 829 7274
GA: General Alignment
#pat 355 323 1427 4539 552 0
time(sec) 1.08 28.8 60.8 13.1 60.8 1840
1
Two instances were not regarded as overlapped ones when all the overlapped part
is composed of rests.
2
The sequence of the highest-pitch notes is extracted as the target note sequence even
if the track contains overlapping notes.
3
je1emu.jpn.org/midiClassic/Bach-Menuet.mid
4
www.cyborg.ne.jp/kokoyo/bach/dl/k140-4.lzh
Packing Alignment: Alignment for Sequences of Various Length Events 243

For both MIDI files, the number of notes (#note) becomes 9-12 times larger
after uniform partitioning. As a result, EnumSubstrFLOOs using DTW and
general alignment are slower than that using packing alignment. The reason
why DTW is faster than general alignment is that pruning of pattern search
space works well for DTW, which means that DTW alignment score is easy to
become a negative value. Note that the best alignment score can become larger
using gaps for our score function but DTW does not use gaps.
The followings are the highest-scored pattern for packing alignment and the
longest patterns for the other methods extracted from Bach-Menuet.mid.

PA:

DTW:

GA:
The pattern extracted by packing alignment looks most appropriate for a
typical melody sequence of the Menuet.

5.3 Musical Variation Extraction Experiment


We conducted an experiment on musical variation extraction from the three fa-
mous variations shown in Table 1. For each MIDI file, we used one note sequence
in its melody track5 . We checked whether whole themes or whole variations can
be extracted as frequent approximate patterns and their occurrences.
The minimum support was set to 5 for all MIDI files. The minimum length
was set to 40 for mozart k265 and mozart-k331-mov16 and 20 for be-pv-19.
Note that the theme lengths of mozart k265, mozart-k331-mov1 and be-pv-19
are 96(> 40), 108(> 40), 32(> 20) quarter notes, respectively.
The PA-row of each chart in Fig. 3 shows the highest-scored pattern and its
occurrences from each MIDI file extracted by EnumSubstrFLOO using packing
alignment. The charts also contains the other three rows (Ex-row, DTW-row
and GA-row) for the comparison described later.
First, let us review the results shown in the PA-rows. The extracted patterns
are whole two consecutive variations for mozart-k265 and nearly whole themes
for the other two MIDI files. Here, we say that A is nearly whole B if the length
of the symmetric difference of A and B is at most 10% of B’s length. Except 4
among 20 occurrences of the three patterns, all the occurrences have appropriate
boundaries, namely, they are nearly whole themes, nearly whole variations and
whole two consecutive variations. The extracted patterns and occurrences are
considered to owe their accurate boundaries to their local optimality. Note that
variations are extracted correctly even if musical time changes: variation 12 ( 24 →
3
4
) in mozart k265, variation 6 ( 68 → 44 ) in mozart-k331-mov1, and variation 6
( 4 → 34 ) in be-pv-19.
2

5
The highest-pitch note is selected if the track contains overlapping notes.
6
It was set to 10 for mozart-k331-mov1 in the case of DTW and general alignment
because nothing was frequent for 40.
244 A. Nakamura and M. Kudo

Table 1. MIDI files used in our experiment on variation extraction

File Title
(Track no.)[Composer] Url
Length Form #Event Musical Time
mozart k265 12 Variations on ”Ah vous dirais-je, Maman” K.265
(4)[Mozart] tirolmusic.blogspot.com/2007/11/12-variations-on-ah-vous-dirais-je.html
12m16s AABABA 3423 2/4[1,589),3/4[589,648)
mozart-k331-mov1 Piano Sonata No. 11 in A major, K 331, Andante grazioso
(1)[Mozart] www2s.sni.ne.jp/watanabe/classical.html
12m20s AA’AA’BA”BA” 3518 6/8[1,218),4/4[218,262)
be-pv-19 6 Variations in D on an Original Theme (Op.76)
(1)[Beethoven] www.classicalmidiconnection.com/cmc/beethoven.html
3m34s AA’BA 1458 2/4[1,50),6/4[50,66),2/4[66,98),3/4[98,156),2/4[156,181)

2 50 98 147 196 245 294 343 392 441 489 538 589
PA
Ex
DTW
GA
theme v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12
(a) mozart-k265
2 38 74 110 146 182 218
PA
Ex
DTW
GA
theme v1 v2 v3 v4 v5 v6
(b)mozart-k331-mov1
2 18 34 50 66 82 99 136 156
PA
Ex
DTW
GA
theme v1 v2 v3 v4 v5 v6 v7 theme
(c)be-pv-19

Fig. 3. Result of musical variation extraction. The horizontal axes refer to measure
number in each musical piece. The vertical broken lines show the starting and ending
positions of themes and variations. The extracted patterns are shown by thick lines and
the other thin lines are their occurrences. The thick lines also represent its occurrence
except those in DTW-rows of (a) and (b).
Packing Alignment: Alignment for Sequences of Various Length Events 245

Let us compare the result with those by other methods. In each Ex-row, a
longest frequent exact pattern and its occurrences are shown. For all the three
MIDI files, the longest frequent exact patterns are very short (2-6 measures)
and their occurrences are clustered in narrow ranges. This result indicates the
importance of using approximate patterns in this extraction task.
In the DTW-rows and GA-rows, a longest approximate pattern extracted by
EnumSubstrFLOO using DTW and general alignment, respectively, and their
occurrences are shown. Note that we applied DTW and general alignment by
ignoring note length. No extracted patterns and occurrences are nearly whole
themes nor nearly whole variations in the case with DTW, neither are those in
the case with general alignment except three nearly whole variations in be-pv-19.
These results indicate the importance of length consideration.

6 Concluding Remarks
By explicitly treating event length, we defined the problem of packing alignment
for sequences of various length events as an optimization problem, which can be
solved efficiently. Direct applicability to such sequences has not only the merit
of time and space efficiency but also the merit of non-decomposability. By virtue
of these merits, we could extract appropriate frequent approximate patterns and
their occurrences in our experiments. We would like to apply packing alignment
to other applications in the future.

Acknowledgements
This work was partially supported by JSPS KAKENHI 21500128.

References
1. Dixon, S., Widmer, G.: MATCH:A music Alignment Tool Chest. In: Proceedings of
ISMIR 2005, pp. 11–15 (2005)
2. Erickson, B.W., Sellers, P.H.: Recognition of patterns in genetic sequences. In:
Sankoff, D., Kruskal, J.B. (eds.) Time Warps, String Edits and Macromolecules:
The Theory and Practice of Sequence Comparison, ch. 2, pp. 55–91. Addison-Wesley,
Reading (1983)
3. Hirshberg, D.S.: A linear space algorithm for computing maximal common subse-
quences. Communications of the ACM 18(6), 341–343 (1975)
4. Henikoff, S., Henikoff, J.: Amino acid substitution matrices from protein blocks.
Proc. Natl. Acad. Sci. USA 89, 10915–10919 (1992)
5. Mongeau, M., Sankoff, D.: Comparison of Musical Sequences. Computers and the
Humanities 24, 161–175 (1990)
6. Nakamura, A., Tosaka, H., Kudo, M.: Mining Approximate Patterns with Frequent
Locally Optimal Occurrences. Division of Computer Science Report Series A, TCS-
TR-A-10-41, Hokkaido University (2010),
http://www-alg.ist.hokudai.ac.jp/tra.html
7. Sakoe, H., Chiba, S.: Dynamic Programming Algorithm Optimization for Spoken
Word Recognition. IEEE Transactions on Acoustics, Speech, and Signal Process-
ing ASSP-26(1), 43–49 (1978)
Multiple Distribution Data Description Learning
Algorithm for Novelty Detection

Trung Le, Dat Tran, Wanli Ma, and Dharmendra Sharma

Faculty of Information Sciences and Engineering,


University of Canberra, ACT 2601, Australia
{trung.le,dat.tran,wanli.ma,
dharmendra.sharma}@canberra.edu.au

Abstract. Current data description learning methods for novelty de-


tection such as support vector data description and small sphere with
large margin construct a spherically shaped boundary around a normal
data set to separate this set from abnormal data. The volume of this
sphere is minimized to reduce the chance of accepting abnormal data.
However those learning methods do not guarantee that the single spher-
ically shaped boundary can best describe the normal data set if there
exist some distinctive data distributions in this set. We propose in this
paper a new data description learning method that constructs a set of
spherically shaped boundaries to provide a better data description to the
normal data set. An optimisation problem is proposed and solving this
problem results in an iterative learning algorithm to determine the set
of spherically shaped boundaries. We prove that the classification error
will be reduced after each iteration in our learning method. Experimen-
tal results on 28 well-known data sets show that the proposed method
provides lower classification error rates.

Keywords: Novelty detection, one-class classification, support vector


data description, spherically shaped boundary.

1 Introduction

Novelty detection (ND) or one-class classification involves learning data descrip-


tion of normal data to build a model that can detect any divergence from nor-
mality [9]. Data description can be used for outlier detection to detect abnormal
samples from a data set. Data description is also used for a classification problem
where one class is well sampled while other classes are severely undersampled.
In real-world applications, collecting the normal data is cheap and easy while
the abnormal data is expensive and is not available in several situations [14].
For instance, in case of machine fault detection, the normal data under the nor-
mal operation is easy to obtain while in faulty situation the machine is required
to devastate completely. Therefore one-class classification is more difficult than
conventional two-class classification because the decision boundary of one-class
classification is mainly constructed from samples of only the normal class and

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 246–257, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Multiple Distribution Data Description Learning Algorithm 247

hence it is hard to decide how strict decision boundary should be. ND is widely
applied to many application domains such as network intrusion, currency valida-
tion, user verification in computer systems, medical diagnosis [3], and machine
fault detection [16].
There are two main approaches to solving the data description problem which
are density estimation approach [1][2][12] and kernel based approach [13][14][20].
In density estimation approach, the task of data description is solved by esti-
mating a probability density of a data set [11]. This approach requires a large
number of training samples for estimation, in practice the training data is not
insufficient and hence does not represent the complete density distribution. The
estimation will mainly focus on modeling the high density areas and can result
in a bad data description [14]. Kernel-based approach aims at determining the
boundaries of the training set rather than at estimating the probability density.
The training data is mapped from the input space into a higher dimensional
feature space via a kernel function. Support Vector Machine (SVM) is one of
the well-known kernel-based methods which constructs an optimal hyperplane
between two classes by focusing on the training samples close to the edge of
the class descriptors [17]. These training samples are called support vectors. In
One-Class Support Vector Machine (OCSVM), a hyperplane is determined to
separate the normal data such that the margin between the hyperplane and
outliers is maximized [13]. Support Vector Data Description (SVDD) is a new
SVM learning method for one-class classification [14]. A hyperspherically shaped
boundary around the normal data set is constructed to separate this set from
abnormal data. The volume of this data description is minimized to reduce the
chance of accepting abnormal data. SVDD has been proven as one of the best
methods for one-class classification problems [19].
Some extensions to SVDD have been proposed to improve the margins of the
hyperspherically shaped boundary. The first extension is Small Sphere and Large
Margin (SSLM) [20] which proposes to surround the normal data in this optimal
hypersphere such that the margin—distance from outliers to the hypersphere, is
maximized. This SSLM approach is helpful for parameter selection and provides
very good detection results on a number of real data sets. We have recently
proposed a further extension to SSLM which is called Small Sphere and Two
Large Margins (SS2LM) [7]. This SS2LM aims at maximising the margin between
the surface of the hypersphere and abnormal data and the margin between that
surface and the normal data while the volume of this data description is being
minimised.
Other extensions to SVDD regarding data distribution have also been pro-
posed. The first extension is to apply SVDD to multi-class classification problems
[5]. Several class-specific hyperspheres that each encloses all data samples from
one class but excludes all data samples from other classes. The second extension
is for one-class classification which proposes to use a number of hyperspheres
to decribe the normal data set [19]. Normal data samples may have some dis-
tinctive distributions so they will locate in different regions in the feature space
and hence if the single hypersphere in SVDD is used to enclose all normal data,
248 T. Le et al.

it will also enclose abnormal data samples resulting a high false positive error
rate. However this work was not presented in detail, the proposed method is
heuristic and there is no proof provided to show that the multi-sphere approach
can provide a better data description.
We propose in this paper a new and more detailed multi-hypersphere ap-
proach to SVDD. A set of hyperspheres is proposed to describe the normal
data set assuming that normal data samples have distinctive data distributions.
We formulate the optimisation problem for multi-sphere SVDD and prove how
SVDD parameters are obtained through solving this problem. An iterative al-
gorithm is also proposed for building data descriptors, and we also prove that
the classification error will be reduced after each iteration. Experimental re-
sults on 28 well-known data sets show that the proposed method provides lower
classification error rates comparing with the standard single-sphere SVDD.

2 Single Hypersphere Approach: SVDD

Let xi , i = 1, . . . , p be normal data points with label yi = +1 and xi , i = p +


1, . . . , n be abnormal data points (outliers) with label yi = −1. SVDD [14] aims
at determining an optimal hypersphere to include all normal data points while
abnormal data points are outside this hypersphere. The optimisation problem is
as follows
 p n 
min R2 + C1 ξi + C2 ξi (1)
R,c,ξ
i=1 i=p+1

subject to

||φ(xi ) − c||2 ≤ R2 + ξi i = 1, . . . , p
||φ(xi ) − c||2 ≥ R2 − ξi i = p + 1, . . . , n
ξi ≥ 0, i = 1, . . . , n (2)

where R is radius of the hypersphere, C1 and C2 are constants, ξ = [ξi ]i=1,...,n


is vector of slack variables, φ(.) is a kernel function, and c is centre of the
hypersphere.
For classifying an unknown data point x, the following decision function is
used: f (x) = sign(R2 − ||φ(x) − c||2 ). The unknown data point x is normal if
f (x) = +1 or abnormal if f (x) = −1.

3 Proposed Multiple Hypersphere Approach

3.1 Problem Formulation

Consider a set of m hyperspheres Sj (cj , Rj )with center cj and radius Rj , j =


1, . . . , m. This hypershere set is a good data description of the normal data set
X = {x1 , x2 , . . . , xn } if each of the
hyperspheres describes a distribution in this
m
data set and the sum of all radii j=1 Rj2 should be minimised.
Multiple Distribution Data Description Learning Algorithm 249

Let matrix U = [uij ]p×m , uij ∈ {0, 1}, i = 1, . . . , p, j = 1, . . . , m where uij is


the membership representing degree of belonging of data point xi to hypersphere
Sj . The optimisation problem of multi-sphere SVDD can be formulated as follows

m 
p 
n 
m 
min Rj2 + C1 ξi + C2 ξij (3)
R,c,ξ
j=1 i=1 i=p+1 j=1

subject to

m 
m
uij ||φ(xi ) − cj ||2 ≤ uij Rj2 + ξi i = 1, . . . , p
j=1 j=1
||φ(xi ) − cj ||2 ≥ Rj2 − ξij i = p + 1, . . . , n, j = 1, . . . , m
ξi ≥ 0, i = 1, . . . , p
ξij > 0, i = p + 1, . . . , n, j = 1, . . . , m (4)

where R = [Rj ]j=1,...,m is vector of radii, C1 and C2 are constants, ξi and ξij are
slack variables, φ(.) is a kernel function, and c = [cj ]j=1,...,m is vector of centres.
The mapping φ(xi0 ) of a normal data point xi0 , i0 ∈ {1, 2, . . . , p}, has to be in
one of those hyperspheres, i.e. there exists a hypershere Sj0 , j0 ∈ {1, 2, . . . , m}
such that ui0 j0 = 1 and ui0 j = 0, j = j0 .
Minimising the function in (3) over variables R, c and ξ subject to (4) will
determine radii and centres of hyperspheres and slack variables if the matrix U
is given. On the other hand, the matrix U will be determined if radii and centres
of hyperspheres are given. Therefore an iterative algorithm will be applied to
find the complete solution. The algorithm consists of two alternative steps: 1)
Calculate radii and centres of hyperspheres and slack variables, and 2) Calculate
membership U .
We present in the next sections the iterative algorithm and prove that the clas-
sification error in the current iteration will be smaller than that in the previous
iteration.
For classifying a data point x, the following decision function is used
  
f (x) = sign max Rj2 − ||φ(x) − cj ||2 (5)
1≤j≤m

The unknown data point x is normal if f (x) = +1 or abnormal if f (x) = −1.


This decision function implies that the mapping of a normal data point has to
be in one of the hyperspheres and that the mapping of an abnormal data point
has to be outside all of those hyperspheres. The following theorem is used to
consider the relation of slack variables to data points classified.
Theorem 1. Assume that (R, c, ξ) is a solution of the optimisation problem in
(3), xi , i ∈ {1, 2, . . . , n} is the i-th data point.
1. xi is normal: denote Sk (ck , Rk ), k ∈ {1, 2, . . . , m} as the only hypersphere
having uik = 1. If xi is missclassified then ξi = ||φ(xi ) − ck ||2 − Rk2 . If xi
is correctly classified then ξi = 0, φ(xi ) ∈ Sk or ξi = ||φ(xi ) − ck ||2 − Rk2 ,
φ(xi ) ∈ Sk .
250 T. Le et al.

2. xi is abnormal: if xi is missclassified and φ(xi ) ∈ Sj then ξij = Rj2 −||φ(xi )−


cj ||2 . If xi is missclassified and φ(xi ) ∈ Sj then ξij = 0. If xi is correctly
classified then ξij = 0.
Proof. From(4) we have 
ξi = max 0, ||φ(xi ) − ck ||2 − Rk2 , if xi is normal, and
 
ξij = max 0, Rj2 − ||φ(xi ) − cj ||2 , if xi is abnormal.

1. xi is normal: if xi is misclassified then φ(xi ) is outside all of the hypersheres.


It follows that ||φ(xi ) − cj ||2 > Rj2 , j = 1, . . . , m. So ξi = ||φ(xi ) − ck ||2 − Rk2
with some k. If xi is correctly classified then the proof  is obtained using (5). 
2. xi is abnormal: it is easy to prove using ξij = max 0, Rj2 − ||φ(xi ) − cj ||2 .

The following empirical error can be defined for a data point xi :


⎧  

⎪ min ||φ(x ) − c ||2
− R 2


j i j j

⎨ if x i is normal and misclassified 
error(i) = min R2 − ||φ(x ) − c ||2 , x ∈ S (c , R ) (6)

⎪ j j i j i j j j



⎩ if xi is abnormal and misclassified
0 otherwise
n
nReferring to Theorem 1, it is easy to prove that i=1 ξi is an upper bound of
i=1 error(i).

3.2 Calculating Radii, Centres and Slack Variables


The Lagrange function for the optimisation problem in in (3) subject to (4) is
as follows

m 
p

n 
m
L(R, c, ξ, α, β) = Rj2 + C1 ξi + C2 ξij +
j=1 i=1 i=p+1 j=1
p  
αi ||φ(xi ) − cs(i) ||2 − Rs(i)
2
− ξi −
i=1
n 
m  
αij ||φ(xi ) − cj ||2 − Rj2 − ξij −
i=p+1 j=1
p 
n 
m
βi ξi − βij ξij (7)
i=1 i=p+1 j=1

where s(i) ∈ {1, . . . , m} is index of the hypersphere to which data point xi


belongs and satisfies uis(i) = 1 and uij = 0 ∀j = s(i).
Setting derivatives of L(R, c, ξ, α, β)with respect to primal variables to 0, we
obtain
∂L  n
=0 ⇒ αi y i + αij yi = 1 (8)
∂Rj −1 p+1
i∈s (j)
Multiple Distribution Data Description Learning Algorithm 251

∂L  
n
=0 ⇒ cj = αi yi φ(xi ) + αij yi φ(xi ) (9)
∂cj
i∈s−1 (j) i=p+1

∂L
=0 ⇒ αi + βi = C1 , i = 1, . . . , p (10)
∂ξj

∂L
=0 ⇒ αij + βij = C2 , i = p + 1, . . . , n, j = 1, . . . , m (11)
∂ξij

αi ≥ 0, ||φ(xi ) − cs(i) ||2 − Rs(i)


2
− ξi ≤ 0,
 
αi ||φ(xi ) − cs(i) ||2 − Rs(i)
2
− ξi = 0, i = 1, . . . , p (12)

αij ≥ 0, ||φ(xi ) − cj ||2 − Rj2 + ξij ≥ 0,


 
αi ||φ(xi ) − cj ||2 − Rj2 + ξij = 0, i = p + 1, . . . , n, j = 1, . . . , m (13)

βi ≥ 0, ξi ≥ 0, βi ξi = 0, i = 1, . . . , p (14)

βij ≥ 0, ξij ≥ 0, βij ξij = 0, i = p + 1, . . . , n, j = 1, . . . , m (15)


To get the dual form, we substitute (8)-(15) to the Lagrange function in (7)
and obtain the following:


p

n 
m
L= αi ||φ(xi ) − cs(i) ||2 − αij ||φ(xi ) − cj ||2
i=1 i=p+1 j=1

p

p

p
= αi K(xi , xi ) − 2 αi φ(xi )cs(i) + αi ||cs(i) ||2 −
i=1 i=1 i=1
n 
m 
n 
m 
n 
m
αij K(xi , xi ) + 2 αij φ(xi )cj − αij ||cj ||2
i=p+1 j=1 i=p+1 j=1 i=p+1 j=1
p 
m  
m 
= αi K(xi , xj ) − 2 αi φ(xi )cj + αi ||cj ||2 −
i=1 j=1 i∈s−1 (j) j=1 i∈s−1 (j)
n 
m m  n m  n
αij K(xi , xj ) + 2 αij φ(xi )cj − αij ||cj ||2
i=p+1 j=1 j=1 i=p+1 j=1 i=p+1
p

m 
n 
m
= αi K(xi , xi ) − αij K(xi , xi ) − ||cj ||2
i=1 j=1 i=p+1 j=1
m 
  
n
= αi yi K(xi , xi ) + αij yi K(xi , xi )−
j=1 i∈s−1 (j) i=p+1
 
n 2 

αi yi φ(xi ) + αij yi φ(xi ) (16)
i∈s−1 (j) i=p+1
252 T. Le et al.

The result in (16) shows that the optimisation problem in (3) is equivalent to
m individual optimisation problems as follows

 
n 2

min αi yi φ(xi ) + αij yi φ(xi ) −
i∈s−1 (j) i=p+1

 n
αi yi K(xi , xi ) − αij yi K(xi , xi ) (17)
i∈s−1 (j) i=p+1

subject to
 
n
α i yi + αij yi = 1,
i∈s−1 (j) i=p+1
−1
0 ≤ αi ≤ C1 , i ∈ s (j), 0 ≤ αij ≤ C2 , i = p + 1, . . . , n, j = 1, . . . , m (18)

After solving all of these individual optimization problems, we can calculate


the updating radii R = [Rj ] and centres c = [cj ], j = 1, . . . , m using the equations
in SVDD.

3.3 Calculating Membership U


We use radii and centres of hyperspheres to update the membership matrix U .
The following algorithm is proposed:

For normal data points xi , i = 1 to p do


If xi is misclassified then
 
Let j0 = arg minj ||φ(xi ) − cj ||2 − Rj2
Set uij0 = 1 and uij = 0 if j = j0
End If
Else
Denote J = {j : φ(xi) ∈ Sj (cj , Rj )}
Let j0 = arg minj∈J ||φ(xi ) − cj ||2
Set uij0 = 1 and uij = 0 if j = j0
End Else
End For

3.4 Iterative Learning Process


The proposed iterative learning process for multi-sphere SVDD will run two al-
ternative steps until a convergence is reached as follows

Initialise U by clustering the normal data set in the input space


Repeat the following
Calculate R, c and ξ using U
Calculate U using R and c
Until a convergence is reached
Multiple Distribution Data Description Learning Algorithm 253

We can prove that the classification error in the current iteration will be
smaller than that in the previous iteration through the following key theorem.

Theorem 2. Let (R, c, ξ, U ) and (R, c, ξ, U) be solutions at the previous


iteration and current iteration, respectively. The following inequality holds

m
2 
p

n 
m 
m 
p

n 
m
Rj + C1 ξi + C2 ξij ≤ Rj2 + C1 ξi + C2 ξij (19)
j=1 i=1 i=p+1 j=1 j=1 i=1 i=p+1 j=1

Proof. We prove that (R, c, ξ, U ) is a feasible solution at current iteration.


Case 1: xi is normal and misclassified.

m   m  
uij ||φ(xi ) − cj ||2 − Rj2 − uij ||φ(xi ) − cj ||2 − Rj2 =
 j=1
 j=1  
uis(i) ||φ(xi ) − cs(i) || − 2 2
Rs(i) − min uij ||φ(xi ) − cj ||2 − Rj2 ≥ 0 (20)
j

Hence

m   
m  
uij ||φ(xi ) − cj || − Rj ≤
2 2
uij ||φ(xi ) − cj ||2 − Rj2 ≤ ξi (21)
j=1 j=1

(21) is reasonable due to (R, c, ξ, U ) is solution at the previous step.

Case 2: xi is normal and correctly classified.  


Denote J = {j : φ(xi ) ∈ S(cj , Rj )} and j0 = arg minj∈J ||φ(xi ) − cj ||2 then


m  
uij ||φ(xi ) − cj ||2 − Rj2 = ||φ(xi ) − cj0 ||2 − Rj20 ≤ 0 ≤ ξi (22)
j=1

Case 3: xi is abnormal.
It is seen that

||φ(xi ) − cj ||2 ≥ Rj2 − ξij , i = p + 1, . . . , n, j = 1, . . . , m (23)

From (21) - (23), we can conclude that (R, c, ξ, U ) is a feasible solution at current
iteration. In addition, (R, c, ξ, U) is optimal solution at current iteration. That
results in our conclusion.

4 Experimental Results
We performed our experiments on 28 well-known data sets related to machine
fault detection and bioinformatics. These data sets were originally balanced data
sets and some of them contain several classes. For each data set, we picked up
a class at a time and divided the data set of this class into two equal subsets.
One subset was used as training set and the other one with data sets of other
254 T. Le et al.

Table 1. Number of data points in 28 data sets. #normal: number of normal data
points, #abnormal: number of abnormal data points and d: dimension.

Data set #normal #abnormal d


Arrhythmia 237 183 278
Astroparticle 2000 1089 4
Australian 383 307 14
Breast Cancer 444 239 10
Bioinformatics 221 117 20
Biomed 67 127 5
Colon cancer 40 22 2000
DelfPump 1124 376 64
Diabetes 500 268 8
Dna 464 485 180
Duke 44 23 7129
Fourclass 307 555 2
Glass 70 76 9
Heart 303 164 13
Hepatitis 123 32 19
Ionosphere 255 126 34
Letter 594 567 16
Liver 200 145 6
Protein 4065 13701 357
Sonar 97 111 67
Spectf 254 95 44
Splice 517 483 60
SvmGuide1 2000 1089 4
SvmGuide3 296 947 22
Thyroid 3679 93 21
USPS 1194 6097 256
Vehicle 212 217 18
Wine 59 71 13

classes were used for testing. We repeated dividing a data set ten times and
calculated the average classification rates. We also compared our multi-sphere
SVDD method with SVDD and OCSVM. The classification rate acc is measured
as [6]

acc = acc+ acc− (24)
where acc+ and acc− are the classification accuracy on normal and abnormal
data, respectively.
 2
The popular RBF kernel function K(x, x ) = e−γ||x−x || was used in our ex-
periments. The parameter γ was searched in {2 : k = 2l + 1, l = −8, −7, . . . , 2}.
k

For SVDD and multi-sphere SVDD, the trade-off parameter C1 was searched
Multiple Distribution Data Description Learning Algorithm 255

Table 2. Classification results (in %) on 28 data sets for OCSVM, SVDD and Multi-
sphere SVDD (MS-SVDD).

Data set OCSVM SVDD MS-SVDD


Arrhythmia 63.16 70.13 70.13
Astroparticle 89.66 90.41 93.23
Australian 77.19 80.00 81.80
B. Cancer 95.25 98.64 98.64
Bioinformatics 68.34 68.10 72.00
Biomed 74.98 63.83 74.76
Colon cancer 69.08 67.42 67.42
DelfPump 63.20 70.65 75.27
Diabetes 68.83 72.30 78.72
Dna 76.08 73.70 83.01
Duke cancer 62.55 65.94 65.94
FourClass 93.26 98.48 98.76
Glass 80.60 79.21 79.21
Heart 73.40 77.60 79.45
Hepatitis 76.82 80.17 81.90
Ionosphere 90.90 88.73 92.12
Letter 91.42 95.86 98.03
Liver 73.80 62.45 74.12
Protein 63.65 70.68 71.11
Sonar 65.97 72.91 72.91
Spectf 77.10 70.71 77.36
Splice 64.43 70.51 70.51
SVMGuide1 89.56 87.92 93.05
SvmGuide3 63.14 70.63 70.63
Thyroid 87.88 87.63 91.44
USPS 92.85 92.83 96.23
Vehicle 64.50 70.38 75.04
Wine 88.30 98.31 98.31

over the grid {2k : k = 2l + 1, l = −8, −7, . . . , 2} and C2 was searched such that
the ratio C2 /C1 belonged to
1 p 1 p p p p 
× , × , ,2 × ,4× (25)
4 n−p 2 n−p n−p n−p n−p
For OCSVM, the parameter ν was searched in {0.1k : k = 1, . . . , 9}. For multi-
sphere SVDD, the number of hyperspheres was changed from 1 to 10 and 50
iterations were applied to each training.
Table 2 presents classification results for OCSVM, SVDD, and multi-sphere
SVDD (MS-SVDD). Those results over 28 data sets show that MS-SVDD always
performs better than SVDD. The reason is that SVDD is regarded as a special
case of MS-SVDD when the number of hyperspheres is 1. MS-SVDD provides
the highest accuracies for data sets except for Colon cancer and Biomed data
sets. For some cases, MS-SVDD obtains the same result as SVDD. This could be
256 T. Le et al.

explained as only one distribution for those data sets. Our new model seems to
attain the major improvement for the larger data sets. It is quite obvious since
the large data sets could have different distributions and can be described by
different hyperspheres.

5 Conclusion
We have proposed a new multiple hypersphere approach to solving one-class
classification problem using support vector data description. A data set is de-
scribed by a set of hyperspheres. This is an incremental learning process and
we can prove theoretically that the error rate obtained in current iteration is
less than that in previous iteration. We have made comparison of our proposed
method with support vector data description and one-class support vector ma-
chine. Experimental results have shown that our proposed method provided
better performance than those two methods over 28 well-known data sets.

References
1. Bishop, C.M.: Novelty detection and neural network validation. In: IEEE Proceed-
ings of Vision, Image and Signal Processing, pp. 217–222 (1994)
2. Barnett, V., Lewis, T.: Outliers in statistical data, 3rd edn. Wiley, Chichester
(1978)
3. Campbell, C., Bennet, K.P.: A linear programming approach to novelty detection.
Advances in Neural Information Processing Systems 14 (2001)
4. Chang, C.-C., Lin, C.-J.: LIBSVM: A Library for Support Vector Machines,
http://www.csie.ntu.edu.tw/~ cjlinlibsvm
5. Hao, P.Y., Liu, Y.H.: A New Multi-class Support Vector Machine with Multi-
sphere in the Feature Space. In: Okuno, H.G., Ali, M. (eds.) IEA/AIE 2007. LNCS
(LNAI), vol. 4570, pp. 756–765. Springer, Heidelberg (2007)
6. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training set: One-sided
selection. In: Proc. 14th International Conference on Machine Learning, pp. 179–
186 (1997)
7. Le, T., Tran, D., Ma, W., Sharma, D.: An Optimal Sphere and Two Large Margins
Approach for Novelty Detection. In: Proc. IEEE World Congress on Computational
Intelligence, WCCI (accepted 2010)
8. Lin, Y., Lee, Y., Wahba, G.: Support vector machine for classification in nonstan-
dard situations. Machine Learning 15, 1115–1148 (2002)
9. Moya, M.M., Koch, M.W., Hostetler, L.D.: One-class classifier networks for target
recognition applications. In: Proceedings of World Congress on Neural Networks,
pp. 797–801 (1991)
10. Mu, T., Nandi, A.K.: Multiclass Classification Based on Extended Support Vector
Data Description. IEEE Transactions on Systems, Man and Cybernetics Part B:
Cybernetics 39(5), 1206–1217 (2009)
11. Parra, L., Deco, G., Miesbach, S.: Statistical independence and novelty detec-
tion with information preserving nonlinear maps. Neural Computation 8, 260–269
(1996)
12. Roberts, S., Tarassenko, L.: A Probabilistic Resource Allocation Network for Nov-
elty Detection. Neural Computation 6, 270–284 (1994)
Multiple Distribution Data Description Learning Algorithm 257

13. Schlkopf, Smola, A.J.: Learning with kernels. The MIT Press, Cambridge (2002)
14. Tax, D.M.J., Duin, R.P.W.: Support vector data description. Machine Learning 54,
45–56 (2004)
15. Tax, D.M.J.: Datasets (2009),
http://ict.ewi.tudelft.nl/~ davidt/occ/index.html
16. Towel, G.G.: Local expert autoassociator for anomaly detection. In: Proc. 17th
International Conference on Machine Learning, pp. 1023–1030. Morgan Kaufmann
Publishers Inc., San Francisco (2000)
17. Vapnik, V.: The nature of statistical learning theory. Springer, Heidelberg (1995)
18. Vert, J., Vert, J.P.: Consistency and convergence rates of one class svm and related
algorithm. Journal of Machine Learning Research 7, 817–854 (2006)
19. Xiao, Y., Liu, B., Cao, L., Wu, X., Zhang, C., Hao, Z., Yang, F., Cao, J.:
Multi-sphere Support Vector Data Description for Outliers Detection on Multi-
Distribution Data. In: Proc. IEEE International Conference on Data Mining Work-
shops, pp. 82–88 (2009)
20. Yu, M., Ye, J.: A Small Sphere and Large Margin Approach for Novelty Detection
Using Training Data with Outliers. IEEE Transaction on Pattern Analysis and
Machine Intelligence 31, 2088–2092 (2009)
RADAR: Rare Category Detection via
Computation of Boundary Degree

Hao Huang, Qinming He, Jiangfeng He, and Lianhang Ma

College of Computer Science and Technology,


Zhejiang University, Hangzhou, 310027, China
{howardhuang,hqm,jerson_hjf2003,badmartin}@zju.edu.cn

Abstract. Rare category detection is an open challenge for active learn-


ing. It can help select anomalies and then query their class labels with
human experts. Compared with traditional anomaly detection, this task
does not focus on finding individual and isolated instances. Instead, it
selects interesting and useful anomalies from small compact clusters. Fur-
thermore, the goal of rare category detection is to request as few queries
as possible to find at least one representative data point from each rare
class. Previous research works can be divided into three major groups,
model-based, density-based and clustering-based methods. Performance
of these approaches is affected by the local densities of the rare classes.
In this paper, we develop a density insensitive method for rare category
detection called RADAR. It makes use of reverse k-nearest neighbors
to measure the boundary degree of each data point, and then selects
examples with high boundary degree for the class-label querying. Exper-
imental results on both synthetic and real-world data sets demonstrate
the effectiveness of our algorithm.

Keywords: active learning, anomaly detection, rare category detection.

1 Introduction
Rare category detection is an interesting task which is derived from anomaly
detection. This task was firstly proposed by Pelleg et al [2] to help the user
to select useful and interesting anomalies. Compared with traditional anomaly
detection, it aims to find representative data points of the compact rare classes
that differ from the individual and isolated instances in the low-density regions.
Furthermore, a human expert is required for labeling the selected data point
under a known class or a previously undiscovered class. A good rare category
detection algorithm should discover at least one example from each class with
the least label requests.
Rare category detection has many applications in real world. In Sloan Digital
Sky Survey [2], it helps astronomer to find the useful anomalies from mass sky sur-
vey images, which may lead to some new astronomic discoveries. In financial fraud
detection [3], although most of the financial transactions are legitimate, there are
few fraudulent ones. Compared with checking them one by one, using rare cate-
gory detection is much more efficient to detect instances of the fraud patterns. In

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 258–269, 2011.
c Springer-Verlag Berlin Heidelberg 2011
RADAR: Rare Category Detection via Computation of Boundary Degree 259

intrusion detection [4], the authors adopted an active learning framework to se-
lect ”interesting traffic” from huge volume traffic data sets. Then engineers could
find out the meaningful malicious network activities. In visual analytics [5], by lo-
cating the attractive changes in mass remote sensing imagery, geographers could
determine which changes on a particular geographic area are significant.
Up until now, several approaches have been proposed for rare category detec-
tion. The main techniques can be categorized into model-based [2], density-based
[7] [8] [9], and clustering-based [10] methods. The model-based methods assume
a mixture model to fit the data, and select the strangest records in the mix-
ture components for class labeling. However, this assumption has limited their
applicable scope. For example, they require that the majority classes and the
rare classes be separable or work best in the separable case [7]. The densities-
based methods employ essentially a local-density-differential-sampling strategy,
which selects the points from the regions where the local densities fall the most.
This kind of approaches can discover examples of the rare classes rapidly, de-
spite non-separability with the majority classes. But when the local densities of
some rare classes are not dramatically higher than that of the majority classes,
their performance is not as good as that in the high density-differential case.
The clustering-based methods first perform a hierarchical mean shift clustering.
Then, select the clusters which are compact and isolated and query the cluster
modes. Intuitively, if each rare class has a high local density and is isolated, its
points will easily converge at the mode of density by using mean shift. But in
real-world data sets, it is actually not the case. First, the rare classes are often
hidden in the majority classes. Second, if the local densities of the rare classes
are not high enough, their points may converge to the other clusters. In a word,
although the density-based and clustering-based methods work reasonably well
compared with model-based methods, their performance is still affected by the
local densities of the rare classes.
In order to avoid the effect of the local densities of the rare classes, we propose
a density insensitive approach called RADAR. To the best of our knowledge,
RADAR is the first sophisticated density insensitive method for rare category
detection. In our approach, we use the change in the number of RkNN to estimate
the boundary degree for each data point. The point with a higher boundary
degree has a higher probability to be the boundary point of the rare class. Then
we sort the data points by their boundary degrees and query their class labels
with human experts.
The key contribution of our work is twofold:
(1) We proposed a density insensitive method for rare category detection.
(2) Our approach has a higher efficiency on finding new classes, effectively re-
duces the number of queries to human experts.
The rest of the paper is organized as follows. Section 2 formalizes the problem
and defines its scope. Section 3 explains the working principle and working steps
of our approach. In Section 4, we compare RADAR with existing approaches on
both synthetic data sets and real data sets. Section 5 is the conclusion of this
paper.
260 H. Huang et al.

2 Problem Formalization
Following the definition of He et al. [7], we are given a set of unlabeled examples
S = {x1 ,x2 ,...,xn }, xi ∈ Rd , which come from m distinct categories, i.e. yi
= {1,2,...,m}. Our goal is to find at least one example from each category by
as few label requests as possible. For convenience, assume that there is only
one majority class, which corresponds to yi = 1, and all the other categories
are minority classes with prior pc , c = 2,...,m. Let p1 denote the prior of the
majority category. Notice that pi , i = 1, is much smaller than p1 .
Our rare category detection strategy is selecting the points which get the
highest boundary degree for labeling. To understand our approach clearly, we
introduce the following definitions to be used for the rest of the paper.
Definition 1. (Reverse k-nearest neighbor)The reverse k-nearest neighbor
(RkNN) of a point denoted as [6]: Given a data set DB, a point p, a posi-
tive integer k and a distance metric M, reverse k-nearest neighbors of p, i.e.
RkN Np (k ), is a set of points pi that pi ∈ DB and ∀pi , p ∈ kN Npi (k ), where
kN Npi (k ) are the k-nearest neighbors of points pi .
Definition 2. (Significant point) A point is significant point if its point number
of RkNN is above a certain threshold τ :

τ (k, w) = mean(k) − w ∗ std(k). (1)



|RkN Nq (k)|
q∈S
mean(k) = . (2)
n
 
 |RkN Np (k)|

 p∈S
std(k) =  (|RkN Nq (k)| − )2 /n. (3)
n
q∈S

Definition 3. (Coarctation index) The coarctation index of a point p, i.e. CI,


denoted as: 
CI (p, k) = |RkN Nq (k)|. (4)

q∈p kN Np (k)

Definition 4. (Uneven index) The uneven index of a point p, i.e. U I, denoted


as: 
 
 CI(p, k) 2
U I (p, k) =  (|RkN Nq (k)| − ) (5)
 k+1
q∈p kN Np (k)

Definition 5. (Boundary degree) The boundary degree of a point p, i.e. BD,


denoted as:
U I(p, k)
BD (p, k) = (6)
CI(p, k)
RADAR: Rare Category Detection via Computation of Boundary Degree 261

3 RADAR Algorithm
3.1 Working Principle
In this subsection, we explain why we have adopted a RkNN-based measurement
for boundary degree, and illustrate the reason for adopting the conception of
significant point.

RkNN-based boundary point detection. RkNN has some unique properties


[6]: (1)The cardinality of a point’s reverse k-nearest neighbors varies by data
distribution. (2)The RkNN of the boundary points is fewer than that of the
inner points.
The first property means that RkNN only considers the relative position be-
tween data points, and thus has nothing to do with the Euclidean distance or the
local density. There is a simple example: two clusters have the same data distri-
bution but differ in absolute distances between data points. Obviously, although
the two clusters have different densities, the kNN and the RkNN of correspond-
ing data points in two clusters remain the same. According to Definition 3, 4
and 5, CI, U I and BD of corresponding data points in two clusters are equal
too. In other words, CI, U I and BD are not designed to evaluate the local den-
sities of data points. Instead, they are used to find the local regions where data
distribution changes.
The second property is the reason why the boundary degree can be used
to detect the change in the data distribution. Generally, the kNN of boundary
points include partial inner points of the cluster, several outer points and some
other boundary points nearby. These three types of data points are very different
in the number of RkNN. By contrast, usually an inner point’s kNN are still inner
points. According to Definition 5, U I indicates the variation of the number of
RkNN between a data point and its kNN. Thus, the boundary points have higher
U I than the inner points. On the other hand, CI of a data point represents the
sum of the number of RkNN. When the number of kNN is fixed, a point with
more inner nearest neighbors has a higher CI. In other words, CI of inner points
is higher than that of the boundary points. Therefore, the boundary points have
higher U I, lower CI, and thus higher BD. By querying the class labels for the
data points with high BD, we can discover examples of rare classes hidden in
the majority class. Just like using radar to scan the data set, we do not consider
the situation in which the local data distribution is even. However, if there is
some changes in local data distribution, we will get a feedback signal and locate
the target.

Significant point. Before discussing the significant points which have more
than τ RkNN, we begin with an example illustrated in Fig. 1 which comes from
literature [6]. When k = 2, the Table 1 shows the kNN and the RKNN of each
point in Fig. 1. The cardinality of each point’s RkNN is as follows: p2 , p3 , p5 and
p7 has 3 RkNN; p6 has 2 RkNN; p1 and p4 has 1 RkNN; p8 has none. Notice that
p8 ’s nearest neighbors are in a relatively compact cluster consisted of p5 , p6 , p7 .
However, points in this cluster are each other’s kNN. Since the capacity of each
262 H. Huang et al.

Fig. 1. Example of RkNN

Table 1. The kNN and RkNN of each point

Point p1 p2 p3 p4 p5 p6 p7 p8
kNN p2 , p3 p1 , p3 p2 , p4 p2 , p3 p6 , p7 p5 , p7 p5 , p6 p5 , p7
RkNN p2 p1 , p3 , p4 p1 , p2 , p4 p3 p6 , p7 , p8 p5 , p7 p5 , p6 , p8

point’s kNN list is limited, p8 is not in the kNN lists of its nearest neighbors and
thus has no RkNN. According to Fig. 1, it is hard to say that p8 is a candidate
of minority-class points. In other words, if the RkNN of a point is extremely few,
it means this point is relatively far from the other points. It is not significant to
query its class label because of the low probability of this point belonging to a
compact cluster. Therefore, in our approach, we will query a point only if it’s a
significant point in advance.

3.2 Algorithm

The RADAR algorithm is presented in Algorithm 1. It works as follows. Firstly,


we estimate the point number ki of each minority category i. Then, we find
the ki -nearest neighbors for each point, and calculate the number of RkNN.
Furthermore, we calculate the ri for each minority class. ri is the global minimum
distance between each point and its kith nearest neighbor. It will be used for
updating the querying-duty-exemption list EL in Step 13. In the outer loop of
Step 9, firstly, we choose the smallest undiscovered class i and set k  to be the
corresponding ki . Then, in Step 11, we calculate each point’s boundary degree
(BD), and determine which points are significant points. By setting BD of the
non-significant points to be negative infinity, we can prevent them from the
RADAR: Rare Category Detection via Computation of Boundary Degree 263

Algorithm 1. Rare Category Detection via Computation of Boundary Degree


(RADAR)
Input:
Unlabeled data set S, p2 ,..., pm , w
Output:
The set I of selected examples and the set L of their labels.
1: for i = 2 to m do
2: Let ki = |S| ∗ pi .
3: ∀xj ∈ S, k distxj (ki ) is the distance between xj and its kith nearest neighbor.
Set kN Nxj (ki )={x|x ∈ S,  x − xj  ≤ k distxj (ki ), x = xj }.
4: ∀xj ∈ S, | kN Nx (k i ) |xj is the number of occurrences of xj in kN Nx (ki ). Set
| RkN Nxj (ki ) |= | kN Nx (ki ) |xj .
x∈S
5: Let ri =min(k distx (ki )).
x∈S
6: end for
7: Set r1 = minm 
i=2 ri
8: Build an empty querying-duty-exemption point list EL.
9: while not all the categories have been discovered do
10: Let k =min{ki |2 ≤ i ≤ m, and category i has not been discovered}.
11: ∀ x ∈ S, calculate BD(x, k ), if | RkN Nx (k ) | < τ (k , w), set BD(x, k )=–∞.
12: for t = 1 to n do
13: for each xj that
 has been selected and labeled yj ,
let EL=EL {x | x ∈ S,  x − xj  ≤ ry j }.
14: Query x=arg maxx∈S\EL(BD(x, k )).
15: if x belongs to a new category, break.
16: end for
17: end while

class-label querying and thus save the querying budget. In addition, setting the
parameter w to be 1 is a suitable experimental choice. In the inner loop of Step
12, we query the point which has the maximum boundary degree with human
experts. When we find an example from a previously undiscovered class, we quit
the inner loop. In order to reduce the number of queries caused by repeatedly
selecting examples from the same discovered category, we employ a discreet
querying-duty-exemption strategy: (1) in Step 8, we build a empty point list EL
to record the points which do not need to be queried; (2) in Step 13, if a point
xj from class yj is labeled, the points falling inside a hyper-ball B of radius ry j
centered at xj will be added into EL.
A good exemption strategy can help us to reduce the querying cost. But if the
exemption strategy is greedy, more points near to the labeled points will be added
into EL. Then, the risk of preventing some minority classes from querying will
be higher, especially when the minority classes are near to each other. In order to
avoid such case, we should ensure that the number of querying-duty-exemption
points will not be too large. In our discreet exemption strategy, when we label
a point under a minority class i, the number of points in the hyper-ball B will
not be more than ki , i.e. | B |≤ ki . The reason is that the radius ri is the global
264 H. Huang et al.

minimum distance between each point and its kith nearest neighbor. When we
label a point under the majority class, we do the querying-duty exemption more
carefully because this point is usually close to a rare category’s boundary. We do
not set the corresponding radius of B to be min(k distx (k1 )). Instead, for the
x∈S
sake of discreetness, we set the r1 = minm 
i=2 ri so that the nearby rare-category
points can keep their querying duties completely or partially.

4 Performance Evaluation
In this section, we compare RADAR with NNDM (density-based method pro-
posed in [7]), HMS (clustering-based method proposed in [10]) and random sam-
pling (RS) on both synthetic and real date sets. For RS, we run the experiments
50 times and take the average numbers of queries as the results.

4.1 Synthetic Data Sets


In this subsection, we compare RADAR with NNDM, HMS and RS on 3 syn-
thetic data sets.
In Fig. 2(a) and Fig. 2(b), the first two synthetic data sets contain the same
majority class, which has 1000 examples (green points) with Gaussian distribu-
tion. Each minority class (red points) in Fig. 2(a) has 20 examples, which fall
inside a hyper-ball of radius 5. In Fig. 2(b), we double the distances between the
20 examples of each minority class (red points). Then, each minority class in the
second data set falls inside a hyper-ball of radius 10. Obviously, the densities of
minority classes in Fig. 2(a) are about 4 times higher than that in Fig. 2(b).
The corresponding comparison results are illustrate in Fig. 3(a) and Fig. 3(b)
respectively. The curves in Fig. 3 show the number of discovered classes as the
function of the number of selected examples which is equal to the number of
queries to the user. To discover all the classes in the first two data sets, RS needs

200 200

150 150

100 100

50 50

0 0
0 50 100 150 200 0 50 100 150 200

(a) High-density minority classes (b) Low-density minority classes

Fig. 2. Synthetic data sets


RADAR: Rare Category Detection via Computation of Boundary Degree 265

5 5
Number of Classes Discovered

Number of Classes Discovered


4 4

3 3

RS
2 2 RS
HMS
HMS
NNDM NNDM
1 1
RADAR RADAR

0 0
0 20 40 60 80 100 0 20 40 60 80 100
Number of Selected Examples Number of Selected Examples

(a) Results of the high-density case (b) Results of the low-density case

Fig. 3. Comparison results on synthetic data sets

101 and 100 queries respectively; HMS needs 62 and 89 queries respectively;
NNDM needs 10 and 31 queries respectively; RADAR needs 8 and 10 queries
respectively. From these results we can see that the performance of NNDM and
HMS are dramatically affected by the local densities of the rare classes. By con-
trast, RADAR and RS are more insensitive to these local densities. Furthermore,
our approach is much more sophisticated than the straightforward method RS,
and has a high efficiency on finding new classes.
The third synthetic data set in Fig. 4(a) is a multi-density data set. The
majority class has 1000 examples (green points) with Gaussian distribution.
Each minority class (red points) has 20 examples and a different density from
each other. The comparison results are shown in Fig. 4(b). From this figure, we
can learn that the performance of RADAR is better than NNDM, HMS and RS
in this multi-density data set. To find all the classes, RS needs 103 queries; HMS
needs 343 queries; NNDM needs 55 queries; RADAR needs 17 queries.

200 5
Number of Classes Discovered

4
150

3
100
2 RS
HMS
50 1 NNDM
RADAR

0 0
0 50 100 150 200 0 20 40 60 80 100
Number of Selected Examples

(a) Data set (b) Results

Fig. 4. Experiment on synthetic multi-density data set


266 H. Huang et al.

4.2 Real Date Sets

In this section, we compare RADAR with NNDM, HMS, and RS on 4 real data
sets from the UCI data repository [1]: the Abalone, Statlog, Wine Quality and
Yeast data sets. The properties of these data sets are summarized in Table 2.
In addition, the Statlog is sub-sampled because the original Image Segmenta-
tion (Statlog) data set contains almost the same number of examples for each
class. The sub-sampling can create an imbalanced data set which suits the rare
category detection scenario. With the sub-sampling, the largest class in Statlog
contains 256 examples; the examples of the next class are half as many as that

Table 2. Properties of the real data sets

Data Set Records Dimension Classes Largest Class Smallest Class


Abalone 4177 7 20 16.5% 0.34%
Statlog 512 19 7 50% 1.5%
Wine Quality 4898 11 6 44.88% 0.41%
Yeast 1484 8 10 31.68% 0.33%

70
200
60

50
150
Local Density

Local Density

40
100 30

20
50
10

0 0
0 5 10 15 20 1 2 3 4 5 6
Minority Class Minority Class

(a) Abalone (b) Statlog

160 20

140

120 15
Local Density

Local Density

100

80 10

60

40 5

20

0 0
1 2 3 4 5 1 2 3 4 5 6 7 8 9
Minority Class Minority Class

(c) Wine Quality (d) Yeast

Fig. 5. Local densities of the minority classes in real data sets


RADAR: Rare Category Detection via Computation of Boundary Degree 267

Table 3. Number of queries needed to find out all classes for each algorithm

Data Set RS Algorithm HMS Algorithm NNDM Algorithm RADAR Algorithm


Abalone 462 539 146 131
Statlog 94 33 63 28
Wine Quality 197 51 - 20
Yeast 261 91 124 112

20 7
Number of Classes Discovered

Number of Classes Discovered


6
15
5

4
10
RS 3 RS
HMS HMS
2
5 NNDM NNDM
RADAR 1 RADAR

0 0
0 50 100 150 200 0 20 40 60 80 100
Number of Selected Examples Number of Selected Examples

(a) Abalone (b) Statlog

6 10
Number of Classes Discovered
Number of Classes Discovered

5
8

4
6
3
RS RS
4
2 HMS HMS
NNDM NNDM
2
1 RADAR RADAR

0 0
0 50 100 150 200 0 50 100 150
Number of Selected Examples Number of Selected Examples

(c) Wine Quality (d) Yeast

Fig. 6. Comparison results on real data sets

of the former one; the smallest classes all have 8 examples. The results are sum-
marized in Table 3. The mark ’-’ indicates that the algorithm cannot find out
all classes in the data set.
These real data sets are multi-density data sets. To estimate the local den-
sity of each minority class, we adopt a measurement for the local density of a
data point. We first calculate the average distance between a data point and
its k-nearest neighbors. Next, multiply the reciprocal of this average distance
by the global maximum distance between the points. The product is roughly
in proportion to the local density of the data point. Finally, we calculate aver-
age value of the products for each minority class and take this value as the
268 H. Huang et al.

local-density metric. For the sake of generalization and convenience, we set


k = min{ki | 2 ≤ i ≤ m}. Fig. 5 shows the local-density values of the mi-
nority classes in real each data set. The standard deviation of these local-density
values is 55.32 in Abalone, 20.47 in Statlog, 28.03 in Wine Quality and 3.99 in
Yeast. Therefore, the Abalone data set is more ”extreme” on the change of local
densities than the other data sets. By contrast, the Yeast data set is the most
”moderate” one.
Fig. 6 illustrates the comparison results on the 4 real data sets in details.
From this figure, we can learn that RADAR effectively reduces the number of
queries to human experts in each data set. It takes the least number of queries
to discover all classes in Abalone, Statlog and Wine Quality. In Yeast, which
is the most ”moderate” date set, the HMS method owns the highest efficiency
of finding new classes. But in Abalone, which is the most ”extreme” data set,
HMS is not as good as RADAR, NNDM or even the RS method. Furthermore,
in Wine Quality, the NNDM method falls short for its performance, and finds
out only 5 classes.
In summary, RADAR has a high efficiency on finding new classes, and is more
suitable for processing multi-density data because of its stability.

5 Conclusion

We have proposed a novel approach (RADAR) for rare category detection. Com-
pared with existing algorithms, RADAR is a density insensitive method, which is
based on reverse k-nearest neighbors (RkNN). In this paper, the boundary degree
of each point is measured by variation of RkNN. Data points with high boundary
degrees are selected for the class-label querying. Experimental results on both
synthetic and real-world data sets demonstrate that the number of queries has
dramatically decreased by using our approach. Moreover, RADAR has a more
attractive property. It is more suitable to handle the multi-density data sets.
Future works involve adopting a technique of parameter automatization to set
the parameter w and adapting our approach to the prior-free case.

References

1. Asuncion, A., Newman, D.J.: UCI machine learning repository (2007)


2. Pelleg, D., Moore, A.W.: Active learning for anomaly and rare-category detection.
In: Proc. NIPS 2004, pp. 1073–1080. MIT Press, Boston (2004)
3. Bay, S., Kumaraswamy, K., Anderle, M., Kumar, R., Steier, D.: Large scale detec-
tion of irregularities in accounting data. In: ICDM 2006, pp. 75–86 (2006)
4. Stokes, J.W., Platt, J.C., Kravis, J., Shilman, M.: ALADIN: active learning of
anomalies to detect intrusions. Technical report, Microsoft Research (2008)
5. Porter, R., Hush, D., Harvey, N., Theiler, J.: Toward interactive search in remote
sensing imagery. In: Proc. SPIE, Vol. 7709, pp. 77090V–77090V–10 (2010)
6. Xia, C., Hsu, W., Lee, M.L., Ooi, B.C.: BORDER: efficient computation of bound-
ary points. IEEE Trans. on Knowledge and Data Engineering 18(3), 289–303 (2006)
RADAR: Rare Category Detection via Computation of Boundary Degree 269

7. He, J., Carbonell, J.: Nearest-neighbor-based active learning for rare category de-
tection. In: Proc. NIPS 2007, pp. 633–640. MIT Press, Boston (2007)
8. He, J., Liu, Y., Lawrence, R.: Graph-based rare category detection. In: Proc. ICDM
2008, pp. 833–838 (2008)
9. He, J., Carbonell, J.: Prior-free rare category detection. In: Proc. SDM 2009, pp.
155–163 (2009)
10. Vatturi, P., Wong, W.: Category detection using hierarchical mean shift. In: Proc.
KDD 2009, pp. 847–856 (2009)
RKOF: Robust Kernel-Based Local Outlier
Detection

Jun Gao1 , Weiming Hu1 , Zhongfei (Mark) Zhang2 ,


Xiaoqin Zhang3 , and Ou Wu1
1
National Laboratory of Pattern Recognition, Institute of Automation,
Chinese Academy of Sciences, Beijing, China
{jgao,wmhu,wuou}@nlpr.ia.ac.cn
2
Dept. of Computer Science, State Univ. of New York at Binghamton,
Binghamton, NY 13902, USA
zhongfei@cs.binghamton.edu
3
College of Mathematics & Information Science, Wenzhou University,
Zhejiang, China
xqzhang@wzu.edu.cn

Abstract. Outlier detection is an important and attractive problem


in knowledge discovery in large data sets. The majority of the recent
work in outlier detection follow the framework of Local Outlier Factor
(LOF), which is based on the density estimate theory. However, LOF
has two disadvantages that restrict its performance in outlier detection.
First, the local density estimate of LOF is not accurate enough to detect
outliers in the complex and large databases. Second, the performance of
LOF depends on the parameter k that determines the scale of the local
neighborhood. Our approach adopts the variable kernel density estimate
to address the first disadvantage and the weighted neighborhood density
estimate to improve the robustness to the variations of the parameter
k, while keeping the same framework with LOF. Besides, we propose a
novel kernel function named the Volcano kernel, which is more suitable
for outlier detection. Experiments on several synthetic and real data
sets demonstrate that our approach not only substantially increases the
detection performance, but also is relatively scalable in large data sets
in comparison to the state-of-the-art outlier detection methods.

Keywords: Outlier detection, Kernel methods, Local density estimate.

1 Introduction

Compared with the other knowledge discovery problems, outlier detection is ar-
guably more valuable and effective in finding rare events and exceptional cases
from the data in many applications such as stock market analysis, intrusion
detection, and medical diagnostics. In general, there are two definitions of the

This work is supported in part by the NSFC (Grant No. 60825204, 60935002 and
60903147) and the US NSF (Grant No. IIS-0812114 and CCF-1017828).

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 270–283, 2011.
c Springer-Verlag Berlin Heidelberg 2011
RKOF: Robust Kernel-Based Local Outlier Detection 271

outlier detection: Regression outlier and Hawkins outlier. Regression outlier de-
fines that an outlier is an observation which does not match the predefined metric
model of the interesting data [1]. Hawkins outlier defines that an outlier is an
observation that deviates so much from other observations as to arouse suspicion
that this observation is generated by a different mechanism [2]. Compared with
Regression outlier detection, Hawkins outlier detection is more challenging work
because of the unknown generative mechanism of the normal data. In this paper,
we focus on the unsupervised methods for Hawkins outlier detection. In the rest
of this paper, outlier detection refers particularly to Hawkins outlier detection.
Over the past several decades, the research on outlier detection varies from
the global computation to the local analysis, and the descriptions of outliers vary
from the binary interpretations to probabilistic representations. Breunig et al.
propose a density estimation based Local Outlier Factor (LOF) [4]. This work is
so influential that there is a rich body of the literature on the local density-based
outlier detection. On the one hand, plenty of local density-based methods are
proposed to compute the outlier factors, such as the local correlation integral [5],
the connectivity-based outlier factor [8], the spatial local outlier measure [9], and
the local peculiarity factor [7]. On the other hand, many efforts are committed
to combining machine learning methods with LOF to accommodate the large
and high dimensional data [10,14].
Although LOF is popular in use in the literature, there are two major dis-
advantages restricting its applications. First, since LOF is based on the local
density estimate theory, it is obvious that the more accurate the density esti-
mate, the better the detection performance. The local reach-ability density used
in LOF is the reciprocal of the average of reach-distances between the given
object and its neighbors. This density estimate is an extension of the nearest
neighbor density estimate, which is defined as

k 1
f (p) = · (1)
2n dk (p)

(a) Old Faithful data (b) Density estimate

Fig. 1. (a) Eruption lengths of 107 eruptions of Old Faithful geyser. (b) The density
of Old Faithful data based on the nearest neighbor density estimate, redrawn from [3].
272 J. Gao et al.

where n is the total number of the objects, and dk (p) is the distance between
object p and its kth nearest neighbor. As shown in Fig. 1, the heavy tails of the
density function and the discontinuities in the derivative reduce the accuracy
of the density estimate. This dilemma indicates that with the LOF method, an
outlier is unable to deviate substantially from the normal objects in the complex
and large databases. Second, like all other local density-based outlier detection
methods, the performance of LOF depends on the parameter k which is defined
as the least number of the nearest neighbors in the neighborhood of an object
[4]. However, in LOF, the value of k is determined based on the average density
estimate of the neighborhood, which is statistically vulnerable to the presence of
an outlier. Hence, it is hard to determine an appropriate value of this parameter
to ensure the acceptable performance in the complex and large databases.
In order to address these two disadvantages of LOF, we propose a Robust
Kernel-based Outlier Factor (RKOF) in this paper. Specifically, the main con-
tributions of our work are as follows:
– We propose a kernel-based outlier detection method which brings the vari-
able kernel density estimate method into the computation of outlier factors,
in order to achieve a more accurate density estimate. Besides, we propose
a new kernel function named the Volcano kernel which requires a smaller
value of the parameter k for outlier detection than other kernels, resulting
in less detection time.
– We propose the weighted density estimate of the neighborhood of a given
object to improve the robustness of determining the value of the parameter
k. Furthermore, we demonstrate that this weighted density estimate method
is superior to the average density estimate method used in LOF in robust
outlier detection.
– We keep the same framework of local density-based outlier detection with
LOF. This makes that RKOF can be directly used in the extensions of LOF,
such as Feature Bagging [10], Top-n outlier detection [14], Local Kernel Re-
gression [15], and improve the detection performance of these extensions.
The remainder of this paper is organized as follows. Section 2 introduces our
RKOF method with a novel kernel function, named the Volcano kernel, and an-
alyzes the special property of the Volcano kernel. Section 3 shows the robustness
and computational complexity of RKOF. Section 4 reports the experimental
results. Finally, Section 5 concludes the paper.

2 Main Framework
A density-based outlier is detected by comparing its density estimate with its
neighborhood density estimate [4]. Hence, we first introduce the notions of the
local kernel density estimate of object p, the weighted density estimate of p’s
neighborhood. Then, we introduce the notion of the robust kernel-based outlier
factor of p, which is used to detect outliers. Besides, we analyze the influences of
different kernels to the performance of our method, and propose a novel kernel
function named the Volcano kernel with its special property in outlier detection.
RKOF: Robust Kernel-Based Local Outlier Detection 273

To make this work self-contains, we introduce the notions of the k-distance of


an object p, and the k-distance neighborhood of p, which are defined in LOF.

Definition 1. Given a data set D, an object p, and any positive integer k, the
k-distance(p) is defined as the distance d(p, o) between p and an object o ∈ D,
such that:

– for at least k objects o ∈ D\{p}, it holds that d(p, o ) ≤ d(p, o).


– for at most k − 1 objects o ∈ D\{p}, it holds that d(p, o ) < d(p, o).

Definition 2. Given a data set D, an object p, and any positive integer k, the
k-distance neighborhood of p, named Nk (p), contains every object whose distance
from p is not greater than the k-distance(p), i.e., Nk (p) = {q ∈ D\{p}|d(p, q) ≤
k-distance(p)}, where any such object q is called a k-distance neighbor of p.
|Nk (p)| is the number of the k-distance neighbors of p.

2.1 Robust Kernel-Based Outlier Factor (RKOF)

Let p = [x1 , x2 , x3 , . . . , xd ] be an object in the data set D, where d is the number


of the attributes. |D| is the number of all the objects in D.

Definition 3. (Local kernel density estimate of object p)


The local kernel density estimate of p is defined as
  −γ −γ 
h λo K(h−1 λ−1 o (p − o))
o∈Nk (p)
kde(p) =
|Nk (p)|
−1 
λo = {f (o)/g}−α log g = |D| log (f (q))
q∈D

where h is the smoothing parameter, γ is the sensitivity parameter, K(x) is the


multivariate kernel function and λo is the local bandwidth factor. f (x) is a pilot
density estimate that satisfies f (x) > 0 for all the objects, α is the sensitivity
parameter that satisfies 0 ≤ α ≤ 1, and g is the geometric mean of f (x).

kde(p) is an extension of the variable kernel density estimate [3]. kde(p) not
only retains the adaptive kernel window width that is allowed to vary from one
object to another, but also is computed locally in the k-distance neighborhood of
object p. The parameter γ equals the dimension number d in the original variable
kernel density estimate [3]. For the local kernel density estimate, the larger γ,
the more sensitive kde(p). However, the high sensitivity of kde(p) is not always
a merit for the local outlier detection in high dimensional data. For example,
if λo is always very small for all the objects in a sparse and high dimensional
data set, (λo )−γ always equals infinity. This makes kde(p) lack of the capacity
to discriminate between outliers and normal data. We give γ a default value 2
to obtain a balance between the sensitivity and the robustness.
274 J. Gao et al.

In this paper, we compute the pilot density function f (x) by the approximate
nearest neighbor density estimate according to Equation 1.
1
f (o) = (2)
k-distance(o)

Substituting Equation 2 into kde(p) in Definition 3, we obtain Equation 3,


where the default values of C and α are 1. In the following experiments, we
estimate the local kernel density of object p as follows:
 1 p−o
(C·k-distance(o)α )2 K( C·k-distance(o)α )
o∈Nk (p)
kde(p) = C = h · gα (3)
|Nk (p)|

Definition 4. (Weighted density estimate of object p’s neighborhood)


The weighted density estimate of p’s neighborhood is defined as

ωo · kde(o)  
o∈Nk (p) ( k-distance(o) − 1)2
wde(p) =  ωo = exp − mink
ωo 2σ 2
o∈Nk (p)

where ωo is the weight of object o in the k-distance neighborhood of object p, σ


is the variance with the default value 1, and mink = mino∈Nk (p) (k-distance(o)).

In the majority of local density-based methods, outlier factor is computed by the


ration of the neighborhood’s density estimate to the given object’s density esti-
mate. The neighborhood’s density is generally measured by the average value of
all the neighbors’ local densities in the neighborhood. In this estimate approach,
the detection performance is sensitive to the parameter k. The larger the value
of k, the larger the scale of the neighborhood. When k is large enough that the
majority in the neighborhood are normal objects, outliers have the chance to
be detected. In the weighted neighborhood density estimate, the weight of the
neighbor object is a monotonically decreasing function of its k-distance. The
neighbor object with the smallest k-distance has the largest weight 1. Compared
with the average neighborhood density estimate, the weighted neighborhood
density estimate makes that outliers can be detected accurately even if the num-
ber of outliers in the neighborhood equals the number of normal objects. This
means that the interval of the acceptable k in the weighted neighborhood density
estimate is much larger than that of the average neighborhood density estimate,
and our method is more robust to the variations of the parameter k.

Definition 5. (Robust kernel-based outlier factor of object p) The robust kernel-


based outlier factor of p is defined as
wde(p)
RKOF (p) =
kde(p)
where wde(p) is the density estimate of the k-distance neighborhood of p, and
kde(p) is the local density estimate of p.
RKOF: Robust Kernel-Based Local Outlier Detection 275

RKOF is computed by dividing the weighted density estimate of the neighbor-


hood of the given object by its local kernel density estimate. The larger RKOF,
the more probable to be an outlier the given object. It is obvious that the
smaller the object p’s local kernel density, and the larger the weighted density
of its neighborhood, the larger its outlier factor.

2.2 Choice of Kernel Functions


In LOF method, for most objects in a cluster, their outlier factors are approxi-
mately equal to 1; for most outliers isolated from the cluster, their outlier factors
are much larger than 1 [4]. This property makes it easy to distinguish between
outliers and normal objects.
The multivariate Gaussian and Epanechnikov kernel functions are commonly
used in the kernel density estimation, whose formulations are defined as follows:
 
1
K(x) = (2π)−d/2 exp − x2 (4)
2

(3/4)d (1 − x2 ), if x ≤ 1
K(x) = (5)
0, otherwise

where x denotes the norm of a vector x and it can be used to compute the
distances between objects.
Our RKOF method with the Gaussian kernel cannot ensure that outlier fac-
tors of the normal objects in a cluster are approximately equal to 1. Then, we
need to determine the threshold value of outlier factors in addition. The Epanech-
nikov kernel function equals zero when x is larger than 1. Hence, for most of
outliers and normal objects lying in the border of clusters, their outlier factors
equal infinity.
In order to achieve the same property with LOF, we define a novel kernel
function called the Volcano kernel as follows:

Definition 6. The Volcano kernel is defined as



β, if x ≤ 1
K(x) =
βg(x), otherwise

where β assures that K(x) integrates to one, and g(x) is a monotonically de-
creasing function, lying in a close interval [0, 1] and equal to zero at the infinity.
Unless otherwise specified, we use g(x) = e−|x|+1 as the default function for our
experiments.

Fig. 2 shows the curve of the Volcano kernel for the univariate data. When x
is not larger than 1, the kernel value equals a constant value β. This generates
that outlier factors of the objects deeply in the cluster are approximately equal
to 1. When x is larger than 1, the kernel value is the monotonically decreasing
276 J. Gao et al.

Fig. 2. The curve of the Volcano kernel for the univariate data

function of x and less than 1. This not only makes outlier factors continuous
and finite, but also makes outlier factors of outliers much larger than 1. Hence,
RKOF method with the Volcano kernel can capture outliers much easier, and
sort all the objects according to their RKOF values.

3 Robustness and Computation Complexity of RKOF


In this section, we first analyze the robustness of RKOF to the parameter k.
Then, we analyze the computation complexity of RKOF in detail.
Compared with the average neighborhood density estimate used in LOF, the
weighted neighborhood density estimate defined in Definition 4 is more robust to
the parameter k and it substantially helps improve the detection performance.
As shown in Theorem 1, if the weighted neighborhood density estimate replaces
the average neighborhood density estimate in the computation of outlier factors,
any local density-based outlier detection method following the framework of LOF
can be less sensitive to the parameter k.
Theorem 1. Let Nk (p) be the neighborhood of object p, and p be an outlier in
a data set D. Let r be the proportion of the outliers in Nk (p). Suppose that these

outliers have the same local density estimate (DE) α and k-distance α with p.
Also suppose that the normal data in Nk (p) have the same local density estimate
  
β and k-distance β , with α < β and α > β . The Outlier Factor (OF) can be
computed based on any local density-based outlier detection method that follows
the framework of LOF. Then for the average density estimate, it holds that:
OF (p) = r + (1 − r)ρ
For the weighted density estimate, it holds that:
(ρ − w)r − ρ
OF (p) =
(1 − w)r − 1
where ρ = β/α and ω is the weight of the outlier in Nk (p).
Proof: For the average density estimate:

DE(o)
o∈Nk (p) r|Nk (p)|α + (1 − r)|Nk (p)|β
OF (p) = = = r + (1 − r)ρ
|Nk (p)|DE(p) |Nk (p)|α
RKOF: Robust Kernel-Based Local Outlier Detection 277

Fig. 3. The curves of OF (p) for the average and the weighted density estimates

For the weighted density estimate:


Let ωoi and ωoj be the weights of the outlier and the normal object, respectively.
 
According to Definition 4, ωoi < 1 and ωoj = 1 because α > β . Replacing ωoi
with ω, we have

ωo · DE(o)
o∈Nk (p) r|Nk (p)|ωα + (1 − r)|Nk (p)|β
OF (p) =  =
DE(p) · ωo α(r|Nk (p)|ω + (1 − r)|Nk (p)|)
o∈Nk (p)

rω + (1 − r)ρ (ρ − ω)r − ρ
= =
rω + (1 − r) (1 − ω)r − 1

According to Theorem 1, OF (p) is a function of the parameter r while ρ has


a fixed value. r is determined by the parameter k. As shown in Fig. 3, for the
average neighborhood density estimate, OF (p) is a monotonically decreasing
linear function when r increases. For the weighted neighborhood density esti-
mate, OF (p) is a quadratic curve of r. When r ∈ [0, 1], OF (p) of the average
method is always much less than that of the weighted method. Fig. 3 shows that
OF (p) of the weighted method is larger than τ % of the maximum of OF (p)
when r ∈ [0, τ ]. τ depends on ρ and the weights of the outliers in the neighbor-
hood of p. More importantly, OF (p) is approximately a constant in the interval
[0, τ ]. This property indicates that the weighted method makes the local outlier
detection more robust to the variations of the parameter k.
Since RKOF shares the same framework with LOF, RKOF has the same
computational complexity as that of LOF. To compute the RKOF values with
the parameter k, the RKOF algorithm includes two steps. In the first step, the
k-distance neighbors for each object need to be found with their distances to
the given object computed in the data set D of n objects. The computational
complexity of this step is O(n log n) by using the index technology for k-nn
queries, which has been used in LOF [4]. In the second step, the kde(p), wde(p),
and RKOF (p) values are computed by scanning the whole data set. Since both
kde(p) and wde(p) are computed in the k-distance neighborhood of the given
object, the computational complexity of this step is O(nk). Hence, the total
278 J. Gao et al.

computational complexity of the RKOF algorithm is O(n log n + nk). Clearly,


the larger k is, the more the running time is consumed.

4 Experiments
In this section, we evaluate the outlier detection capability of RKOF based on
different kernel functions and compare RKOF with the state-of-the-art outlier
detection methods on several synthetic and real data sets.

(a) Synthetic-1 (b) Synthetic-2

Fig. 4. The distributions of the synthetic data sets

4.1 Synthetic Data


As shown in Fig. 4, the Synthetic-1 data set consists of 1500 normal objects and
16 outliers with two attributes. The normal objects distribute in three Gaussian
clusters with 500 normal objects in each cluster with the same variance, respec-
tively. Fifteen outliers lie in a dense Gaussian cluster, and the other outlier is
isolated from the others. The Synthetic-2 data set consists of 500 normal objects
uniformly in the annular region, 500 normal objects in a Gaussian cluster, and
20 outliers in two Gaussian clusters.
Table 1 exhibits the outlier detection results of LOF and RKOF on the
Synthetic-1 data set, respectively, where σ is the parameter of the weight in
RKOF. Top-16 objects are the sixteen objects that have the largest outlier fac-
tors in the synthetic data set. Obviously, if all top-16 objects are outliers, the

Table 1. Outlier detection for the Synthetic-1 data set

Number of outliers in the top-16 objects (coverage)


k
LOF RKOF(σ = 0.1) RKOF(σ = 1)
26 1(6.25%) 15(93.75%) 15(93.75%)
27 2(12.5%) 16(100%) 15(93.75%)
30 4(25%) 16(100%) 15(93.75%)
31 5(31.25%) 16(100%) 16(100%)
59 15(93.75%) 16(100%) 16(100%)
60 16(100%) 16(100%) 16(100%)
70 16(100%) 16(100%) 16(100%)
RKOF: Robust Kernel-Based Local Outlier Detection 279

(a) RKOF: k=14 (b) LOF: k=20

Fig. 5. The best performances of RKOF and LOF on the Synthetic-2 data (Top-20)

detection rate is 100% and the false alarm rate is zero. coverage is the ratio of
the number of the detected outliers to the 16 total outliers. RKOF(σ = 0.1)
can identify all the outliers when k ≥ 27. RKOF(σ = 1) can detect all the out-
liers when k ≥ 31. Clearly, the parameter σ directly relates to the sensitivity of
the outlier detection for RKOF. LOF is unable to identify all the outliers until
k = 60. Table 1 indicates that the available k interval of RKOF is larger than
that of LOF, which means that RKOF is less sensitive to the parameter k.
As shown in Fig. 5, RKOF with k = 14 captures all the outliers in Top-20
objects. LOF obtains its best performance with k = 20, whose detection rate is
85%. Compared with RKOF, LOF can not detect all the outliers whatever the
value of k is. It is obviously that the annular cluster and the Gaussian cluster
pose an obstacle to the choice of k. This result indicates that RKOF is more
adapted to the complex data sets than LOF.

4.2 Real Data

We compare RKOF with several state-of-the-art methods, including LOF [4],


LDF [6], LPF [7], Feature Bagging [10], Active Learning [11], Bagging [12], and
Boosting [13], on the real data sets. The performance of RKOF with the Gaus-
sian, Epanechnikov, and Volcano kernels is also compared. In the real data sets,
the features of the original data include discrete features and continuous fea-
tures. All the data are processed using the standard text processing techniques
following the original steps of the methods [7,11,10].
These real data sets consist of the KDD Cup 1999, the Mammography data
set, the Ann-thyroid data set, and the Shuttle data set, all of which can be
downloaded from the UCI database except the Mammography data set1 . The
KDD Cup 1999 is a general data set condensed for the intrusion detection re-
search. 60593 normal records and 228 U2R attack records labeled as outliers
are combined as the KDD outlier data set. All the records are described by 34
continuous features and 7 categorical features. The Mammography data set in-
cludes 10923 records labeled 1 as normal data and another 260 records labeled
2 as outliers; all the records consist of 6 continuous features. The Ann-thyroid
data set consists of 73 records labeled 1 as outliers and 3178 records labeled 3
1
Thank Professor Nitesh.V.Chawla for providing this data set.
280 J. Gao et al.

Table 2. The AUC values and the running time in parentheses for RKOF and the
comparing methods on the real data sets by the k-d tree method [17]. Since LPF has
the higher complexity and is unable to complete the data sets in the reasonable time,
the accurate running time for LPF is not given in this table.

PP
PP Data KDD Mammography Ann-thyroid Shuttle (average)
Methods PPP
RKOFa 0.962 (1918.1s) 0.871 (15.8s) 0.970 (4.9s) 0.990 (36.4s)
RKOFb 0.961 (2095.2s) 0.870 (19.8s) 0.970 (5.2s) 0.990 (36.9s)
RKOFc 0.944 (2363.7s) 0.855 (48.2s) 0.965 (13.2s) 0.993 (36.7s)
LOF 0.610 (2160.1s) 0.640 (28.8s) 0.869 (5.9s) 0.852 (42.0s)
LDF 0.941 (2214.9s) 0.824 (36.4s) 0.943 (7.2s) 0.962 (37.1s)
LPF 0.98 (2363.7s) 0.87 (48.2s) 0.97 (13.2s) 0.992 (42.0s)
Bagging 0.61(±0.25) 0.74(±0.07) 0.98(±0.01) 0.985(±0.031)
Boosting 0.51(±0.004) 0.56(±0.02) 0.64 0.784(±0.13)
Feature Bagging 0.74(±0.1) 0.80(±0.1) 0.869 0.839
Active Learning 0.94(±0.04) 0.81(±0.03) 0.97(±0.01) 0.999(±0.0006)
a. Using Volcano kernel b. Using Gaussian kernel c. Using Epanechnikov kernel

as normal data. There are 21 attributes where 15 attributes are binary and 6
attributes are continuous. The Shuttle data set consists of 11478 records with
label 1, 13 records with label 2, 39 records with label 3, 809 records with label
5, 4 records with label 6, and 2 records with label 7. We divide this data set into
5 subsets: label 2, 3, 5, 6, 7 records vs label 1 records, where the label 1 records
are normal, and others are outliers.
All the comparing outlier detection methods are evaluated using the ROC
curves and the AUC values. The ROC curve represents the trade-off between
the detection rate as y-axis and the false alarm rate as x-axis. The AUC value
is the surface area under the ROC curve. Clearly, the larger the AUC value, the
better outlier detection method.
The AUC values for RKOF with different kernels and all other comparing
methods are given in Table 2. Also shown in Table 2 are the running time data
for RKOF with different kernels as well as those of the other three local density-
based methods; since the AUC values for other comparing methods are directly
obtained from their publications in the literature, the running time data for
these methods are not available and thus are not included in this table.
From Table 2, we see that different RKOF methods using different kernels
receive similar AUC values on all the data sets, especially the Volcano and Gaus-
sian kernels. The k values with the best detection performance for all the three
kernels on all the data sets are shown in Fig. 6(a). Clearly, the k values for the
Volcano kernel are always smaller than those of the other kernels, and the k
values for the Epanechnikov kernel are the largest among three kernels. This
experiment supports one of the contributions of this work that the proposed
Volcano kernel achieves the least computation time among the existing kernels.
RKOF: Robust Kernel-Based Local Outlier Detection 281

(a) k for different kernels (b) ROC curves

Fig. 6. (a)The k values with the best performance for different kernels in RKOF. (b)
ROC curves for RKOF based on the Volcano kernel on the KDD and the Mammography
data sets.

(a) KDD (b) Mammography

Fig. 7. AUC values of RKOF based on the Volcano kernel with different k values for
the KDD and Mammography data sets

It indicates that different kernels used in RKOF do not significantly influence


the detection performance, but they dramatically change the minimal k value
with the acceptable performance and consequently the running time.
Fig. 6(b) shows the ROC curves of RKOF based on the Volcano kernel for
the KDD data set (k = 320) and the Mammography data set (k = 110). Fig.
7 shows the AUC values of RKOF based on the Volcano kernel with different
k values for the KDD and Mammography data sets. The AUC values for the
KDD data set are larger than 0.941, when k varies from 280 to 700; the AUC
values for the Mammography data set are larger than 0.824, when k changes
from 40 to 460. Clearly, the detection performance of RKOF for any k in these
interval is better than that of the other comparing methods except LPF. For
the Mammography data set, RKOF is more effective than the other comparing
methods with k = 110, compared with k = 11183 for LPF. For the KDD data
set, RKOF achieves the second best performance with k = 320. The best AUC
value is achieved by LPF, but this AUC value is obtained when k = 13000. The
complexity of RKOF is O(n log n + nk), compared with O(nd log n + ndk) for
LPF, where d is the dimensionality of the data. It is clear that under the same
circumstances LPF takes much longer time than RKOF while the AUC value
of RKOF is very close to this best value. For the Ann-thyroid data set, RKOF
282 J. Gao et al.

achieves the acceptable performance that is very close to the best performance.
The AUC value of the Shuttle data set is the average AUC of all the five subsets,
where the AUC values of the subsets with the label 5, label 6, and label 7 are
all approximately equal to 1. RKOF also obtains the acceptable performance
that is very close to the best performance for the Shuttle data set. Overall, while
there is no winner for all the cases, RKOF always achieves the best performance
or is close to the best performance in all the data sets with the least running
time. In particular, RKOF achieves the best performance or is close to the best
performance for the KDD and the Mammography data sets with much less
running time, which are the two large data sets of all the four data sets. This
demonstrates the high scalability of the RKOF method in outlier detection.
Specifically, in all the cases RKOF always has less running time than LOF, LDF
and LPF. Though the running time data for the other comparing methods are
not available, from the theoretic complexity analysis it is clear that they would
all take longer running time than RKOF.

5 Conclusions
We have studied the local outlier detection problem in this paper. We have
proposed the RKOF method based on the variable kernel density estimate and
the weighted density estimate of the neighborhood of an object, which have ad-
dressed the existing disadvantages of LOF and other density-based methods. We
have proposed a novel kernel function named the Volcano kernel, which is more
suitable for outlier detection. Theoretical analysis and empirical evaluations on
the synthetic and real data sets demonstrate that RKOF is more robust and
effective for outlier detection at the same time taking less computation time.

References
1. Rousseeuw, P.J., Leroy, A.M.: Robust Rgression and Outlier Detection. John Wiley
and Sons, New York (1987)
2. Hawkins, D.: Identification of Outliers. Chapman and Hall, London (1980)
3. Silverman, B.: Density Estimation for Statistics and Data Analysis. Chapman and
Hall, London (1986)
4. Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: Lof: Identifying density-based
local outliers. In: SIGMOD, pp. 93–104 (2000)
5. Papadimitriou, S., Kitagawa, H., Gibbons, P.: Loci: Fast outlier detection using
the local correlation integral. In: ICDE, pp. 315–326 (2003)
6. Latecki, L.J., Lazarevic, A., Pokrajac, D.: Outlier Detection with Kernel Density
Functions. In: Perner, P. (ed.) MLDM 2007. LNCS (LNAI), vol. 4571, pp. 61–75.
Springer, Heidelberg (2007)
7. Yang, J., Zhong, N., Yao, Y., Wang, J.: Local peculiarity factor and its application
in outlier detection. In: KDD, pp. 776–784 (2008)
8. Tang, J., Chen, Z., Fu, A.W.-c., Cheung, D.W.: Enhancing effectiveness of outlier
detections for low density patterns. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD
2002. LNCS (LNAI), vol. 2336, pp. 535–548. Springer, Heidelberg (2002)
RKOF: Robust Kernel-Based Local Outlier Detection 283

9. Sun, P., Chawla, S.: On local spatial outliers. In: KDD, pp. 209–216 (2004)
10. Lazarevic, A., Kumar, V.: Feature bagging for outlier detection. In: KDD, pp.
157–166 (2005)
11. Abe, N., Zadrozny, B., Langford, J.: Outlier detection by active learning. In: KDD,
pp. 504–509 (2006)
12. Breiman, L.: Bagging predictors. J. Machine Learning 24(2), 123–140 (1996)
13. Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and
an application to boosting. J. Comput. Syst. Sci. 55(1), 113–139 (1997)
14. Jin, W., Tung, A., Ha, J.: Mining top-n local outliers in large databases. In: KDD,
pp. 293–298 (2001)
15. Gao, J., Hu, W., Li, W., Zhang, Z.M., Wu, O.: Local Outlier Detection Based on
Kernel Regression. In: ICPR, pp. 585–588 (2010)
16. Barnett, V., Lewis, T.: Outliers in Statistic Data. John Wiley, New York (1994)
17. Bentley, J.L.: Multidimensional binary search trees used for associative searching.
J. Communications of the ACM 18(9), 509–517 (1975)
Chinese Categorization and Novelty Mining

Flora S. Tsai and Yi Zhang

Nanyang Technological University,


School of Electrical & Electronic Engineering,
Singapore
fst1@columbia.edu

Abstract. The categorization and novelty mining of chronologically or-


dered documents is an important data mining problem. This paper fo-
cuses on the entire process of Chinese novelty mining, from preprocessing
and categorization to the actual detection of novel information, which has
rarely been studied. First, preprocessing techniques for detecting novel
Chinese text are discussed and compared. Next, we investigate the cate-
gorization and novelty mining performance between English and Chinese
sentences and also discuss the novelty mining performance based on the
retrieval results. Moreover, we propose new novelty mining evaluation
measures, Novelty-Precision, Novelty-Recall, Novelty-F Score, and Sen-
sitivity, which measures the sensitivity of the novelty mining system to
the incorrectly classified sentences. The results indicate that Chinese
novelty mining at the sentence level is similar to English if the sentences
are perfectly categorized. Using our new evaluation measures of Novelty-
Precision, Novelty-Recall, Novelty-F Score, and Sensitivity, we can more
fairly assess how the performance of novelty mining is influenced by the
retrieval results.

Keywords: novelty mining, novelty detection, categorization, retrieval,


Chinese, preprocessing, information retrieval.

1 Introduction
The overabundance of information leads to the proliferation of useless and redun-
dant content. Novelty mining (NM) is able to detect useful and novel information
from a chronologically ordered list of relevant documents or sentences. Although
techniques such as dimensionality reduction [15,18], probabilistic models [16,17],
and classification [12] can be used to reduce the data size, novelty mining tech-
niques are preferred since they allow users to quickly get useful information by
filtering away the redundant content.
The process of novelty mining consists of three main steps, (i) preprocessing,
(ii) categorization, and (iii) novelty detection. This paper focuses on all three
steps of novelty mining, which has rarely been explored. In the first step, text
sentences are preprocessed by removing stop words, stemming words to their
root form, and tagging the Parts-of-Speech (POS). In the second step, each in-
coming sentence is classified into its relevant topic bin. In the final step, novelty

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 284–295, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Chinese Categorization and Novelty Mining 285

detection searches through the time sequence of sentences and retrieves only
those with “novel” information. This paper examines the link between catego-
rization and novelty mining. In this task, we need to identify all novel Chinese
text given groups of relevant sentences. Moreover, we also discuss the sentence
categorization and novelty mining performance based on the retrieval results.
The main contributions of this work are the investigation of the preprocessing
techniques for detecting novel Chinese text, the discussion of the POS filtering
rule for selecting words to represent a sentence, several experiments to compare
the novelty mining performance between Chinese and English, the discovery that
the novelty mining performance on Chinese can be as good as that on English
if we can increase the preprocessing precision on Chinese text, the application
of a mixed novelty metric that can effectively improves Chinese novelty min-
ing performance, and a set of new novelty mining evaluation measures which
can facilitate users to objectively evaluate the novelty mining results: Novelty-
Precision, Novelty-Recall, Novelty-F Score, and Sensitivity.
The rest of this paper is organized as follows. The first section gives a brief
overview of related work on detecting novel documents and sentences. The next
section introduces the details of preprocessing steps for English and Chinese.
Next, we describe the categorization algorithm and the mixed metric technique,
which is applied in Chinese novelty mining. Traditional evaluation measures are
described and new novelty evaluation measures for novelty mining are then pro-
posed. Next, the experimental results are reported on the effect of preprocessing
rules on Chinese novelty mining, Chinese novelty mining using mixed metric,
categorization in English and Chinese, and novelty mining based on categoriza-
tion using the old and newly proposed evaluation measures. The final section
summarizes the research contributions and findings.

2 Related Work
Traditional sentence categorization methods use queries from topic information
to evaluate similarity between an incoming sentence and the topic [1]. Then,
each sentence is placed into its category according to the similarity. However,
using queries from the topic information cannot guarantee satisfactory results
since these queries can only provide some limited information. Later works have
emphasized on how to expand the query so as to optimize the retrieval results
[2]. The initial query, which is usually short, can be expanded based on the
explicit user feedback or implicit pseudo-feedback in the target collections and
the external resources, such as Wikipedia, search engines, etc. [2] Moreover, ma-
chine learning algorithms have been applied to sentence categorization that first
transform sentences, which typically are strings of characters, into a representa-
tion suitable for the learning algorithms. Then, different classifiers are chosen to
categorize the sentences to their relevant topic.
Initial studies of novelty mining focused on the detection of novel documents.
A document which is very similar to any of its history documents is regarded as
a redundant document. To serve users better, novel information at the sentence
level can be further highlighted. Therefore, later studies focused on detecting
286 F.S. Tsai and Y. Zhang

novel sentences, such as those reported in TREC Novelty Tracks [11], which
compared various novelty metrics [19,21], and integrated different natural lan-
guage techniques [7,14,20,22].
Studies for novelty mining have been conducted on the English and Malay
languages [4,6,8,24]. Novelty mining studies on the Chinese language have been
performed on topic detection and tracking, which identifies and collects relevant
stories on certain topics from a stream of information [25]. However, to the best
of our knowledge, few studies have been reported on entire process of Chinese
novelty mining, from preprocessing and categorization to the actual detection of
novel information, which is the focus of this paper.

3 Preprocessing for English and Chinese

3.1 English

English preprocessing first removes all stop words, such as conjunctions, prepo-
sitions, and articles. After removing stop words, word stemming is performed,
which reduces the inflected words to their primitive root forms.

3.2 Chinese
Chinese preprocessing first needs to perform lexical analysis since there is no
obvious boundary between Chinese words. Chinese word segmentation is a very
challenging problem because of the difficulties in defining what constitutes a
Chinese word [3]. Furthermore, there are no white spaces between Chinese words


or expressions and there are many ambiguities in the Chinese language, such
as: ‘ ’ (means ‘mainboard and server’ in English) might be ‘
/ / ’ (means ‘mainboard/and/server’ in English) or ‘ /  / / ’
(means ‘mainboard/ kimono/ task/ utensil’ in English). This ambiguity is a great
challenge for Chinese word segmentation. Moreover, since there are no obvious
derived words in Chinese, word stemming cannot be performed.
To reduce the noise from Chinese word segmentation and obtain a better word
list for a sentence, we first apply word segmentation on the Chinese text and
then utilize Part-of-Speech (POS) tagging to select the meaningful candidate
words. We used ICTCLAS for word segmentation and POS tagging because it
achieves a higher precision than other Chinese POS tagging softwares [23].
Two different rules were used to select the candidate words of a sentence.

– Rule1: Non-meaningful words were removed, such as pronouns (‘r’ in the


Chinese POS tagging criterions [9]), auxiliary words (‘u’), tone words (‘y’),
conjunctions (‘c’), prepositions (‘p’) and punctuation words (‘w’).
– Rule2: Fewer types of words were selected to represent a sentence, such as
nouns (including ‘n’ short for common nouns, ‘nr’ short for person name,
‘ns’ short for location name, ‘nt’ short for organization name, ‘nz’ short for
other proper nouns), verbs (‘v’), adjectives (‘a’) and adverbs (‘d’).
Chinese Categorization and Novelty Mining 287

In the following example, the Chinese sentence: “   Ü ” means

  
“There is a picture on the wall”. After POS filtering using Rule1, following

Ü 
words are used: “ (‘n’), (‘v’), (‘v’), (‘m’ measure word), (‘q’ quanti-
fier), (‘n’)”. After POS filtering using Rule2, following words remain: “ (‘n’),
 (‘v’), (‘v’), Ü(‘n’)”. By using Rule2, we can remove more unimportant
words.

4 Categorization
From the output of the preprocessing steps on English and Chinese languages,
we obtain bags of English and Chinese words. The corresponding term sen-
tence matrix (TSM) can be constructed by counting the term frequency (TF) of
each word. Therefore, each sentence can be conveniently represented by a vector
where the TF value of each word is considered as one feature. Retrieving rel-
evant sentences is traditionally based on computing the similarity between the
representations of the topic and the sentences. The famous Rocchio algorithm
[10] is adopted to categorize the sentences to their topics.
The Rocchio algorithm is popular for two reasons. First, it is computationally
efficient for online learning. Secondly, compared to many other algorithms, it
works well empirically, especially at the beginning stage of adaptive filtering
where the number of training examples is very small.

5 Novelty Mining
From the output of preprocessing, a bag of words is obtained, from which the cor-
responding term-sentence matrix (TSM) can be constructed by counting the term
frequency (TF) of each word. The novelty mining system compares the incoming
sentence to its history sentences in this vector space. Since the novelty mining
process is the same for English and Chinese, a novelty mining system designed for
English can also be applied to Chinese.
The novelty of a sentence can be quantitatively measured by a novelty metric
and represented by a novelty score N . The final decision on whether a sentence is
novel depends on whether the novelty score falls above a threshold. The sentence
that is predicted as “novel” will be placed into the history list of sentences.

5.1 Mixed Metric on Chinese Novelty Mining


The effect of sentence ordering can divide novelty metrics into two types: sym-
metric and asymmetric [21], as summarized in Table 1. In order to leverage the
strengths of symmetric metrics and asymmetric metrics, we utilize a new tech-
nique for measuring the novelty by a mixture of both types of novelty metrics
[13]. The goal of the mixed metric is to integrate the merits of both types of
metrics and hence generalize better over different topics. Two major issues for
constructing a mixed metric are (i) solving the scaling problem to ensure different
component metrics comparable and consistent and (ii) combining the strategy
that defines the way of fusing the outputs from different component metrics. By
288 F.S. Tsai and Y. Zhang

Table 1. Symmetric vs. Asymmetric Metrics

Symmetric metric Asymmetric metric


Definitions A metric M yields the A metric M yields different
same result regardless the results based on the different
ordering of two sentences, ordering of two sentences,
i.e. M (si , sj ) = M (sj , si ). i.e. M (si , sj ) = M (sj , si ).
Typical metrics Cosine similarity, New word count,
in NM Jaccard similarity Overlap

normalizing the metrics, novelty scores from all novelty metrics range from 0
(i.e. redundant) to 1 (i.e. totally novel). Therefore, the metrics are both com-
parable and consistent because they have the same range of values. For the
combining strategy, we adopt a new technique of measuring the novelty score
N (st ) of the current sentence st , by combining two types of metrics, as shown in
Equation (1).

N (st ) = αNsym (st ) + (1 − α)Nasym (st ) (1)

where Nsym is the novelty score using the symmetric metric, Nasym is the novelty
score using the asymmetric metric, and α is the combining parameter ranging
from 0 to 1. The larger the value of α, the heavier the weight for the symmetric
metrics.
The new word count novelty metric is a popular asymmetric metric, which
was proposed for sentence-level novelty mining [1]. The idea of the new word
count novelty metric is to assign the incoming sentence the count of the new
words that have not appeared in its history sentences, as defined in Equation
(2).
  
newW ord(st ) = |W (st )| − W (st ) ∩ ∪t−1
i=1 W (si )
 (2)
where W (si ) is the set of words in the sentence si . The values of the new word
count novelty metric for an incoming sentence are non-negative integers such as
0, 1, 2, etc. To normalize the values of the novelty scores into the range of 0 to
1, the new word count novelty metric can be normalized by the total number of
words in the incoming sentence st as below.
  
W (dt ) ∩ ∪t−1 W (di ) 
i=1
NnewW ord (dt ) = 1 − (3)
|W (dt )|
where the denominator |W (dt )| is the word count of dt . This normalized metric,
NnewW ord , has the range of values from 0 (i.e. no new word) to 1 (i.e. 100% new
words).
In the following experiments using mixed metric, α is set 0.75. We chose cosine
metrics as the symmetric metric and new word count defined in Equation (2) as
the asymmetric metric, and the term weighting function as T F .
Chinese Categorization and Novelty Mining 289

5.2 Evaluation Measures

Precision (P ), recall (R), F Score (F ) and precision-recall (P R) curves are used


to evaluate the performance for novelty mining [1]. The larger the area under
the P R curve, the better the algorithm. We drew the standard F Score contours
[11], which indicate the F Score values when setting precision and recall from
0 to 1 with a step of 0.1. These contours can facilitate us to compare F Scores
along the P R curves.
Based on all the topics’ P and R, the average P and average R can be obtained
by calculating the arithmetic mean of these scores over all topics. Then, the
average F Score (F ) is obtained by the harmonic average of the average P and
average R.

5.3 Novelty Evaluation Measures

Although Precision, Recall, and F-Score can measure the novelty mining perfor-
mance well when sentences are correctly categorized, if there are errors in the
categorization, the measures cannot objectively measure the novelty mining per-
formance. In order to objectively measure the novelty mining performance, we
propose a set of new evaluation measures called Novelty Precision (N-Precision),
Novelty Recall (N-Recall) and Novelty F Score(N-F Score). They are calculated
only on correctly categorized sentences by our novelty mining system instead
of all task relevant sentences. We remove the incorrectly categorized sentences
before our novelty mining evaluation.

NN+
N-precision = (4)
N N + + N R+

NN+
N-recall = (5)
NN+ + NN−
2 × N-precision × N-recall
N-F = (6)
N-precision + N-recall
where N R+ ,N R− ,N N + ,N N − correspond to the number of sentences that fall
into each category (see Table 2).
Our N-precision, N-recall and N-F Score do not consider the novelty mining
performance of the sentences that are wrongly categorized to one topic. More-
over, in order to better measure the novelty mining result of this part, we bring
forward a new measure called Sensitivity (defined in Equation 7), which indi-
cates whether the novelty mining system is sensitive to the irrelevant sentences.

Table 2. Categories for novelty mining evaluation based on categorization

Correctly categorized and non-novel Correctly categorized and Novel


Delivered N R+ NN+

Not Delivered NR NN−
290 F.S. Tsai and Y. Zhang

If sensitivity is high, most wrongly categorized (irrelevant) sentences are pre-


dicted as novel, which will produce noise that prevent readers from finding the
true novel information.
N
Sensitivity = (7)
IC
where IC means the number of sentences that are incorrectly categorized by nov-
elty mining system. N means the number of the wrongly categorized sentences
that are predicted as novel by our system.

6 Experiments and Results


6.1 Dataset
The public dataset from TREC Novelty Track 2004 [11] is selected as our experi-
mental dataset for sentence-level novelty mining. The TREC 2004 Novelty Track
data is developed from AQUAINT collection. Both relevant and novel sentences
are selected by TREC’s assessors.
Since the dataset that we used is originally in English, we first translated
the data into Chinese. During this process, we investigated issues on machine
translation vs. manually corrected translation. After comparing the results of
novelty mining on the machine translation sentences and the humanly corrected
translation sentences individually, we found that there is only a slight difference
(<2%) in the precision and F Score. This indicated that machine translation was
sufficient for Chinese novelty mining.

6.2 Effect of Preprocessing Rules on Chinese Novelty Mining


In the first experimental study, the focus was on novelty mining rather than
relevant sentences categorization. Therefore, our first experiments started with
all given relevant Chinese text, from which the novel text should be identified.
For the Chinese dataset, we first segmented the sentences into words and then
performed POS filtering to acquire the candidate words for the space vector.
Based on the vectors of Chinese text, we calculated the similarities between
sentences and predicted the novelty for each sentence in the Chinese and English
datasets. An incoming Chinese/English sentence will be compared with all the
system-delivered 1000 novel sentences. We used threshold values between 0.05
and 0.95 with an equal step of 0.10. Then, we evaluated the Chinese/English
novel text detection performance by setting a series of novelty score thresholds.
We adopted two different rules to select the candidate words to represent
one sentence and investigated the POS filtering influence on detecting the novel
Chinese text. Rule1 selects only some non-meaningful words including pronouns,
auxiliary words, tone words, conjunctions, prepositions, and punctuation words,
and Rule2 selects fewer kinds of words to represent a sentence.
Chinese Categorization and Novelty Mining 291

Based on our experiments, we learn that the Chinese novelty mining per-
formance is better when choosing the stricter rule (Rule2). Thus, POS filter-
ing is necessary for Chinese because just removing some non-meaningful words
(like stop words) may not be sufficient. POS filtering removes the less mean-
ingful words so that each vector can be better represented. Rule2, which keeps
only nouns, verbs, adjectives and adverbs, produces better results for novelty
mining. Therefore, the remaining experiments used Rule2 for preprocessing the
Chinese text.

6.3 Chinese Novelty Mining Using Mixed Metric


In this section, we compare the Chinese novelty mining performance after ap-
plying mixed metric at the sentence level. Novelty mining using cosine similarity
and term weighting function is T F is compared to results using mixed metric,
which blends the cosine similarity together with new word count. When setting
novelty threshold from 0.05 to 0.95 with a step of 0.1, we can draw the PR curves
for sentence-level novelty mining. Figure 1 shows the novelty mining results using
mixed metrics at the sentence-level.
From Figure 1, we learn that the Chinese novelty mining performance im-
proves after applying mixed metric because it effectively utilizes the strengths
of both the symmetric and asymmetric metric.

Sentence−Level NM on TREC 2004


1
F score

0.9 Novelty Threshold=0.95

0.9
0.8 Chinese S−NM
Chinese S−NM_Mix
0.7 0.85
0.8

0.6
0.75
Precision

0.7
0.5

0.6
0.4
0.5
0.3
0.4
0.2
0.3
0.2
0.1
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall

Fig. 1. PR curves for sentence-level novelty mining on Chinese using mixed metric on
TREC 2004. The grey dashed lines show contours at intervals of 0.1 points of F .
292 F.S. Tsai and Y. Zhang

6.4 Categorization in English and Chinese


We also conducted experiments to compare the categorization performance on
Chinese with that of English on TREC 2004 Novelty Track. The topic infor-
mation from TREC 2004 Novelty Track data is extracted by title, description
and narrative. We used topic title and topic description to construct the initial
query. Each sentence will be compared with the queries from each topic. When
the relevance score of a sentence is greater than the categorization threshold, it
is categorized as a relevant sentence to this topic.
From our experiments, we observe that the categorization performance using
Rocchio on Chinese is lower than that on English. This may be because of the
influence of the higher preprocessing error rate on the results of Chinese cate-
gorization. Li [5] also mentioned that the results of Chinese text categorization
for small categories were much worse than those for English. Reducing the noise
in feature vectors of the Chinese text, which needs a better preprocessing of
Chinese text, may lead to better results.

6.5 Novelty Mining Based on Categorization


Based on the categorization results, we performed sentence novelty mining using
our new evaluation measures of Novelty-Precision, Novelty-Recall, and Novelty-
F Score. We chose the categorization results when the categorization threshold
θc =0.3. Then, we compare the novelty mining performance on Chinese and En-
glish. We use mixed metrics and T F ISF term weighting function.
Table 3 and Table 4 show the novelty mining performance using the old eval-
uation measure and our new proposed measure on English and Chinese respec-
tively. From Table 3 and Table 4, we learn that N-Precision is nearer to the
precision of novelty mining on both English and Chinese when the sentences are
perfectly categorized. In addition, sensitivity can explain the reasons why there
is a big difference between these two groups of results because most wrongly
categorized sentences are labeled as novel by our system. It is noticed that our
novelty mining system is sensitive to the irrelevant sentences (Sensitivity ≥ 70%)
and is more sensitive to the irrelevant sentences on Chinese than that on English,
which is consistent with the results.
Figure 2 shows the sentence-level novelty mining NPRF curves on Chinese
and English based on the categorization results. The novelty score thresholds
were chosen between 0 and 1 with a step of 0.10.
Figure 2, it is noticed that the novelty mining performance on the categoriza-
tion results varies in Chinese and English. Chinese cannot achieve as same as
performance as that on English because the novelty mining performance on the

Table 3. Comparison of Novelty Mining Performance on English

Precision Recall F Score


Original NM on TREC perfect categorization 0.479 0.911 0.615
N-Precision N-Recall N-F Score Sensitivity
NM on categorization using new evaluation 0.4655 0.7491 0.5469 0.7002
Chinese Categorization and Novelty Mining 293

Table 4. Comparison of Novelty Mining Performance on Chinese

Precision Recall F Score


Original NM on TREC perfect categorization 0.467 0.889 0.599
N-Precision N-Recall N-F Score Sensitivity
NM on categorization using new evaluation 0.3551 0.6545 0.4414 0.8493

Sentence−Level NM based on categorization result on TREC 2004


(Mixed (cos_tfisf + nwc) alpha = 0.75)
1 NF score
Chinese data
0.9 English data
0.9
0.8

0.7
0.8
0.6
N−Precision

0.7
0.5
0.6
0.4 Threshold
0.05
=0
0.05Threshold 0.5
0.3
=0 0.4
0.2 0.3
0.2
0.1
0.1
0
0 0.2 0.4 0.6 0.8 1
N−Recall

Fig. 2. Sentence-level novelty mining on categorization results: comparison between


Chinese and English using PRF curve

categorization results is not good is because not all the relevant sentences are
correctly categorized. The assessors judge the novelty of each sentence only on
the correct relevant sentences. Therefore, if the categorization of each sentence
is incorrect, the following novelty mining performance will be badly influenced.

7 Conclusion

This paper studied the entire process of preprocessing, categorization and novelty
mining for detecting novel Chinese text, which were insufficiently addressed in
previous studies. We described the Chinese preprocessing steps when choosing
different Part-of-Speech (POS) filtering rules. We compared the novelty mining
performance between Chinese and English and found that the novelty mining
performance on Chinese can be as good as that on English by increasing the
preprocessing precision on Chinese text.
294 F.S. Tsai and Y. Zhang

Then we applied a mixed novelty metric that effectively improved the Chinese
novelty mining performance at the sentence level. Next, we compared the per-
formance of categorization in English and Chinese, and found that Chinese cat-
egorization was influenced by the noise in preprocessing. Finally, we discuss the
categorization and the novelty mining performance based on retrieval results. In
order to objectively evaluate the novelty mining performance, we proposed a set
of new novelty mining evaluation measures, Novelty-Precision, Novelty-Recall,
Novelty-F Score, and Sensitivity. The new evaluation measures can more fairly
assess how the performance of novelty mining is influenced by the categorization
results.

References
1. Allan, J., Wade, C., Bolivar, A.: Retrieval and novelty detection at the sentence
level. In: SIGIR 2003: Proceedings of the 26th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, pp. 314–321
(2003)
2. Diaz, F., Metzler, D.: Improving the estimation of relevance models using large
external corpora. In: SIGIR 2006, Seattle, USA, pp. 154–161 (2006)
3. Gao, J., Li, M., Wu, A., Huang, C.-N.: Chinese word segmentation and named
entity recognition: A pragmatic approach. Computational Linguistics 31(4), 531–
574 (2005)
4. Kwee, A.T., Tsai, F.S., Tang, W.: Sentence-level novelty detection in english and
malay. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD
2009. LNCS, vol. 5476, pp. 40–51. Springer, Heidelberg (2009)
5. Li, Y., Taylor, J.S.: The SVM with uneven margins and Chinese document cate-
gorisation. In: Proceedings of the 17th Pacific Asia Conference on Language, In-
formation and Computation, pp. 216–227 (2003)
6. Liang, H., Tsai, F.S., Kwee, A.T.: Detecting novel business blogs. In: ICICS 2009 -
Conference Proceedings of the 7th International Conference on Information, Com-
munications and Signal Processing (2009)
7. Ng, K.W., Tsai, F.S., Chen, L., Goh, K.C.: Novelty detection for text documents
using named entity recognition. In: 2007 6th International Conference on Informa-
tion, Communications and Signal Processing, ICICS (2007)
8. Ong, C.L., Kwee, A., Tsai, F.: Database optimization for novelty detection. In:
ICICS 2009 - Conference Proceedings of the 7th International Conference on In-
formation, Communications and Signal Processing (2009)
9. PKU and CAS, Chinese POS tagging criterion (1999),
http://icl.pku.edu.cn/icl_groups/corpus/addition.htm
10. Rocchio, J.: Relevance feedback in information retrieval. In: The SMART Retrieval
System: Experiments in Automatic Document Processing, pp. 313–323 (1971)
11. Soboroff, I.: Overview of the TREC 2004 Novelty Track. In: Proceedings of TREC
2004 - the 13th Text Retrieval Conference, pp. 1–16 (2004)
12. Tan, R., Tsai, F.S.: Authorship identification for online text. In: International
Conference on Cyberworlds, pp. 155–162 (2010)
13. Tang, W., Tsai, F.S., Chen, L.: Blended metrics for novel sentence mining. Expert
Syst. Appl. 37(7), 5172–5177 (2010)
14. Tsai, F.S.: Review of techniques for intelligent novelty mining. Information Tech-
nology Journal 9(6), 1255–1261 (2010)
Chinese Categorization and Novelty Mining 295

15. Tsai, F.S.: Dimensionality reduction techniques for blog visualization. Expert Sys-
tems With Applications 38(3), 2766–2773 (2011)
16. Tsai, F.S.: A tag-topic model for blog mining. Expert Systems With Applica-
tions 38(5), 5330–5335 (2011)
17. Tsai, F.S., Chan, K.L.: Detecting cyber security threats in weblogs using proba-
bilistic models. In: Yang, C.C., Zeng, D., Chau, M., Chang, K., Yang, Q., Cheng,
X., Wang, J., Wang, F.-Y., Chen, H. (eds.) PAISI 2007. LNCS, vol. 4430, pp. 46–57.
Springer, Heidelberg (2007)
18. Tsai, F.S., Chan, K.L.: Dimensionality reduction techniques for data exploration.
In: 2007 6th International Conference on Information, Communications and Signal
Processing, ICICS 2007, pp. 1568–1572 (2007)
19. Tsai, F.S., Chan, K.L.: Redundancy and novelty mining in the business blogo-
sphere. The Learning Organization 17(6), 490–499 (2010)
20. Tsai, F.S., Chan, K.L.: An intelligent system for sentence retrieval and novelty
mining. International Journal of Knowledge Engineering and Data Mining 1(3),
235–253 (2011)
21. Tsai, F.S., Tang, W., Chan, K.L.: Evaluation of metrics for sentence-level novelty
mining. Information Sciences 180(12), 2359–2374 (2010)
22. Tsai, F.S., Zhang, Y.: D2S: Document-to-sentence framework for novelty detection.
Knowledge and Information Systems (2011)
23. Zhang, H.-P., Liu, Q., Cheng, X.-Q., Zhang, H., Yu, H.-K.: Chinese lexical analysis
using hierarchical hidden markov model. In: Second SIGHAN Workshop Affiliated
with 41th ACL, pp. 63–70 (2003)
24. Zhang, Y., Tsai, F.S.: Combining named entities and tags for novel sentence detec-
tion. In: Proceedings of the WSDM 2009 ACM Workshop on Exploiting Semantic
Annotations in Information Retrieval, ESAIR 2009, pp. 30–34 (2009)
25. Zheng, W., Zhang, Y., Zou, B., Hong, Y., Liu, T.: Research of Chinese topic track-
ing based on relevance model (2008)
Finding Rare Classes: Adapting Generative and
Discriminative Models in Active Learning

Timothy M. Hospedales, Shaogang Gong, and Tao Xiang

Queen Mary University of London, UK, E1 4NS


{tmh,sgg,txiang}@eecs.qmul.ac.uk

Abstract. Discovering rare categories and classifying new instances of


them is an important data mining issue in many fields, but fully super-
vised learning of a rare class classifier is prohibitively costly. There has
therefore been increasing interest both in active discovery: to identify
new classes quickly, and active learning: to train classifiers with mini-
mal supervision. Very few studies have attempted to jointly solve these
two inter-related tasks which occur together in practice. Optimizing both
rare class discovery and classification simultaneously with active learn-
ing is challenging because discovery and classification have conflicting
requirements in query criteria. In this paper we address these issues with
two contributions: a unified active learning model to jointly discover new
categories and learn to classify them; and a classifier combination algo-
rithm that switches generative and discriminative classifiers as learning
progresses. Extensive evaluation on several standard datasets demon-
strates the superiority of our approach over existing methods.

1 Introduction
Many real life problems are characterized by data distributed between vast yet
uninteresting background classes, and small rare classes of interesting instances
which should be identified. In astronomy, the vast majority of sky survey image
content is due to well understood phenomena, and only 0.001% of data is of inter-
est for astronomers to study [12]. In financial transaction monitoring, most are
ordinary but a few unusual ones indicate fraud and regulators would like to find
future instances. Computer network intrusion detection exhibits vast amounts
of normal user traffic, and a very few examples of malicious attacks [16]. Finally,
in computer vision based security surveillance of public spaces, observed activi-
ties are almost always people going about everyday behaviours, but very rarely
may be a dangerous or malicious activity of interest [19]. All of these classifi-
cation problems share two interesting properties: highly unbalanced frequencies
– the vast majority of data occurs in one or more background classes, while
the instances of interest for classification are much rarer; and unbalanced prior
supervision – the majority classes are typically known a priori, while the rare
classes are not. Classifying rare event instances rather than merely detecting any
rare event is crucial because different classes may warrant different responses,
for example due to different severity levels. In order to discover and learn to

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 296–308, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Finding Rare Classes: Adapting Generative and Discriminative Models 297

classify the interesting rare classes, exhaustive labeling of a large dataset would
be required to ensure sufficient rare class coverage. However this is prohibitively
expensive when generating each label requires significant time of a human ex-
pert. Active learning strategies might be used to discover or train a classifier
with minimal label cost, but this is complicated by the dependence of classifier
learning on discovery: one needs examples of each class to train a classifier.
The problem of joint discovery and classification has received little atten-
tion despite its importance and broad relevance. The only existing attempt to
address this is based on simply applying schemes for discovery and classifier
learning sequentially or in fixed iteration [16]. Methods which treat discovery
and classification independently perform poorly due to making inefficient use
of data, (e.g., spending time on classifier learning is useless if the right classes
have not been discovered and vice-versa). Achieving the optimal balance is crit-
ical, but non-trivial given the conflict between discovery and learning criteria.
To address this, we build a generative-discriminative model pair [11,4] for com-
puting discovery and learning query criteria, and adaptively balance their use
based on joint discovery and classification performance. Depending on the ac-
tual supervision cost and sparsity of rare class examples, the quantity of labeled
data varies. Given the nature of data dependence in generative and discrimi-
native models [11], the ideal classifier also varies. As a second contribution, we
therefore address robustness to label quantity and introduce a classifier switch-
ing algorithm to optimize performance as data is accumulated. The result is a
framework which significantly and consistently outperforms existing methods at
the important task of discovery and classification of rare classes.

Related Work. A common unsupervised approach to rare class detection is out-


lier detection: building an unconditional model of the data and flagging unlikely
instances. This has a few serious limitations: it does not classify; it fails with
non-separable data, where interesting classes are embedded in the majority dis-
tribution; and it does not exploit any supervision about flagged outliers, limiting
its accuracy – especially in distinguishing rare classes from noise.
Iterative active learning approaches are often used to learn a classifier with
minimal supervision [14]. Much of the active learning literature is concerned
with the relative merits of different query criteria. For example, querying points
that: are most uncertain [14]; reduce the version space [17]; or reduce direct
approximations of the generalization error [13]. Different criteria may be suited to
different datasets, e.g, uncertainty criteria are good to refine decision boundaries,
but can be fatal if the classes are non-separable (the most uncertain points may
be hopeless) or highly multi-modal. This has led to attempts to select dataset
specific criteria online [2]. All these approaches rely on classifiers, and do not
generally apply to scenarios in which the target classes are themselves unknown.
Recently, active learning has been applied to discovering rare classes using
e.g., likelihood [12] or gradient [9] criteria. Solving discovery and classification
problems together with active learning is challenging because for a single dataset,
good discovery and classification criteria are often completely different. Consider
the toy scenarios in Figure 1. Here the color indicates the true class, and the
298 T.M. Hospedales, S. Gong, and T. Xiang

symbol indicates the estimated class based on two initial labeled points (large
symbols). The black line indicates the initial decision boundary. In Figure 1(a) all
classes are known but the decision boundary needs refining. Likelihood sampling
(most unlikely point under the learned model) inefficiently builds a model of the
whole space (choosing first the points labeled L), while uncertainty sampling
selects points closest to the boundary (U symbols), leading to efficient refine-
ment. In Figure 1(b) only two classes are known. Uncertainty inefficiently queries
around the known decision boundary (choosing first the points U) without dis-
covering the new classes above. In contrast, these are the first places queried
by likelihood sampling (L symbols). Evidently, single-criterion approaches are
insufficient. Moreover, multiple criteria may be necessary for a single dataset at
different stages of learning, e.g., likelihood to detect new classes and uncertainty
to learn to classify them. A simple but inefficient approach [16] is to simply iterate
over criteria in fixed proportion. In contrast, our innovation is to adapt criteria
online so as to select the right strategy at each stage of learning, which can dra-
matically increase efficiency. Typically, “exploration” is automatically preferred
while there are easily discoverable classes, and “exploitation” to refine decision
boundaries when most classes have been discovered. This ultimately results in
better rare class detection performance than single objective, or non-adaptive
methods [16].

(a) L (b) L L

U U
U
U
L

Fig. 1. Sample Problems

Finally, there is the issue of what base classifier to use in the active learning
algorithm of choice. One can categorize classifiers into two broad categories: gen-
erative and discriminative. Discriminative models directly learn p(y|x) for class
y and data x. Generative models learn p(x, y) and compute p(y|x) via Bayes
rule. The importance of this for active learning is that for a given generative-
discriminative pair (in the sense of equivalent parametric form – such as naive
Bayes & logistic regression), generative classifiers typically perform better with
few training examples, while discriminative models are better asymptotically
[11]. The ideal classifier is therefore likely to be completely different early and
late in the active learning process. An automatic way to select the right classi-
fier online as more labels are obtained is therefore key. Existing active learning
work focuses on single generative [13] or discriminative [17] classifiers. We intro-
duce a novel algorithm to switch classifiers online as the active learning process
progresses in order to get the best of both worlds.
Finding Rare Classes: Adapting Generative and Discriminative Models 299

2 Adaptive Active Learning


2.1 Active Learning
In this paper we deal with pool-based uncertainty sampling and likelihood sam-
pling because of their computational efficiency and clearly complementary na-
ture. Our method can nevertheless be easily generalized to other criteria. We
consider a classification problem starting with many unlabeled instances U =
(x1 , .., xn ) and a small set of labeled instances L = ((x1 , y1 ), .., (xm , ym )). L
does not include the full set of possible labels Y in advance. We wish to learn
the posterior conditional distribution p(y|x) so as to accurately classify the data
in U. Active learning proceeds by iteratively: i) training a classifier C on L; ii) us-
ing query function Q(C, L, U) → i∗ to select unlabeled instances i∗ to be labeled
and iii) removing xi∗ from U and adding (xi∗ , yi∗ ) to L.

Query Criteria. Perhaps the most commonly applied query criteria are uncer-
tainty sampling and variants [14]. The intuition is that if the current classification
of a point is highly uncertain, it should be informative to label. Uncertainty is
typically quantified by posterior entropy, which for binary classification reduces
to selecting the point whose posterior is closest to p(y|x) = 0.5. The posterior
p(y|x) of every point in U is evaluated and the uncertain points queried,
 

pu (i) ∝ exp β p(yi |xi ) log p(yi |xi ) . (1)
yi

Rather than selecting a single maxima, we exploit a normalized degree of pref-


erence pu (i) for every point i can be expressed by putting the entropy into a
Gibbs function (1). For non-probabilistic SVM classifiers, an approximation to
p(y|x) can be derived from the distance to the margin from each point [14].
A complementary query criteria is that of low likelihood p(x|y). Such points
are badly explained by the current model, and should therefore be informative
to label [12]. This may involve marginalizing or selecting the most likely class,
 
pl (i) ∝ exp −βmax p(xi |yi ) . (2)
yi

The uncertainty measure in (1) is in spirit discriminative (in focusing on de-


cision boundaries), although p(y|x) can obviously be realized by a generative
classifier. In contrast, the likelihood measure in (2) is intrinsically generative,
in that it requires a density model of each class y, rather than just the deci-
sion boundary. The uncertainty measure is generally unsuitable for finding new
classes, as it focuses on known decision boundaries, and the likelihood measure
is good at finding new classes, while being poorer at refining decision boundaries
between known classes (Figure 1). Note that the likelihood measure can still be
useful to improve known-class classification if the classes are multi-modal – it
will explore different modes. Our adaptation method will allow it to be used in
both ways. Next, we discuss specific parametric forms for our models.
300 T.M. Hospedales, S. Gong, and T. Xiang

2.2 Generative-Discriminative Model Pairs


We use a Gaussian mixture model (GMM) for the generative model and a sup-
port vector machine (SVM) for the discriminative model. These were chosen
because they may both be incrementally trained (for active learning efficiency),
and they are a complementary generative-discriminative pair in that (assuming
a radial basis SVM kernel) they have equivalent classes of decision boundaries
[4], but are optimized with very different criteria during learning.

Incremental GMM Estimation. For online GMM learning, we use the incremen-
tal agglomerative algorithm from [15]. To summarize the procedure, for the first
n = 1..N training points observed with the same label y, {xn , y}N
n , we incremen-
tally build a model p(x|y) for y using kernel density estimation with Gaussian
kernels N (xn , Σ) and weight ωn = n1 . d is the dimension of the data x.

1 
N
1 
p(x|y) = ωn exp − (x − xn )T Σ −1 (x − xn ) . (3)
(2π)
d/2 1/2
|Σ| n=1
2

To bound the complexity, after some maximal number of Gaussians Nmax is


reached, merge two existing Gaussians i and j by moment matching [7].

ωi ωj
μ(i+j) = μi + μj , (4)
ω(i+j) = ωi + ωj , ω(i+j ) ω(i+j )

ωi  
Σ(i+j) = Σi + (μi − μ(i+j) )(μi − μ(i+j) )T
ω(i+j)
ωj  
+ Σj + (μj − μ(i+j) )(μj − μ(i+j) )T . (5)
ω(i+j)
The components to merge are chosen by the selecting the pair of Gaussian kernels
(Gi , Gj ) whose replacement G(i+j) is most similar, in terms of the Kullback-
Leibler divergence. Specifically, we minimize the cost Cij ,
Cij = ωi KL(Gi ||G(i+j) ) + ωj KL(Gj ||G(i+j) ). (6)
Importantly for iterative active learning online, merging Gaussians and updating
the cost matrix requires constant O(Nmax ) computation every iteration once the
initial cost matrix is built. In contrast, learning a GMM with latent variables
requires multiple expensive O(n) expectation-maximization iterations [12]. The
initial covariance Σ is assumed uniform diagonal Σ = Iσ 2 , and is estimated a
priori by leave-one-out cross validation on the (large) unlabeled set U :
⎛ ⎞
 1
σ̂ = argmax ⎝ exp − 2 (x − xn )2 ⎠ .
d
σ− 2 (7)
σ 2σ
n∈U x=xn

Given the learned models p(x|y), we can classify ŷ ← fgmm (x), where
Finding Rare Classes: Adapting Generative and Discriminative Models 301


fgmm (x) = argmaxp(y|x), p(y|x) ∝ wi N (x; μi,y , Σi,y )p(y). (8)
y i

SVM. We use a standard SVM approach with RBF kernels, treating multi-class
classification as a set of 1-vs-1 decisions, for which the decision rule [4] is given
(by an equivalent form to (8)) as
⎛ ⎞

fsvm (x) = argmax ⎝ αki N (x; vi ) + αk0 ⎠ , (9)
y
vi ∈SVy

and p(y|x) can be computed based on the binary posterior estimates [18].

2.3 Combining Active Query Criteria

Given the generative GMM and discriminative SVM models defined in Sec-
tion 2.2, and their respective likelihood and uncertainty query criteria defined
in Section 2.1, our first concern is how to adaptively combine the query criteria
online for discovery and classification. Our algorithm involves probabilistically
selecting a query criteria Qk according to some weights w (k ∼ Multi(w)) and
then sampling the query point from the distribution i∗ ∼ pk (i) ((1) or (2))1 .
The weights w will be adapted based on the discovery and classification perfor-
mance φ of our active learner at each iteration. In an active learning context,
[2] shows that because labels are few and biased, cross-validation is a poor way
to assess classification performance, and suggest the unsupervised measure of
binary classification entropy (CE) on the unlabeled set U instead. This is espe-
cially the case in the rare class context where there is often only one example of
a given class, so cross-validation is not well defined. To overcome this problem,
we generalize CE to multi-class entropy (MCE) of the classifier f (x) and take it
as our indication of classification performance,
ny  

i I(f (xi ) = y) I(f (xi ) = y)
H=− logny i . (10)
y=1
|U| |U|

Here I is the indicator function that returns 1 if its argument is true, and ny
is the number of classes observed so far. Importantly, we explicitly reward the
discovery of new classes to jointly optimize classification and discovery. We define
overall active learning performance φt (i) upon querying point i at time t as,
 
φt (i) = αI(yi ∈
/ L) + (1 − α) (eHt − eHt−1 ) − (1 − e) /(2e − 2). (11)
1
We choose this method because each criterion has very different “reasons” for its
preference. An alternative is querying a product or mean [2,5,3] of the criteria. That
risks querying a merely moderately unlikely and uncertain point – neither outlying
nor on a decision boundary – which is useless for either classification or discovery.
302 T.M. Hospedales, S. Gong, and T. Xiang

The first right hand term above rewards discovery of a new class, and the second
term rewards an increase in MCE (as an estimate of classification accuracy) after
labeling point i at time t. The constants (1 − e) and (2e − 2) ensure the second
term lies between 0 and 1. The parameter α is the mixing prior for discovery
vs. classification. Given this performance measure, we define an update for the
future weight wt+1 of each active criterion k,

pk (i)
wt+1,k (q) ∝ λwt,k + (1 − λ)φt (i) +
. (12)
p(i)

Here we define an exponential decay (first term) of the weight in favor of


(second term) the current performance φ weighted by how strongly criteria
k recommended
 the chosen point i, compared to the joint recommendation
p(i) = k pk (i). λ is the forgetting factor. The third term encourages exploration
by diffusing the weights so every criterion is tried occasionally. In summary, this
approach adaptively selects more frequently those criteria that have been suc-
cessful at discovering new classes and/or increasing MCE, thereby optimizing
both discovery and classification accuracy.

2.4 Adaptive Selection of Classifiers

As discussed in Section 1, although we broadly expect the generative GMM clas-


sifier to have better initial performance, and the discriminative SVM classifier
to have better asymptotic performance, the ideal classifier will vary with dataset
and active learning iteration. The remaining question is how to combine these
classifiers [10] online for best performance given any specific supervision budget.
Cross-validation to determine reliability is infeasible because of lack of data;
however we can again resort to the MCE over the training set U (10). In our ex-
perience, MCE is indeed indicative of generalization performance, but relatively
crudely and non-linearly so. This makes approaches based on MCE weighted
posterior fusion unreliable. We therefore choose a simpler but more reliable ap-
proach which switches the final classifier at the end of each iteration to the one
with higher MCE, aiming to perform as well as the better classifier for any label
budget. Additionally, the process of multi-class posterior estimation for SVMs
[18] requires cross-validation and is inaccurate with limited data. To compute
the uncertainty criterion (1) at each iteration, we therefore use posterior of the
classifier determined to be more reliable by MCE. This ensures that uncertainty
sampling is as accurate as possible in both low and high data contexts.

Summary. Algorithm 1 summarizes our approach. There are four parameters:


Gibbs parameter β, discovery vs. classification prior α, forgetting rate λ and
exploring rate
. None of these were tuned; we set them all crudely to intuitive
values for all experiments, β = 100, α = 0.5, λ = 0.6 and
= 0.01. The GMM
and SVM classifiers both have regularization hyperparameters Nmax and (C, γ).
These were not optimized, but set at standard values Nmax = 32, C = 1, γ = 1/d.
Finding Rare Classes: Adapting Generative and Discriminative Models 303

Algorithm 1. Integrated Active Learning for Discovery and Classification


Active Learning
Input: Labeled L and unlabeled U data. Classifiers C, query criteria Qk , weights w.

1. Build unconditional GMM from L ∪ U (3)-(5)


2. Estimate σ by cross-validation (7)
3. Train initial GMM fgmm and SVM fsvm classifiers on L using σ

Repeat as training budget allows:

1. Compute query criteria pu (i) (1) and pl (i) (2)


2. Sample query criteria to use k ∼ Multi(w)
3. Query point i∗ ∼ pk (i), add (xi∗ , yi∗ ) to L
4. Update classifiers fgmm and fsvm with point i∗ (8) and (9)
5. Compute multi-class classification entropies H gmm and H svm (10)
6. Update query criteria weights w (11) and (12)
7. If H gmm > H svm : select classifier fgmm (x), Else: select fsvm (x)

Testing
Input: Testing samples U ∗ , selected classifier c.

1. Classify x ∈ U ∗ with fc (x) ((8) or (9))

3 Experiments
Evaluation Procedure. We tested our method on 7 rare class datasets from the
UCI repository [1] and on the CASIA gait dataset [20], for which we addressed
the image viewpoint recognition problem. We unbalanced the CASIA dataset by
sampling training classes in geometric proportion. In each case we labeled one
point from the largest class and the goal was to discover and learn to classify
the remaining classes. Table 1 summarizes the properties of each dataset. Per-
formance was evaluated at each iteration by: i) the number of distinct classes
discovered and ii) the average classification accuracy over all classes. This accu-
racy measure weights the ability to classify rare classes equally with the majority
class despite the fewer rare class points. Moreover, it means that undiscovered
rare classes automatically penalize accuracy. Accuracy was evaluated by 2-fold
cross-validation, averaged over 25 runs from random initial conditions.

Comparative Evaluations. We compared the following methods: S/R: A baseline


SVM classifier making random queries. G/G: GMM classification with GMM
likelihood criterion (2). S/S: SVM classifier with SVM uncertainty criterion (1).
S/GSmix: SVM classifier alternating GMM likelihood and SVM uncertainty
queries (corresponding to [16]). S/GSonline: SVM classifier fusing GMM like-
lihood & SVM uncertainty criteria by the method in [2]. S/GSadapt: SVM
classification with our adaptive fusion of GMM likelihood & SVM uncertainty
304 T.M. Hospedales, S. Gong, and T. Xiang

Table 1. Dataset properties. Number of Table 2. Classification performance sum-


items N, classes Nc , dimensions d. Small- mary in terms of area under classification
est and largest class proportions S/L. curve

Data N d Nc S% L% Data G/G S/GSmix S/GSad GSsw/GSad


Ecoli 336 7 8 1.5% 42% EC 59 60 60 62
PageBlock 5473 10 5 .5% 90% PB 53 57 58 59
Glass 214 10 6 4% 36% GL 63 55 57 64
Covertype 10000 10 7 3.6% 25% CT 41 39 43 46
Shuttle 10000 9 7 .01% 78% SH 40 39 42 43
Thyroid 3772 22 3 2.5% 92% TH 50 55 56 59
KDD99 50000 23 15 .04% 51% KD 41 23 54 59
Gait view 2353 25 9 3% 49% GA 38 31 49 57

criteria (10)-(12). GSsw/GSadapt: Our full model including online switching


of GMM and SVM classifiers, as detailed in Algorithm 1.

Shuttle. (Figure 2(a)). Our methods S/GSadapt (cyan) and GSsw/GSadapt


(red), exploit likelihood sampling early for fast discovery, and hence early classi-
fication accuracy. (We also outperform the gradient and EM based active discov-
ery models in [9] and [12].) Our adaptive models switch to uncertainty sampling
later on, and hence achieve higher asymptotic accuracy than the pure likelihood
based G/G method. Figure 2(c) illustrates this process via the query criteria
weighting (12) for a typical run. The likelihood criterion discovers a new class
early, leading to higher weight (11) and rapid discovery of the remaining classes.
After 50 iterations, with no new classes to discover, uncertainty criteria obtains
greater reward (11) and dominates, efficiently refining classification performance.

Thyroid. (Figure 2(b)). Our GSsw/GSadapt model (red) is the best overall
classifier: it matches the initially superior performance of the G/G likelihood-
based model (green), but later achieves the asymptotic performance of the SVM
classifier based models. This is because of our classifier switching innovation (Sec-
tion 2.4). Figure 2(d) illustrates switching via the average (training) classifica-
tion entropy and (testing) accuracy of the classifiers composing GSsw/GSadapt.
The GMM classifier entropy (black dots) is higher than the SVM entropy (blue
dots) for the first 25 iterations. This is approximately the period over which the
GMM classifier (black line) has better performance than the SVM classifier (blue
line), so switching classifier on entropy allows the pair (green dashes) to always
perform as well as the best individual classifier for each iteration.

Glass. (Figure 2(e)). GSsw/GSadapt again performs best by switching to match


the good initial performance of the GMM classifier and asymptotic performance
of the SVM. Note the dramatic improvement over the SVM models in the first
50 iterations. Pageblocks (Figure 2(f)). The SVM-based models outperform
G/G at most iterations. Our GSsw/GSadapt correctly selects the SVM classifier
Finding Rare Classes: Adapting Generative and Discriminative Models 305

(a) Shuttle: Discovery Shuttle: Classification Thyroid: Discovery Thyroid: Classification


7 0.6
(b)
0.7
3
He 2007
Classes Discovered

Classes Discovered
6

Average Accuracy

Average Accuracy
0.5
Pelleg 2004 2.5 0.6
5 S/R S/R
0.4 S/S
S/S
4 2 G/G 0.5
G/G 0.3 S/GSmix
3 S/GSmix
S/GSonline 0.4
S/GSonline 0.2 1.5
2 S/GSadapt
S/GSadapt
GSsw/GSadapt
GSsw/GSadapt 0.1 0.3
1 1
50 100 150 50 100 150 50 100 150 50 100 150
Labeled Points Labeled Points Labeled Points Labeled Points

(c) Adaptive Active Learning Criteria (d) Entropy based classifier switching (e) Glass: Discovery Glass: Classification
1 0.7 6 0.8

0.6 0.7

Classes Discovered
Entropy / Accuracy

Average Accuracy
0.8
Criteria Weight, w

5
0.5 S/R 0.6
0.6 0.4 4 S/S 0.5
Likelihood
GMM Entropy G/G
Uncertainty 0.3 0.4
0.4 SVM Entropy 3 S/GSmix
0.2 GMM Accuracy S/GSonline 0.3
0.2 SVM Accuracy 2 S/GSadapt 0.2
0.1
Joint Accuracy GSsw/GSadapt
0 0 1 0.1
0 50 100 150 0 50 100 150 20 40 60 80 100 20 40 60 80 100
Labeled Points Labeled Points Labeled Points Labeled Points

(f) Pageblocks: Discovery Pageblocks: Classification (g) Gait: Discovery Gait: Classification
5 0.8 0.7
8
Classes Discovered

0.7
Classes Discovered 0.6
Average Accuracy

Average Accuracy
4
S/R 0.6 S/R 0.5
S/S 6 S/S
3 G/G 0.5 G/G 0.4
S/GSmix 4 S/GSmix
0.4 0.3
S/GSonline S/GSonline
2
S/GSadapt 0.3 S/GSadapt 0.2
2
GSsw/GSadapt GSsw/GSadapt
1 0.2 0.1
50 100 150 50 100 150 50 100 150 50 100 150
Labeled Points Labeled Points Labeled Points Labeled Points

Fig. 2. (a) Shuttle and (b) Thyroid dataset performance. (c) Shuttle criteria adapta-
tion, (d) Thyroid entropy based classifier switching. (e) Glass, (f) Pageblocks and (g)
Gait view dataset performance.

throughout. Gait view (Figure 2(g)). The majority class contains outliers, so
likelihood criteria is unusually weak at discovery. Additionally for this data SVM
performance is generally poor, especially in early iterations. GSsw/GSadapt
adapts impressively to this dataset in two ways enabled by our contributions:
exploiting uncertainty sampling criteria extensively and switching to predicting
using the GMM classifier.
In summary the G/G method (likelihood criterion) was usually the most ef-
ficient at discovering classes as expected. However, it was usually asymptot-
ically weaker at classifying new instances. This is because generative model
mis-specification tends to cost more with increasing amounts of data [11]. S/S,
(uncertainty criterion), was general poor at discovery (and hence classification).
Alternating between likelihood and uncertainty sampling, S/GSmix (correspond-
ing to [16]) did a fair job of both discovery and classification, but under-performed
our adaptive models due to its inflexibility. S/GSonline (corresponding to [2])
was better than random or S/S, but was not the quickest learner. Our first model
S/GSadapt, which solely adapted the multiple active query criteria, was com-
petitive at discovery, but sometimes not the best at classification in early phases
with very little data – due to exclusively using the discriminative SVM classifier.
Finally, by exploiting generative-discriminative classifier switching, our complete
GSsw/GSadapt model was generally the best classifier over all stages of learn-
ing. Table 2 quantitatively summarizes the performance of the most competitive
models for all datasets in terms of area under the classification curve.
306 T.M. Hospedales, S. Gong, and T. Xiang

4 Conclusion
Summary. We highlighted active classifier learning with a priori unknown rare
classes as an under-studied but broadly relevant and important problem. To
solve joint rare class discovery and classification, we proposed a new framework
to adapt both active query criteria and classifier. To adaptively switch gen-
erative and discriminative classifiers online we introduced MCE; and to adapt
query criteria we exploited a joint reward signal of new class discovery and MCE.
In adapting to each dataset and online as data is obtained, our model signifi-
cantly outperformed contemporary alternatives on eight standard datasets. Our
approach will be of great practical value for many problems.

Discussion. A related area of research to our present work is that of learning


from imbalanced data [8] which aims to learn classifiers for classes with imbal-
anced distributions, while avoiding the pitfall of simply classifying everything
as the majority class. One strategy to achieve this is uncertainty based active
learning [6], which works because the distribution around the class boundaries is
less imbalanced than the whole dataset. Our task is also an imbalanced learning
problem, but more general in that the rare classes must also be discovered. We
succeed in learning from imbalanced distributions via our use of uncertainty sam-
pling, so in that sense our method generalizes [6]. Although our approach lacks
the theoretical bounds of the fusion method in [2], we find it more compelling for
various reasons: it addresses a very practical and previously unaddressed prob-
lem of learning to discover new classes and find new instances of them by jointly
optimizing searching for new classes and refining their decision boundaries. It
adapts based on the current state of the learning process, i.e., early on, class find-
ing via likelihood may be more appropriate, and later on boundary refinement
via uncertainty. In contrast [2] solely optimizes classification accuracy and is
not directly applicable to discovery. [5] and [3] address the fusion of uncertainty
and density (to avoid outliers) criteria for classifier learning (not discovery). [5]
adapts between density weighted and unweighted uncertainty sampling based on
their expected future error. This is different to our situation because there is no
meaningful notion of future error when an unknown number of classes remain
to be discovered. [3] samples from a weighted sum of density and uncertainty
criteria. This is less powerful than our approach because it does not adapt online
based on the performance of each criteria. Importantly, both [5] and [3] prefer
high density points; while for rare class discovery we require the opposite – low
likelihood. In comparison to other active rare class discovery work, our frame-
work generalizes [12], (which exclusively uses generative models and likelihood
criteria) to using more criteria and adapting between them. [9] focuses on a dif-
ferent active discovery intuition, using local gradient to discover non-separable
rare classes. We derived an analogous query criterion based on GMM local gra-
dient. It was generally weaker than likelihood-based discovery (and was hence
adapted downward in our framework) for our datasets, so we do not report on it
here. Unlike our work here, [5,12,9] all also rely on the very strong assumption
that the user at least specifies the number of classes in advance. Finally, the
Finding Rare Classes: Adapting Generative and Discriminative Models 307

only other work of which we are aware which addresses both discovery and clas-
sification is [16]. This uses a fixed classifier and non-adaptively iterates between
discovery and uncertainty criteria (corresponding to our S/GSmix condition). In
contrast, our results have shown that our switching classifier and adaptive query
criteria provide compelling benefit for discovery and classification.

Future Work. There are various interesting questions for future research includ-
ing and how to create tighter coupling between the generative and discriminative
components [4], and generalizing our ideas to stream based active learning, which
is a more natural setting for some practical problems.

Acknowledgment. This research was funded by the EU FP7 project SAMU-


RAI with grant no. 217899.

References
1. Asuncion, A., Newman, D.: UCI machine learning repository (2007),
http://www.ics.uci.edu/ml/
2. Baram, Y., El-Yaniv, R., Luz, K.: Online choice of active learning algorithms.
Journal of Machine Learning Research 5, 255–291 (2004)
3. Cebron, N., Berthold, M.R.: Active learning for object classification: from explo-
ration to exploitation. Data Min. Knowl. Discov. 18(2), 283–299 (2009)
4. Deselaers, T., Heigold, G., Ney, H.: SVMs, gaussian mixtures, and their genera-
tive/discriminative fusion. In: ICPR (2008)
5. Donmez, P., Carbonell, J.G., Bennett, P.N.: Dual strategy active learning. In: Kok,
J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron,
A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 116–127. Springer, Heidelberg
(2007)
6. Ertekin, S., Huang, J., Bottou, L., Giles, L.: Learning on the border: active learning
in imbalanced data classification. In: CIKM (2007)
7. Goldberger, J., Roweis, S.: Hierarchical clustering of a mixture model. In: NIPS
(2004)
8. He, H., Garcia, E.: Learning from imbalanced data. IEEE Transactions on Knowl-
edge and Data Engineering 21(9), 1263–1284 (2009)
9. He, J., Carbonell, J.: Nearest-neighbor-based active learning for rare category de-
tection. In: NIPS (2007)
10. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. IEEE
Transactions on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998)
11. Ng, A., Jordan, M.: On discriminative vs. generative classifiers: A comparison of
logistic regression and naive bayes. In: NIPS (2001)
12. Pelleg, D., Moore, A.: Active learning for anomaly and rare-category detection. In:
NIPS (2004)
13. Roy, N., McCallum, A.: Toward optimal active learning through sampling estima-
tion of error reduction. In: ICML, pp. 441–448 (2001)
14. Settles, B.: Active learning literature survey. Tech. Rep. 1648, University of
wisconsin–Madison (2009)
15. Sillito, R., Fisher, R.: Incremental one-class learning with bounded computational
complexity. In: ICANN (2007)
308 T.M. Hospedales, S. Gong, and T. Xiang

16. Stokes, J.W., Platt, J.C., Kravis, J., Shilman, M.: Aladin: Active learning of anoma-
lies to detect intrusions. Tech. Rep. 2008-24, MSR (2008)
17. Tong, S., Koller, D.: Support vector machine active learning with applications to
text classification. In: ICML (2000)
18. Wu, T.F., Lin, C.J., Weng, R.C.: Probability estimates for multi-class classification
by pairwise coupling. Journal of Machine Learning Research 5, 975–1005 (2004)
19. Xiang, T., Gong, S.: Video behavior profiling for anomaly detection. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 30(5), 893–908 (2008)
20. Yu, S., Tan, D., Tan, T.: A framework for evaluating the effect of view angle,
clothing and carrying condition on gait recognition. In: ICPR (2006)
Margin-Based Over-Sampling Method for
Learning from Imbalanced Datasets

Xiannian Fan, Ke Tang , and Thomas Weise

Nature Inspired Computational and Applications Laboratory,


School of Computer Science and Technology,
University of Science and Technology of China,
Hefei, China, 230027
xnfan@mail.ustc.edu.cn,{ketang,tweise}@ustc.edu.cn

Abstract. Learning from imbalanced datasets has drawn more and


more attentions from both theoretical and practical aspects. Over-
sampling is a popular and simple method for imbalanced learning. In
this paper, we show that there is an inherently potential risk associated
with the over-sampling algorithms in terms of the large margin principle.
Then we propose a new synthetic over sampling method, named Margin-
guided Synthetic Over-sampling (MSYN), to reduce this risk. The MSYN
improves learning with respect to the data distributions guided by the
margin-based rule. Empirical study verities the efficacy of MSYN.

Keywords: imbalance learning, over-sampling, over-fitting, large mar-


gin theory, generalization.

1 Introduction

Learning from imbalanced datasets has got more and more emphases in recent
years. A dataset is imbalanced if its class distributions are skewed. The class
imbalance problem is of crucial importance since it is encountered by a large
number of real world applications, such as fraud detection [1], the detection of
oil spills in satellite radar images [2], and text classification [3]. In these scenarios,
we are usually more interested in the minority class instead of the majority class.
The traditional data mining algorithms have a poor performance due to the fact
that they give equal attention to the minority class and the majority class.
One way for solving the imbalance learning problem is to develop ”imbalanced
data oriented algorithms” that can perform well on the imbalanced datasets. For
example, Wu et al. proposed class boundary alignment algorithm which modi-
fies the class boundary by changing the kernel function of SVMs [4]. Ensemble
methods were used to improve performance on imbalance datasets [5]. In 2010,
Liu et al. proposed the Class Confidence Proportion Decision Tree (CCPDT)
[6]. Furthermore, there are other effective methods such as cost-based learning
[7] and one class learning [8].

Corresponding author.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 309–320, 2011.
c Springer-Verlag Berlin Heidelberg 2011
310 X. Fan, K. Tang, and T. Weise

Another important way to improve the results of learning from imbalanced


data is to modify the class distributions in the training data by over-sampling
the minority class or under-sampling the majority class [9]. The simplest sam-
pling methods are Random Over-Sampling (ROS) and Random Under-Sampling
(RUS). The former increases the number of the minority class instances by du-
plicating the instances of the minority, while the latter randomly removes some
instances of the majority class. Sampling with replacement has been shown to
be ineffective for improving the recognition of minority class significantly. [9][10].
Chawla et al. interpret this phenomenon in terms of decision regions in feature
space and proposed the Synthetic Minority Over-Sampling Technique (SMOTE)
[11]. There are also many other synthetic over-sampling techniques, such as
Borderline-SMOTE [12] and ADASYN [13]. To summarize, under-sampling meth-
ods can reduce useful information of the datasets; over-sampling methods may
make the decision regions of the learner smaller and more specific, thus may
cause the learner to over-fit.
In this paper, we analyze the performance of over-sampling techniques from
the perspective of the large margin principle and find that the over-sampling
methods are inherently risky from this perspective. Aiming to reduce this risk,
we propose a new synthetic over-sampling method, called Margin-guided Syn-
thetic Over-Sampling (MSYN). Our work is largely inspired by the previous
works in feature selection using the large margin principle [14] [15] and prob-
lems of over-sampling for imbalance learning [16]. The empirical study revealed
the effectiveness of our proposed method.
The rest of this paper is organized as follows. Section 2 reviews the related
works. Section 3 presents the margin-based analysis for over-sampling. Then in
Section 4 we propose the new synthetic over-sampling algorithm. In Section 5, we
test the performance of the algorithms on various machine learning benchmarks
datasets. Finally, the conclusion and future work are given in Section 6.

2 Related Works
We use A to denote a dataset of n instances A = {a1 , ..., an }, where ai is a real-
valued vector of dimension m. Let AP ⊂ A denote the minority class instances,
AN ⊂ A denote the majority class instances.
Over-sampling techniques augment the minority class to balance between
the numbers of the majority and minority class instances. The simplest over-
sampling method is ROS. However, it may make the decision regions of the ma-
jority smaller and more specific, and thus can cause the learner to over-fit [16].
Chawla et al. over-sampled the minority class with their SMOTE method,
which generates new synthetic instances along the line between the minority in-
stances and their selected nearest neighbors [11]. Specifically, for the subset AP ,
they consider the k-nearest neighbors for each instances ai ∈ Ap . For some spec-
ified integer number k, the k-nearest neighbors are define as the k elements of
AP , whose Euclidian distance to the element ai under consideration is the small-
est. To create a synthetic instance, one of the k-nearest neighbors is randomly
Margin-Based Over-Sampling Method 311

selected and then multiplied by the corresponding feature vector difference with
a random number between [0, 1]. Take a two-dimensional problem for example:

anew = ai + (ann − ai ) × δ

where ai ∈ AP is the minority instance under consideration, ann is one of the k-


nearest neighbors from the minority class, and δ ∈ [0, 1]. This leads to generating
a random instance along the line segment between two specific instances and
thus effectively forces the decision region of the minority class to become more
general [11]. The advantage of SMOTE is that it makes the decision regions
larger and less specific [16].
Borderline-SMOTE focuses the instances on the borderline of each class and
the ones nearby. The consideration behind it is: the instances on the borderline
(or nearby) are more likely to be misclassified than the ones far from the border-
line, and thus more important for classification. Therefore, Borderline-SMOTE
only generates synthetic instances for those minority instances closer to the
border while SMOTE generates synthetic instances for each minority instance.
ADASYN uses a density distribution as a criterion to automatically decide the
number of synthetic instances that need to be generated for each minority in-
stance. The density distribution is a measurement of the distribution of the
weights for different minority class instances according to their level of difficulty
in learning. The consideration is similar to the idea of AdaBoost [17]: one should
pay more attention to the difficult instances. In summary, either Borderline-
SMOTE or ADASYN improves the performance of over-sampling techniques by
paying more attention on some specific instances. They, however, did not touch
the essential problem of the over-sampling techniques which causes over-fitting.
Different from the previous work, we resort to margins to analyze the problem
of over-sampling, since margins offer a theoretic tool to analyze the generalization
ability. Margins play an indispensable role in machine learning research. Roughly
speaking, margins measure the level of confidence a classifier has with respect to
its decision. There are two natural ways of defining the margin with respect to a
classifier [14]. One approach is to define the margin as the distance between an
instance and the decision boundary induced by the classification rule. Support
Vector Machines are based on this definition of margin, which we refer to as
sample margin. An alternative definition of the margin can be the Hypothesis
Margin; in this definition the margin is the distance that the classifier can travel
without changing the way it labels any of the sample points [14].

3 Large Margin Principle Analysis for Over-Sampling


For prototype-based problems (e. g. the nearest neighbor classifer), the classifier
is defined by a set of training points (prototypes) and the decision boundary
is the Voronoi tessellation [18]. The sample margin in this case is the distance
between the instance and the Voronoi tessellation. Therefore it measures the
sensitivity to small changes of the instance position. The hypothesis margin R
for this case is the maximal distance such that the following condition holds:
312 X. Fan, K. Tang, and T. Weise

near hit

near miss
Class A train point
Class A test point
Class B train point

Fig. 1. Two types of margins in terms of the Nearest Neighbor Rule. The toy problem
involves class A and class B. Margins of a new instance (the blue circle), which belongs
to class A, are shown. The sample margin 1(left) is the distance between the new
instance and the decision boundary (the Voronoi tessellation). The hypothesis margin
1(right) is the largest distance the sample points can travel without altering the label
of the new instance. In this case it is half the difference between the distance to the
nearest miss and the distance to the nearest hit.

if we draw a sphere with radius R around each prototype, any change of the
location of prototypes inside their sphere will not change the assigned labels.
Therefore, the hypothesis margin measures the stability to small changes in the
prototypes locations. See Figure 1 for illustration.
Throughout this paper we will focus on the margins for the Nearest Neighbor
rule (NN). For this special case, it is proved the following results [14]:
1. The hypothesis-margin lower bounds the sample-margin
2. It is easy to compute the hypothesis-margin of an instance x with respect
to a set of instances A by the following formula:
1
θA (x) = (||x − nearestmissA (x)|| − ||x − nearesthitA (x)||) (1)
2
where nearesthitA (x) and nearestmissA (x) denote the nearest instance to x in
dataset A with the same and different label, respectively.
In the case of the NN, we can know that the hypothesis margin is easy to
calculate and that a set of prototypes with large hypothesis margin then it has
large sample margin as well [14].
Now we consider the over-sampling problem using the large margin principle.
When adding a new minority class instance x, we consider the difference of the
overall margins for the minority class:

ΔP (x) = (θA\a∪{x} (a) − θA\a (a)) (2)
a∈AP

where A\a denotes the dataset excluding a from the dataset A, and A\a ∪ {x}
denotes the union of A\a and {x}.
Margin-Based Over-Sampling Method 313

For any a ∈ AP , ||a − nearestmissA\a∪{x} (a)|| = ||a − nearestmissA\a (a)||


and ||a − nearesthitA\a∪{x}(a)|| ≤ |a − nearesthitA\a (a)||. From Eq. (1), it
follows that ΔP (x) ≥ 0. We call ΔP (x) the margin gain for the minority class.
Further, the difference of the overall margins for majority class is:

ΔN (x) = (θA\a∪{x} (a) − θA\a (a)) (3)
a∈AN

for any a ∈ AN , ||a − nearestmissA\a∪{x} (a)||≤||a − nearestmissA\a (a)|| and


||a − nearesthitA\a∪{x} (a)|| = ||a − nearesthitA\a(a)||. From Eq. (1), it follows
that ΔN (x) ≤ 0. We call −ΔN (x) the margin loss for the majority class.
In summary, it is shown that the over-sampling methods are inherently risky
from the perspective of the large margin principle. The over-sampling methods,
such as SMOTE, will enlarge the nearest-neighbor based margins for the minority
class while may decrease the nearest neighbor based margins for the majority
class. Hence, over-sampling will not only bias towards the minority class but
also may be detrimental to the majority class. We cannot eliminate these effects
when adopting over-sampling for imbalance learning completely, but we can seek
methods to optimize the two parts.
In the simplest way, one can maximize the margins for the minority class and
ignore the margins loss for the majority class, i.e., the following formula:

f1 = −ΔP (x) (4)

Alternatively, one may also minimize the margins loss for the majority class,
which is
f2 = −ΔN (x) (5)
One intuitive method is to seek a good balance between maximizing the margins
gain for the minority class and minimizing the margins loss for the majority
class. This can be conducted by minimizing Eq. (6):

−ΔN (x)
f (x)3 = ,ε > 0 (6)
ΔP (x) + ε

where ε is a positive constant to ensure that the denominator of Eq. (6) to be


non-zero.

4 The Margin-Guided Synthetic Over-Sampling


Algorithm
In this section we apply the above analysis to the over-sampling techniques.
Without loss of generality, our algorithm is designed on the basis of SMOTE.
The general idea behind it, however, can also be applied to any other over-
sampling technique
Based on the analysis in the previous section, Eq. (6) is employed to decide
whether a new synthetic instance is good enough to be added into the training
314 X. Fan, K. Tang, and T. Weise

Algorithm 1. MSYN
Input: Training set X with n instances (ai , yi ), i = 1, ..., n where ai is an
instance in the m dimensional feature space, and yi belongs to
Y = {1, −1} is the class identity label associated with ai , Define mP
and mN as the number of the minority class instances and the number
of the majority class instances, respectively. Therefore, mP < mN . BIN
is the set of synthetic instances, which is initialized as empty.
Parameter: P ressure.
1 Calculate the number of synthetic instances that need to be generated for the
minority class: G = (mN − mP ) ∗ P ressure;
2 Calculate the number of synthetic instances that needed to be generated for
each minority example ai :
G
gi =
mP
3 for each minority class instances ai do
4 for j ← 1 to gi do
5 Randomly choose one minority instance, azi , from the k nearest
neighbors for the instance ai ;
6 Generate the synthetic instances as using the technique of SMOTE;
7 Add as to BIN

8 sort the synthetic instances in BIN according to the their values of Eq. (6);
9 return (mN − mP ) instances who have the minimum (mN − mP ) values of Eq.
(6).

dataset. Our new Margin-guided Synthetic Over-Sampling algorithm, MSYN for


short, is given in Algorithm 1. The major focus of MSYN is to use margin-based
guideline to select the synthetic instances. P ressure ∈ N, a natural number, is a
parameter for controlling the selection pressure. In order to get (mN − mP ) new
synthetic instances, we first create (mN −mP )∗P ressure new instances, then we
only select top best (mN − mP ) new instances according to the values of Eq. (6)
and discard the rest instances. This selection process implicitly decides whether
an original minority instance is used to create a synthetic instances as well as
how many synthetic instances will be generated, which is different from SMOTE
since SMOTE generates the same number of synthetic instances for each original
minority instances. Moreover, it is easy to see that computational complexity of
MSYN is O(n2 ), which is mainly decided by calculating the distance matrix.

5 Experiment Study

The Weka’s C4.5 implementation [19] is employed in our experiments. We com-


pare our proposed MSYN with SMOTE [11], ADASYN [13], Borderline-SMOTE
[12] and ROS. All experiments were carried out using 10 runs of 10-fold cross-
validation. For MSYN, the parameter P ressure is set to 10 and the ε can be
any random positive real number; for other methods, the parameters are set as
recommended in the corresponding paper.
Margin-Based Over-Sampling Method 315

1 minority class
majority class
0.9 true boundary
0.8

0.7

0.6
Feature 2

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
Feature 1

Fig. 2. The distribution of the dataset Concentric with noise

To evaluate the performance of our approach, experiments on both artificial


and real datasets have been performed. The former is used to show the behavior
of the MSYN on known data distributions while the latter is used to verify the
utility of our method when dealing with real-world problems.

5.1 Synthetic Datasets


This part of our experiments focuses on synthetic data to analyze the character-
istics of the proposed MSYN. We used the dataset Concentric from the ELENA
project [20]. The Concentric dataset is a two-dimensional uniform concentric
circular distributions problem with two classes. The instances of minority class
uniformly distribute within a circle of radius 0.3 centered on (0.5, 0.5). The
points of majority class are uniformly distribute within a ring centered on (0.5,
0.5) with internal and external radius respectively to 0.3 and 0.5.
In order to investigate the problem of over-fitting to noise, we modify the
dataset by randomly flipping the labels of 1% instances, as shown in Figure 2.
In order to show the performance of the various synthetic over-sampling tech-
niques, we sketch them in Figure 3. The new synthetic instances created by each
over-sampling method, the original majority instances and the corresponding
C4.5 decision boundary are drawn. From Figure 3, we can see that MSYN shows
good performance in the presence of noise while SMOTE and ADASYN suffer
greatly from over-fitting the noise. MSYN generates no noise instances. This can
be attributed to the fact that the margin-based Eq. (6) contains the information
of the neighboring instances, and this information helps to decrease the influence
of noise. Both SMOTE and ADASYN generate a large number of noise instances
and their decision boundary is greatly influenced. Borderline-SMOTE generates
a small number of noise instances and its decision boundary is slightly influenced.
316 X. Fan, K. Tang, and T. Weise

synthetic minority instances majority instances C4.5 boundary True boundary

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
SMOTE MSYN
1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Borderline−SMOTE ADASYN

Fig. 3. The synthetic instances and the corresponding C4.5 decision boundary after
processing by SMOTE, MSYN, Borderline-SMOTE, ADASYN, respectively.

Furthermore, Borderline-SMOTE pays little attention to interior instances and


creates only a few of synthetic instances.

5.2 Real World Problems

We test the algorithms on ten datasets from the UCI Machine Learning Repos-
itory [21]. Information about these datasets is summarized in Table 1, where
num is the size of the dataset, attr is the number of features, min% is the ratio
of the number of minority class number to NUM.
Instead of using the overall classification accuracy, we uadopt metrics related
to Receiver Operating Characteristics (ROC) curve [22] to evaluate the compared
algorithms, because traditional overall classification accuracy may not be able to
provide a comprehensive assessment of the observed learning algorithms in case
of class imbalanced datasets [3]. Specifically, we use the AUC [22] and F-Measure
[23] to evaluate the performance. We apply the Wilcoxon signed rank test with
a 95% confidence level on each dataset to see whether the difference between the
compared algorithms is statistically significant.
Table 2 and Table 3 show the AUC and F-Measure for the datasets, respec-
tively. The results of Table 2 reveal that MSYN wins against SMOTE on nine
out of ten datasets, beats ADASYN on seven out of ten datasets, outperforms
Margin-Based Over-Sampling Method 317

Table 1. Summary of the DataSets

Datasets num attr min%


Abalone 4177 8 9.36%
Contraceptive 1473 9 22.61%
Heart 270 9 29.28%
Hypothyroid 3163 8 34.90%
Ionosphere 351 34 35.90%
Parkinsons 195 22 24.24%
Pima 768 8 34.90%
Spect 367 19 20.65%
Tic-tac-toe 958 9 34.66%
Transfusion 748 4 31.23%

Table 2. Result in terms of AUC in the experiments performs on real datasets. For
SMOTE, ADAYSN, ROS and Borderline-SMOTE, if the value is underlined, MSYN
has better performance than that method; if the value is starred, MSYN exhibits lower
performance compared to that method; if the value is in normal style it means that the
corresponding method does not perform significantly different from MSYN according
to the Wilcoxon signed rank test. The row W/D/L Sig. shows the number of wins,
draws and losses of MSYN from the statistical point of view.

Dataset MSYN SMOTE ADASYN ROS Borderline-SMOTE


Abalone 0.7504 0.7402 0.7352 0.6708 0.7967*
Contraceptive 0.6660 0.6587 0.6612 0.6055 0.6775*
Heart 0.7909 0.7862 0.7824 0.7608 0.7796
Hypothyroid 0.9737 0.9652 0.9655 0.9574 0.9653
Ionosphere 0.8903 0.8731 0.8773 0.8970* 0.8715
Parkinsons 0.8248 0.8101 0.8298* 0.7798 0.8157
Pima 0.7517 0.7427 0.7550 0.7236 0.7288
Spect 0.7403 0.7108 0.7157 0.6889 0.7436
Tic-tac-toe 0.9497 0.9406 0.9391 0.9396 0.9456
Transfusion 0.7140 0.6870 0.6897 0.6695 0.6991
W/D/L Sig. N/A 9/1/0 7/2/1 9/0/1 6/2/2

ROS on nine out of ten datasets, and wins against Borderline-SMOTE on six out
of ten datasets. The results of Table 3 show that MSYN wins against SMOTE on
seven out of ten datasets, beats ADASYN on six out of ten datasets, beats ROS
on six out of ten datasets, and wins against Borderline-SMOTE on six out of ten
datasets. The comparisons reveal that MSYN outperforms the other methods in
terms of both AUC and F-measure.
318 X. Fan, K. Tang, and T. Weise

Table 3. Result in terms of F-measure in the experiments performs on real datasets.


For SMOTE, ADAYSN, ROS and Borderline-SMOTE, if the value is underlined,
MSYN has better performance than that method; if the value is starred, MSYN ex-
hibits lower performance compared to that method; if the value is in normal style it
means that the corresponding method does not perform significantly different from
MSYN according to the Wilcoxon singed rank test. The row W/D/L Sig. shows the
number of wins, draws and losses of MSYN from the statistical point of view.

Dataset MSYN SMOTE ADASYN ROS Borderline-SMOTE


Abalone 0.2507 0.3266* 0.3289* 0.3479* 0.3154*
Contraceptive 0.3745 0.4034* 0.4118* 0.4133* 0.4142*
Heart 0.7373 0.7305 0.7318 0.7151 0.7223
Hypothyroid 0.8875 0.8412 0.8413 0.8771 0.9054*
Ionosphere 0.8559 0.8365 0.8338 0.8668* 0.8226
Parkinsons 0.7308 0.6513 0.6832 0.6519 0.6719
Pima 0.6452 0.6435 0.6499 0.6298 0.6310
Spect 0.4660 0.4367 0.4206 0.4644 0.4524
Tic-tac-toe 0.8619 0.8465 0.8437 0.8556 0.8604
Transfusion 0.4723 0.4601 0.4507 0.4596 0.4664
W/D/L Sig. N/A 7/1/2 6/2/2 6/1/3 6/1/3

6 Conclusion and Future Work


This paper gives an analysis of over-sample techniques from the viewpoint of the
large margin principle. It is shown that over-sampling techniques will not only
bias towards the minority class but may also bring detrimental effects to the clas-
sification of the majority class. This inherent dilemma of over-sampling cannot be
entirely eliminated, but only reduced. We propose a new synthetic over-sampling
method to strike a balance between the two contradictory objectives. We eval-
uate our new method on a wide variety of imbalanced datasets using different
performance measures and compare it to the established over-sampling methods.
The results support our analysis and indicate that the proposed method, MSYN,
is indeed superior.
As a new sampling method, MSYN can be further extended along several di-
rections. First of all, we investigate the performance of MSYN using C4.5. Based
on the nearest neighbor margin, MSYN has a bias for the 1-NN. Some strategies,
however, can be adopted to approximate the hypothesis margin for the other clas-
sification rules. For example, we can use the confidence of the classifiers’ output to
approximate the hypothesis margin. Thus we expect MSYN can be extended to
work well with other learning algorithms, such as k-NN, RIPPER [28]. But solid
empirical study is required to justify this expectation. Besides, ensemble learning
algorithms can improve the accuracy and robustness of the learning procedure [25].
It is thus worthy of integrating MSYN with ensemble learning algorithms. Such an
investigation can be conducted following the methodology employed in the work
of SMOTEBoost [5], DataBoost-IM [26], BalanceCascade [27], etc.
Margin-Based Over-Sampling Method 319

Secondly, MSYN can be generalized to multiple-class imbalance learning as


well. For each minority class i, a straightforward idea is to extend Eq. (6) to:

− Δi,j (x)
j=i
fi (x) = ,ε > 0 (7)
Δi (x) + ε
where Δi (x) denotes the margin gain of minority class i by adding a new minority
instance x (x belongs to class i), and −Δi,j (x) denotes the margin loss for class
j by adding a new minority instance x (x belongs to class i). Then we create the
synthetic instances for each minoirty class to make the number of them being
equal to the number of the majority class, which has the maximum number of
instances. However, this idea is by no means the only one. Extending a technique
from binary to multi-class problems is usually non-trivial, and more in-depth
investigation is necessary to seek the best strategy.

References
1. Chan, P.K., Stolfo, S.J.: Toward scalable learning with non-uniform class and cost
distributions: a case study in credit card fraud detection. In: Proceedings of the
Fourth International Conference on Knowledge Discovery and Data Mining, pp.
164–168 (2001)
2. Kubat, M., Holte, R.C., Matwin, S.: Machine Learning for the Detection of Oil
Spills in Satellite Radar Images. Machine Learning 30(2), 195–215 (1998)
3. Weisis, G.M.: Mining with Rarity: A Unifying Framwork. SiGKDD Explo-
rations 6(1), 7–19 (2004)
4. Wu, G., Chang, E.Y.: Class-Boundary Alignment for Imbalanced Dataset Learning.
In: Workshop on Learning from Imbalanced Datasets II, ICML, Washington DC
(2003)
5. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: Smoteboost: Improving
Prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todor-
ovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119.
Springer, Heidelberg (2003)
6. Liu, W., Chawla, S., Cieslak, D.A., Chawla, N.V.: A Robust Decision Tree Algo-
rithm for Imbalanced Data Sets. In: SIAM International Conf. on Data Mining
(2010)
7. Zhou, Z.H., Liu, X.Y.: Training cost-sensitive neural networks with methods ad-
dressing the class imbalance problem. IEEE Transactions on Knowledge and Data
Engineering, 63–77 (2006)
8. Raskutti, B., Kowalczyk, A.: Extreme re-balancing for SVMs: a case study.
SIGKDD Explorations 6(1), 60–69 (2004)
9. Japkowicz, N.: The Class Imbalance Problem: Significance and Strategies. In: Pro-
ceeding of the 2000 International Conf. on Artificial Intelligence (ICAI 2000): Spe-
cial Track on Inductive Learning, Las Vegas, Nevada (2000)
10. Ling, C., Li, C.: Data Mining for Direct Marketing Problems and Solutions. In:
Proceeding of the Fourth International Conf. on Knowledge Discovery and Data
Mining, KDD 1998, New York, NY (1998)
11. Chawla, N.V., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: SMOTE: Synthetic
Minority Oversampling Technique. Journal of Artificial Intelligence Research 16,
321–357 (2002)
320 X. Fan, K. Tang, and T. Weise

12. Han, H., Wang, W.Y., Mao, B.H.: Borderline-SMOTE: A New Over-Sampling
Method in Imbalanced Data Sets Learning. Advances in Intelligent Computing,
878–887 (2005)
13. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: Adaptive Synthetic Sampling
Approach for Imbalanced Learning. In: Proceeding of International Conf. Neural
Networks, pp. 1322–1328 (2008)
14. Crammer, K., Gilad-Bachrach, R., Navot, A., Tishby, N.: Margin analysis of the
LVQ algorithm. Advances in Neural Information Processing Systems, 479–486
(2003)
15. Gilad-Bachrach, R., Navot, A., Tishby, N.: Margin based feature selection-theory
and algorithms. In: Proceeding of the Twenty-First International Conference on
Machine Learning (2004)
16. He, H., Garcia, E.A.: Learning from Imbalance Data. IEEE Transaction on Knowl-
edge and Data Engineering 21(9), 1263–1284 (2009)
17. Freund, Y., Schapire, R.: A desicion-theoretic generalization of on-line learning
and an application to boosting. Journal of Computer and System Sciences 55(1),
119–139 (1997)
18. Bowyer, A.: Computing dirichlet tessellations. The Computer Journal 24(2) (1981)
19. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and tech-
niques with Java implementations. ACM SIGMOD Record 31(1), 76–77 (2002)
20. UCL machine learning group, http://www.dice.ucl.ac.be/mlg/?page=Elena
21. Asuncion, A., Newman, D.: UCI machine learning repository (2007)
22. Bradley, A.: The use of the area under the ROC curve in the evaluation of machine
learning algorithms. Pattern Recognition 30(7), 1145–1159 (1997)
23. Van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
24. Wang, B.X., Japkowicz, N.: Imbalanced Data Set Learning with Synthetic Samples.
In: Proc. IRIS Machine Learning Workshop (2004)
25. Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F.
(eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)
26. Guo, H., Viktor, H.L.: Learning from Imbalanced Data Sets with Boosting and
Data Generation: the DataBoost-IM Approach. SIGKDD Explorations: Special
issue on Learning from Imbalanced Datasets 6(1), 30–39 (2004)
27. Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance
learning. IEEE Transactions on Systems, Man and Cybernetics - Part B: Cyber-
netics 39(2), 539–550 (2009)
28. Cohen, W.: Fast Effective Rule Induction. In: Proceeding of 12th International
Conf. on Machine Learning, Lake Tahoe, CA, pp. 115–123. Morgan Kaufmann,
San Francisco (1995)
Improving k Nearest Neighbor with Exemplar
Generalization for Imbalanced Classification

Yuxuan Li and Xiuzhen Zhang

School of Computer Science and Information Technology,


RMIT University, Melbourne, Australia
{li.yuxuan,xiuzhen.zhang}@rmit.edu.au

Abstract. A k nearest neighbor (kNN) classifier classifies a query in-


stance to the most frequent class of its k nearest neighbors in the training
instance space. For imbalanced class distribution, a query instance is of-
ten overwhelmed by majority class instances in its neighborhood and
likely to be classified to the majority class. We propose to identify exem-
plar minority class training instances and generalize them to Gaussian
balls as concepts for the minority class. Our k Exemplar-based Nearest
Neighbor (kENN) classifier is therefore more sensitive to the minority
class. Extensive experiments show that kENN significantly improves the
performance of kNN and also outperforms popular re-sampling and cost-
sensitive learning strategies for imbalanced classification.

Keywords: Cost-sensitive learning, imbalanced learning, kNN,


re-sampling.

1 Introduction

Skewed class distribution is common for many real-world classification problems,


including for example, detection of software defects in software development
projects [14], identification of oil spills in satellite radar images [11], and detection
of fraudulent calls [9]. Typically the intention of classification learning is to
achieve accurate classification for each class (especially the rare class) rather than
an overall accuracy without distinguishing classes. In this paper our discussions
will be focused on the two-class imbalanced classification problem, where the
minority class is the positive and the majority class is the negative.
Imbalanced class distribution has been reported to impede the performance
of many concept learning systems [20]. Many systems adopt the maximum gen-
erality bias [10] to induct a classification model, where the concept1 of the least
number of conditions (and therefore the most general concept) is chosen to de-
scribe a cluster of training instances. In the presence of class imbalance this
induction bias tends to over-generalize concepts for the majority class and miss
concepts for the minority class. In formulating decision trees [17] for example,
1
The term concept is used in its general sense. Strictly in the context of classification
learning it is a subconcept of the complete concept for some class.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 321–332, 2011.
c Springer-Verlag Berlin Heidelberg 2011
322 Y. Li and X. Zhang

induction may stop at a node where class for the node is decided by the majority
of instances under the node and instances of the minority class are ignored.
In contrast to most concept learning systems, k nearest neighbor (kNN) clas-
sification [6,1,2] or instance-based learning, does not formulate a generalized
conceptual model from the training instances at the training stage. Rather at
the classification stage, a simple and intuitive rule is used to make decisions:
instances close in the input space are likely to belong to the same class. Typi-
cally a kNN classifier classifies a query instance to the class that appears most
frequently among its k nearest neighbors. k is a parameter for tuning the clas-
sification performance and is typically set to three to seven.
Although instance-based learning has been advocated for imbalanced learn-
ing [10,19,3], to the best of our knowledge, a large-scale study of applying kNN
classification to imbalanced learning has not been reported in literature. Most
research efforts in this area have been on trying to improve its classification ef-
ficiency [1,2,21]. Various strategies have been proposed to avoid an exhaustive
search for all training instances and to achieve accurate classification.
In the presence of class imbalance kNN classification also faces challenges to
correctly detect the positive instances. For a query instance, if its neighborhood
is overwhelmed by negative instances, positive instances are still likely to be
ignored in the decision process. Our main idea to mitigate the decision errors is
to introduce a training stage to generalize the positive instances from a point to
a gaussian ball in the instance space. Rather than generalizing every positive in-
stances which may introduce false positives, we propose an algorithm to identify
the exemplar positive instances called pivot positive instances (c.f. Section 3),
and use them to reliably derive the positive class boundary.
Experiments on 12 real-world imbalanced datasets show that our classifier,
k Exemplar-based Nearest Neighbor (kENN), is effective and significantly im-
proves the performance of kNN for imbalanced learning. kENN also outperforms
the current re-sampling and cost-sensitive learning strategies, namely SMOTE [5]
and MetaCost [7], for imbalanced classification.

1.1 Related Work


kNN has been advocated for learning with minority instances [10,19,3] because
of its high specificity bias of keeping all minority instances. In [10], the problem
of small disjuncts (a small cluster of training instances) was first studied and
the maximum specificity bias was shown to reduce errors for small disjuncts.
In [19] kNN was used to improve learning from small disjuncts encountered in
the C4.5 decision tree system [17]. In [3] kNN was employed to learn from small
disjuncts directly, however learning results were not provided to demonstrate
its performance. In contrast to these previous work, we propose to generalize
exemplar positive instances to form concepts for the minority class, and then
apply kNN directly for imbalanced classification.
With the assumption that learning algorithms perform best when classes
are evenly distributed, re-sampling training data for even class distribution has
been proposed to tackle imbalanced learning. Kubat and Matwin [12] tried to
Improving kNN with Exemplar Generalization 323

- - - -
+ + +
- -
- - -
+ - -
+ + - - -
+ P1 -
- -
+ + -
- -
+
+ - - -
+ - - - -
-
-
* -
N1 -
- - -
- - - -
- -
- -
-
- - - - - -
-
- - -
- - - -
+
- -
- - - - P3
-
+
+ * - + + * - -
- - +
P2 * -
- + -
- -
- -

Fig. 1. An artificial imbalance classification problem

under-sample the majority class, while Ling and Li [13] combined over-sampling
of the minority class with under-sampling of the majority class. Especially
Chawla and Bowyer [5] proposed Synthetic Minority Over-sampling TEchnique
(SMOTE) to over-sample the minority class by creating synthetic samples. It
was shown that SMOTE over-sampling of the minority class in combination
with under-sampling the majority class often could achieve effective imbalanced
learning.
Another popular strategy tackling the imbalanced distribution problem is
cost-sensitive learning [8]. Domingos [7] proposed a re-costing method called
MetaCost, which can be applied to general classifiers. The approach made error-
based classifiers cost-sensitive. His experimental results showed that MetaCost
reduced costs compared to cost-blind classifier using C4.5Rules as baseline.
Our experiments (c.f. Section 5) show that SMOTE in combination with
under-sampling of majority class as well as MetaCost significantly improves the
performance of C4.5 for imbalanced learning. However these strategies some-
how do not statistically significantly improve the performance of kNN for class
imbalance. This may be partly explained by that kNN makes classification de-
cision by examining the local neighborhood of query instances where the global
re-sampling and cost-adjustment strategies may not have pronounced effect.

2 Main Ideas
Fig. 1 shows an artificial two-class imbalance problem, where positive instances
are denoted as “+” and negative instances are denoted as “-”. True class bound-
aries are represented as solid lines while the decision boundaries by some classi-
fication model are represented as dashed lines. Four query instances that indeed
belong to the positive class are represented as stars (*). Three subconcepts as-
sociated with the positive class are the three regions formed by the solid lines,
denoted as P1, P2 and P3 respectively. Subconcept P1 covers a large portion of
instances in the positive instance space whereas P2 and P3 correspond to small
324 Y. Li and X. Zhang

- - - - - -
- -
+ - + -
- -
+ + * - - + + * - -
- + - +
- -
+ - + -
- - - -
(a) Standard 1NN (b) Exemplar 1NN

Fig. 2. The Voronoi diagram for the subspace of subconcept P3 of Fig. 1

disjuncts of positive instances. Note that the lack of data for the subconcepts P2
and P3 cause the classification model to learn inappropriate decision boundaries
for P2 and P3. As a result, two query instances (denoted by *) that are indeed
positive defined by P2 fall outside the positive decision boundary of the classifier
and similarly for another query instance defined as positive by P 3.
Given the problem in Fig. 1, we illustrate the challenge faced by a stan-
dard kNN classifier using the subspace of instances at the lower right corner.
Figure 2(a) shows the Voronoi diagram for subconcept P3 in the subspace, where
the positive class boundary decided by standard 1NN is represented as the poly-
gon in bold line. The 1NN induction strategy where the class of an instance is
decided by the class of its nearest neighbor results in a class boundary much
smaller than the true class boundary (circle). As a result the query instance (de-
noted by *), which indeed is a positive instance inside the true positive boundary,
is predicted as negative by standard 1NN. Obviously to achieve more accurate
prediction, the decision boundary for the positive class should be expanded so
that it is closer to the true class boundary.
A naive approach to expanding the decision boundary for the positive class is
to generalize every positive instance in the training instance space from a point
to a Gaussian ball. However this aggressive approach to expanding the positive
boundary can most definitely introduce false positives. We need a strategy to
selectively expand some positive points in the training instance space so that
the decision boundary closely approximates the real class boundary while not
introducing too many false positives.
Our main idea of expanding the decision boundary for the positive class while
minimizing false positives is based on exemplar positive instances. Exemplar
positive instances should be the positive instances that can be generalized to
reliably classify more positive instances in independent tests. Intuitively these
instances should include the strong positive instances at or close to the center
of a disjunct of positive instances in the training instance space. Weak positive
instances close to the class boundaries should be excluded.
Fig. 2(b) shows the Voronoi diagram after the three positive instances at the
center of the disjunct of positive instances have been used to expand the bound-
ary for the positive class. Obviously the decision boundary after adjustment is
Improving kNN with Exemplar Generalization 325

much closer to the real class boundary. As a result, the query instance (repre-
sented by *) is now enclosed by the boundary decided by the classifier and is
correctly predicted as positive.

3 Pivot Positive Instances


Ideally exemplar positive instances can be reliably generalized to form the sub-
concept for a disjunct of positive instances with low false positive errors in the
space of training instances. We call these exemplar instances pivot positive in-
stances (PPIs) and define them using their neighborhood.
Definition 1. The Gaussian ball B(x, r) centered at an instance x in the train-
ing instance space Rn (n is the number of features defining the space) is the set
of instances within distance r of x: {y ∈ Rn | distance(x, y)≤ r}.
Each Gaussian ball defines a positive subconcept and only those positive in-
stances that can form sufficiently accurate positive subconcepts are pivot positive
instances, as defined below.
Definition 2. Given a training instance space Rn , and a positive instance x ∈
Rn , let the distance between x and its nearest positive neighbor be e. For a false
positive error rate (FP rate) threshold δ, x is a pivot positive instance (PPI) if
the subconcept for Gaussian ball B(x, e) has an FP rate ≤ δ.
For simplicity the FP rate for the Gaussian ball centered at a positive instance
is called the FP rate for the positive instance. To explain the concept of PPI,
let us for now assume that the false positive rate for a positive instance is its
observed false positive ratio in the training instance space.
Example 1. Consider the positive instances in the subspace highlighted in Fig. 1
Given a false positive rate threshold of 30%, the three positive instances at the
center have zero false positives in its Gaussian ball (shown in enlarged form in
Fig. 2(b)) and therefore are PPIs. The other two positive instances however are
not PPIs, as they have observed false positive ratio of respectively 50% (2 out
of 4) and 33.3% (1 out of 3).
The observed false positive ratio for a positive instance in the training instance
space is not accurate description of its performance in the presence of indepen-
dently chosen test instances. We estimate the false positive rate by re-adjusting
the observed false positive ratio using pessimistic estimate. A similar approach
has been used to estimate the error rate for decision tree nodes in C4.5 [17,22].
Assume that the number of false positives in a Gaussian ball of N instances
follows the binomial distribution B(N, p), where p is the real probability of false
positives in the Gaussian ball. For a given confidence level c where its corre-
sponding z can be computed, p can be estimated from N and the observed false
positive ratio f as follows [22]:

f + z 2 /2N + z f (1 − f )/N + z 2 /4N 2
(1)
1 + z 2 /N
326 Y. Li and X. Zhang

Algorithm 1. Compute the set of positive pivot points


Input: a) Training set T (|T | is number of instances in T ); b) confidence level c.
Output: The set of pivot positive instances P (with radius r for each Gaussian ball)
1: δ ← FP rate threshold by Equation (1) from c, |T |, and prior negative frequency
2: P ← φ
3: for each positive instance x ∈ T do
4: G ← neighbors of x in increasing order of distance to x
5: for k = 1 to |G| do
6: if G[k] is a positive instance then
7: break {;; G[k] is the nearest positive neighbor of x}
8: end if
9: end for
10: r ← distance(x, G[k])
11: f ← k−1
k+1
{;; Gaussian ball B(x, r) has k + 1 instances and (k + 1 − 2) FPs}
12: p ← the FP rate by Equation (1) from c, k and f
13: if p ≤ δ then
14: P ← P ∪ {x} {;; x is a pivot positive instance, and P is the output}
15: end if
16: end for

The Gaussian ball for a positive instance always has two positive instances— the
reference positive instance and its nearest positive neighbor. The confidence level
is a parameter tuning the performance of PPIs. A high confidence level means
the estimated false positive error rate is close to the observed false negative ratio,
and thus few false positives are tolerated in identifying PPIs. On very imbalanced
data we need to tolerate a large number of false positives to aggressively identify
PPIs to achieve high sensitivity for the positives. Our experiments (Section 5.3)
confirm this hypothesis. The default confidence level is set to 10%.
We set the FP rate threshold for identifying PPIs based on the imbalance
level of training data. The threshold for PPIs are dynamically determined by
the prior negative class frequency. If the false positive rate for a positive in-
stance estimated using Equation (1) is not greater than the threshold estimated
from the prior negative class frequency, the positive instance is a PPI. Under
this setting, a relatively larger number of FP errors are allowed in Gaussian balls
for imbalanced data while less errors are allowed for balanced data. Especially
on very balanced data the PPI mechanism will be turned off and kENN reverts
to standard kNN. For example on a balanced dataset of 50 positive instances
and 50 negative instances, at a confidence level of 10%, the FP rate thresh-
old for PPIs is 56.8% (estimated from the 50% negative class frequency using
Equation (1)). A Gaussian ball without any observed FP errors (and containing
2 positive instances only) has an estimated FP rate of 68.4%2 . As a result no
PPIs are identified at the training stage and standard kNN classification will be
applied.

2
Following standard statistics, when there are not any observed errors, for N instances

at confidence level c the estimated error rate is 1 − N c.
Improving kNN with Exemplar Generalization 327

4 k Exemplar-Based Nearest Neighbor Classification


We now describe the algorithm to identify pivot positive instances at the training
stage, and how to make use of the pivot positive instances for classification for
nearest neighbor classification.
The complete process of computing pivot positive instances from a given set
of training instances is illustrated in Algorithm 1. Input to the algorithm are
the training instances and a confidence level c. Output of the algorithm are the
pivot positive instances with their radius distance for generalization to Gaussian
balls. In the algorithm first FP rate threshold δ is computed using Equation (1)
from confidence level c, number of training instances |T | and the prior negative
class frequency (line 1). The neighbors of each positive instance x are sorted in
increasing order of their distance to x (line 4). The loop of lines 3 to 16 computes
the PPIs and accumulating them in P to be output (line 14). Inside the loop, the
FP rate p for each positive instance x is computed using Equation (1) from the
observed FP ratio f (line 12) and if p ≤ δ, x is identified as a PPI and kept in P
(line 14). The main computation of Algorithm 1 lies in the process computing
up to n − 1 nearest neighbors for each positive instance x and sorting according
to their distance to x, where n is the size of the training set (line 4). Algorithm 1
thus has a complexity of O(p∗n log n), where p and n are respectively the number
of positive and all instances in the training set. Note that p << n for imbalanced
datasets, and so the algorithm has reasonable time efficiency.
At the classification stage, to implement the concept of Gaussian balls for pivot
positive instances, the distance of a query instance to its k nearest neighbors is
adjusted for all PPIs. Specifically for a query instance t and a training instances
x , the adjusted distance between t and x is defined as:

distance(t, x) − x.radius if x is a PPI
adjusted distance(t, x) = (2)
distance(t, x) otherwise

where distance(t, x) is the distance between t and x using some metric of stan-
dard kNN. With the above equation, the distance between a query instance and
a PPI in the training instance space is reduced by the radius of the PPI. As a re-
sult the adjusted distance is conceptually equivalent to the distance of the query
instance to the edge of the Gaussian ball centered at the PPI. The adjusted
distance function as defined in Equation (2) can be used in kNN classification
in the presence of class imbalance, and we call the classifier k Exemplar-based
Nearest Neighbor (kENN).

5 Experiments

We conducted experiments to evaluate the performance of kENN. kENN was


compared against kNN and the naive approach of generalizing positive sub-
concepts for kNN (Section 2). kENN was also compared against two popular
imbalanced learning strategies SMOTE re-sampling and MetaCost cost-sensitive
328 Y. Li and X. Zhang

Table 1. The experimental datasets, ordered in decreasing level of imbalance

Dataset size #attr (num, symb) classes (pos, neg) minority (%)
Oil 937 47(47, 0) (true, false) 4.38%
Hypo-thyroid 3163 25 (7, 18) (true, false) 4.77%
PC1 1109 21 (21,0) (true, false) 6.94%
Glass 214 9 (9,0) (3, other) 7.94%
Satimage 6435 36 (36,0) (4, other) 9.73%
CM1 498 21 (21,0) (true, false) 9.84%
New-thyroid 215 5 (5,0) (3, other) 13.95%
KC1 2109 21 (21,0) (true, false) 15.46%
SPECT F 267 44 (44,0) (0, 1) 20.60%
Hepatitis 155 19 (6,13) (1, 2) 20.65%
Vehicle 846 18 (18,0) (van, other) 23.52%
German 1000 20 (7,13) (2, 1) 30.00%

learning, using kNN (IBk in WEKA) and C4.5 (J48 in WEKA) as the base classi-
fiers. All classifiers were developed based on the WEKA data mining toolkit [22],
and are available at http://www.cs.rmit.edu.au/∼zhang/ENN. For both kNN
and kENN k was set to 3 by default, and the confidence level of kENN was set
to 10%. To increase the sensitivity of C4.5 to the minority class, C4.5 was set
with the -M1 option that minimum one instance was allowed for a leaf node
and without pruning. SMOTE oversampling combined with undersampling was
applied to 3NN and C4.5, denoted as 3NNSmt+ and C4.5Smt+ respectively.
SpreadSubsample was used to undersample the majority class for uniform dis-
tribution (M=1.0), and then SMOTE was applied to generate additional 3 times
more instances for the minority class. MetaCost was used for cost-sensitive learn-
ing with 3NN and C4.5 (denoted as 3NNMeta and C4.5Meta) and the cost of
each class was set to the inverse of class ratio.
Table 1 summarizes the 12 real-world imbalanced datasets from various do-
mains used in our experiments, from highly imbalanced (the minority 4.35%) to
moderately imbalanced (the minority 30.00%). The Oil dataset was provided by
Robert Holte [11], and the task is to detect the oil spill (4.3%) from satellite
images. The CM1, KC1 and PC1 datasets were obtained from the NASA IV&V
Facility Metrics Data Program (MDP) repository (http://mdp.ivv.nasa.gov/
index.html). The task is to predict software defects (around 10% on average)
in software modules. The remaining datasets were compiled from the UCI Ma-
chine Learning Repository (http://archive.ics.uci.edu/ml). In addition to
the natural 2-class domains, like thyroid diseases diagnoses and Hepatitis, we
also constructed four imbalanced datasets by choosing one class as the positive
and the remaining classes combined as the negative.
The Receiver Operating Characteristic (ROC) curve [18] is becoming widely
used to evaluate imbalanced classification. Given a confusion matrix of four types
of decisions True Positive (TP), False Positive (FP), True Negative (TN) and
False Negative (FN), ROC curves depict tradeoffs between T P rate = T PT+F P
N
FP
and F P rate = F P +T N . Good classifiers can achieve a high TP rate at a low
Improving kNN with Exemplar Generalization 329

Table 2. The AUC for kENN, in comparison with other systems. The best result for
each dataset is in bold. AUCs with difference <0.005 are considered equivalent.

Dataset 3ENN Naive 3NN 3NNSmt+ 3NNMeta C4.5 C4.5Smt+ C4.5Meta


Oil 0.811 0.788 0.796 0.797 0.772 0.685 0.771 0.764
Hypo-thyroid 0.846 0.831 0.849 0.901 0.846 0.924 0.948 0.937
PC1 0.806 0.786 0.756 0.755 0.796 0.789 0.728 0.76
Glass 0.749 0.623 0.645 0.707 0.659 0.696 0.69 0.754
Satimage 0.925 0.839 0.918 0.902 0.928 0.767 0.796 0.765
CM1 0.681 0.606 0.637 0.666 0.625 0.607 0.666 0.668
New-thyroid 0.99 0.945 0.939 0.972 0.962 0.927 0.935 0.931
KC1 0.794 0.732 0.759 0.756 0.779 0.64 0.709 0.695
SPECT F 0.767 0.728 0.72 0.725 0.735 0.626 0.724 0.643
Hepatitis 0.783 0.71 0.758 0.772 0.744 0.753 0.713 0.745
Vehicle 0.952 0.945 0.969 0.942 0.956 0.921 0.926 0.929
German 0.714 0.677 0.69 0.686 0.705 0.608 0.649 0.606
Average 0.818 0.768 0.786 0.798 0.792 0.745 0.771 0.766

FP rate. Area Under the ROC Curve (AUC) measures the overall classification
performance [4], and a perfect classifier has an AUC of 1.0. All results reported
next were obtained from 10-fold cross validation and two-tailed paired t-tests at
95% confidence level were used to test statistical significance.
The ROC convex hull method provides visual performance analysis of classifi-
cation algorithms at different levels of sensitivity [15,16]. In the ROC space, each
point of the ROC curve for a classification algorithm corresponds to a classifier.
If a point falls on the convex hull of all ROC curves the corresponding classifier
is potentially an optimal classifier; otherwise the classifier is not optimal. Given
a classification algorithm, the higher fraction of its ROC curve points lie on the
convex hull the more chance the algorithm produce optimal classifiers.
For all results reported next, data points for the ROC curves were generated
using the ThresholdCurve module of WEKA, which correspond to the number
of TPs and FPs that result from setting various thresholds on the probability of
the positive class. The AUC value for ROC curves were obtained using the Mann
Whitney statistic in WEKA. The convex hull for ROC curves were computed
using the ROCCH package3 .

5.1 Performance Evaluation Using AUC


Table 2 shows the AUC results for all models. It can be seen that 3ENN is a
very competitive model. Compared with the remaining models, 3ENN has the
highest average AUC of 0.818 and wins on 9 datasets. In comparison the average
AUC for the Naive method is just 0.768. 3ENN significantly outperforms all of
3NN (p = 0.005), 3NNSmt+ (p = 0.029), 3NNMeta (p = 0.008), C4.5Smt+
(p = 0.014) and C4.5Meta (p = 0.021). This result confirms that our exemplar-
based positive concept generalization strategy is very effective for improving the
3
Available at http://home.comcast.net/~ tom.fawcett/public_html/ROCCH/
330 Y. Li and X. Zhang
1.00

1.0
0.95

0.8
0.90

0.6
3ENN
C4.5Smt+ 3ENN
3NNSmt+ C4.5Smt+
3NNMeta 3NNSmt+
0.85

Convex Hull 3NNMeta

0.4
Convex Hull
0.80

0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
(a) New-thyroid (b) German

Fig. 3. ROC curves with convex hull on two datasets. The x-axis is the FP rate and
the y-axis is the TP rate. Points on the convex hull are highlighted with a large circle.

performance of kNN for imbalanced classification, and furthermore the strategy


is more effective than re-sampling and cost-sensitive learning strategies.
It should be noted that C4.5Smt+ and C4.5Meta both demonstrate
improvement over C4.5. This shows that re-sampling and cost-sensitive learning
strategies are effective for improving the performance of C4.5 for imbalanced
classification, which is consistent with previous findings [5,20]. On the other
hand however, 3NNSmt+ and 3NNMeta do not show significant improvement
over 3NN. That these strategies are less effective on kNN for class imbalance may
be attributed to that kNN adopts a maximal specificity induction strategy. For
example, the re-sampling strategy ensures overall class balance however it does
not necessarily mean that for the neighborhood of individual query instances the
minority class is well represented. Not forming concepts from overall even sample
distribution after re-sampling, kNN may still miss some positive query instances
due to the under-representation of the positive class in their neighborhood.

5.2 The ROC Convex Hull Analysis


Table 2 has shown that C4.5Smt+ outperforms C4.5Meta, and so for readability
we only compare the ROC curves of 3ENN against that of 3NNSmt+, 3NNMeta
and C4.5Smt+. Fig. 3 shows the ROC curves of the four models on the New-
thyroid and German datasets. From Table 2, 3ENN and 3NNSmt+ have the
best AUC results of 0.99 and 0.972 on New-thyroid, which has a relatively high
level of imbalance of 13.95%. But as shown in Fig. 3(a), the ROC curves of the
four models show very different trends. Notably more points of 3ENN lie on the
convex hull at low FP rates (<10%). Conversely more points of 3NNSmt+ lie on
the convex hull at high FP rates (>50%). It is desirable in many applications to
achieve accurate prediction at low false positive rate and so 3ENN is obviously
a good choice for this purpose. German has a moderate imbalance level of 30%.
Improving kNN with Exemplar Generalization 331

0.85
0.80
AUC

0.75
0.70
Oil
Glass
KC1
German
0.65

0 10 20 30 40 50

Confidence level%

Fig. 4. The AUC of 3ENN with varying confidence level

ROC curves of the four models demonstrate similar trends on German, as shown
in Fig. 3(b). Still at low FP rates, more points from 3ENN lie on the ROC convex
hull, which again shows that 3ENN is a strong model.

5.3 The Impact of Confidence Level on kENN


As discussed in Section 3 confidence level affects the decision in kENN of whether
to generalize a positive instance to a Gaussian ball. We applied 3ENN to two
highly imbalanced datasets and two moderately imbalanced datasets with con-
fidence level from 1% to 50%. The AUC results are shown in Fig. 4. For the two
datasets with high imbalance (Oil 4.38% and Glass 7.94%) AUC is negatively
correlated with confidence level. For example on Oil when the confidence level
increases from 1% to 50% the AUC decreases from 0.813 to 0.801. However for
the two datasets with moderate imbalance (KC1 15.46% and German 30.00%)
AUC is inversely correlated with confidence level. On German when confidence
level increases from 1% to 50% AUC increases from 0.69 to 0.718. The opposite
behavior of AUC in relation to confidence level may be explained by that on
highly imbalanced data, to predict more positive instances, it is desirable to tol-
erate more false positives in forming Gaussian balls, which is achieved by setting
a low confidence level. Such an aggressive strategy increases the sensitivity of
kENN to positive instances. On less imbalanced datasets where there are rela-
tively sufficient positive instances, a high confidence level is desired to ensure a
low level of false positives in positive Gaussian balls.

6 Conclusions
With kNN classification, the class of a query instance is decided by the majority
class of its k nearest neighbors. In the presence of class imbalance, a query in-
stance is often classified as belonging to the majority class and as a result many
positive (minority class) instances are misclassified. In this paper, we have pro-
posed a training stage where exemplar positive training instances are identified
332 Y. Li and X. Zhang

and generalized into Gaussian balls as concepts for the minority class. When
classifying a query instance using its k nearest neighbors, the positive concepts
formulated at the training stage ensure that classification is more sensitive to the
minority class. Extensive experiments have shown that our strategy significantly
improves the performance of kNN and also outperforms popular re-sampling and
cost-sensitive learning strategies for imbalanced learning.

References
1. Aha, D.W. (ed.): Lazy learning. Kluwer Academic Publishers, Dordrecht (1997)
2. Aha, D.W., et al.: Instance-based learning algorithms. Machine Learning 6 (1991)
3. Bosch, A., et al.: When small disjuncts abound, try lazy learning: A case study.
In: BDCML (1997)
4. Bradley, A.P.: The use of the area under the ROC curve in the evaluation of
machine learning algorithms. Pattern Recognition 30 (1997)
5. Chawla, N.V., et al.: SMOTE: Synthetic minority over-sampling technique. Journal
of Artificial Intelligence Research 16 (2002)
6. Cover, T., Hart, P.: Nearest neighbor pattern classification. Institute of Electrical
and Electronics Engineers Transactions on Information Theory 13 (1967)
7. Domingos, P.: Metacost: A general method for making classifiers cost-sensitive. In:
KDD 1999 (1999)
8. Elkan, C.: The foundations of cost-sensitive learning. In: IJCAI (2001)
9. Fawcett, T., Provost, F.J.: Adaptive fraud detection. Data Mining and Knowledge
Discovery 1(3) (1997)
10. Holte, R.C., et al.: Concept learning and the problem of small disjuncts. In: IJCAI
1989 (1989)
11. Kubat, M., et al.: Machine learning for the detection of oil spills in satellite radar
images. Machine Learning 30(2-3) (1998)
12. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided
selection. In: ICML 1997 (1997)
13. Ling, C., et al.: Data mining for direct marketing: Problems and solutions. In: KDD
1998 (1998)
14. Menzies, T., et al.: Data mining static code attributes to learn defect predictors.
IEEE Transactions on Software Engineering 33 (2007)
15. Provost, F., et al.: The case against accuracy estimation for comparing induction
algorithms. In: ICML 1998 (1998)
16. Provost, F.J., Fawcett, T.: Robust classification for imprecise environments. Ma-
chine Learning 42(3) (2001)
17. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Fran-
cisco (1993)
18. Swets, J.: Measuring the accuracy of diagnostic systems. Science 240(4857) (1988)
19. Ting, K.: The problem of small disjuncts: its remedy in decision trees. In: Canadian
Conference on Artificial Intelligence (1994)
20. Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explorations 6(1)
(2004)
21. Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning
algorithms. Machine Learning (2000)
22. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Tech-
niques. Morgan Kaufmann, San Francisco (2005)
Sample Subset Optimization for Classifying
Imbalanced Biological Data

Pengyi Yang1,2,3 , Zili Zhang4,5, , Bing B. Zhou1,3 , and Albert Y. Zomaya1,3


1
School of Information Technologies, University of Sydney, NSW 2006, Australia
2
NICTA, Australian Technology Park, Eveleigh, NSW 2015, Australia
3
Centre for Distributed and High Performance Computing, University of Sydney
NSW 2006, Australia
4
Faculty of Computer and Information Science, Southwest University
CQ 400715, China
5
School of Information Technology, Deakin University, VIC 3217, Australia
yangpy@it.usyd.edu.au, zhangzl@swu.edu.cn

Abstract. Data in many biological problems are often compounded by


imbalanced class distribution. That is, the positive examples may largely
outnumbered by the negative examples. Many classification algorithms
such as support vector machine (SVM) are sensitive to data with im-
balanced class distribution, and result in a suboptimal classification. It
is desirable to compensate the imbalance effect in model training for
more accurate classification. In this study, we propose a sample subset
optimization technique for classifying biological data with moderate and
extremely high imbalanced class distributions. By using this optimization
technique with an ensemble of SVMs, we build multiple roughly balanced
SVM base classifiers, each trained on an optimized sample subset. The
experimental results demonstrate that the ensemble of SVMs created by
our sample subset optimization technique can achieve higher area under
the ROC curve (AUC) value than popular sampling approaches such as
random over-/under-sampling; SMOTE sampling, and those in widely
used ensemble approaches such as bagging and boosting.

1 Introduction
Modern molecular biology is rapidly advanced by the increasing use of computa-
tional techniques. For tasks such as RNA gene prediction [1], promoter recogni-
tion [2], splice site identification [3], and the classification of protein localization
sites [4], it is often necessary to address the problem of imbalanced class distri-
bution because the datasets extracted from those biological systems are likely to
contain a large number of negative examples (referred to as majority class) and
a small number of positive examples (referred to as minority class). Many pop-
ular classification algorithms such as support vector machine (SVM) have been
applied to a large variety of bioinformatics problems including those mentioned
above (e.g. refs. [1,3,4]). However, most of these algorithms are sensitive to the

Corresponding author.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 333–344, 2011.
c Springer-Verlag Berlin Heidelberg 2011
334 P. Yang et al.

imbalanced class distribution and may not perform well if being directly applied
on the imbalanced data [5,6].
Sampling is a popular approach to addressing the imbalanced class distri-
bution [7]. Simple methods such as random under-sampling and random over-
sampling are routinely applied in many bioinformatics studies [8]. With random
under-sampling, the size of the majority class is reduced to compensate the im-
balance, whereas with random over-sampling, the size of the minority class is
increased to compensate the imbalance. Although they are straightforward and
computationally efficient, these two methods are prone to either increased noise
and duplicated samples or informative sample removal [9]. A more sophisticated
approach known as SMOTE is to synthesize “new” samples using original sam-
ples in the dataset [10]. However, many bioinformatics problems often present
several thousands of samples with a highly imbalanced class distribution. Ap-
plying SMOTE will introduce a large number of synthetic samples which may
increase the data noise substantially. Alternatively, a cost-metric can be speci-
fied to force the classifier to pay more attention to the minority class [11]. This
requires to choose a correct cost-metric which is often unknown a priori.
Several recent studies found that ensemble learning could improve the per-
formance of a single classifier in imbalanced data classification [6,12]. In this
study, we explore along this direction. In particular, we introduce a sample sub-
set optimization technique for ‘intelligent under-sampling’ in imbalanced data
classification. Using this technique, we designed an ensemble of SVMs specifi-
cally for learning from imbalanced biological datasets. This system has several
advantages over the conventional ones:
– It creates each base classifier using a roughly balanced training subset with
a built-in intelligent under-sampling. This is important in learning from im-
balanced data because it reduces the risk of bias towards one class while
neglecting the other one.
– The system embraces an ensemble framework in which multiple roughly bal-
anced training subsets are created to train an ensemble of classifiers. Thus,
it reduces the risk of removing informative samples from the majority class,
which may occur when a simple under-sampling technique is applied.
– As opposed to random sampling, the sample subset optimization technique
is applied to identify optimal sample subsets. This may improve the quality
of the base classifiers and result in a more accurate ensemble.
– The aforementioned biological problems often present several thousands of
training samples. The proposed technique is essentially an under-sampling
approach. It can avoid the introduction of data noise and the generated data
subsets may be more efficient for classifier training.
The rest of the paper discusses the details of the proposed sample subset op-
timization technique and the associated ensemble learning system. Section 2
presents the ensemble learning system. Section 3 describes the main idea of
sample subset optimization. The base classifier and fitness function of the en-
semble system are described in Section 4. Comparisons with typical sampling
and ensemble methods are given in Section 5. Section 6 concludes the paper.
Sample Subset Optimization for Classifying Imbalanced Biological Data 335

2 Ensemble System
Ensemble learning is an effective approach for improving the prediction accuracy
of a single classification algorithm. Such an improvement is commonly achieved
by using multiple classifiers (known as the base classifiers) each trained on a
subset of samples created by random sampling such as those used in bagging
[13], or cost-sensitive sampling such as those used in boosting [14]. The base
classifiers are typically combined using an integration function such as averaging
[15] or majority voting [16].
We propose an ensemble learning system specifically designed for imbalanced
biological data classification. The schematic representation of the proposed sys-
tem is shown in Figure 1. It has three main components – sample subset opti-
mization, base classifier, and fitness function. The key of this ensemble system
is the application of the sample subset optimization techniques (to be described
in Section 3).
Suppose that a highly imbalanced dataset contains n samples from the ma-
jority class and m samples from the minority class where n  m, the system
creates each sample subset by including all m minority samples and selecting
a subset of samples from the n majority samples according to an internal opti-
mization procedure. This procedure is conducted to generate multiple optimized
sample subsets, each being a roughly balanced subset containing m minority
samples and ni carefully selected majority samples, where ni  n (i = 1...L)
and L is the total number of optimized sample subsets. Using those optimized
sample subsets, we can obtain a group of base classifiers ci (i = 1...L), each
being trained on its corresponding sample subset {m + ni }. The base classifiers
are then combined using majority voting to form an ensemble of classifiers.
Algorithm 1 summarizes the procedure. A line starting with “//” in the al-
gorithm is a comment for its adjacent next line.

Training set Optimized training subsets Test set

m m m m m’

Optimize samples from

n2 nL

n1
majority class

Base classifiers

n n’
c1 c2 … cL

Majority voting

Prediction
AUC value

Fig. 1. A schematic representation of the proposed ensemble system


336 P. Yang et al.

Algorithm 1. sampleSubsetOptimization
Input: Imbalanced dataset DI
Output: Roughly balanced dataset DB
1: cvSize = 2;
2: cvSets = crossValidate(DI , cvSize);
3: for i = 1 to cvSize do
4: // obtain the internal training samples
5: DiT = getTrain(cvSets, i);
6: // obtain the internal test samples
7: Dit = getTest(cvSets, i);
8: // obtain samples of the minority class
9: Diminor = getMinoritySample(DiT );
10: // obtain samples of the majority class
11: Dimajor = getMajoritySample(DiT );
12: // select a subset of samples from the majority class
13: Dimajor = optimizeMajoritySample(Dimajor , Diminor , Dit );
14: DB = DB ∪ (Diminor ∪ Dimajor );
15: end for
16: return DB ;

3 Sample Subset Optimization


The key function in Algorithm 1 is the optimization procedure applied to select
a subset of samples from the majority class (Algorithm 1, line 13). The principal
idea of the sample subset optimization procedure is to apply a cross validation
procedure to form a subset in which each sample is selected according to the
internal classification accuracy. In this section, we describe its formulation using
a particle swarm optimization (PSO) algorithm [17], and analyze its behavior
using a synthetic dataset. The base classifier and the fitness function used for
optimization are discussed in Section 4.

3.1 Formulation of Sample Subset Optimization


We formulate the sample subset optimization using a particle swarm optimiza-
tion algorithm. In particular, for each sample from the majority class a dimension
in the particle space is assigned. That is, for n majority samples, the particle is
coded as an indicator function set p = {Ix1 , Ix2 , ..., Ixn }. For each dimension,
an indicator function Ixj takes value “1” when the corresponding jth sample
xj is included to train a classifier. Similarly, a “0” denotes that the correspond-
ing sample is excluded from training. By optimizing a population of L particles
pi (i = 1...L), the velocity of the ith particle vi,j (t) and the position of this
particle si,j (t) in the jth dimension of the solution space are updated in each
iteration t as follows:

vi,j (t + 1) = w · vi,j (t) + c1 r1 · (pbesti,j − si,j (t)) + c2 r2 · (gbesti,j − si,j (t))


(1)
Sample Subset Optimization for Classifying Imbalanced Biological Data 337


0: if random()  S(vi,j (t + 1))
si,j (t + 1) = (2)
1: if random() < S(vi,j (t + 1))
1
S(vi,j (t + 1)) = (3)
1 + e−vi,j (t+1)
where pbesti,j and gbesti,j are the previous best position and the best position
found by informants, respectively. c1 , r1 , c2 , and r2 are the learning rates and
social coefficients. random() is the random number generator with a uniform
distribution of [0,1].
Representing this optimization procedure in pseudocode, we obtain Algorithm
2. Note that the PSO algorithm produces multiple optimized sample subsets in
parallel. Therefore, by specifying the popSize parameter, we can obtain any
number of optimized sample subsets with a single execution of the algorithm.

Algorithm 2. optimizeMajoritySamples
Input: Majority samples Dmajor , Minority samples Dminor , Internal test samples Dt
pi
Output: Optimized sample subsets Dmajor  (i = 1...L)

1: popSize = L;
2: initiateParticles(Dmajor , popSize);
3: for t = 1 to termination do
4: // go through each particle in the population
5: for i = 1 to popSize do
6: // extract the samples according to the indicator function set
7: Dpmajor
i
 = extractSelectedSamples(pi , Dmajor );

Dtrain = Dpmajor  ∪ Dminor ;


pi i
8:
9: // train a classifier using selected majority samples and all minority samples
10: hi = trainClassifier(Dptraini
);
11: // calculate the fitness of the trained classifier using internal test samples
12: f itness = calculateFitness(hi , Dt );
13: // update velocity (Eq. (1)) and position (Eq. (2)) according to fitness value
14: vi,j (t) = updateVelocity(vi,j (t), f itness);
15: si,j (t) = updatePosition(si,j (t), f itness);
16: end for
17: end for
18: return Dpmajori
 (i = 1...L)

3.2 Analysis of Behavior


We analyze the behavior of sample subset optimization by using an imbalanced
synthetic data. Samples are created with each has two features. These two fea-
tures are generated from the same distribution. Specifically, 20 samples of the
majority class are generated from a normal distribution N (5, 1) and 10 samples
of the minority class are generated from a normal distribution N (7, 1). In ad-
dition, 5 “outlier” samples are introduced to the dataset. They are labeled as
majority class, but are generated from the normal distribution of the minority
class. The class ratio of the data is 25:10.
338 P. Yang et al.

9 9
Linear SVM border
Linear SVM border
8 8

7 7
Feature 2

Feature 2
6 6

5 5

4 4

3 Majority samples 3 Majority samples


Minority samples Minority samples
2 2
3 4 5 6 7 8 9 10 3 4 5 6 7 8 9 10
Feature 1 Feature 1

(a) orinigal dataset (b) dataset after optimization

Fig. 2. The green lines are the classification boundary created using a linear SVM with
(a) the original dataset and (b) the dataset after optimization

Figure 2(a) shows the original dataset and the resulting classification bound-
ary of a linear SVM, and Figure 2(b) shows a dataset after applying sample
subset optimization and the resulting classification boundary of a linear SVM.
Note that this is one of the optimized dataset which is used to train one base
classifier. Our ensemble is the aggregation of multiple base classifiers trained on
multiple optimized datasets. It is evident that the class ratio is more balanced
after optimization (from 25:10 to 15:10). In addition, the 3 out of 5 outlier sam-
ples are removed, and 7 redundant majority samples which has limited effect on
the decision boundary of the linear SVM classifier are removed to correct the
imbalanced class distribution.

4 Base Classifier and Fitness Function


We select SVM as the base classifier for building the ensemble system. SVM is
routinely applied to many challenging bioinformatics problems. The design of
the fitness function is another important facet for sample subset optimization.
It determines the quality of the base classifiers, and thus the performance of the
ensemble. The following subsections describe these two components in details.

4.1 Base Classifier of Support Vector Machine


SVM is a popular classification algorithm which has been widely used in many
bioinformatics problems. Among different kernel choices, linear SVM with a soft
margin is robust for large scale and high-dimensional dataset classification [18].
Let us denote each sample in the dataset as a vector xi (i = 1...M ) where M
is the total number of samples, and yi is the class label of sample xi . Each
component in xi is a feature xij (j = 1...N ) interpreted as the jth feature of the
ith sample, where N is the dimension of the feature space. In our case, features
could be GC-content, dinucleotide values, or other biological markers used to
characterize each sample.
Sample Subset Optimization for Classifying Imbalanced Biological Data 339

A linear SVM with a soft margin is trained by optimizing following functions:

1  M
min ||w||2 + C ξi
w,b,ξ 2
i=1

subject to : yi (< w, xi >) + b ≥ 1 − ξi


where w is the weight vector, ξi are slack variables, and b is the bias. The constant
C determines the trade-off between maximizing the margin and minimizing the
amount of slack.
In this study, we utilize the implementation proposed by Hsieh et al. [19]. This
is an implementation for fast and large scale linear SVM, which is especially
suited as base classifier for ensemble learning due to its computational efficiency.
Notice that classifiers are trained both for sample subset optimization and for
composing ensemble. However, these two procedures are independent from each
other, and therefore, the classifiers trained for sample subset optimization are
not the classifiers used for ensemble. The purpose of the classifiers trained in
the sample subset optimization procedure are to provide fitness feedbacks of the
selected samples, whereas the classifiers used for composing ensemble are trained
by using the optimized sample subsets and serve as the base classifiers of the
ensemble. To maximize the specificity of the feedbacks, the same classification
algorithm, that is, linear SVM, is used for both procedures.

4.2 Fitness Function

For building a classifier, a subset of samples from the majority class is selected
according to an indicator function set pi (see Section 3.1), and combined with
the samples from the minority class to form a training set Dptraini
. The goodness
of an indicator function set can be assessed by the performance of the classifier
trained with the samples specified by it. For imbalanced data, one effective way
to evaluate the performance of the classifier is to use area under the ROC curve
metric [20]. Hence, we devise AU C(hi (Dptraini
, Dtest )) as a component of fitness
pi
function, where Dtrain denotes the training set generated using pi and Dtest de-
notes the test data. Function AU C() calculates the AUC value of a classification
model hi (Da , Db ) which is trained on Da and evaluated on Db .
Moreover, the size of the subset is also important because a small training set
is likely to result in a poorly trained model with poor generalization. Therefore,
the fitness function can be constructed by combining the two components:

f itness(ui) = w1 · AU C(hi (Dptrain


i
, Dtest )) + w2 · Size(pi) (4)
where Size() determines the size of a subset (specified by pi ). Coefficients w1 and
w2 are empirical constants which can be adjusted to alter the relative importance
of each fitness component. The default values are w1 = 0.8 and w2 = 0.2 as they
work well in a range of datasets.
340 P. Yang et al.

5 Experimental Results
In this section, we first describe four imbalanced biological datasets used in our
experiment. They are generated from several important and diverse biological
problems and represent different degrees of imbalanced class distribution. Next
we present the performance results of our ensemble algorithm compared with six
other algorithms using those datasets.

5.1 Datasets
We evaluated different algorithms using datasets generated for identification of
miRNA, classification of protein localization sites, and prediction of promoter
(drosophila and human). Specifically, the miRNA identification dataset contains
691 positive samples and 9248 negative samples, which is described by 21 fea-
tures [21]. The protein localization dataset is generated from the study discussed
in [22]. We attempted to differentiate membrane proteins (258) from the rests
(1226). The human promoter dataset contains 471 promoter sequences and 5131
coding sequences (CDS) and intron sequences. Compared to the human pro-
moter dataset, the drosophila promoter dataset has a relatively balanced class
distribution with 1936 promoter sequences and 2722 CDS and intron sequences.
We calculated the 16 dinucleotide features according to [23].
The datasets are summarized and organized according to class ratio in
Table 1.

Table 1. Summary of biological datasets used for evaluation

Dataset (short name) # Sample # Features Minority vs. Majority


drosophila promoter (DroProm) 6594 16 0.4156 (≈ 1:2.5)
protein localization (ProtLoc) 1484 8 0.2104 (≈ 1:5)
human promoter (HuProm) 5602 16 0.0918 (≈ 1:10)
miRNA identification (miRNA) 9939 21 0.0747 (≈ 1:13)

5.2 Performance Comparison


The performance of the single classifier of SVM was used as the baseline for all
datasets. We compared the single classifier approaches including random under-
sampling with SVM (RUS-SVM), random over-sampling with SVM (ROS-SVM),
SMOTE sampling with SVM (SMOTE-SVM), and the ensemble approaches
including boosting with base classifiers of SVM (Boost-SVMs), bagging with base
classifiers of SVM (Bag-SVMs), and our sample subset optimization technique
with SVM (SSO-SVMs).
For the ensemble methods, we tested the ensemble size from 10 to 100 with a
step of 10. A 5-fold cross-validation procedure was applied to partition datasets
for training and testing, and each algorithm was tested on the same parti-
tion to reduce evaluation variance. Among the six tested algorithms, four of
them employed the randomization procedure. They are RUS-SVM, ROS-SVM,
Bag-SVMs, and SSO-SVMs (note that the Boost-SVMs algorithm uses the
Sample Subset Optimization for Classifying Imbalanced Biological Data 341

0.9 0.92

0.91
0.85
0.9
Area Under ROC Curve

Area Under ROC Curve


0.89
0.8
0.88

0.75 0.87
SSO−SVMs
0.86 Bag−SVMs
SSO−SVMs
0.7 Boost−SVMs
Bag−SVMs
0.85 Single−SVM
Boost−SVMs
ROS−SVM
Single−SVM
0.84 RUS−SVM
0.65 ROS−SVM
SMOTE−SVM
RUS−SVM
0.83
SMOTE−SVM
0.82
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of Base Classifiers Number of Base Classifiers

(a) drosophila promoter (b) protein localization


0.8 0.95

0.75 0.9
Area Under ROC Curve

Area Under ROC Curve

0.7 0.85

0.65 0.8 SSO−SVMs


SSO−SVMs
Bag−SVMs Bag−SVMs
Boost−SVMs Boost−SVMs
Single−SVM Single−SVM
0.6 0.75
ROS−SVM ROS−SVM
RUS−SVM RUS−SVM
SMOTE−SVM SMOTE−SVM
0.55 0.7
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Number of Base Classifiers Number of Base Classifiers

(c) human promoter (d) miRNA identification

Fig. 3. The comparison of different algorithms for data classification. The x-axis de-
notes the ensemble sizes and the y-axis denotes the AUC value. For those algorithms
that use a single classifier, the same AUC value is plotted on different ensemble sizes
for the purpose of comparison.

reweighting implementation and is deterministic). For those with the randomiza-


tion procedure, we repeated the test 10 times, each time with a different random
seed.
Figure 3 shows the results comparison. It can be seen that in most cases en-
semble approaches give higher AUC values than the single classifier approaches.
For single classifier approaches, random under-sampling, random over-sampling,
and SMOTE sampling do improve the classification results when the analyzed
dataset has a highly imbalanced class distribution such as the cases in Figure
3(b)(c)(d). However, the improvements become less significant when the imbal-
ance is moderate (drosophila promoter dataset in Figure 3(a)). SMOTE sampling
performs better than random under-sampling and over-sampling approaches in
the case of protein localization (Figure 3(b)). However, the performance gain
is marginal in other three datasets (Figure 3(a)(c)(d)). We do not observe
significant difference of the performance between random under-sampling and
342 P. Yang et al.

Table 2. The comparison of different algorithms for data classification according to


AUC value. The value for ensemble approaches are averaged across different ensemble
sizes.

Algorithm DroProm ProtLoc HuProm miRNA


Single-SVM 0.6584 0.8296 0.5740 0.7542
RUS-SVM 0.6584 0.8850 0.6016 0.7644
ROS-SVM 0.6555 0.8866 0.5986 0.8114
SMOTE-SVM 0.6400 0.8976 0.5961 0.7924
Boost-SVMs 0.7756 0.8852 0.6644 0.8891
Bag-SVMs 0.8507 0.8671 0.7264 0.9198
SSO-SVMs 0.8520 0.9098 0.7718 0.9419

Table 3. P -value using one-tail student t-test to compare the performance difference

Algorithm DroProm ProtLoc HuProm miRNA


SSO-SVMs vs. Single-SVM 2 × 10−15 4 × 10−18 1 × 10−11 1 × 10−14
SSO-SVMs vs. RUS-SVM 2 × 10−15 1 × 10−13 4 × 10−11 2 × 10−14
SSO-SVMs vs. ROS-SVM 2 × 10−15 2 × 10−13 4 × 10−11 3 × 10−13
SSO-SVMs vs. SMOTE-SVM 8 × 10−16 8 × 10−11 3 × 10−11 9 × 10−14
SSO-SVMs vs. Boost-SVMs 2 × 10−8 8 × 10−7 7 × 10−6 2 × 10−5
SSO-SVMs vs. Bag-SVMs 6 × 10−4 7 × 10−11 1 × 10−6 2 × 10−3

random over-sampling, except in the case of miRNA identification (Figure 3(d))


where random over-sampling is relatively better than random under-sampling.
For ensemble approaches, Boost-SVMs performs surprisingly worse than the
other two approaches in most cases and the performance fluctuates among dif-
ferent ensemble sizes. This may be caused by its training process in that the
boosting algorithm assigns increasingly more classification weights to those most
“difficult” samples in each iteration. However, those “difficult” samples could be
the outliers and cause deleterious effect when the classifiers pay too much at-
tention on classifying them while ignoring other more representative samples.
In this regard, Bag-SVMs and SSO-SVMs appear to be the better approaches.
However, SSO-SVMs almost always performs the best in every case and gener-
ates much smaller performance variance when different random seeds were used.
It is likely that the SSO-SVMs can capture the most representative samples from
the training set which gives a better generalization on unseen data classification.
We also observe that the improvement is more significant when the datasets has
a highly imbalanced class distribution (Figure 3(b)(c)(d)).
Table 2 shows the AUC values of both single classifier and ensemble ap-
proaches. For the ensemble approaches, the AUC value is the average of those
given by the ensemble sizes from 10 to 100. The proposed SSO-SVMs performs
the best in all four tested datasets. Comparing these results with the base-
line of a single SVM, they account for 10%-20% improvements. To confirm the
improvements are statistically significant, we applied a one-tail student t-test
and compared SSO-SVMs with the other six methods. Table 3 shows the p-
value of the comparison. In all four datasets, the performance of SSO-SVMs is
Sample Subset Optimization for Classifying Imbalanced Biological Data 343

significantly better than the other six methods, with a p-value smaller than 0.05.
Therefore, we confirmed the effectiveness of the proposed ensemble approach.

6 Conclusion
In this paper we introduced a sample subset optimization technique for sampling
optimal sample subsets from training data. We integrated this technique in an
ensemble learning framework and created an ensemble of SVMs specifically for
imbalanced biological data classification. The proposed algorithm was applied to
several bioinformatics tasks with moderate and highly imbalanced class distribu-
tions. According to our experimental results, (1) the approaches based on data
sampling for a single SVM are generally less effective compared to the ensemble
approaches; (2) the proposed sample subset optimization technique appears to
be very effective and the ensemble optimized by this technique produced the
best classification results in terms of AUC value for all evaluation datasets.

References
1. Meyer, I.M.: A practical guide to the art of RNA gene prediction.. Briefings in
bioinformatics 8(6), 396–414 (2007)
2. Zeng, J., Zhu, S., Yan, H.: Towards accurate human promoter recognition: a re-
view of currently used sequence features and classification methods. Briefings in
Bioinformatics 10(5), 498–508 (2009)
3. Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., Rätsch, G.: Accurate splice
site prediction using support vector machines. BMC Bioinformatics 8(suppl. 10),
7 (2007)
4. Hua, S., Sun, Z.: Support vector machine approach for protein subcellular local-
ization prediction. Bioinformatics 17(8), 721–728 (2001)
5. Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbal-
anced datasets. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.)
ECML 2004. LNCS (LNAI), vol. 3201, pp. 39–50. Springer, Heidelberg (2004)
6. Liu, Y., An, A., Huang, X.: Boosting prediction accuracy on imbalanced datasets
with SVM ensembles. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.)
PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 107–118. Springer, Heidelberg (2006)
7. Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study.
Intelligent Data Analysis 6(5), 429–449 (2002)
8. Batuwita, R., Palade, V.: A New Performance Measure for Class Imbalance Learn-
ing. Application to Bioinformatics Problems. In: 2009 International Conference on
Machine Learning and Applications, pp. 545–550. IEEE, Los Alamitos (2009)
9. Chawla, N., Japkowicz, N., Kotcz, A.: Editorial: special issue on learning from
imbalanced data sets. ACM SIGKDD Explorations Newsletter 6, 1–6 (2004)
10. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: synthetic minority
over-sampling technique. Journal of Artificial Intelligence Research 16(1), 321–357
(2002)
11. Weiss, G.M.: Mining with rarity: a unifying framework. ACM SIGKDD Explo-
rations Newsletter 6(1), 7–19 (2004)
12. Hido, S., Kashima, H., Takahashi, Y.: Roughly balanced bagging for imbalanced
data. Statistical Analysis and Data Mining 2(5-6), 412–426 (2009)
344 P. Yang et al.

13. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
14. Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S.: Boosting the margin: A new
explanation for the effectiveness of voting methods. The Annals of Statistics 26(5),
1651–1686 (1998)
15. Tax, D., Van Breukelen, M., Duin, R.: Combining multiple classifiers by averaging
or by multiplying? Pattern Recognition 33(9), 1475–1485 (2000)
16. Lam, L., Suen, S.Y.: Application of majority voting to pattern recognition: an
analysis of its behavior and performance. IEEE Transactions on Systems, Man,
and Cybernetics, Part A: Systems and Humans 27(5), 553–568 (1997)
17. Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. Swarm Intelli-
gence 1(1), 33–57 (2007)
18. Ben-Hur, A., Ong, C.S., Sonnenburg, S., Schölkopf, B., Rätsch, G.: Support vec-
tor machines and kernels for computational biology. PLoS Computational Biol-
ogy 4(10) (2008)
19. Hsieh, C., Chang, K., Lin, C., Keerthi, S., Sundararajan, S.: A dual coordinate de-
scent method for large-scale linear SVM. In: Proceedings of the 25th International
Conference on Machine Learning, pp. 408–415. ACM, New York (2008)
20. Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27(8),
861–874 (2006)
21. Batuwita, R., Palade, V.: microPred: effective classification of pre-miRNAs for
human miRNA gene prediction. Bioinformatics 25(8), 989–995 (2009)
22. Horton, P., Nakai, K.: A probabilistic classification system for predicting the cellu-
lar localization sites of proteins. In: Proceedings of the Fourth International Con-
ference on Intelligent Systems for Molecular Biology, pp. 109–115. AAAI Press,
Menlo Park (1996)
23. Rani, T.S., Bhavani, S.D., Bapi, R.S.: Analysis of E. coli promoter recognition
problem in dinucleotide feature space. Bioinformatics 23(5), 582–588 (2007)
Class Confidence Weighted k NN Algorithms for
Imbalanced Data Sets

Wei Liu and Sanjay Chawla

School of Information Technologies, University of Sydney


{wei.liu,sanjay.chawla}@sydney.edu.au

Abstract. In this paper, a novel k -nearest neighbors (k NN) weighting


strategy is proposed for handling the problem of class imbalance. When
dealing with highly imbalanced data, a salient drawback of existing kNN
algorithms is that the class with more frequent samples tends to dominate
the neighborhood of a test instance in spite of distance measurements,
which leads to suboptimal classification performance on the minority
class. To solve this problem, we propose CCW (class confidence weights)
that uses the probability of attribute values given class labels to weight
prototypes in kNN. The main advantage of CCW is that it is able to
correct the inherent bias to majority class in existing k NN algorithms
on any distance measurement. Theoretical analysis and comprehensive
experiments confirm our claims.

1 Introduction

A data set is “imbalanced” if its dependent variable is categorical and the number
of instances in one class is different from those in the other class. Learning
from imbalanced data sets has been identified as one of the 10 most challenging
problems in data mining research [1].
In the literature of solving class imbalance problems, data-oriented meth-
ods use sampling techniques to over-sample instances in the minor class or
under-sample those in the major class, so that the resulting data is balanced.
A typical example is the SMOTE method [2] which increases the number of
minor class instances by creating synthetic samples. It has been recently pro-
posed that using different weight degrees on the synthetic samples (so-called
safe-level-SMOTE [3]) produces better accuracy than SMOTE. The focus of
algorithm-oriented methods has been on extensions and modifications of ex-
isting classification algorithms so that they can be more effective in dealing
with imbalanced data. For example, modifications of decision tree algorithms
have been proposed to improve the standard C4.5, such as HDDT [4] and
CCPDT [5].
K NN algorithms have been identified as one of the top ten most influential
data mining algorithms [6] for their ability of producing simple but powerful

The first author of this paper acknowledges the financial support of the Capital
Markets CRC.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 345–356, 2011.
c Springer-Verlag Berlin Heidelberg 2011
346 W. Liu and S. Chawla

classifiers. The k neighbors that are the closest to a test instances are conven-
tionally called prototypes. In this paper we use the concepts of “prototypes” and
“instances” interchangeably.
There are several advanced k NN methods proposed in the recent literature.
Weinberger et al. [7] learned Mahanalobis distance matrices for k NN classifica-
tion by using semidefinite programming, a method which they call large margin
nearest neighbor (LMNN) classification. Experimental results of LMNN show
large improvements over conventional k NN and SVM. Min et al. [8] have pro-
posed DNet which uses a non-linear feature mapping method pre-trained with
Restricted Boltzmann Machines to achieve the goal of large-margin k NN classi-
fication. Recently, a new method WDk NN was introduced in [9] which discovers
optimal weights for each instance in training phase which are taken into ac-
count during test phases. This method is demonstrated superior to other k NN
algorithm including LPD [10], PW [11], A-NN [12] and WDNN [13].
In this paper, the model we propose is an algorithm-oriented method and
we preserve all original information/distribution of the training data sets. More
specifically, the contributions of this paper are as follows:
1. We express the mechanism of traditional k NN algorithms as equivalent to
using only local prior probabilities to predict instances’ labels, from which
perspective we illustrate why many existing k NN algorithms have undesir-
able performance on imbalanced data sets;
2. We propose CCW (class confidence weights), the confidence (likelihood) of a
prototype’s attributes values given its class label, which transforms prior
probabilities of to posterior probabilities. We demonstrate that this trans-
formation makes the k NN classification rule analogous to using a likelihood
ratio test in the neighborhood;
3. We propose two methods, mixture modeling and Bayesian networks, to effi-
ciently estimate the value of CCW;
The rest of the paper is structured as follows. In Section 2 we review existing
k NN algorithms and explain why they are flawed in learning from imbalanced
data. We define CCW weighting strategy and justify its effectiveness in Section
3. CCW is estimated in Section 4. Section 5 reports experiments and Section 6
concludes the paper.

2 Existing k NN Classifiers
Given labeled training data (xi , yi ) (i = 1,...,n), where xi ∈ Rd are feature
vectors, d is the number of features and yi ∈ {c1 , c2 } are binary class labels,
k NN algorithm finds a group of k prototypes from the training set that are
the closest to a test instance xt by a certain distance measure (e.g. Euclidean
distances), and estimates the test instance’s label according to the predominance
of a class in this neighborhood. When there is no weighting (NW) strategy, this
majority voting mechanism can be expressed as:

NW: yt = arg max I(yi = c) (1)
c∈{c1 ,c2 }
xi ∈φ(xt )
Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets 347

where yt is a predicted label, I(·) is an indicator function that returns 1 if


its condition is true and 0 otherwise, and φ(xt ) denotes the set of k training
instances (prototypes) closest to xt . When the k neighbors vary widely in their
distances and closer neighbors are more reliable, the neighbors are weighted by
the multiplicative-inverse (MI) or the additive-inverse (AI) of their distances:
 1
MI: yt = arg max I(yi = c) · (2)
c∈{c1 ,c2 } dist(xt , xi )
xi ∈φ(xt )

 dist(xt , xi )
AI: yt = arg max I(yi = c) · (1 − ) (3)
c∈{c1 ,c2 } distmax
xi ∈φ(xt )

where dist(xt , xi ) represents the distance between the test point xt and a pro-
totype xi , and distmax is the maximum possible distance between two training
instances in the feature space which normalizes dist(x t ,xi )
distmax to the range of [0,1].
While MI and AI solve the problem of large distance variance among k neigh-
bors, their effects become insignificant if the neighborhood of a test point is
considerably dense, and one of the class (or both classes) is over-represented by
its samples – since in this scenario all of the k neighbors are close to the test
point and the difference among their distances is not discriminative [9].

2.1 Handling Imbalanced Data


Given the definition of the conventional k NN algorithm, we now explain its
drawback in dealing with imbalanced data sets. The majority voting in Eq. 1
can be rewritten as the following equivalent maximization problem:

yt = arg max I(yi = c)
c∈{c1 ,c2 }
xi ∈φ(xt )
 
⇒ max { I(yi = c1 ), I(yi = c2 ) }
xi ∈φ(xt ) xi ∈φ(xt ) (4)
 
xi ∈φ(xt ) I(yi = c1 ) xi ∈φ(xt ) I(yi = c2 )
= max { , }
k k
= max { pt (c1 ), pt (c2 ) }

where pt (c1 ) and pt (c2 ) represent the proportion of class c1 and c2 appearing
in φ(xt ) – the k -neighborhood of xt . If we integrate this k NN classification rule
into Bayes’s theorem, treat φ(xt ) as the sample space and treat pt (c1 ) and pt (c2 )
as priors 1 of two classes in this sample space, Eq. 4 intuitively illustrates that
the classification mechanism of k NN is based on finding the class label that has
a higher prior value.
This suggests that traditional k NN uses only the prior information to estimate
class labels, which has suboptimal classification performance on the minority
class when the data set is highly imbalanced. Suppose c1 is the dominating
class label, it is expected that the inequality pt (c1 )  pt (c2 ) holds true in most
1
We note that pt (c1 ) and pt (c2 ) are conditioned (on xt ) in the sample space of the
overall training data, but unconditioned in the sample space of φ(xt ).
348 W. Liu and S. Chawla

10
8.5
9
8
8

7 7.5

6 7

5
6.5
4
6
3

2 5.5

1 5

0
0 2 4 6 8 10 2.5 3 3.5 4 4.5 5 5.5 6

(a) Balanced data full view (b) Balanced data regional


view

10

9 5

4.5
7

6
4
5

4 3.5

3
2

1
2.5
0
0 2 4 6 8 10 4 4.5 5 5.5 6 6.5

(c) Imbalanced data full view (d) Imbalanced data regional


view

Fig. 1. Performance of conventional k NN (k = 5) on synthetic data. When data is


balanced, all misclassifications of circular points are made on the upper left side of an
optimal linear classification boundary; but when data is imbalanced, misclassifications
of circular points appear on both sides of the boundary.

regions of the feature space. Especially in the overlap regions of two class labels,
k NN always tends to be biased towards c1 . Moreover, because the dominating
class is likely to be over-represented in the overlap regions, “distance weighting”
strategies such as WI and AI are ineffective in correcting this bias.
Figure 1 shows an example where k NN is performed by using Euclidean dis-
tance measure for k = 5. Samples of positive and negative classes are generated
from Gaussian distributions with mean [μpos pos neg neg
1 , μ2 ] = [6, 3] and [μ1 , μ2 ] =
[3, 6] respectively and a common standard deviation I (the identity matrix).
The (blue) triangles are samples of the negative/majority class, the (red) un-
filled circles are those of the positive/minority class, and the (green) filled circles
indicate the positive samples incorrectly classified by the conventional k NN al-
gorithm. The straight line in the middle of two clusters suggests a classification
boundary built by an ideal linear classifier. Figure 1(a) and 1(c) give global
overall views of k NN classifications, while Figure 1(b) and 1(d) are their corre-
sponding “zoom-in” subspaces that focus on a particular misclassified positive
sample. Imbalanced data is sampled under the class ratio of Pos:Neg = 1:10.
Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets 349

As we can see from Figure 1(a) and 1(b), when data is balanced all of the
misclassified positive samples are on the upper left side of the classification
boundary, and are always surrounded by only negative samples. But when data is
imbalanced (Figure 1(c) and 1(d)), misclassifications of positives appear on both
sides of the boundary. This is because the negative class is over-represented and
dominates much larger regions than the positive class. The incorrectly classified
positive point in Figure 1(d) is surrounded by 4 negative and 1 positive neighbors,
with a negative neighbor being the closest prototype to the test point. In this
scenario, distances weighting strategies (e.g. MI and AI) cannot be helpful to
correct the bias to negative class. In the next section, we introduce CCW and
explain how it can solve such problems and correct the bias.

3 CCW Weighted k NN
To improve the existing k NN rule, we introduce CCW to capture the probability
(confidence) of attributes values given a class label. We define CCW on a training
instance i as follows:
wiCCW = p(xi |yi ), (5)
where xi and yi represent the attribute vector and the class label of instances i.
Then the resulting classification rule integrated with CCW is:

CCW: yt = arg max I(yi = c) · wiCCW , (6)
c∈{c1 ,c2 }
xi ∈φ(xt )

and by applying it into distance weighting schemes MI and AI we obtain:


 1
CCWMI : yt = arg max I(yi = c) · p(xi |yi ) (7)
c∈{c1 ,c2 } dist(xt , xi )
xi ∈φ(xt )
 dist(xt , xi )
CCWAI : yt = arg max I(yi = c)(1 − ) · p(xi |yi ) (8)
c∈{c1 ,c2 } distmax
xi ∈φ(xt )
With the integration of CCW, the maximization problem in Eq. 4 becomes:

yt = arg max I(yi = c) · p(xi |yi )
c∈{c1 ,c2 }
xi ∈φ(xt )
 I(yi = c1 )  I(yi = c2 )
⇒ max { p(xi |yi = c1 ), p(xi |yi = c2 ) }
k k
xi ∈φ(xt ) xi ∈φ(xt ) (9)
= max { pt (c1 )p(xi |yi = c1 )xi ∈φ(xt ) , pt (c2 )p(xi |yi = c2 )xi ∈φ(xt ) }
= max { pt (xi , c1 )xi ∈φ(xt ) , pt (xi , c2 )xi ∈φ(xt ) }
= max { pt (c1 |xi )xi ∈φ(xt ) , pt (c2 |xi )xi ∈φ(xt ) }
where pt (c|xi )xi ∈φ(xt ) represents the probability of xt belonging to class c given
the attribute values of all prototypes in φ(xt ). Comparisons between Eq. 4 and
Eq. 9 demonstrate that the use of CCW changes the bases of k NN rule from using
priors to posteriors: while conventional k NN directly uses the proba-
bilities (proportions) of class labels among the k prototypes, we use
conditional probabilities of classes given the values of the k prototypes’
350 W. Liu and S. Chawla

feature vectors. The change from priors to posteriors is easy to understand


since CCW behaves just like the notion of likelihood in Bayes’ theorem.

3.1 Justification of CCW


Since CCW is equivalent to the notion of likelihood in Bayes’ theorem, in this
subsection we demonstrate how the rationale of using CCW-based k NN rule can
be interpreted by likelihood ratio tests.
We assume c1 is the majority class and define the null hypothesis (H0 ) as “xt
belonging to c1 ”, and the alternative hypothesis (H1 ) as “xt belonging to c2 ”.
Assume among φ(xt ), the first j neighbors are from c1 and the other k − j ones
are from c2 . We obtain the likelihood of H0 (L0 ) and H1 (L1 ) from:


j

k
L0 = p(xi |yi = c1 )xi ∈φ(xt ) , L1 = p(xi |yi = c2 )xi ∈φ(xt )
i=1 i=j+1

Then the likelihood ratio test statistic can be written as:


j
L0 p(xi |yi = c1 )xi ∈φ(xt )
Λ= = k i=1 (10)
i=j+1 p(xi |yi = c2 )xi ∈φ(xt )
L1

Note that the numerator and the denominator in the fraction of Eq. 10 corre-
spond to the two terms of the maximization problem in Eq. 9. It is essential
to ensure the majority class does not have higher priority than the minority in
imbalanced data, so we choose “Λ = 1” as the rejection threshold. Then the
mechanism of using Eq. 9 as the k NN classification rule is equivalent to “predict
xt to be c2 when Λ ≤ 1” (reject H0 ), and “predict xt to be c1 when Λ > 1” (do
not reject H0 ).
Example 1. We reuse the example in Figure 1. The size of triangles/circles is
proportional to their CCW weights: the larger the size of a triangle/cirle, the
greater the weight of that instance; and the smaller the lower the weight. In
Figure 1(d), the misclassified positive instance has four negative-class neighbors
with CCW weights 0.0245, 0.0173, 0.0171 and 0.0139, and has one positive-class
neighbor of weight 0.1691. Then the total negative-class weight is 0.0728 and the
total positive-class weight is 0.1691, and the CCW ratio is 0.0728
0.1691 < 1 which gives
a label prediction to the positive (minority) class. So even though the closest
prototype to the test instance comes from the wrong class which also dominates
the test instance’s neighborhood, a CCW weighted k NN can still correctly classify
this actual positive test instances.

4 Estimations of CCW Weights


In this section we briefly introduce how we employ mixture modeling and
Bayesian networks to estimate CCW weights.
Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets 351

4.1 Mixture Models


In the formulation of mixture models, the training data is assumed follow a q-
component finite mixture distribution with probability density function (pdf ):

q
p(x|θ) = αm p(x|θm ) (11)
m=1

where x is a sample of training data whose pdf is demanded, αm represents mix-


ing probabilities, θm defines the mth component, and θ ≡ {θ1 ,...,θq , α1 ,...,αq }
is the complete set of parameters specifying the mixture model. Given training
 =
data Ω, the log-likelihood of a q-component mixture distribution is: log p(Ω|θ)
n  =  n q
log i=1 p(xi |θ) i=1 log m=1 p(xi |θm ). Then the maximum likelihood (ML)

estimate θML = arg maxθ log p(Ω|θ) can be found analytically. We use the
expectation-maximization (EM) algorithm to solve ML and then apply the esti-
mated θ into Eq. 11 to find the pdf of all instances in training data set as their
corresponding CCW weights.
Example 2. We reuse the example in Figure 1, but now we assume the underlying
distribution parameters (i.e. the mean and variance matrixes) that generate the
two classes of data are unknown. We apply training samples into ML estimation,
solve for θ by EM algorithm, and then use Eq. 11 to estimate the pdf of training
instances which are used as their CCW weights. The estimated weights (and their
effects) of the neighbors of the originally misclassified positive sample in Figure
1(d) are shown in Example 1.

4.2 Bayesian Networks


While mixture modeling deals with numerical features, Bayesian networks can
be used to estimate CCW when feature values are categorical. The task of learning
a Bayesian network is to (i) build a directed acyclic graph (DAG) over Ω, and
(ii) learn a set of (conditional) probability tables {p(ω|pa(ω)), ω ∈ Ω} where
pa(ω) represents the set of parents of ω in the DAG. From these conditional
distributions
 one can recover the joint probability distribution over Ω by using
p(Ω) = d+1 i=1 p(ωi |pa(ωi )).
In brief, we learn and build the structure of the DAG by employing K2 al-
gorithm [14] which in the worst case has an overall time complexity of O(n2 ),
one “n” for the number of features and another “n” for the number of train-
ing instances. Then we estimate the conditional probability tables directly from
training data. After obtaining the joint distributions p(Ω), the CCW weight of a
training instance i can be easily obtained from wiCCW = p(xi |yi ) ∝ p(y
p(Ω)
i)
where
p(yi ) is the proportion of class yi among the entire training data.

5 Experiments and Analysis


In this section, we analyze and compare the performance of CCW-based k NN
against existing k NN algorithms, other algorithm-oriented state of the art
352 W. Liu and S. Chawla

Table 1. Details of imbalanced data sets and comparisons of kNN algorithms on


weighting strategies for k = 1

Area Under Precision-Recall Curve


Name #Inst #Att MinClass CovVar
NW MI CCWMI AI CCWAI WDk NN
7
KDDCup’09 :
Appetency 50000 278 1.76% 4653.2 .022(4) .021(5) .028(2) .021(5) .035(1) .023(3)
Churn 50000 278 7.16% 3669.5 .077(3) .069(5) .077(2) .069(5) .093(1) .074(4)
Upselling 50000 278 8.12% 3506.9 .116(6) .124(4) .169(2) .124(4) .169(1) .166(3)
8
Agnostic-vs-Prior :
Ada.agnostic 4562 48 24.81% 1157.5 .441(6) .442(4) .520(2) .442(4) .609(1) .518(3)
Ada.prior 4562 15 24.81% 1157.5 .443(4) .433(5) .518(3) .433(5) .606(1) .552(2)
Sylva.agnostic 14395 213 6.15% 11069.1 .672(6) .745(4) .790(2) .745(4) .797(1) .774(3)
Sylva.prior 14395 108 6.15% 11069.1 .853(6) .906(4) .941(2) .906(4) .945(1) .907(3)
9
StatLib :
BrazilTourism 412 9 3.88% 350.4 .064(6) .111(4) .132(2) .111(4) .187(1) .123(3)
Marketing 364 33 8.52% 250.5 .106(6) .118(4) .152(1) .118(4) .152(2) .128(3)
Backache 180 33 13.89% 93.8 .196(6) .254(4) .318(2) .254(4) .319(1) .307(3)
BioMed 209 9 35.89% 16.6 .776(6) .831(4) .874(2) .831(4) .887(1) .872(3)
Schizo 340 15 47.94% 0.5 .562(4) .534(5) .578(3) .534(5) .599(1) .586(2)
Text Mining [15]:
Fbis 2463 2001 1.54% 2313.3 .082(6) .107(4) .119(2) .107(4) .117(3) .124(1)
Re0 1504 2887 0.73% 1460.3 .423(6) .503(5) .561(2) .503(4) .563(1) .559(3)
Re1 1657 3759 0.78% 1605.4 .360(1) .315(5) .346(2) .315(5) .346(2) .335(4)
Tr12 313 5805 9.27% 207.7 .450(6) .491(4) .498(1) .491(3) .490(5) .497(2)
Tr23 204 5833 5.39% 162.3 .098(6) .122(4) .136(1) .122(4) .128(3) .134(2)
UCI [16]:
Arrhythmia 452 263 2.88% 401.5 .083(6) .114(4) .145(2) .114(4) .136(3) .159(1)
Balance 625 5 7.84% 444.3 .064(1) .063(4) .063(4) .064(2) .064(3) .061(6)
Cleveland 303 14 45.54% 2.4 .714(6) .754(4) .831(2) .754(4) .846(1) .760(3)
Cmc 1473 10 22.61% 442.1 .299(6) .303(5) .318(2) .305(4) .357(1) .315(3)
Credit 690 16 44.49% 8.3 .746(6) .751(4) .846(2) .751(4) .867(1) .791(3)
Ecoli 336 8 5.95% 260.7 .681(4) .669(5) .743(2) .669(5) .78(1) .707(3)
German 1000 21 30.0% 160.0 .407(6) .427(4) .503(2) .427(4) .509(1) .492(3)
Heart 270 14 44.44% 3.3 .696(6) .758(4) .818(2) .758(4) .826(1) .790(3)
Hepatitis 155 20 20.65% 53.4 .397(6) .430(4) .555(2) .430(4) .569(1) .531(3)
Hungarian 294 13 36.05% 22.8 .640(6) .659(4) .781(2) .659(4) .815(1) .681(3)
Ionosphere 351 34 35.9% 27.9 .785(6) .874(5) .903(2) .884(3) .911(1) .882(4)
Ipums 7019 60 0.81% 6792.8 .056(6) .062(4) .087(1) .062(5) .087(2) .078(3)
Pima 768 9 34.9% 70.1 .505(6) .508(4) .587(2) .508(4) .618(1) .533(3)
Primary 339 18 4.13% 285.3 .168(6) .222(4) .265(1) .217(5) .224(3) .246(2)
Average Rank 5.18 4.18 1.93 4.03 1.53 2.84
Friedman Tests  7E-7  8E-6 Base  2E-5 –  4E-5
Friedman Tests  3E-6  2E-6 –  9E-6 Base  2E-4

approaches (i.e. WDk NN2 , LMNN3 , DNet4 , CCPDT5 and HDDT6 ) and data-
oriented methods (i.e. safe-level-SMOTE). We note that since WDk NN has been
demonstrated (in [9]) better than LPD, PW, A-NN and WDNN, in our exper-
iments we include only the more superior WDk NN among them. CCPDT and
HDDT are pruned by Fisher’s exact test (as recommended in [5]). All experi-
ments are carried out using 5×2 folds cross-validations, and the final results are
the average of the repeated runs.

2
We implement CCW-based k NNs and WDkNN inside Weka environment [17].
3
The code is obtained from www.cse.wustl.edu/~ kilian/Downloads/LMNN.html
4
The code is obtained from www.cs.toronto.edu/~ cuty/DNetkNN_code.zip
5
The code is obtained from www.cs.usyd.edu.au/~ weiliu/CCPDT_src.zip
6
The code is obtained from www.nd.edu/~ dial/software/hddt.tar.gz
Class Confidence Weighted kNN Algorithms for Imbalanced Data Sets 353

Table 2. Performance of k NN weighting strategies when k = 11


Area Under Precision-Recall Curve
Datasets
MI CCWMI AI CCWAI SMOTE WDk NN LMNN DNet CCPDT HDDT
Appetency .033(8) .037(4) .036(6) .043(1) .040(3) .036(5) .035(7) .042(2) .024(10) .025(9)
Churn .101(7) .113(2) .101(6) .115(1) .108(4) .100(8) .107(5) .111(3) .092(10) .099(9)
Upselling .219(8) .243(5) .218(9) .241(6) .288(3) .212(10) .231(7) .264(4) .443(1) .437(2)
Ada.agnostic .641(9) .654(5) .646(8) .652(6) .689(3) .636(10) .648(7) .670(4) .723(1) .691(2)
Ada.prior .645(8) .669(2) .654(7) .668(3) .661(5) .639(9) .657(6) .664(4) .682(1) .605(10)
Sylva.agnostic .930(2) .926(8) .930(3) .925(9) .928(6) .922(10) .928(4) .926(7) .934(1) .928(5)
Sylva.prior .965(4) .965(2) .965(6) .965(4) .904(10) .974(1) .965(3) .935(9) .946(8) .954(7)
BrazilTourism .176(9) .242(1) .232(5) .241(2) .233(4) .184(8) .209(6) .237(3) .152(10) .199(7)
Marketing .112(10) .157(2) .113(9) .161(1) .124(8) .150(3) .134(5) .142(4) .130(6) .125(7)
Backache .311(7) .325(3) .307(8) .328(2) .317(6) .330(1) .318(5) .322(4) .227(9) .154(10)
BioMed .884(5) .885(3) .858(7) .844(8) .910(2) .911(1) .884(4) .877(6) .780(10) .812(9)
Schizo .632(6) .632(4) .626(7) .617(8) .561(10) .663(3) .632(5) .589(9) .807(2) .846(1)
Fbis .134(10) .145(5) .135(9) .141(6) .341(3) .136(8) .140(7) .241(4) .363(2) .384(1)
Re0 .715(3) .717(1) .705(5) .709(4) .695(7) .683(8) .716(2) .702(6) .573(9) .540(10)
Re1 .423(7) .484(1) .434(6) .475(4) .479(2) .343(8) .454(5) .477(3) .274(9) .274(9)
Tr12 .628(6) .631(4) .624(7) .601(8) .585(10) .735(3) .629(5) .593(9) .946(1) .946(1)
Tr23 .127(8) .156(3) .123(10) .156(3) .124(9) .128(7) .141(5) .140(6) .619(2) .699(1)
Arrhythmia .160(7) .214(4) .167(6) .229(3) .083(10) .134(9) .187(5) .156(8) .346(2) .385(1)
Balance .127(7) .130(5) .145(2) .149(1) .135(4) .091(9) .129(6) .142(3) .092(8) .089(10)
Cleveland .889(8) .897(2) .890(6) .897(1) .889(7) .895(3) .893(5) .893(4) .806(10) .846(9)
Cmc .346(9) .383(2) .357(7) .384(1) .358(6) .341(10) .365(5) .371(4) .356(8) .380(3)
Credit .888(7) .895(2) .887(8) .894(3) .891(5) .903(1) .891(6) .893(4) .871(9) .868(10)
Ecoli .943(3) .948(1) .938(5) .941(4) .926(7) .920(8) .945(2) .933(6) .566(10) .584(9)
German .535(7) .541(2) .533(8) .537(4) .536(6) .561(1) .538(3) .537(5) .493(9) .464(10)
Heart .873(7) .876(4) .873(8) .876(5) .878(2) .883(1) .875(6) .877(3) .828(9) .784(10)
Hepatitis .628(6) .646(1) .630(5) .645(2) .625(8) .626(7) .637(3) .635(4) .458(9) .413(10)
Hungarian .825(5) .832(1) .823(7) .831(2) .819(8) .826(4) .829(3) .825(6) .815(9) .767(10)
Ionosphere .919(4) .919(2) .916(7) .918(5) .916(7) .956(1) .919(3) .917(6) .894(9) .891(10)
Ipums .123(8) .138(4) .123(7) .140(2) .136(5) .170(1) .130(6) .138(3) .037(9) .020(10)
Pima .645(7) .667(1) .644(8) .665(2) .657(4) .655(6) .656(5) .661(3) .587(10) .613(9)
Primary .308(5) .314(2) .271(8) .279(7) .310(4) .347(1) .311(3) .294(6) .170(10) .183(9)
Average Rank 6.5 2.78 6.59 3.71 5.59 5.18 4.68 4.78 6.68 6.9
Friedman 2E-7 Base 1E-6 – 0.1060 0.002 2E-7 0.007 0.019 0.007
Friedman 0.011 – 4E-5 Base 0.1060 0.007 0.007 0.048 0.019 0.007

We select 31 data sets from KDDCup’097 , agnostic vs. prior competition8 ,


StatLib9 , text mining [15], and UCI repository [16]. For multiple-label data sets,
we keep the smallest label as the positive class, and combine all the other labels
as the negative class. Details of the data sets are shown in Table 1. Besides the
proportion of the minor class in a data set, we also present the coefficient of
variation (CovVar) [18] to measure imbalance. CovVar is defined as the ratio of
the standard deviation and the mean of the class counts in data sets.
The metric of AUC-PR (area under precision-recall curve) has been reported
in [19] better than AUC-ROC (area under ROC curve) on imbalanced data. A
curve dominates in ROC space if and only if it dominates in PR space, and clas-
sifiers that are more superior in terms of AUC-PR are definitely more superior in
terms of AUC-ROC, but not vice versa [19]. Hence we use the more informative
metric of AUC-PR for classifier comparisons.

7
http://www.kddcup-orange.com/data.php
8
http://www.agnostic.inf.ethz.ch
9
http://lib.stat.cmu.edu/
354 W. Liu and S. Chawla

Weighted by MI Weighted by MI Weighted by MI


MI MI MI
Weighted by CCW Weighted by CCW Weighted by CCW
1 1 1

0.8 0.8 0.8


Area under PR curve

Area under PR curve

Area under PR curve


0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 10 20 30 0 10 20 30 0 10 20 30
Data sets indexes Data sets indexes Data sets indexes

(a) Manhattan (k=1) (b) Euclidean (k=1) (c) Chebyshev (k=1)

Weighted by MI Weighted by MI Weighted by MI


MI MI MI
Weighted by CCW Weighted by CCW Weighted by CCW
1 1 1

0.8 0.8 0.8


Area under PR curve

Area under PR curve

Area under PR curve


0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0
0 10 20 30 0 10 20 30 0 10 20 30
Data sets indexes Data sets indexes Data sets indexes

(d) Manhattan (k=11) (e) Euclidean (k=11) (f ) Chebyshev (k=11)

Fig. 2. Classification improvements from CCW on Manhattan distance (1 norm), Eu-
clidean distance (2 norm) and Chebyshev distance (∞ norm)

5.1 Comparisons among NN Algorithms


In this experiment we compare CCW with existing k NN algorithm using Euclidean
distance on k = 1. When k = 1, apparently all k NNs that use the same distance
measure have exactly the same prediction on a test instances. However the effects
of CCW weights generate different probabilities of being positive/negative for each
test instance, and hence produce different AUC-PR values.
While there are various ways to compare classifiers across multiple data sets,
we adopt the strategy proposed by [20] that evaluates classifiers by ranks. In
Table 1 the k NN classifiers in comparison are ranked on each data set by the
value of their AUC-PR, with ranking of 1 being the best. We perform Friedman
tests on the sequences of ranks between different classifiers. In Friedman tests,
p–values that are lower than 0.05 reject the hypothesis with 95% confidence that
the ranks of classifiers in comparison are not statistically different. Numbers in
parentheses of Table 1 are the ranks of classifiers on each data set, and a  sign
in Friedman tests suggests classifiers in comparison are significantly different.
As we can see, both CCWMI and CCWAI (the “Base” classifiers) are significantly
better than existing methods of NW, MI, AI and WDk NN.
Class Confidence Weighted k NN Algorithms for Imbalanced Data Sets 355

5.2 Comparisons among k NN Algorithms

In this experiment, we compare k NN algorithms on k > 1. Without losing gen-


erality, we set a common number k = 11 for all k NN classifiers. As shown in Ta-
ble 2, both CCWMI and CCWAI significantly outperforms MI, AI, WDk NN, LMNN,
DNet, CCPDT and HDDT.
In the comparison with over-sampling techniques, we focus on MI equipped
with safe-level-SMOTE [3] method, shown as “SMOTE” in Table 2. The results
we obtained from CCW classifiers are comparable to (better but not significant
than) the over-sampling technique under 95% confidence. This observation sug-
gests that if one uses CCW he can obtain results comparable to the cutting-edge
sampling technique, so the extra computational cost of data sampling before
training can be saved.

5.3 Effects of Distance Metrics

While in all previous experiments k NN classifiers are performed under Euclidean


distance (2 norm), in this subsection we provide empirical results that demon-
strate the superiority of CCW methods on other distance metrics such as Manhat-
tan distance (1 norm) and Chebyshev distance (∞ norm). Due to page limits,
here we only present the comparisons of “CCWMI vs. MI”. As we can see from
Figure 2, CCWMI can improve MI on all three distance metrics.

6 Conclusions and Future Work


The main focus of this paper is on improving existing k NN algorithms and make
them robust to imbalanced data sets. We have shown that conventional k NN
algorithms are akin in using only prior probabilities of the neighborhood of a
test instance to estimate its class labels, which leads to suboptimal performance
when dealing with imbalanced data sets.
We have proposed CCW, the likelihood of attribute values given a class label, to
weight prototypes before taking them into effect. The use of CCW transforms the
original k NN rule of using prior probabilities to their corresponding posteriors.
We have shown that this transformation has the ability of correcting the inherent
bias towards majority class in existing k NN algorithms.
We have applied two methods (mixture modeling and Bayesian networks)
to estimate training instances’ CCW weights, and their effectiveness is confirmed
by synthetic examples and comprehensive experiments. When learning Bayesian
networks, we construct network structures by applying the K2 algorithm which
has an overall time complexity of O(n2 ).
In future our plan is to extend the idea of CCW to multiple-label classification
problems. We also plan to explore the use of CCW on other supervised learning
algorithms such as support vector machines etc.
356 W. Liu and S. Chawla

References
1. Yang, Q., Wu, X.: 10 challenging problems in data mining research. International
Journal of Information Technology and Decision Making 5(4), 597–604 (2006)
2. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE. Journal of Artificial
Intelligence Research 16(1), 321–357 (2002)
3. Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-Level-SMOTE. In:
Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009.
LNCS, vol. 5476, pp. 475–482. Springer, Heidelberg (2009)
4. Cieslak, D., Chawla, N.: Learning Decision Trees for Unbalanced Data. In: Daele-
mans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS
(LNAI), vol. 5211, pp. 241–256. Springer, Heidelberg (2008)
5. Liu, W., Chawla, S., Cieslak, D., Chawla, N.: A Robust Decision Tree Algorithms
for Imbalanced Data Sets. In: Proceedings of the Tenth SIAM International Con-
ference on Data Mining, pp. 766–777 (2010)
6. Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan,
G., Ng, A., Liu, B., Yu, P., et al.: Top 10 algorithms in data mining. Knowledge
and Information Systems 14(1), 1–37 (2008)
7. Weinberger, K., Saul, L.: Distance metric learning for large margin nearest neigh-
bour classification. The Journal of Machine Learning Research 10, 207–244 (2009)
8. Min, R., Stanley, D.A., Yuan, Z., Bonner, A., Zhang, Z.: A deep non-linear feature
mapping for large-margin knn classification. In: Proceedings of the 2009 Ninth
IEEE International Conference on Data Mining, pp. 357–366 (2009)
9. Yang, T., Cao, L., Zhang, C.: A Novel Prototype Reduction Method for the K-Nearest
Neighbor Algrithms with K ≥ 1. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V.
(eds.) PAKDD 2010. LNCS, vol. 6119, pp. 89–100. Springer, Heidelberg (2010)
10. Paredes, R., Vidal, E.: Learning prototypes and distances. Pattern Recogni-
tion 39(2), 180–188 (2006)
11. Paredes, R., Vidal, E.: Learning weighted metrics to minimize nearest-neighbor
classification error. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 1100–1110 (2006)
12. Wang, J., Neskovic, P., Cooper, L.: Improving nearest neighbor rule with a simple
adaptive distance measure. Pattern Recognition Letters 28(2), 207–213 (2007)
13. Jahromi, M.Z., Parvinnia, E., John, R.: A method of learning weighted similar-
ity function to improve the performance of nearest neighbor. Information Sci-
ences 179(17), 2964–2973 (2009)
14. Cooper, G., Herskovits, E.: A Bayesian method for the induction of probablistic
networks from data. Machine Learning 9(4), 309–347 (1992)
15. Han, E., Karypis, G.: Centroid-based document classification. In: Zighed, D.A.,
Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp.
116–123. Springer, Heidelberg (2000)
16. Asuncion, A., Newman, D.: UCI Machine Learning Repository (2007)
17. Witten, I., Frank, E.: Data mining: practical machine learning tools and techniques
with Java implementations. ACM SIGMOD Record 31(1), 76–77 (2002)
18. Hendricks, W., Robey, K.: The sampling distribution of the coefficient of variation.
The Annals of Mathematical Statistics 7(3), 129–132 (1936)
19. Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves.
In: Proceedings of the 23rd International Conference on Machine Learning, pp.
233–240 (2006)
20. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. The Jour-
nal of Machine Learning Research 7, 1–30 (2006)
Multi-agent Based Classification Using
Argumentation from Experience

Maya Wardeh , Frans Coenen, Trevor Bench-Capon, and Adam Wyner

Department of Computer Science, The University of Liverpool,


Liverpool L69 3BX, UK
{maya.wardeh,coenen,tbc,A.Z.Wyner}@liverpool.ac.uk

Abstract. An approach to multi-agent classification, using an Argu-


mentation from Experience paradigm is describe, whereby individual
agents argue for a given example to be classified with a particular la-
bel according to their local data. Arguments are expressed in the form
of classification rules which are generated dynamically. The advocated
argumentation process has been implemented in the PISA multi-agent
framework, which is also described. Experiments indicate that the op-
eration of PISA is comparable with other classification approaches and
that it can be utilised for Ordinal Classification and Imbalanced Class
problems.

Keywords: Classification, Argumentation, Multi-agent (Data Mining),


Classification Association Rules.

1 Introduction
Argumentation is concerned with the dialogical reasoning processes required to
arrive at a conclusion given two or more alternative viewpoints. The process of
multi-agent argumentation is conceptualised as a discussion, about some issue
that requires a solution, between a set of software agents with different points of
view; where each agent attempts to persuade the others that its point of view,
and the consequent solution, is the correct one. In this paper we propose apply-
ing argumentation to facilitate classification. In particular, it is argued that one
model of argumentation, Arguing from Experience ([24,23]), is well suited to the
classification tasks. Arguing from Experience provides a computational model of
argument based on inductive reasoning from past experience. The arguments are
constructed dynamically using Classification Association Rule Mining (CARM)
techniques. The setting is a “debate;; about how to classify examples; the gen-
erated Classification Association Rules (CARs) provide reasons for and against
particular classifications.
The proposed model allows a number of agents to draw directly from past
examples to find reasons for coming to a decision about the classification of an
unseen instance. Agents formulate their arguments in the form of CARs gen-
erated from datasets of past examples. Each agent’s dataset is considered to

Corresponding author.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 357–369, 2011.
c Springer-Verlag Berlin Heidelberg 2011
358 M. Wardeh et al.

encapsulate that agent’s experience. The exchange of arguments between agents


represents a dialogue which continues until an agent poses an argument for a par-
ticular classification that no other agent can refute. The model has been realised
in the form of argumentation framework called PISA: Pooling Information from
Several Agents. The promoted argumentation-based approach is thus a multi-
agent classification technique [5] that offers a number of practical advantages:
(i) dynamic generation of classification rules in a just in time manner accord-
ing to the requirements of each agent, (ii) easy-to-understand explanations, in
the form of dialogues, concerning a particular classification, and (iii) applica-
tion to ordinal classification and imbalanced class problems as well as standard
classification. The approach also provides for a natural representation of agent
“experience” as a set of records, and the arguments as CARs. At the same time
the advocated approach also preserves the privacy of the information each agent
knows, therefore it can be used with sensitive data.
The rest of this paper is organised as follows. Section 2 provides an overview
of the PISA Framework. Section 3 details the nature of the Classification Asso-
ciation Rules (CARs) used in PISA. In Section 4 details and empirical analysis
are provided of three different applications of PISA to classification problems: (i)
standard classification, (ii) ordinal classification and (iii) the imbalanced class
problem. Finally, we conclude with a summary of the main findings and some
suggested for further work.

2 Argumentation-Based Multi Agent Classification: The


PISA Framework
The intuition behind PISA is to provide a method whereby agents argue about
a classification task. In effect each agent can be viewed as a dynamic classi-
fier. The overall process thus leads to a reasoned consensus obtained through
argumentation, rather than some other mechanism such as voting (e.g. [2]). It
is suggested that this dialogue process increases the acceptability of the out-
come to all parties. In this respect PISA can be said to be an ensemble-like
method. Both theoretical and empirical research (e.g. [20]) has demonstrated
that a good ensemble is one comprising individual classifiers that are relatively
accurate but make their errors on different parts of the input training set. Two
of the most-popular ensemble methods are: (i) Bagging [3] and (ii) Boosting
[14]. Both techniques rely on varying the data to obtain different training sets
for each of the classifiers in the ensemble.
PISA is viewed as a bagging-like multi-agent ensemble, whereby the dataset is
equally divided amongst a number of participants corresponding to the number
of class values in the dataset. Each participant applies the same set of algorithms
to mine CARs supporting their advocated class. To this end, each participant
can be said to correspond to a single classifier. The argumentation process by
which each participant advances moves to support its proposals corresponds to
voting methods by which ensemble techniques assign class labels to input cases.
But rather than simple voting, PISA applies an argumentation debate (dia-
logue). PISA also differs from Boosting techniques in that it does not generate a
Multi-agent Based Classification Using Argumentation from Experience 359

sequence of classifiers; instead the desired classification is achieved through the


collaborative operation of several classifiers. Furthermore, PISA classifies unseen
records by (dynamically) producing a limited number of CARs sufficient to reach
a decision without the need to produce the full set of CARs.
The PISA framework comprises three key elements:

1. Participant Agents. A number of Participant Agents, at least one for


each class in the discussion domain such that each advocates one possible
classification.
2. Chairperson. A neutral mediator agent which administers a variety of tasks
aimed at facilitating PISA dialogues.
3. Set of CARs. The joint set of CARs exchanged in the course of one PISA
dialogue. This set is represented by a central argument structure, called
the Argumentation Tree, maintained by the Chairperson. Participant Agents
have access to this tree and may use it to influence their choice of move. The
agents consult the tree at the beginning of each round and decide which CAR
they are going to attack. Full Details of this data structure can be found in
[24]. Once a dialogue has terminated the status of the argumentation tree
will indicate the winning’ classification. Note that the dialogues produced
by PISA also explains the resulting classifications. This feature is seen as an
essential advantage offered by PISA.

Each Participant Agent has its own distinct (tabular) local dataset relating to a
classification problem (domain). These agents produce reasons for and against
classifications by mining CARs from their datasets using a number of CARM
algorithms (Section 3). The antecedent of every CAR represents a set of reasons
for believing the consequent. In other words given a CAR, P → c, this should
be read as: P are reasons to believe that the case should classify as c. CARs
are mined dynamically as required. The dynamic mining provides for four dif-
ferent types of move, each encapsulated by a distinct category of CAR. Each
Participant Agent can employ any one of the following types of move to gen-
erate arguments: (i) Proposing moves, (ii) Attacking moves, and (iii) Refining
moves. The different moves available are discussed further below. Note that each
of these moves has a set of legal next moves (see Table 1).

Proposing Moves. There is only one kind of proposing move:


1. Propose Rule: Allows a new CAR, with a confidence higher than a given
threshold, to be proposed. All PISA dialogues commence with a Propose
Rule move.
Attacking Moves. Moves intended to show that a CAR proposed by some other
agent should not be considered decisive with respect to the current instance.
Two sub-types are available: (i) Distinguish and (ii) Counter Rule, as follows:
2. Distinguish: Allows an agent to add new attributes (premises) to a previ-
ously proposed CAR so that the confidence of the new rule is lower than the
confidence threshold, thus rendering the original classification inadmissible.
360 M. Wardeh et al.

Table 1. Legal next moves in PISA

Move Label Next Move Move Label Next Move


1 Propose Rule 2, 3 3 Counter Rule 2, 1
2 Distinguish 4, 1 4 Increase Conf 2, 3

3. Counter Rule: Similar to Propose Rule but used to cite a classification


other than that advocated by the initial Propose Rule move.
Refining Moves. Moves that enable a CAR to be refined to meet a counter attack.
For the purposes of using PISA as a classifier, one refining move is implemented:
4. Increase Confidence: Allows the addition of new attribute(s) to the
premise associated with a previously proposed CAR so as to increase the
confidence of the rule, thus increasing the confidence that the case should
be classified as indicated.

3 PISA Dynamic CAR Mining


Having introduced, in the foregoing, the legal moves in PISA dialogues, the real-
isation of these moves is described in this section. The idea is to mine CARs ac-
cording to: (i) a desired minimum confidence, (ii) a specified consequent and (iii)
a set of candidate attributes for the antecedent (a subset of the attributes rep-
resented by the case under discussion). Standard CARM techniques (e.g.[7,16])
tend to generate the complete set of CARs represented in the input data. PISA
on the other hand utilises a just in time approach to CARM, directed at gen-
erating particular subsets of CARs, and applied such that each agent mines
appropriate CARs as needed. The mining process supports two different forms
of dynamic ARM request:

1. Find a subset of rules that conform to a given set of constraints.


2. Distinguish a given rule by adding additional attributes.

In order to realise the above, each Participant Agent utilises a T-tree [6] to
summarise its local dataset. A T-tree is a reverse set enumeration tree structure
where nodes are organised using reverse lexicographic ordering, which in turn
enables direct indexing according to attribute number; therefore computational
efficiency gains are achieved. A further advantage, with respect to PISA, is that
the reverse ordering dictates that each sub-tree is rooted at a particular class
attribute, and so all the attribute sets pertaining to a given class are contained
in a single T-tree branch. This means that any one of the identified dynamic
CARM requests need be directed at only one branch of the tree. This reduces
the overall processing cost compared to other prefix tree structures (such as
FP-Trees [16]). To further enhance the dynamic generation of CARs a set of
algorithms that work directly on T-trees were developed. These algorithms were
able to mine CARs satisfying different values of support threshold. At the start
of the dialogue each player has an empty T-tree and slowly builds a partial
Multi-agent Based Classification Using Argumentation from Experience 361

T-tree from their data set, as required, containing only the nodes representing
attributes from the case under discussion plus the class attribute. Note that no
node pruning, according to some user specified threshold, takes place; except
for nodes that have zero support. Two dynamic CAR retrieval algorithms were
developed: (i) Algorithm A which finds a rule that conforms to a given set of
constraints, and (ii) Algorithm B which distinguishes a given rule by adding
additional attributes. Further details of these algorithms can be found in [25].

4 Applications of PISA
Arguing from Experience enables PISA agents to undertake a number of different
tasks, mainly:
1. Multi-agent Classification: Follows the hypothesis that the described oper-
ation of PISA produces at least comparative results to that obtained using
traditional classification paradigms.
2. Ordinal Classification: Follows the hypothesis that PISA can be successfully
applied to datasets with ordered-classes, using a simple agreement strategy.
3. Classifying imbalanced data using dynamic coalitions: Follows the hypothesis
that dynamic coalitions between a number of participant agents, representing
rare classes, improves the performance of PISA with imbalanced multi-class
datasets.
In this section the above applications of PISA are empirically evaluated. For the
evaluation we used a number of real-world datasets drawn from the UCI reposi-
tory [4]. Where appropriate continuous values were discretised into ranges. The
chosen datasets (Table 2) display a variety of characteristics with respect to
number of records (R), number of classes (C) and number of attributes (A). Im-
portantly, they include a diverse number of class labels, distributed in a different
manner in each dataset (balanced and unbalanced), thus providing the desired
variation in the experience assigned to individual PISA participants.

4.1 Application 1: PISA-Based Classification


The first application of PISA is in the context of multi-agent classification based
on argumentation. In order to provide an empirical assessment of this application
we ran a series of experiments designed to evaluate the hypothesis that PISA pro-
duces at least comparative results to that obtained using traditional classification
paradigms. In particular, ensemble classification methods. The results presented
throughout this sub-section, unless otherwise noted, were obtained using Ten-
fold Cross Validation (TCV). For the purposes of running PISA, each training
dataset was equally divided among a number of Participant Agents correspond-
ing to the number of classes in the dataset. Then a number of PISA dialogues
were executed to classify the cases in the test sets1 . In order to fully assess its
operation, PISA was compared against a range of classification paradigms:
1
For each evaluation the confidence threshold used by each participant was 50% and
the support threshold 1%.
362 M. Wardeh et al.

Table 2. Summary of data sets. Columns indicate: domain name, number of records,
number of classes, number of attributes and class distribution (approximately balanced
or not.
Name R C A Bal Name R C A Bal
Hepatitis 155 2 19(56) no Ionosphere 351 2 34(157) no
HorseColic 368 2 27(85) no Congressional Vot- 435 2 17(34) yes
ing
Cylnder Bands 540 2 39(124) yes Breast 699 2 11(20) yes
Pima (Diabetes) 768 2 9(38) yes Tic-Tac-Toe 958 2 9(29) no
Mushrooms 8124 2 23(90) yes Adult 48842 2 14(97) no
Iris 150 3 4(19) yes Waveform 5000 3 22(101) yes
Wine 178 3 13(68) yes Connect4 67557 3 42(120) no
Lymphography 148 4 18(59) no Car Evaluation 1728 4 7(25) no
Heart 303 5 22(52) no Nursery 12960 5 9(32) no
Dematology 366 6 49(49) no Annealing 898 6 38(73) no
Zoo 101 7 17(42) no Automobile (Auto) 205 7 26(137) no
Glass 214 7 10(48) no Page Blocks 5473 7 11(46) no
Ecoli 336 8 8(34) no Solar Flare 1389 9 10(39) no
Led7 3200 10 8(24) yes Pen Digits 10992 10 17(89) yes
Chess 28056 18 6(58) no

Table 3. Summary of the Ensemble Methods used. The implementation of these meth-
ods was obtained from [15]. (S=Support, RDT=Random Decision Trees)

Ensemble Technique Base Ensemble Technique Base


Bagging-C4.5 Bagging[3] C4.5 Bagging-RDT Bagging[3] RDT
(S=1%) (S=1%)
ADABoost-C4.5 ADABoost.M1 C4.5 ADABoost-RDT ADABoost.M1 RDT
. [14] (S=1%) [14] (S=1%)
MutliBoostAB- MultiBoosting C4.5 MultiBoostAB- MultiBoosting RDT
C4.5 [26] (S=1%) RDT [26] (S=1%)
DECORATE [19] C4.5
(S=1%)

1. Decision trees: Both C4.5, as implemented in [15], and the Random Decision
Tree (RDT)[8], were used.
2. CARM : The TFPC (Total From Partial Classification) algorithm [7] was
adopted because this algorithm utilises similar data structures [6] as PISA.
3. Ensemble classifiers: Table 3 summarises the techniques used. We chose to
apply Boosting and Bagging, combined with decision trees, because previous
work demonstrated that such combination is very effective (e.g. [2,20]).

For each of the included methods (and PISA) three values were calculated
for each dataset: (i) classification error rate, (ii) Balanced Error Rate (BER)
using a confusion matrix obtained from each TCV2 ; and (iii) execution time.
2
Balanced Error Rates (BER) were calculated, for each dataset, as follows:
1
BER = C
C Fci
i=1 Fci +Tci

C = the number of classes in the dataset, Tci = the number of cases which are
correctly classified as class ci, and Fci = the number of cases which should have
been classified as ci but where classified under different class label.
Multi-agent Based Classification Using Argumentation from Experience 363

Table 4. Test set error rate (%). Values in bold are the lowest in a given dataset.
Ensembles Decision Trees
Dataset PISA Bagging ADABoost.M1 MultiBoost TFPC
Decorate
C4.5 RDT C4.5 RDT C4.5 RDT C4.5 RDT
Hepatitis 13.33 18.06 14.84 15.48 21.29 13.55 18.71 16.13 16.13 23.23 18.00
Ionosphere 3.33 7.69 6.84 7.12 10.83 6.27 10.83 7.41 8.55 2.57 14.29
HorseColic 2.78 3.89 22.78
Congress 1.78 3.01 2.31 2.08 3.01 2.08 3.01 2.77 4.16 0.00 9.30
CylBands 15.00 42.22 27.04 42.22 34.81 42.22 34.81 39.81 42.22 36.48 30.37
Breast 3.91 5.01 4.86 4.86 4.86 4.86 4.86 5.43 4.86 5.07 10.00
Pima 14.47 27.21 25.26 25.26 23.83 25.13 24.87 25.66 26.69 16.18 25.92
TicTacToe 2.84 7.20 5.43 2.19 20.35 2.19 20.35 5.85 15.45 20.77 33.68
Mushrooms 0.41 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 1.05
Adult 14.49 13.09 19.19
Iris 2.67 4.67 5.33 6.00 7.33 6.00 7.33 4.67 4.00 8.00 6.00
Waveform 2.16 17.97 11.98 21.48 21.48 13.62 11.96 21.48 21.48 2.42 33.32
Wine 1.18 0.00 25.29
Connect4 5.01 4.31 34.17
Lympho 6.23 18.92 19.59 14.86 29.73 15.54 29.73 19.59 22.97 25.00 24.29
Car Eval 4.11 4.51 1.24 2.43 6.25 2.60 6.25 4.28 5.09 5.90 30.00
Heart 5.05 20.07 19.73 22.79 21.09 19.05 19.73 20.41 19.05 4.67 46.67
Page Bloc 2.24 6.93 6.93 7.02 6.93 7.02 6.93 6.93 7.02 6.94 9.95
Nursery 6.37 2.08 3.09 0.38 3.09 0.35 3.09 1.91 2.62 3.72 22.25
Dematology 4.96 4.10 3.55 3.83 15.30 3.28 15.30 1.64 6.01 5.28 25.00
Annealing 9.55 1.22 0.67 0.45 1.78 0.56 1.78 1.34 1.56 1.67 11.80
Zoo 9.90 7.92 4.95 3.96 19.80 3.96 19.80 6.93 7.92 0.00 8.00
Auto 12.00 15.12 15.61 14.15 21.46 15.61 21.46 16.10 18.05 17.00 29.00
Glass 14.69 27.10 21.50 22.43 29.91 25.23 29.91 29.91 33.18 29.91 33.81
Ecoli 5.17 13.99 15.18 16.37 24.70 14.88 24.70 13.10 15.77 8.79 37.27
Flare 6.09 2.48 3.41 3.41 3.41 3.41 3.10 3.10 2.48 8.03 14.74
Led7 12.00 24.81 24.16 24.84 24.28 24.91 24.34 24.75 24.84 24.25 31.03
Pen Digit 2.75 4.47 1.35 1.58 2.51 5.07 1.87 2.51 5.65 1.08 18.24
Chess 9.13 18.58 15.73

These three values then provided the criteria for assessing and comparing the
classification paradigms.
The results are presented in Table 4. From the table it can be seen that
PISA performs consistently well; out performing the other association rule clas-
sifier, and giving comparable results to the decision tree methods. Additionally,
PISA produced results comparable to those produced by the ensemble methods.
Moreover, PISA scored an average overall accuracy of 93.60%, higher than that
obtained from any of the other methods tested (e.g. Bagging-RDT (89.48%) and
RDT (90.24%))3 .
Table 5 shows the BER for each of the given datasets. From the table it can
be seen that PISA produced reasonably good results overall, producing the best
result in 14 out of the 39 datasets tested.
Table 6 gives the execution times (in milliseconds) for each of the methods.
Note that PISA is not the fastest method. However, the recorded performance
is by no means the worst (for instance Decorate runs slower than PISA with
respect to the majority of the datasets). Additionally, PISA seems to run faster
than Bagging and ADABoost with some datasets.

4.2 Application 2: PISA-Based Ordinal Classification


Having established PISA as a classification paradigm, we now explore the appli-
cation of PISA to ordinal classification. In this form of multi-class classification the
set of class labels is finite and ordered. Whereas traditional classification paradigms
commonly assume that the class values are unordered. For many practical applica-
tions class labels do exhibit some form of order (e.g. the weather can be cold, mild,
3
These accuracies were calculated from Table 4.
364 M. Wardeh et al.

Table 5. Test set BER (%). Values in bold are the lowest in a given dataset.
Ensembles Decision Trees
Dataset PISA Bagging ADABoost.M1 MultiBoost TFPC
Decorate
C4.5 RDT C4.5 RDT C4.5 RDT C4.5 RDT
Hepatitis 12.00 27.41 20.63 23.37 33.69 19.89 25.05 24.60 23.38 38.19 36.44
Ionosphere 4.58 7.08 6.63 6.42 11.43 5.31 11.43 7.08 8.17 2.19 13.41
HorseColic 2.80 3.71 28.63
Congress 2.35 3.43 2.66 2.27 3.19 2.27 2.27 3.05 4.69 0.00 9.71
CylBands 14.50 46.10 24.48 46.10 35.63 46.10 35.63 40.14 46.10 34.56 32.78
Breast 4.75 6.03 6.20 6.20 6.20 6.07 6.20 6.71 6.20 4.71 12.89
Pima 13.94 28.88 26.86 26.86 25.18 26.72 26.12 27.16 28.34 24.47 33.67
TicTacToe 2.14 6.71 5.35 2.25 22.46 2.25 22.46 5.25 16.98 22.94 47.44
Mushrooms 0.59 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 1.04
Adult 8.80 17.75 39.89
Iris 2.90 4.61 5.29 5.93 7.32 5.93 7.32 4.69 3.96 7.96 6.07
Waveform 3.93 18.00 11.99 21.51 21.51 13.64 11.97 21.51 21.51 2.39 33.35
Wine 1.42 0.00 24.05
Connect4 11.90 5.33 66.67
Lympho 15.95 30.12 9.74 25.90 43.43 38.66 43.43 39.35 35.97 47.11 16.09
Car Eval 8.21 11.25 6.77 4.79 10.43 5.29 10.43 10.24 16.55 10.67 75.00
Heart 8.25 9.16 8.97 9.93 7.98 7.85 8.97 9.43 7.97 9.82 48.02
Page Bloc 9.45 21.46 22.85 27.89 21.46 27.87 21.46 22.85 27.89 21.48 19.89
Nursery 5.47 4.14 2.28 1.05 5.82 0.76 5.78 4.55 5.98 5.74 40.10
Dematology 8.49 4.68 3.89 3.93 19.41 3.31 19.41 1.84 7.00 3.25 61.67
Annealing 16.13 6.76 3.92 2.57 4.31 3.25 4.31 7.16 6.83 4.43 33.51
Zoo 13.23 12.78 10.71 10.71 36.51 10.71 36.51 15.71 17.50 0.00 17.14
Auto 12.26 11.43 15.98 10.55 18.92 15.98 18.92 12.84 17.04 13.57 19.60
Glass 16.09 24.57 19.42 24.33 29.56 23.18 29.56 29.56 37.98 29.56 48.55
Ecoli 16.18 36.66 40.18 41.18 51.89 37.92 51.89 24.16 43.35 9.42 23.23
Flare 17.18 12.77 12.66 10.91 12.66 12.66 12.62 12.62 12.54 7.59 14.74
Led7 11.84 24.56 24.07 24.87 24.31 25.09 24.23 24.92 24.72 24.36 31.39
Pen Digit 3.47 4.48 1.51 1.59 2.23 4.92 1.89 2.23 5.57 3.73 18.38
Chess 9.63 16.38 24.53

warm and hot). Given ordered classes, one is not only concerned to maximise the
classification accuracy, but also to minimise the distances between the actual and
the predicted classes. The problem of ordinal classification is often solved by either
multi-class classification or regression methods. However, some new approaches,
tailored specifically for ordinal classification, have been introduced in the litera-
tures (e.g. [13,22]). PISA can be utilised for ordinal classification by the means of
biased agreement. Agents in PISA have the option to agree with CARs suggested
by other agents, by not attacking these rules, even if a valid attack is possible. PISA
agents can either agree with all the opponents or with a pre-defined set of opponents
that match the class order. For instance, in the weather scenario, agents support-
ing the decision that the weather is hot, agree with those with the opinion that
the weather is warm, and vice versa. Whereas agents supporting that the weather
is cold or mild agree with each other. We refer to the latter form of agreement by
the term biased agreement. In which the agents are equipped with a simple list of
the class labels that they could agree with (the agreement list). Here, we have two
forms of this mode of agreement:
1. No Attack Biased Agreement (NA-BIA): In which agents consult their
agreement list before mining any rules from their local datasets and attempt
only to attack/respond to CARs of the following shape: P → Q : ∀q ∈ Q, q ∈ /
agreementlist
2. Confidence Threshold Biased Agreement (CT-BIA): Here, if the
agents fail to attack any CARs that contradict with their agreement list,
then they try to attack CARs (P → Q : ∃q ∈ Q, q ∈ agreementlist) if and
only if they fail to mine a matching CAR, with same or higher confidence,
from their own local dataset (P → Q‘ : Q‘ ⊇ Q).
Multi-agent Based Classification Using Argumentation from Experience 365

Table 6. Test set execution times (milliseconds). Values in bold are the lowest in a
given dataset.
cm Ensembles Decision Trees
Dataset PISA Bagging ADABoost.M1 MultiBoost TFPC
Decorate
C4.5 RDT C4.5 RDT C4.5 RDT C4.5 RDT
Hepatitis 115 110 40 190 70 200 60 610 40 60 213
Ionosphere 437 1130 210 1170 20 1210 20 4090 80 12 109
HorseColic 17 4.8 108
Congress 34 50 20 20 140 130 20 590 30 15 154
CylBands 83 110 130 40 20 40 20 1190 40 17 936
Breast 31 110 110 140 110 170 170 330 8.1 8 11
Pima 75 160 90 80 130 80 110 500 20 21 11
TicTacToe 71 80 70 250 30 280 10 620 20 6.1 61.4
Mushrooms 313 750 380 110 50 60 50 6400 80 117 630
Adult 3019 706 1279
Iris 42 40 50 60 50 50 10 110 10 13 2
Waveform 1243 1840 380 4400 830 1650 560 4730 200 102 862
Wine 136 106 163
Connect4 4710 3612 6054
Lympho 15 80 50 90 10 70 10 140 5 5 29
Car Eval 74 300 110 370 20 20 310 1580 80 24 17
Heart 343 250 80 480 20 430 10 620 20 5 183
Page Bloc 159 130 430 430 130 280 130 430 120 55 60
Nursery 965 1790 720 3130 60 3760 10 1449 110 139 204
Dematology 194 160 40 230 20 20 20 480 20 7 169
Annealing 750 1090 120 850 10 1170 10 3340 50 28 689
Zoo 43 40 10 20 10 30 10 110 10 5 85
Auto 210 440 70 320 10 350 10 520 20 5 43
Glass 180 260 120 340 10 430 10 1060 20 10 43
Ecoli 139 240 150 360 10 340 10 1510 10 3 4
Flare 239 30 20 60 40 20 20 140 10 27 23
Pen Digits 1345 2300 460 5810 820 2790 800 2300 290 80 1606
Led7 78 730 360 260 130 1150 480 3380 110 90 25
Chess 2412 334 226

Table 7. The application of PISA with datasets from Table 2 with ordered classes
Datasets ER BER MSE MAE
PISA CT-BIA NA-BIA PISA CT-BIA NA-BIA PISA CT-BIA NA-BIA PISA CT-BIA NA-BIA

Lympo 6.21 4.76 3.38 15.95 20.73 13.94 0.199 0.046 0.015 2.07 1.36 0.84
Car Eval 4.11 5.00 4.03 9.53 10.09 10.61 0.863 1.220 0.708 1.02 1.32 1.01
Page Bloc 2.67 3.64 3.91 13.43 10.42 10.06 1.250 5.164 4.757 0.49 0.78 0.83
Nursery 6.37 6.27 5.83 11.79 13.57 7.88 7.450 7.071 6.725 1.61 1.57 1.46
Dema 4.96 7.95 6.87 8.49 8.74 7.53 0.144 0.143 0.100 1.46 1.37 1.24
Zoo 9.90 7.92 6.86 13.23 14.67 12.17 0.223 0.230 0.232 2.26 2.26 1.96
Ecoli 6.03 5.52 4.34 16.81 6.72 6.91 0.008 0.008 0.005 8.23 7.92 4.63

To test the hypothesis that the above approach improves the performance of
PISA when applied to ordinal classification a series of TCV tests, using a number
of datasets from Table 2 which have ordered classes, were conducted. PISA was
run using the NA-BIA and CT-BIA strategies, and the results were compared
against the use of PISA without any agreement strategy. Additionally, to provide
better comparison the Mean Squared Error (MSE) and the Mean Absolute Error
(MAE) rates for the included datasets and methods were calculated. [11] notes
that little attention has been directed at the evaluation of ordinal classification
solutions, and that simple measures, such as accuracy, are not sufficient. In [11] a
number of evaluation metrics, for ordinal classification, are compared. As a result
MSE is suggested as the best metric when more (smaller) errors are preferred
to reduce the number of large errors; while MAE is a good metric if, overall,
fewer errors are preferred with more tolerance for large errors. Table 7 provides
a summary of the results of the experiments. From the table it can be seen that
the NA-BIA produces better results with datasets with ordinal classes.
366 M. Wardeh et al.

4.3 Application 3: PISA-Based Solution to the Imbalanced Class


Problem

Another application of PISA is using dynamic coalitions between different agents


to produce better performance in the face of imbalanced class problem. It has
been observed (e.g.[17]) that class imbalance (i.e a significant differences in class
prior probabilities) may produce an important deterioration of the performance
achieved by existing learning and classification systems. This situation is often
found in real-world data describing an infrequent but important case (e.g. Table
2. There have been a number of proposed mechanisms for dealing with the class
imbalance problem (e.g. [10,21]). [12,17] note a number of different approaches:

1. Changing class distributions: by “upsizing” the small class at random (or


focused random), or by “downsizing” the large class at random (or focused
random).
2. At the classifier level by either: manipulating classifiers internally, cost-
sensitive learning or one-class learning.
3. Specially designed ensemble learning methods.
4. Agent-based remedies such as that proposed in [18] where three agents, each
using a different classification paradigm, generate classifiers from a filtered
version of the training data. Individual predictions are then combined ac-
cording to a voting scheme. The intuition is that the models generated using
different learning biases are more likely to make errors in different ways.

In the following we present a refinement of the basic PISA model which en-
ables PISA to tackle the imbalance-class problem in multi-class datasets, using
Dynamic Coalitions between agents representing the rare classes. Unlike the bi-
ased agreement approach (Sub-section 4.2), coalition requires mutual agreement
among a number of participants, thus a preparation step is necessary. However,
for the purposes of this paper we assume that the agents representing the rare
classes are in coalition from the start of the dialogue, thus eliminating the need
for a preparatory step. The agents in a coalition stop attacking each other, and
only attack CARs placed by agents outside the coalition. The objective of such
coalition is to attempt to remove the agents representing dominant class(es) from
the dialogue, or at least for a pre-defined number of rounds. Once the agent in
question is removed from the dialogue, the coalition is dismantled and the agents
go on attacking each others as in a normal PISA dialogue. In the following we
provide experimental analysis of two coalition techniques:

1. Coalition (1): The coalition is dismantled if the agent supporting the dom-
inant class does not participate in the dialogue for two consecutive rounds.
2. Coalition (2): The coalition is dismantled if the agent supporting the dom-
inant class does not participate in the dialogue for two consecutive rounds,
and this agent is not allowed to take any further part in the dialogue.
Multi-agent Based Classification Using Argumentation from Experience 367

Table 8. The application of PISA with imbalanced multi-class datasets from Table 2
Datasets ER BER G-Mean Time
PISA Coal(1) Coal(2) PISA Coal(1) Coal(2) PISA Coal(1) Coal(2) PISA Coal(1) Coal(2)

Connect4 5.02 4.18 3.78 11.90 9.68 8.70 87.47 89.96 91.00 4710 5376 5818
Lympo 6.21 5.02 4.03 15.95 11.90 14.64 69.31 82.60 92.81 15 65 55
Car Eval 4.11 3.73 4.22 9.53 7.24 4.47 79.42 88.40 92.52 74 163 158
Heart 5.05 4.95 4.95 8.25 2.54 3.17 84.44 87.67 89.97 343 531 612
Page Bloc 2.24 1.43 1.14 13.43 7.96 9.63 68.17 85.43 84.02 159 207 222
Derma 4.96 3.91 3.60 8.49 4.95 4.48 75.79 84.27 90.14 194 119 107
Annealing 9.55 4.24 4.01 16.13 7.72 4.24 63.57 86.20 91.52 750 980 881
Zoo 9.90 8.00 7.00 13.23 8.33 3.92 67.19 85.42 85.51 43 93 85
Auto 12.00 6.37 5.77 12.26 6.53 6.64 79.74 87.88 90.87 210 336 293
Glass 14.69 12.02 5.74 16.09 7.45 5.81 80.12 93.60 93.24 180 178 171
Ecoli 6.03 5.15 5.64 16.18 10.93 3.92 74.16 87.31 96.01 139 86 81
Flare 6.09 7.10 6.86 17.18 5.58 5.15 77.41 91.21 95.76 2393 2291 6267
Chess 9.13 8.47 6.28 9.63 5.91 5.82 76.70 91.26 92.22 2412 3305 3393

To test the hypothesis that the above approaches improves the performance of
PISA when applied to imbalanced class datasets we ran a series of TCV tests
using a number of datasets from Table 2, which have imbalanced class distribu-
tions. The results were compared against the use of PISA without any coalition
strategy. Four measures were used in this comparison: error rate, balanced error
rate, time and geometric mean (g-mean)4 . This last measure was used to quan-
tify the classifier performance in the class [1]. Table 8 provides the result of the
above experiment. From,the table it can be seen that both coalition techniques
boost the performance of PISA, with imbalance-class datasets, with very little
additional cost in time, due to the time needed to dismantle the coalitions.

5 Conclusions
The PISA Arguing from Experience Framework has been described. PISA al-
lows a collection of agents to conduct a dialogue concerning the classification of
an example. The system progresses in a round-by-round manner. During each
round agents can elect to propose an argument advocating their own position
or attack another agent’s position. The arguments are mined and expressed in
the form of CARs, which are viewed as generalisations of the individual agent’s
experience. In the context of classification PISA provides for a “distributed”
classification mechanism that harnesses all the advantages offered by Multi-agent
Systems. The effectiveness of PISA is comparable with that of other classification
paradigms. Furthermore the PISA approach to classification can operate with
temporally evolving data. We have also demonstrated that PISA can be utilised
to produce better performance with imbalanced classes and ordinal classification
problems.

References
1. Alejo, R., Garcia, V., Sotoca, J., Mollineda, R., Sanchez, J.: Improving the Perfor-
mance of the RBF Neural Networks with Imbalanced Samples. In: Proc. 9th Intl.
Conf. on Artl. Neural Networks, pp. 162–169. Springer, Heidelberg (2007)
4  1
The geometric mean is defined as g − mean = ( C i=1 pii )
C where pii is the class

accuracy of class i, and C is the number of classes in the dataset.


368 M. Wardeh et al.

2. Bauer, E., Kohavi, R.: An empirical comparison of voting classification algorithms:


Bagging, Boosting and variants. J. Machine Learning 36, 105–139 (1999)
3. Brieman, L.: Bagging predictors. J. Machine Learning 24, 123–140 (1996)
4. Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. University
of California (1998), http://www.ics.uci.edu/~ mlearn/MLRepository.html
5. Cao, L., Gorodetsky, V., Mitkas, P.: Agent Mining: The Synergy of Agents and
Data Mining. IEEE Intelligent Systems 24(3), 64–72 (2009)
6. Coenen, F., Leng, P., Ahmed, S.: Data structure for association rule mining: T-trees
and p-trees. J. IEEE Trans. Knowl. Data Eng. 16(6), 774–778 (2004)
7. Coenen, F., Leng, P.: Obtaining Best Parameter Values for Accurate Classification.
In: Proc. ICDM 2005, pp. 597–600. IEEE, Los Alamitos (2005)
8. Coenen, F.: The LUCS-KDD Decision Tree Classifier Software Dept. of Computer
Science, The University of Liverpool, UK (2007),
http://www.csc.liv.ac.uk/ frans/KDD/Software/DecisionTrees
/decisionTree.html
9. Dietterich, T.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.)
MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000)
10. Elkan, C.: The Foundations of Cost-Sensitive Learning. In: Proc. IJCAI 2001, vol. 2,
pp. 973–978 (2001)
11. Gaudette, L., Japkowicz, N.: Evaluation Methods for Ordinal Classification. In:
Yong, G., Japkowicz, N. (eds.) AI 2009. LNCS, vol. 5549, pp. 207–210. Springer,
Heidelberg (2009)
12. Guo, X., Yin, Y., Dong, C., Zhou, G.: On the Class Imbalance Problem. In: Proc.
ICNC 2008, pp. 192–201. IEEE, Los Alamitos (2008)
13. Frank, E., Hall, M.: A simple approach to ordinal classification. In: Flach, P.A.,
De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, pp. 145–157. Springer,
Heidelberg (2001)
14. Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Proc.
ICML 1996, pp. 148–156 (1996)
15. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The
WEKA Data Mining Software: An Update. J. SIGKDD Explorations 11(1) (2009)
16. Han, J., Pei, J., Yiwen, Y.: Mining Frequent Patterns Without Candidate Gener-
ation. In: Proc. SIGMOD 2000, pp. 1–12. ACM Press, New York (2000)
17. Japkowicz, N., Stephen, S.: The Class Imbalance Problem: A systematic study. J.
Intelligent Data Analysis 6(5), 429–449 (2002)
18. Kotsiantis, S., Pintelas, P.: Mixture of Expert Agents for Handling Imbalanced
Data Sets. Annals of Mathematics, Computing & TeleInformatics 1, 46–55 (2003)
19. Melville, P., Mooney, R.: Constructing Diverse Classifier Ensembles Using Artificial
Training Examples. In: Proc. IJCAI 2003, pp. 505–510 (2003)
20. Opitz, D., Maclin, R.: Popular Ensemble Methods: An Empirical Study. J. Artif.
Intell. Research 11, 169–198 (1999)
21. Philippe, L., Lallich, S., Do, T., Pham, N.: A comparison of different off-centered
entropies to deal with class imbalance for decision trees. In: Washio, T., Suzuki,
E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp.
634–643. Springer, Heidelberg (2008)
22. Blaszczynski, J., Slowinski, R., Szelag, M.: Probabilistic Rough Set Approaches to
Ordinal Classification with Monotonicity Constraints. In: Hüllermeier, E., Kruse,
R., Hoffmann, F. (eds.) IPMU 2010. LNCS, vol. 6178, pp. 99–108. Springer, Hei-
delberg (2010)
Multi-agent Based Classification Using Argumentation from Experience 369

23. Wardeh, M., Bench-Capon, T., Coenen, F.: Multi-Party Argument from Experi-
ence. In: McBurney, P., Rahwan, I., Parsons, S., Maudet, N. (eds.) ArgMAS 2009.
LNCS, vol. 6057, Springer, Heidelberg (2010)
24. Wardeh, M., Bench-Capon, T., Coenen, F.: Arguments from Experience: The
PADUA Protocol. In: Proc. COMMA 2008, Toulouse, France, pp. 405–416. IOS
Press, Amsterdam (2008)
25. Wardeh, M., Bench-Capon, T., Coenen, F.: Dynamic Rule Mining for Argumenta-
tion Based Systems. In: Proc. 27th SGAI Intl. Conf. on AI (AI 2007), pp. 65–78.
Springer, London (2007)
26. Webb, G.: MultiBoosting: A Technique for Combining Boosting and Wagging. J.
Machine Learning 40(2), 159–196 (2000)
Agent-Based Subspace Clustering

Chao Luo1 , Yanchang Zhao2 , Dan Luo1 , Chengqi Zhang1 , and Wei Cao3
1
Data Sciences and Knowledge Discovery Lab
Centre for Quantum Computation and Intelligent Systems
Faculty of Engineering & IT, University of Technology, Sydney, Australia
{chaoluo,dluo,chengqi}@it.uts.edu.au
2
Data Mining Team, Centrelink, Australia
yanchang.zhao@centrelink.gov.au
3
Hefei University of Technology, China
caowei880428@163.com

Abstract. This paper presents an agent-based algorithm for discovering


subspace clusters in high dimensional data. Each data object is repre-
sented by an agent, and the agents move from one local environment to
another to find optimal clusters in subspaces. Heuristic rules and objec-
tive functions are defined to guide the movements of agents, so that sim-
ilar agents(data objects) go to one group. The experimental results show
that our proposed agent-based subspace clustering algorithm performs
better than existing subspace clustering methods on both F1 measure
and Entropy. The running time of our algorithm is scalable with the size
and dimensionality of data. Furthermore, an application in stock market
surveillance demonstrates its effectiveness in real world applications.

1 Introduction
As an extension of traditional full-dimensional clustering, subspace clustering
seeks to find clusters in subspaces in high-dimensional data. Subspace clustering
approaches can provide fast search in different subspaces, so as to find clusters
hidden in subspaces of a full dimensional space.
The interpretability of the results is highly desirable in data mining appli-
cations. As a basic approach, the clustering results should be easily utilized by
other methods, such as visulization techniques. In last decade, subspace clus-
tering has been researched widely. However, there are still some issues in this
area. Some subspaces clustering, such as CLIQUE [3], produce only overlapped
clustering, where one data point can belong to several clusters. This makes the
clusters fail to provide a clear description of data. In addition, most subspace
clustering methods generate low quality clusters.
In order to obtain high quality subspace clustering, we design a model of
Agent-based Clustering on Subspaces (ACS). By simulating the actions and in-
teractions of autonomous agents with a view to accessing their effects on the sys-
tem as a whole, Agent-based subspace clustering can result in far more complex
and interesting clustering. The clusters obtained can provide a natural descrip-
tion of data. Agent-based subspace clustering is a powerful clustering modeling
technique that can be applied to real-business problems.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 370–381, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Agent-Based Subspace Clustering 371

This paper is organized as follows. Section 2 gives the background and re-
lated work of subspace clustering. Section 3 presents our model of agent-based
subspace clustering. The experimental results and evaluation are given in Sec-
tion 4. An application on market manipulation is also provided in Section 4. We
conclude the research in Section 5.

2 Background and Related Works


The goal of clustering is to group a given set of data points into clusters such that
all data within a cluster are similar to each other. However, with the increase of
dimensionality, traditional clustering methods are questioned. For example, the
traditional distance measures fail to differentiate the nearest neighbor from the
farthest point in very high-dimensional spaces. Subspace clustering was introduced
in order to solve the problem and identify the similarity of data points in subspaces.
According to search strategy, there are two main approaches to subspace
clustering: bottom-up search methods and top-down search methods.
The bottom-up search methods, such as CLIQUE[3], ENCLUS [5] and DOC[9],
take advantage of the downward closure property of density to reduce the search
space by using an APRIORI style approach. Candidate high-dimensional sub-
spaces are generated from low dimensional subspaces which contain dense units.
The searching stops when no candidate subspaces are generated. The top-down
subspace clustering approaches, such as PROCLUS [1], FINDIT and COSA,
starts by finding an initial approximation of the clusters with equal weight on
full dimensions. Then each dimension is assigend a weight for each cluster. The
updated weights are then used in the next iteration to update the clusters. Most
top-down methods use sampling technique to improve their performance.
The CLIQUE algorithm [3] is one of the first subspace clustering algorithms.
The algorithm combines density and grid based clustering. In CLIQUE, grid
cells are defined by a fixed grid splitting each dimensions in ξ equal width cells.
Dense units are those cells with more than τ data points. CLIQUE uses an apriori
style search technique to find dense units. A cluster in CLIQUE is defined as
a connection of dense units. The hyperrectangular clusters are then defined by
a Disjunctive Normal Form (DNF) experssion. Clusters generated by CLIQUE
may be found in overlapping subspaces, which means that each data point can
belong to more than one clusters.
In order to obtain an effective hard clustering where clusters are not over-
lapped with each other, we utilize agent-based modeling to find subspace clus-
ters. Agent-based clustering approach is able to provide a natural description
of a data clustering. Flexibility is another advantage of agent-based clustering.
There are two main categoreis of agent-based clustering methods: multi-agent
clustering and biologically inspired clustering.
Ogston et al. presented a method of clustering within a fully decentralized
multi-agent system [8]. In their system, each object is considered as an agent and
agents try to form groups among themselves. Agents are connected in a random
network and the agents search in a peer-to-peer fashion for other similar agents.
Usually, the network is complex, which limits the use of this approach.
372 C. Luo et al.

Biologically inspired clustering algorithms have been studied by many re-


searchers. Ant colonies, flocks of birds, swarms of bees, etc., are agent-based
insect models that have been used in many applications [10]. In these meth-
ods, the bio-inspired agents can change their environment locally. They have the
ability to accomplish the tasks that can not be achieved by a single agent.
Although many agent-based clustering have been proposed, there is no work
reported yet on subspace clustering in agent-based approach. In the next section,
we show how agent-based methods can be applied to subspace clustering.

3 Agent-Based Subspace Clustering


3.1 Problem Statement
Let S = {S1 , S2 , . . . , Sd } be a set of dimensions, and S = S1 × S2 . . . × Sd a
d-dimensional numerical or categorical space. The input consists of a set of d-
dimensional points V = {v1 , v2 , .., vm }, where vi = (vi1 , vi2 , . . . , vid ). Vij , the jth
component of vi , is drawn from dimension Sj .
Table 1 is a simple example of data. The columns are dimensions S =
{S1 , S2 , S3 , S4 , S5 }, and the rows are data sets V = {v1 , v2 , v3 , v4 , v5 }. In this
example, numerical data set is used to show the discretization process. After
discretization, clustering on numerical data is similar with the clustering on
categorical data.
Table 1. A simple example of Table 2. Data after dis-
data points cretization

S1 S2 S3 S4 S5 S6 S1 S2 S3 S4 S5 S6
v1 1 14 23 12 4 21 v1 1 2 3 2 1 3
v2 2 12 22 13 13 4 v2 1 2 3 2 2 1
v3 1 23 23 11 12 2 v3 1 2 3 2 2 1
v4 25 4 14 13 11 2 v4 3 1 2 2 2 1
v5 23 2 12 1 23 2 v5 3 1 2 1 3 1

The expected output is a hard clustering C = {C1 , C2 , . . . , Ck }. C is a


partitioning
 of the input data set V . C1 , C2 , . . . , Ck are disjoint clusters, and
i |Ci | = m. Let Ci .dimensions stands for the dimensions of cluster Ci and
Ci .dimensions ⊆ S, ∀Ci ∈ C.
Now, the question is what is the best clustering C? Based on the goal of hard
clustering, a larger cluster size |Ci | and a higher dimensionality |Ci .dimensions|
are preferred. However, there is a conflict between them. For example, assume
that one cluster C1 has |C1 | = 500 and |C1 .dimension| = 2 while another
cluster C2 has |C2 | = 100 and C2 .dimension = 10. Which cluster is ”better”
or preferred? In order to balance the two preferred choices, we define a measure
to evaluate the quality of clustering with respect to both cluster size |Ci | and
dimensionality |Ci .dimensions| as follows:

M (C) = (|Ci |)2 × |Ci .dimensions| ∀Ci ∈ C (1)
i
Agent-Based Subspace Clustering 373

The clusters C with optimized M (C) will have a large data size and a large
dimensionality at the same time.
3.2 Agent-Based Subspace Clustering
In this section, we describe the design of our agent-based subspace clustering ap-
proach and explain how to implement the tasks of subspace clustering definded
above.
Firstly, we briefly present the model of agent-based subspace clustering. In the
agent-based model, there are a set of agents and each agent represents a data
point. The agents move from one local environment to another local environment.
We named the local environment as bins. The movement of agents is instructed
by some heuristic rules. There is a global environment to determinate when
agents stop moving. In this way, an optimized clustering is obtained as a whole.
To sum up, the complex subspace clustering is achieved by using the simple
behaviors of agents under the guidance of heuristic rules [4,6].
The key components of agent-based subspace clustering are: agents, the local
environment bins and the global environment. In order to explain the detail of
the agent-based subspace clustering model, we take the data set in Table 1 as a
simple example.
– Let A = {a1 , a2 , . . . , am } represent agents. For example, agents
A = {a1 , a2 , a3 , a4 , a5 } represent data in Table 1.
– Let B = {B1 , B2 , . . . , Bn } be a set of bins. The bins B are the local en-
vironment of agents. Therefore each bin Bi (Bi ∈ B) contains a number of
agents. We refer Bi .agents as the agents contained by Bi . For each agent
aj in Bi .agents, we say agent aj belong to bin Bi . Bin Bi has a property
Bi .dimensions, which denotes the subspace under which Bi .agents are sim-
ilar to each other.
In this model, we choose CLIQUE as the method to generate bins B. The first
step of CLIQUE is to discrete the numeric values into intervals. Table 2 shows
the agents after discretization with ξ = 3, which is the number of levels in each
dimension. The intervals on different dimensions form units. CLIQUE firstly
finds all dense units as the basic elements of clusters. Then the connected dense
units are treated as final clusters. Figure 1 is an example of result of CLIQUE
with τ = 0.8 and ξ = 3 on the data in Table 1. There are two clusters on
subspaces S4 and S6 respectively.
However, this clustering is unble to satisfy the definition of hard clustering. In
our model, the groups generated by CLIQUE are treated as bins B as an input
to our model to generate higher quality clusters. The global environment is an
important component of an agent-based model. We define an objective function
as a global environment based on Equation (1). In our model, local environment
bins B are being optimized with the movement of agents A. When objective
function M (B) in Equation (2) reaches its maximal value, agents stop moving.
Bins B are fully optimized and are treated as final clusters C.

M (B) = |Bi .dimensions| × (|Bi .objects|)2 ∀Bi ∈ B (2)
i
374 C. Luo et al.

Fig. 1. An example of CLIQUE clustering

Some simple rules are defined to make sure that M (B) can be optimized by the
movements of agents A.
The movement of agents is a parallel decentralized process. Initially, each
agent ai (ai ∈ A) randomly choose a bin Bj (Bj ∈ B ∧ ai ∈ Bj ) it belongs to as
its local environment. In the next loop, agent ai randomly chooses another bin
Bk (k = j ∧Bk ∈ B) as the destination of movement. The ΔM (Bj ) and ΔM (Bk )
measure the changes in Bj and Bk with respect to the global objective function
M (B). move(ai ) in Equation (5) indicates the influence of the movement on
M (B). If move(ai ) is evaluated as positive, the agent will move from its bin Bj
to the destination Bk . Otherwise, agent ai stays in Bj .

ΔM (Bj ) = ((|Bj .agents| − 1)2 × |Bj .dimensions|) − (|Bj .agents|2 × |Bj .dimensions|)
(3)

ΔM (Bk ) = ((|Bk .agents|+1)2 ×|Bk .dimensions|2 )−(|Bk .agents|2 ×|Bk .dimensions|)


(4)

move(ai ) = ΔM (Bk ) + ΔM (Bj ) (5)

When bins B are generated by CLIQUE, each agent ai may be contained in


multiple bins, and they are called the preferred bins of ai , denoted as ai .bins.
Therefore, ai ∈ Bj , ∀Bj ∈ ai .bins. In order to improve the efficiency of move-
ment, we only allow agent ai move among its preferred bins ai .bins.
When objective function M (B) reachs its maximal value, the movements stop
and the final clustering C is obtained. Figure 2 is an example of the final clus-
tering on the example data in Table 1. Two clusters are generated: One cluster
contain points v1 , v2 , and v3 on dimensions S1 , S2 , S3 and S4 , and another
cluster contain v4 and v5 on dimensions S1 , S2 , S3 and S6 .

3.3 Algorithm
The algorithm is composed of the following three steps.
Agent-Based Subspace Clustering 375

Fig. 2. Example of agent-based clustering

Step 1. Generate local bins B. In this step, we utilize CLIQUE algorithm[3] to


generate bins B. The input is agents A and parameters ξ and τ . Parameter
ξ is used to partition each dimension into intervals with equal length. τ is
the parameter to select dense units. The output of this step is local bins B.
Step 2. Agents A move among their preferred bins. Each agent ai ∈ A randomly
chooses one of its preferred bin bk ∈ B as destination of movement. If the
movement is positive, this movement is done. Otherwise, this movement is
cancelled. This process repeats until objective function M (B) reaches its
maximum.
Step 3. If M (B) reaches the maximal value, then the clustering process stops.
Each bin in B is treated as a final cluster. The output of this step is the final
clusters C.

4 Experiments
4.1 Data and Evaluation Criteria
In the experiments, we compare our ACS algorithm with existing subspace clus-
tering algorithems which include CLIQUE, DOC, FIRES, P3C, Schism, Subclu,
MineClus, and PROCLUS [1]. All these algorithms are implemented on a Weka
subspace clustering plugin tool[7]. Table 3 shows the datasets used in our exper-
iments, which are public data sets from UCI repositery.
F1 measure and Entropy are chosen to evaluate the algorithms.

– F1 measure considers recall and precision [7]. For each cluseter Ti in cluster-
ing T , there is a set of mapped found clusters mapped(Ti ). Let VTi be the
objects of the cluster Ti and Vm(Ti ) the union of all objects from the clusters
in mapped(Ti). Recall and precision are formalized by:

|VTi ∩ Vm(Ti ) |
recall(Ti ) = (6)
|VTi |
|VTi ∩ Vm(Ti ) |
precision(Ti ) = (7)
|Vm(Ti ) |
The harmonic mean of precison and recall is the F1 measure. A high F1
measure correspons to a good cluster quality.
376 C. Luo et al.

Algorithm 1. Agent-based Subspace Clustering


Input: Agents A = {a1 , . . . , am }, Parameters ξ, τ
Output: C = {C1 , . . . , Cm }
// Step 1
1. B = CLIQU E(V, ξ, τ )
// Step 2
2. for all bj in B do
3. for all ai in bj .agents do
4. insert bj into ai .bins
5. end for
6. end for
7. for all bj in B do
8. bj .agents ← null
9. end for
10. M (B) = 0
11. repeat
12. for all ai in A do
13. Randomly choose destination bin bk
14. if ΔM (ai ) > 0 then
15. ai move to bk
16. else
17. ai stay at the bj
18. end if
19. end for
20. until M (B) is not increased for a certain number of continuous loops
// Step 3
21. for all bi in B do
22. Ci ← Bi
23. end for

Table 3. Public Data Sets of UCI repository

Data Name Attributes Num. Data Size


Breasta 34 198
Diabetesb 8 768
Glass 10 214
Shape 18 160
Pendigits 16 7494
Liver 7 345

a
Breast Cancer Wisconsin (Prognostic).
b
Pima Indians Diabetes.

– Entropy is to measure the homogeneity of the found clusters with respect to


the true clusters [7]. Let C be the found clusters and T is the true clusters.
For each Cj ∈ C, the entropy of mCj is defined as:
E(Cj ) = − p(Ti |C).log(p(Ti |Cj )) (8)
i=1
Agent-Based Subspace Clustering 377

The overall quality of the clustering is obtained as the average over all clus-
ters Cj ∈ C weighted by the number of objects per cluster. By normalizing
with the maximal entropy log(m) for m hidden clusters and taking the in-
verse, the range is between 0 (low quality) and 1 (perfect):
k
j=1|Cj |.E(Cj )
1−  (9)
log(m) kj=1 |Cj |

4.2 Experimental Results

For a fair evaluation, we show the best results of all algorithms in the massive
experiments with various parameter settings for each algorithm.
Figures 3-8 show the performance of the algorithms on data breast, diabetes,
glass, pendigits, liver and shape. From the figures, we can see that ACS performs
better than the other subspaces methods on both F1 measure and Entropy. For
breast, glass and shape data, ACS has the best performance on F1 measure and
Entropy. In particular, ACS has much better F1 measure than the others. For
diabetes, pendigits and liver data, the performance of ACS ranks higher on F1
measure and entropy. In fact, ACS performs similarly to the first rank algorthms
in each figure.

Fig. 3. Results on Breast Data

Fig. 4. Results on diabetes data


378 C. Luo et al.

Fig. 5. Results on glass data

Fig. 6. Results on pendigits data

Fig. 7. Results on liver data

Fig. 8. Results on shape data


Agent-Based Subspace Clustering 379

Fig. 9. Scalability with dimensionality

Fig. 10. Scalability with data size

Fig. 11. Running time with various parameters

Fig. 12. Results on stock data

Figures 9 and 10 show the time consumed with respect to the dimensionality
and data size. It is obvious that the time consumed by ACS is similar with those
of MineClus, CLIQUE and Schism, while STATPC, DOC, FIRES, P3C and
PROCLUS consume much longer time than the first group. We can conclude
that ACS are fast and scalable with the number of dimensions and data size.
380 C. Luo et al.

ACS has two parameters: ξ and τ . The figure 11 show the time change with
the parameters. From these figures, we can see the run time decrease with the
increase of ξ and τ .

4.3 A Case Study


Our technique has been applied to stock market surveillance. In stock market,
the key surveillance function is identifying market anomalies, such as market
manipulation, to provide a fair and efficient trading platform. Market manipula-
tion refers to the trade action which aims to interfere with the demand or supply
of a given stock to make the price increase or decrease in a particular way.
A data set is composed based on the financial model proposed by Rajesh
and Guo [2]. In this model, there are three period of time to describe the stock
market manipulation: pre-manipulation, manipulation and post-manipulation.
The stock price rises throughtout the manipulation period and then falls in
the post-manipulation. We analyze stock market manipulation in HKEx (Hong
Kong Stock Exchange). The market manipulation data are collected from SFC
(Securityies and Futures Commission). The trade data are collected from Yahoo
website. The attributes of the daily trade data include: daily price, daily volume,
daily volatility, daily highest price, daily lowest price. We also collect the index
of Hong Kong in the same period of time. The trade days when a manipulation
occurs are treated as normal trade day while the trade days when a manipulation
happended are treated as abnormal day.
We test the performance of ACS and other subspace clustering algorithms on
this dataset to see their performance in reality. From Figure 12, we can see ACS
perform the best on both F1 measure and Entropy. The results show that ACS
is a potential approach to cluster data sets in the real-world applications.

5 Conclusion and Future Work


This paper presents an agent-based subspace clustering algorithm, with which
agent-based method is used to obtain an optimized subspace clustering by mov-
ing agents among local environments. The experimental results show that the
proposed technique outperforms existing subspace clustering algorithms. The
effectiveness of our technique is also validated by a case study in stock market
surveillance.
This research can be extented in two ways. One potential research is utilizing
agent-based subspace clustering to identify outliers in high dimensional data.
The other future work is to research on using agent-based method for semi-
supervised subspace clustering.

References
1. Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for
projected clustering. In: SIGMOD 1999: Proceedings of the, ACM SIGMOD inter-
national conference on Management of data, pp. 61–72. ACM, New York (1999)
Agent-Based Subspace Clustering 381

2. Aggarwal, R.K., Wu, G.: Stock market manipulations. Journal of Business 79(4),
1915–1954 (2006)
3. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clus-
tering of high dimensional data for data mining applications. SIGMOD Rec. 27(2),
94–105 (1998)
4. Cao, L.: In-depth behavior understanding and use: the behavior informatics ap-
proach. Information Science 180(17), 3067–3085 (2010)
5. Cheng, C.-H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for mining
numerical data. In: KDD 1999: Proceedings of the Fifth ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data Mining, pp. 84–93. ACM,
New York (1999)
6. Cao, P.A.M.L., Gorodetsky, V.: Agent mining: The synergy of agents and data
mining. IEEE Intelligent Systems 24(3), 64–72 (2009)
7. Müller, E., Günnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace
projections of high dimensional data. Proc. VLDB Endow. 2(1), 1270–1281 (2009)
8. Ogston, E., Overeinder, B., van Steen, M., Brazier, F.: A method for decentralized
clustering in large multi-agent systems. In: AAMAS 2003: Proceedings of the sec-
ond international joint conference on Autonomous agents and multiagent systems,
pp. 789–796. ACM, New York (2003)
9. Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A monte carlo algorithm
for fast projective clustering. In: SIGMOD 2002: Proceedings of the 2002 ACM
SIGMOD International Conference on Management of Data, pp. 418–427. ACM,
New York (2002)
10. Xu, X., Chen, L., He, P.: A novel ant clustering algorithm based on cellular au-
tomata. Web Intelli. and Agent Sys. 5(1), 1–14 (2007)
Evaluating Pattern Set Mining Strategies in a
Constraint Programming Framework

Tias Guns, Siegfried Nijssen, and Luc De Raedt

Katholieke Universiteit Leuven


Celestijnenlaan 200A, B-3001 Leuven, Belgium
{Tias.Guns,Siegfried.Nijssen,Luc.DeRaedt}@cs.kuleuven.be

Abstract. The pattern mining community has shifted its attention from
local pattern mining to pattern set mining. The task of pattern set min-
ing is concerned with finding a set of patterns that satisfies a set of
constraints and often also scores best w.r.t. an optimisation criteria. Fur-
thermore, while in local pattern mining the constraints are imposed at
the level of individual patterns, in pattern set mining they are also con-
cerned with the overall set of patterns. A wide variety of different pattern
set mining techniques is available in literature. The key contribution of
this paper is that it studies, compares and evaluates such search strate-
gies for pattern set mining. The investigation employs concept-learning
as a benchmark for pattern set mining and employs a constraint pro-
gramming framework in which key components of pattern set mining are
formulated and implemented. The study leads to novel insights into the
strong and weak points of different pattern set mining strategies.

1 Introduction
In the pattern mining literature, the attention has shifted from local to global
pattern mining [1,10] or from individual patterns to pattern sets [5]. Local pattern
mining is traditionally formulated as the problem of computing Th(L, ϕ, D) =
{π ∈ L | ϕ(π, D) is true}, where D is a data set, L a language of patterns, and ϕ
a constraint or predicate that has to be satisfied. Local pattern mining does not
take into account the relationships between patterns; the constraints are evalu-
ated locally, that is, on every pattern individually, and if the constraints are not
restrictive enough, too many patterns are found. On the other hand, in global
pattern mining or pattern set mining, one is interested in finding a small set of rel-
evant and non-redundant patterns. Pattern set mining can be formulated as the
problem of computing Th(L, ϕ, ψ, D) = {Π ⊆ Th(L, ϕ, D) | ψ(Π, D) is true},
where ψ expresses constraints that have to be satisfied by the overall pattern
sets. In many cases a function f is used to evaluate pattern sets and one is then
only interested in finding the best pattern set Π, i.e. arg maxΠ∈Th(L,ϕ,ψ,D) f (Π).
Within the data mining and the machine learning literature numerous ap-
proaches exist that perform pattern set mining. These approaches employ a wide
variety of search strategies. In data mining, the step-wise strategy is common,

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 382–394, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Evaluating Pattern Set Mining Strategies 383

in which first all frequent patterns are computed; they are heuristically post-
processed to find a single compressed pattern set; examples are KRIMP [16]
and CBA [12]. In machine learning, the sequential covering strategy is popular,
which repeatedly and heuristically searches for a good pattern or rule and imme-
diately adds this pattern to the current pattern- (or rule-)set; examples are FOIL
[14] and CN2 [3]. Only a small number of techniques, such as [5,7,9], search for
pattern sets exhaustively, either in a step-wise or in a sequential covering setting.
The key contribution of this paper is that we study, evaluate and compare
these common search strategies for pattern set mining. As it is infeasible to
perform a detailed comparison on all pattern set mining tasks that have been
considered in the literature, we shall focus on one prototypical task for pattern
set mining: boolean concept-learning. In this task, the aim is to most accurately
describe a concept for which positive and negative examples are given.Within
this paper we choose to fix the optimisation measure used to accuracy; our focus
is on the exploration of a wide variety of search strategies for this measure, from
greedy to complete and from step-wise to one-step approaches.
To be able to obtain a fair and detailed comparison we choose to reformulate
the different strategies within the common framework of constraint program-
ming. This choice is motivated by [4,13], who have shown that constraint pro-
gramming is a very flexible and usable approach for tackling a wide variety of
local pattern mining tasks (such as closed frequent itemset mining and discrim-
inative or correlated itemset mining), and recent work [9,7] that has lifted these
techniques to finding k-pattern sets under constraints (sets containing exactly
k patterns). In [7], a global optimization approach to mining pattern sets has
been developed and has been shown to work for concept-learning, rule-learning,
redescription mining, conceptual clustering as well as tiling. In the present work,
we employ this constraint programming framework to compare different search
strategies for pattern set mining, focusing on one mining task in more detail.
This paper is organized as follows: in Section 2, we introduce the problem of
pattern set mining and its benchmark, concept-learning; in Section 3, we formu-
late these problems in the framework of constraint programming and introduce
various search strategies for pattern set mining; in Section 4, we report on ex-
periments, and finally, in Section 5, we conclude.

2 Pattern Set Mining Task


The benchmark task on which we shall evaluate different pattern set mining
strategies is that of finding boolean concepts in the form of k-term DNF expres-
sions. This task is well-known in computational learning theory [8] and is closely
related to rule-learning systems such as FOIL [14] and CN2 [3] and data mining
systems such as CBA [12] and KRIMP [16]. It is – as we shall now show – a
pattern set mining task of the form arg maxΠ∈Th(L,ϕ,ψ,D) f (Π).
In this setting, one is given a set of positive and negative examples, where
each example corresponds to a boolean variable assignment to the items in I,
the set of possible items. Thus each example is an itemset Ix ⊆ I. Positive
examples will belong to the set of transactions T + , negatives ones to T − . The
384 T. Guns, S. Nijssen, and L. De Raedt

pattern language is the set L = 2I . Hence each pattern corresponds to an itemset


Ip ⊆ I and represents a conjunction of items. The task is then to learn a concept
description (a boolean formula) that covers all (or most) of the positive examples
and none (or only a few) of the negatives. This can be measured using the
accuracy measure, defined as:
p + (N − n)
accuracy(p, n) = (1)
P +N
where p and n are the number of positive, respectively negative, examples cov-
ered, and P and N are the total number of positive, resp. negative, examples
present in the database. Concept descriptions are pattern sets, where each pat-
tern set corresponds to a disjunction of patterns (conjunctions). Following [7,15],
we shall focus on finding pattern sets that contain exactly k patterns. Thus the
pattern sets correspond to k-term DNF formulas. An example is considered cov-
ered by the pattern set if the example is a superset of at least one of the itemsets
in the pattern set.
Thus the task considered is an instance of the pattern set mining task
arg maxΠ∈Th(L,ϕ,ψ,D) f (Π), where f is the accuracy, D = T = T + ∪ T − , and
L = 2I ; ϕ can be instantiated to a minimum support constraint (requiring
that each pattern covers a certain number of examples), a minimum accuracy
constraint (requiring that each pattern is individually accurate), or to true, a
constraint which is always true and allows any pattern to be used which leads
to an accurate final set. ψ states that |Π| = k.
Finding a good pattern set is often a hard task; many pattern set mining
tasks, such as the task of k-term DNF learning, are NP complete [8]. Hence,
there are no straightforward algorithms for solving such tasks in general, giving
rise to a wide variety of search algorithms. The pattern set mining techniques
they employ can be categorized along two dimensions.
Two-Step vs One-Step: in the two step approach, one first mines patterns
under local constraints to compute the set Th(L, ϕ, D); afterwards, these pat-
terns are fed into another algorithm that computes arg maxΠ∈Th(L,ϕ,ψ,D)f (Π)
using post-processing. In the one step approach, this strict distinction be-
tween these two phases can not be made.
Exact vs Approximate: exact methods provide strong guarantees for find-
ing the optimal pattern set under the given constraints, while approximate
methods employ heuristics to find good though not necessarily optimal so-
lutions.
In the next section we will consider the instantiations of these settings for the case
of concept learning. However, first we will introduce the constraint programming
framework within which we will study these instantiations.

3 Constraint Programming Framework


Throughout the remainder of this paper we shall employ the constraint program-
ming framework of [4] for representing and solving pattern set mining problems.
Evaluating Pattern Set Mining Strategies 385

This framework has been shown 1) to allow for the use of a wide range of con-
straints, 2) to work for both frequent and discriminative pattern mining [13], and
3) to be extendible towards the formulation of k pattern set mining, cf. [7,9].
These other papers provide detailed descriptions of the underlying constraint
programming algorithms and technology, including an analysis of the way in
which they explore the search tree and a performance analysis. On the other
hand, in the present paper – due to space restrictions – we need to focus on
the declarative specification of the constraint programming problems; we refer
to [4,13,7] for more details on the search strategy of such systems.

3.1 Constraint Programming Notation


Following [4], we assume that we are given a domain of items I and transactions
T , and a binary matrix D. A key insight of the work of [4] is that constraint
based mining tasks can be formulated as constraint satisfaction problems over
the variables in π = (I, T ), where a pattern π is represented using the vectors
I and T , with a boolean variable Ii and Tt for every item i ∈ I and every
transaction t ∈ T . A candidate solution to the constraint satisfaction problem is
then one assignment of the variables in π which corresponds to a single itemset.
For instance, the pattern represented by π = (< 1, 0, 1 >, < 1, 1, 0, 0, 1 >) has
items 1 and 3, and covers transactions 1, 2 and 5. Following [7], a pattern set
Π of size k simply consists of k such patterns: Π = {π1 , . . . , πk }, ∀p = 1, . . . , k :
πp = (I p , T p ). We now discuss the different two-step and one-step pattern set
mining approaches.

3.2 Two-Step Pattern Set Mining


In two step pattern set mining approaches, one first searches for the set of local
patterns Th(L, ϕ, D) that satisfy a set of constraints, and then post-processes
these to find the pattern sets in Th(L, ϕ, ψ, D).

Step 1: Local Pattern Mining. Using the above notation one can formulate
many local pattern mining problems, such as frequent and discriminative pattern
mining. Indeed, consider the following constraints, introduced in [4,13]:

∀t ∈ T : Tt = 1 ↔ Ii (1 − Dti ) = 0. (Coverage)
i∈I

∀i ∈ I : Ii = 1 ↔ Tt (1 − Dti ) = 0. (Closedness)
t∈T

∀i ∈ I : Ii = 1 → Tt Dti ≥ θ. (Min. frequency)
t∈T
 
∀i ∈ I : Ii = 1 → accuracy( Tt Dti , Tt Dti ) ≥ θ. (Min. accuracy)
t∈T + t∈T −

In these constraints, the coverage constraint links the items to the transactions:
it states that the transaction set T must be identical to the set of all transactions
that are covered by the itemset I. The closedness constraint removes redundancy
386 T. Guns, S. Nijssen, and L. De Raedt

by ensuring that an itemset has no superset with the same frequency. It is a


well-known property that every non-closed pattern has an equally frequent and
accurate closed counterpart. The minimum frequency constraint ensures that
itemset
 I covers at least θ transactions. It can more simply be formulated as
t∈T Tt ≥ θ. The above  formulation is equivalent, but posted for each item
separately (observe that t∈T Tt Dti counts the number of t in column i of binary
matrix D for which Tt = 1). This so-called reified formulation results in more
effective propagation; cf. [4]. Finally, to mine for all accurate patterns instead
of all frequent patterns, the minimum accuracy constraint can be used, which
ensures that itemsets have an accuracy of at least θ. The reified formulation
again results in more effective propagation [13].
To emulate the two step approaches that are common in data mining [12,16,1],
we shall employ two alternatives for the first step: 1) using frequent closed
patterns, which are found with the coverage, closedness and minimum fre-
quency constraints; 2) using accurate closed patterns, found with the cov-
erage, closedness and minimum accuracy constraints. Both of these approaches
preform the first step in an exact manner. They find the set of all local patterns
adhering to the constraints.

Step 2: Post-processing the Local Patterns. Once the local patterns have
been computed, the two step approach post-processes them in order to arrive at
the pattern set. We describe the two main approaches for this.

Post-processing by Sequential Covering (Approximate). The most simple ap-


proach to the second step is to perform greedy sequential covering, in which
one iteratively selects the best local pattern from Th(L, ϕ, D) and removes all of
the positive examples that it covers. This continues until the desired number of
patterns k has been reached or all positive examples are already covered. This
type of approach is most common in data mining systems. Whereas in the first
step the set Th(L, ϕ, D) is computed exactly in these methods, the second step
is often an iterative loop in which patterns are selected greedily from this set.

Post-processing using Complete Search (Exact). Another possibility is to per-


form a new round of pattern mining as described in [5]. In this case, each pre-
viously found pattern in P = Th(L, ϕ, D) can be seen as an item r in a new
database; each new item identifies a pattern. One is looking for the set of pat-
tern identifiers P ⊆ P with the highest accuracy. In this case, the set is not a
conjunction of items, but a disjunction of patterns, meaning that a transaction
is covered if at least one of the patterns r ∈ P covers it. This can be formulated
in constraint programming after a transformation of the data matrix D into a
matrix M where the rows correspond to the transactions in T and the columns
to the patterns in P. Moreover Mtr is 1 if and only if pattern r covers trans-
action t and 0 otherwise. The solution set is now represented using Π = (P, T ),
where P is the vector representation of the pattern set, that is, Pr = 1 iff r ∈ P .
The formulation of post-processing using complete search is now:
Evaluating Pattern Set Mining Strategies 387


∀t ∈ T : Tt = 1 ↔ Pr Mtr ≥ 1. (Disj . Coverage)
r∈P
 
∀r ∈ P : Pr = 1 → accuracy( Ltr , Ltr ) ≥ θ (Min. Accuracy)
t∈T + t∈T −


Pr = k (Set Size)
r∈P

To obtain a reified formulation of the accuracy constraint we here use Ltr =


max(Tt , Mtr ) = Mtr + (1 − Mtr )Tt . The column for pattern r in this matrix
represents the transaction vector if the pattern r would be added to the set P .
The first constraint is the disjunctive coverage constraint. The second con-
straint is the minimum accuracy constraint, posted on each pattern separately
and taking the disjunctive coverage into account. Lastly, the set size constraint
limits the pattern set to size k.
This type of exact two-step approach is relatively new in data mining. Two
notable works are [11,5]. In these publications, it was proposed to post-process
a set of patterns by using a complete search over subsets of patterns. If an exact
pattern mining algorithm is used to compute the initial set of pattern in the first
step, this gives a method that is overall exact and offers strong guarantees on
the quality of the solution found.

3.3 One-Step Pattern Set Mining


This type of strategy, which is common in machine learning, searches for the
pattern set Th(L, ϕ, ψ, D) directly, that is, the computation of Th(L, ϕ, ψ, D)
and Th(L, ϕ, D) is integrated or interleaved. This can remove the need to have
strong constraints with strict thresholds in ϕ. There are two approaches to this:
Iterative Sequential Covering (Approximate). In the iterative sequential
covering approach that we investigate here, a beam search is employed (with
beam width b) to heuristically find the best pattern set. At each step during the
search a local pattern mining algorithm is used to find the top-b patterns (with
the highest accuracy) and uses these to compute new candidate pattern sets on
its beam, after which it prunes all but the best b pattern sets from its beam. This
setting is similar to 2-step sequential covering, only that here, at each iteration,
the most accurate pattern is mined for directly, instead of selecting it from a set
of previously mined patterns. Mining for the most accurate pattern can be done
in a constraint programming setting by doing branch-and-bound search over the
accuracy threshold θ. In the experimental section, we shall consider different
versions of the approach, corresponding to different sizes of the beam. When
b = 1, one often talks about greedy sequential covering.
Examples of one-step greedy sequential covering methods are FOIL and CN2;
however, they use greedy algorithms to identify the local patterns instead of a
branch-and-bound pattern miner. In data mining, the use of branch-and-bound
pattern mining algorithms was recently studied for identifying top-b patterns;
see for instance [2].
388 T. Guns, S. Nijssen, and L. De Raedt

Global Optimization (Exact). The last option is to specify the problem of


finding a pattern set of size k as a global optimization problem. This is possible
in a constraint programming framework, thanks to its generic handling of con-
straints, cf. [7]. The formulation, searching for k patterns πp = (I p , T p ) directly,
is as follows:

∀p ∈ {1, . . . , k} : ∀t ∈ T : Ttp ↔ Iip (1 − Dti ) = 0, (Coverage) (2)
i∈I

∀p ∈ {1, . . . , k} : ∀i ∈ I : Iip ↔ Ttp (1 − Dti ) = 0, (Closed ) (3)
t∈T

T 1 < T 2 < . . . < T k (Canonical ) (4)


⎡ ⎤

∀t ∈ T : Bt = ⎣ ( Tt ) ≥ 1⎦ , (Disj .coverage)
p
(5)
p∈{1..k}
 
maximize accuracy( Bt , Bt ). (Accurate) (6)
t∈T + t∈T −

Each pattern has to cover the transactions (Eq. 2) and be closed (Eq. 3). The
canonical form constraint in Eq. 4 enforces a fixed lexicographic ordering on the
itemsets, thereby avoiding to find equivalent but differently ordered pattern sets.
In Eq. 5, the variables Bt are auxiliary variables representing whether transaction
t is covered by at least one pattern, corresponding to a disjunctive coverage.
The one-step global optimization approaches to pattern set mining are less
common; the authors are only aware of [7,9]. One could argue that some iterative
pattern mining strategies will find pattern sets that are optimal under certain
conditions. For instance, Tree2 [2] can find a pattern set with minimal error on
supervised training data; however, it neither provides guarantees on the size of
the final pattern set nor provides guarantees under additional constraints.

4 Experiments

We now compare the different approaches to boolean concept learning that we


presented and answer the following two questions:

– Q1: Under what conditions do the different strategies perform well?


– Q2: What quality/runtime trade-offs do the strategies make?

To measure the quality of a pattern set, we evaluate its accuracy on the dataset.
This is an appropriate means of evaluation, as in the boolean concept learning
task we consider, the goal is to find a concise description of the training data,
rather than a hypothesis that generalizes to an underlying distribution.
The experiments were performed using the Gecode-based system proposed
by [4] and performed on PCs running Ubuntu 8.04 with Intel(R) Core(TM)2
Quad CPU Q9550 processors and 4GB of RAM. The datasets were taken from
the website accompanying this system1 . The datasets were derived from the UCI
1
http://dtai.cs.kuleuven.be/CP4IM/datasets/
Evaluating Pattern Set Mining Strategies 389

Table 1. Data properties and number of patterns found for different constraints and
thresholds. 25M+ denotes that more than 25 million patterns were found.
Mushroom Vote Hepatitis German-credit Austr.-credit Kr-vs-kp
Transactions 8124 435 137 1000 653 3196
Items 119 48 68 112 125 73
Class distr. 52% 61% 81% 70% 55% 52%
Total patterns 221524 227032 3788342 25M+ 25M+ 25M+
Pattern poor/rich poor poor poor rich rich rich
frequency ≥ 0.7 12 1 137 132 274 23992
frequency ≥ 0.5 44 13 3351 2031 8237 369415
frequency ≥ 0.3 293 627 93397 34883 257960 25M+
frequency ≥ 0.1 3287 35771 1827264 2080153 24208803 25M+
accuracy ≥ 0.7 197 193 361 2 11009 52573
accuracy ≥ 0.6 757 1509 3459 262 492337 2261427
accuracy ≥ 0.5 11673 9848 31581 6894 25M+ 25M+
accuracy ≥ 0.4 221036 105579 221714 228975 25M+ 25M+

Machine Learning repository [6] by discretising numeric attributes into eight


equal-frequency bins. To obtain reasonably balanced class sizes we used the
majority class as the positive class . Experiments were run on many datasets,
but we here present the findings on 6 diverse datasets whose basic properties are
listed in the top 3 rows of Table 1.

4.1 Two-Step Pattern Set Mining

The result of a two-step approach obviously depends on the quality of the pat-
terns found in the first step. We start by investigating the feasibility of this first
step, and then study the two-step methods as a whole.

Step 1: Local Pattern Mining. As indicated in Section 3.2, we employ two


alternatives: using frequent closed patterns and using accurate closed patterns.
Both methods rely on a threshold to influence the number of patterns found.
Table 1 lists the number of patterns found on a number of datasets, for the
two alternatives and with different thresholds. Out of practical considerations
we stopped the mining process when more than 25 million patterns were found.
Using this cut-off, we can distinguish pattern poor data (data having less than
25 million patterns when mining unconstrained) and pattern rich data. In the
case of pattern poor data, one can mine using very low or even no thresholds. In
the case of pattern rich data, however, one has to use a more stringent threshold
in order not be overwhelmed by patterns. Unfortunately, one has to mine with
different thresholds to discover how pattern poor or rich an unseen dataset is.

Step 2: Post-processing the Local Patterns. We now investigate how the


quality of the global pattern sets is influenced by the threshold used in the first
step, and how this compares to pattern sets found by 1-step methods that do
not have such thresholds.
390 T. Guns, S. Nijssen, and L. De Raedt

Fig. 1. Quality & runtime for approx. methods, pattern poor hepatitis dataset. In the
left figure, algorithms with identical outcome are grouped together.

Fig. 2. Quality & runtime for approx. methods, pattern rich australian-credit dataset.

Post-processing by Sequential Covering (Approximate). This two-step approach


picks the best local pattern from the set of patterns computed in step one. As
such, the quality of the pattern set depends on whether the right patterns are
in the pre-computed pattern set. We use our generic framework to compare
two-step sequential covering to the one-step approach.
For pattern poor data for which the set of all patterns can be calculated, such
as the mushroom, vote and hepatitis dataset, using all patterns obviously results
in the same pattern set as found by the one-step approach. Figure 1 shows the
prototypical result for such data: low thresholds lead to good pattern sets, while
higher thresholds gradually worsen the solution. For this dataset, starting from
K=3, no better pattern set can be found. The same is true for the mushroom
dataset, while in the vote dataset the sequential covering method continues to
improve for higher K. Also note that in Figure 1 a better solution is found
when using patterns with accuracy greater than 40%, compared to patterns
with accuracy greater than 50%. This implies that a better pattern set can be
found containing a local pattern that has a low accuracy on the whole data.
This indicates that using accurate local patterns does not permit putting high
thresholds in the first step. With respect to question Q2, we can observe that
using a lower threshold comes at the cost of higher runtimes. However, for pattern
poor datasets such as the one in Figure 1, these times are still manageable. The
remarkable efficiency of the one-step sequential covering method is thanks to
recent advances in mining top-k discriminative patterns [13].
Evaluating Pattern Set Mining Strategies 391

Table 2. Largest K (up to 6) and time to find it for the 2-step complete search method.
- indicates that step 1 was aborted because more than 25 million patterns were found, –
indicates that step 2 did not manage to finish within the timeout of 6 hours. * indicates
that no other method found a better pattern set.

Mushroom Vote Hepatitis German-cr. Austr.-cr. Kr-vs-kp


K sec K sec K sec K sec K sec K
all – – – - - -
freq. ≥ 0.7 6 0.2 only 1 pat 6 0.03 6 2.12 6 0.59 –
freq. ≥ 0.5 6 2.2 6 0.01 2 2650 2 8163 6 14244 –
freq. ≥ 0.3 6 14 6 0.89 – – – -
freq. ≥ 0.1 2 9477 1 1015 – – – -
acc. ≥ 0.7 6 8.6 6 0.12 6 3.05 6 0.01 1 713 –
acc. ≥ 0.6 *4 6714 5 14205 2 6696 6 104 – –
acc. ≥ 0.5 – 1 391 1 3169 1 696 - -
acc. ≥ 0.4 – – – – - -

On pattern rich data such as the german-credit, australian-credit and kr-vs-


kp dataset, similar behaviour can be observed. The only difference is that one is
forced to use more stringent thresholds. Because of this, the pattern set found
by the one-step approach can usually not be found by the two-step approaches.
Figure 2 exemplifies this for the australian-credit dataset. Using a frequency
threshold of 0.1, the same pattern set as for the one-step method is found for
up to K=3, but not so for higher K. When using the highest thresholds, there is
a risk of finding significantly worse pattern sets. On the kr-vs-kp dataset, when
using high frequency thresholds significantly worse results were found as well,
while this was not the case for the accuracy threshold. With respect to Q2 we
have again observed that lower thresholds lead to higher runtimes for the two-
step approaches. Lowering the thresholds further to find even better pattern sets
would correspondingly come at the cost of even higher computation times.

Post-processing using Complete Search (Exact). When post-processing a col-


lection of patterns using complete search, the size of that collection becomes
a determining factor for the success of the method. Table 2 shows the same
datasets and threshold values as in Table 1; here the entries show the largest
K for which a pattern set could be found, up to K=6, and the time it took. A
general trend is that in case many patterns are found in step 1, e.g. more than
100 000, the method is not able to find the optimal solution. With respect to
Q1, only for the mushroom dataset the method found a better pattern set than
any other method, when using all accurate patterns with threshold 0.4. For all
other sets it found however, one of the 1-step methods found a better solution.
Hence, although this method is exact in its second step, it depends on good
patterns from its first step. Unfortunately finding those usually requires using
low threshold values with corresponding disadvantages.
392 T. Guns, S. Nijssen, and L. De Raedt

Table 3. Largest K for which the optimal solution was found within 6 hours
Mushroom Vote Hepatitis German-credit Australian-credit Kr-vs-kp
K=2 K=4 K=3 K=2 K=2 K=3

Fig. 3. Quality & runtime for 1-step methods, german-credit dataset. In the left figure,
algorithms with identical outcome are grouped together.

4.2 One-Step Pattern Set Mining

In this section we compare the different one-step approaches, who need no local
pattern constraints and thresholds. We investigate how feasible the one-step
exact approach is, as well as how close the greedy sequential covering method
brings us to this optimal solution, and whether beam search can close the gap
between the two.
When comparing the two-step sequential covering approach with the one-step
approach, we already remarked that the latter is very efficient, though it might
not find the optimal solution. The one-step exact method is guaranteed to find
the optimal solution, but has a much higher computational cost. Table 3 below
shows up to which K the exact method was able to find the optimal solution
within the 6 hours time out. Comparing these results to the two-step exact
approach in Table 2, we see that pattern sets can be found without constraints,
where the two-step approach failed even with constraints.
With respect to Q1 we observed that only for the kr-vs-kp dataset the greedy
method, and hence all beam searches with a larger beam, found the same pattern
sets as the exact method. For the mushroom and vote dataset, starting from
beam width 5, the optimal pattern set was found. For the german-credit and
australian-credit, a beam width of size 15 was necessary. The hepatitis dataset
was the only dataset for which the complete method was able to find a better
pattern set, in this case for K=3, within the timeout of 6 hours.
Figure 3 shows a representative figure, in this case for the german-credit
dataset: while the greedy method is not capable of finding the optimal pat-
tern set, larger beams successfully find the optimum. For K=6, beam sizes of 15
or 20 lead to a better pattern set than when using a lower beam size. The exact
method stands out as being the most time consuming. For beam search methods,
larger beams clearly lead to larger runtimes. The runtime only increases slightly
Evaluating Pattern Set Mining Strategies 393

for increasing sizes of K because the beam search is used in a sequential covering
loop that shrinks the dataset at each iteration.

5 Conclusions

We compared several methods for finding pattern sets within a common con-
straint programming framework, where we focused on boolean concept learning
as a benchmark. We distinguished one step from two step approaches, as well
as exact from approximate ones. Each method has its strong and weak points,
but the one step approximate approaches, which iteratively mine for patterns,
provided the best trade-off between runtime and accuracy and do not depend
on a threshold; additionally, they can easily be improved using a beam search.
The exact approaches, perhaps unsurprisingly, do not scale well to larger and
pattern-rich datasets. A newly introduced approach for one-step exact pattern
set mining however has optimality guarantees and performs better than previ-
ously used two-step exact approaches. In future work our study can be extended
to consider other problem settings in pattern set mining, as well as other heuris-
tics and evaluation metrics; furthermore, even though we cast all settings in one
implementation framework in this paper, a more elaborate study could clarify
how this approach compares to the pattern set mining systems in the literature.

Acknowledgements. This work was supported by a Postdoc and project “Prin-


ciples of Patternset Mining” from the Research Foundation—Flanders, as well as
a grant from the Agency for Innovation by Science and Technology in Flanders
(IWT-Vlaanderen).

References

1. Bringmann, B., Nijssen, S., Tatti, N., Vreeken, J., Zimmermann, A.: Mining sets
of patterns. In: Tutorial at ECMLPKDD 2010 (2010)
2. Bringmann, B., Zimmermann, A.: Tree2 - decision trees for tree structured data.
In: Jorge, A., Torgo, L., Brazdil, P., Camacho, R., Gama, J. (eds.) PKDD 2005.
LNCS (LNAI), vol. 3721, pp. 46–58. Springer, Heidelberg (2005)
3. Clark, P., Niblett, T.: The CN2 induction algorithm. Machine Learning 3, 261–283
(1989)
4. De Raedt, L., Guns, T., Nijssen, S.: Constraint programming for itemset mining.
In: KDD, pp. 204–212. ACM, New York (2008)
5. De Raedt, L., Zimmermann, A.: Constraint-based pattern set mining. In: SDM.
SIAM, Philadelphia (2007)
6. Frank, A., Asuncion, A.: UCI machine learning repository (2010),
http://archive.ics.uci.edu/ml
7. Guns, T., Nijssen, S., De Raedt, L.: k-Pattern set mining under constraints. CW
Reports CW596, Department of Computer Science, K.U.Leuven (October 2010),
https://lirias.kuleuven.be/handle/123456789/278655
8. Kearns, M.J., Vazirani, U.V.: An introduction to computational learning theory.
MIT Press, Cambridge (1994)
394 T. Guns, S. Nijssen, and L. De Raedt

9. Khiari, M., Boizumault, P., Crémilleux, B.: Constraint programming for mining n-
ary patterns. In: Cohen, D. (ed.) CP 2010. LNCS, vol. 6308, pp. 552–567. Springer,
Heidelberg (2010)
10. Knobbe, A., Crémilleux, B., Fürnkranz, J., Scholz, M.: From local patterns to
global models: The lego approach to data mining. In: Fürnkranz, J., Knobbe, A.
(eds.) Proceedings of LeGo 2008, an ECMLPKDD 2008 Workshop (2008)
11. Knobbe, A.J., Ho, E.K.Y.: Pattern teams. In: Fürnkranz, J., Scheffer, T.,
Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 577–584.
Springer, Heidelberg (2006)
12. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining.
In: KDD, pp. 80–86 (1998)
13. Nijssen, S., Guns, T., De Raedt, L.: Correlated itemset mining in ROC space: a
constraint programming approach. In: KDD, pp. 647–656. ACM, New York (2009)
14. Quinlan, J.R.: Learning logical definitions from relations. Machine Learning 5,
239–266 (1990)
15. Rückert, U., De Raedt, L.: An experimental evaluation of simplicity in rule learning.
Artif. Intell. 172(1), 19–28 (2008)
16. Siebes, A., Vreeken, J., van Leeuwen, M.: Item sets that compress. In: Ghosh,
J., Lambert, D., Skillicorn, D.B., Srivastava, J. (eds.) SDM, pp. 395–406. SIAM,
Philadelphia (2006)
Asking Generalized Queries with Minimum Cost

Jun Du and Charles X. Ling

Department of Computer Science,


The University of Western Ontario, London, Ontario, N6A 5B7, Canada
jdu42@csd.uwo.ca, cling@csd.uwo.ca

Abstract. Previous works of active learning usually only ask specific


queries. A more natural way is to ask generalized queries with don’t-
care features. As each of such generalized queries can often represent a
set of specific ones, the answers are usually more helpful in speeding up
the learning process. However, despite of such advantages of the general-
ized queries, more expertise (or effort) is usually required for the oracle
to provide accurate answers in real-world situations. Therefore, in this
paper, we make a more realistic assumption that, the more general a
query is, the higher querying cost it causes. This consequently yields a
trade-off that, asking generalized queries can speed up the leaning, but
usually with high cost; whereas, asking specific queries is much cheaper
(with low cost), but the learning process might be slowed down. To re-
solve this issue, we propose two novel active learning algorithms for two
scenarios: one to balance the predictive accuracy and the querying cost;
and the other to minimize the total cost of misclassification and query-
ing. We demonstrate that our new methods can significantly outperform
the existing active learning algorithms in both of these two scenarios.

1 Introduction
Active learning, as an effective learning paradigm to reduce the labeling cost in
supervised settings, has been intensively studied in recent years. In most tradi-
tional active learning studies, the learner usually regards the specific examples
directly as queries, and requests the corresponding labels from the oracle. For
instance, given a diabetes patient dataset, the learner usually presents the en-
tire patient example, such as [ID = 7354288, name = John, age = 65, gender =
male, weight = 230, blood−type = AB, blood−pressure = 160/90, temperature
= 98, · · · ] (with all the features), to the oracle, and requests the corresponding
label whether this patient has diabetes or not. However, in this case, many fea-
tures (such as ID, name, blood-type, and so on) might be irrelevant to diabetes
diagnosis. Not only could queries like this confuse the oracle, but each answer
responded from the oracle is also applicable for only one specific example.
In many real-world active learning applications, the oracles are often human
experts, thus they are usually capable of answering more general queries. For
instance, given the same diabetes patient dataset, the learner could ask a gen-
eralized query, such as “are men over age 60, weighted between 220 and 240
pounds, likely to have diabetes?”, where only three relevant features (gender,

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 395–406, 2011.
c Springer-Verlag Berlin Heidelberg 2011
396 J. Du and C.X. Ling

age and weight) are provided. Such generalized query can often represent a set
of specific examples, thus the answer for the query is also applicable to all these
examples. For instance, the answer to the above generalized query is applicable
for all men over age 60 and weighted between 220 and 240 pounds. This allows
the active learner to improve learning more effectively and efficiently.
However, although the oracles are indeed capable of answering such general-
ized queries in many applications, the cost (effort) is often higher. For instance,
it is relatively easy (i.e., with low cost) to diagnose whether one specific patient
has diabetes or not, with all necessary information provided. However, it is of-
ten more difficult (i.e., with higher cost) to provide accurate diabetes diagnoses
(accurate probability) for all men over age 60 and weighted between 220 and
240 pounds. In real-world situation, more domain expertise is usually required
for the oracles to answer such generalized queries well, thus the cost for asking
generalized queries is often more expensive. Consequently, it yields a trade-off
in active learning: on one hand, asking generalized queries can speed up the
leaning, but usually with high cost; on the other hand, asking specific queries is
much cheaper (with low cost), but the learning process might be slowed down.
In this paper, we apply a cost-sensitive framework to study generalized queries
in active learning. More specifically, we assume that the querying cost is known
to be non-uniform, and ask generalized queries in the following two scenarios:

– Scenario 1 (Balancing Acc./Cost Trade-off ): We consider only query-


ing cost in this scenario. Thus, instead of tending to achieve high predictive
accuracy by asking as few as possible queries (as in traditional active learn-
ing), the learning algorithm is required to achieve high predictive accuracy
by paying as low as possible querying cost.
– Scenario 2 (Minimizing Total Cost): In addition to querying cost, we
also consider misclassification cost produced by the learning model in this
scenario.1 Thus, the learning algorithm is required to achieve minimum total
cost of querying and misclassification in the learning process.

In particular, we propose a novel method to, first construct generalized queries


according to two objective functions in the above two scenarios, and then up-
date the training data and the learning model accordingly. Empirical study in a
variety of settings shows that, the proposed methods can indeed outperform the
existing active learning algorithms in simultaneously maximizing the predictive
performance and minimizing the querying cost.

2 Related Work

All of the active learning studies make assumptions. Specifically, most of the
previous works assume that the oracles can only answer specific queries, and the
costs for asking these queries are uniform. Thus, most active learning algorithms
1
In this paper, we only consider that both the querying cost and the misclassification
cost are on the same scale. Extra normalization might be required otherwise.
Asking Generalized Queries with Minimum Cost 397

Table 1. Assumptions in active learning studies

Specific Queries Generalized Queries


Uniform Cost [7,11,12,2,3,9] [4]
Non-uniform Cost [8,6,10] This Paper

(such as [7,11,12,2,3,9]) are designed to achieve as high as possible predictive


accuracy by asking a certain amount of queries.
[4] relaxes the assumption of asking specific queries, and proposes active lean-
ing with generalized queries. However, it assumes that the oracles can answer
these generalized queries as easily as the specific ones. That is, the costs for
asking all the queries are still the same, regardless of the queries being specific
or generalized.
[8,6,10] relax the assumption of uniform cost, and study active learning in
cost-sensitive framework. However, they limit their research in specific queries,
and only consider that the costs for asking those specific ones are different.
In this paper, we study generalized queries with cost in active learning. Specif-
ically, we assume that the oracles can answer both specific and generalized
queries, but with different cost. This assumption is more flexible, more gen-
eral, and more applicable to the real-world applications. Under this assumption,
considering uniform cost for generalized queries (such as [4]) and considering
non-uniform costs for specific queries (such as [8,6,10]) can both be regarded
as special cases. Table 1 illustrates the different assumptions in active learning
studies. As far as we know, this is the first time to propose this more general
assumption and design corresponding learning algorithms for active learning.

3 Algorithm for Asking Generalized Queries


In this section, we design active learning algorithm to ask generalized queries.
Roughly speaking, the active learning process can be broken into the following
two steps in each learning iteration:

– Step 1: Based on the current training and unlabeled datasets, the learner
constructs a generalized query according to certain objective function.
– Step 2: After obtaining the answer of the generalized query, the learner
updates the training dataset, and updates the learning model accordingly.

We will discuss each step in detail in the following subsections.

3.1 Constructing Generalized Queries


In each learning iteration, constructing the generalized queries can be regarded as
searching the optimal query in the query space, according to the given objective
function. We propose two objective functions for the previous two scenarios, and
design an efficient searching strategy to reduce the computation complexity.
398 J. Du and C.X. Ling

Balancing Acc./Cost Trade-off. In scenario 1, we only consider querying


cost, and still use accuracy to measure the predictive performance of the learning
model, thus the learning algorithm is required to balance the trade-off between
the predictive accuracy and the querying cost. We therefore design an objective
function to choose query that yields maximum ratio of accuracy improvement
to querying cost in each iteration.
More formally, Equation 1 shows the objective function for searching query in
iteration t, where q t denotes the optimal query, Qt denotes the entire query space,
CQ (q) denotes the querying cost for the current candidate query q, ΔAcct (q)
denotes the accuracy improvement produced by q, which can also be presented
by subtracting the accuracy in iteration t − 1 (denoted by Acct−1 ) from the
accuracy in iteration t (denoted by Acct (q)).2

ΔAcct (q) Acct (q) − Acct−1


q t = arg maxt = arg maxt (1)
q∈Q CQ (q) q∈Q CQ (q)

We can see from Equation 1 that, estimating ΔAcct (q)/CQ (q) is required to
evaluate the candidate query q. As we assume that the querying cost CQ (q) is
known, we only need separately estimate the accuracies before and after asking
q (i.e., Acct−1 and Acct (q)).
Estimating Acct−1 is rather easy. We simply apply cross-validation or leave-
one-out to the current training data, and obtain the desired average accuracy.
However, estimating Acct (q) is a bit difficult. Note that, if we know the answer
of q, the training data could be updated by using exactly the same strategy
we will describe in Section 3.2 (Updataing Learning Model), and Acct (q) thus
could be easily estimated on the updated training data. However, the answer of
q is still unknown in the current stage, thus here, we apply a simple strategy to
optimistically estimate this answer, and then evaluate q accordingly.
Specifically, we first assume that the label of q is certainly 1.3 Thus, we update
the training data (using the same method as in Section 3.2), and estimate Acct (q)
accordingly. Then, we assume that the label of q is certainly 0, and again update
the training data and estimate Acct (q) in the same way. We compare these two
estimates of Acct (q), and optimistically choose the better (higher) one as the
final estimate.

Minimizing Total Cost. In Scenario 2, we consider both the querying and


misclassification costs, and require the learning algorithm to achieve minimum
total cost in the learning process.
However, calculating this total cost of querying and misclassification is a bit
tricky. In real-world applications, the learning model constructed on the current
training data is often used for the future prediction, thus the “true” misclassifi-
cation cost should also be calculated according to the future predicted examples.
2
The accuracy improvement (ΔAcct (q)) can be negative, when the accuracy after
asking the query (Acct (q)) is even lower than the one before asking (Acct−1 ).
3
We only consider binary classification with labels 0 and 1 here, for better illustration.
Asking Generalized Queries with Minimum Cost 399

We assume that the rough size of such “to-be-predicted” data is known in this pa-
per, due to the following reason. In reality, the quantity of such “to-be-predicted”
data directly affects the quantity of resource (effort, cost) that should be spent
in constructing the learning model. For instance, if the model would be used for
only few times and on only limited unimportant data, it might not be worth to
spend much resource on model construction; on the other hand, if the model is
expected to be extensively used on a large amount of important data, it would
be even more beneficial to improve the model performance by spending more
resource. In many such real-world situations, in order to determine how much
resource should be spent in constructing the model, it is indeed known (or could
be estimated) that how extensively the model would be used in the future (i.e.,
the rough quantity of the to-be-predicted data).
It is exactly the same case in our current scenario of generalized queries. More
specifically, if the current learning model will only “play a small role” (i.e., make
predictions on only few examples) in the future, it may not worth paying high
querying cost to construct a high-performance model. On the other hand, if a
large number of examples need to be predicted, it would be indeed worthwhile
to acquire more generalized queries (at the expense of high querying cost), such
that an accurate model with low misclassification cost could be constructed.
This indicates that, the number of “to-be-predicted” examples is crucial in
minimizing total cost. Therefore, we formalized the total cost after t iterations
(denoted by CTt ) in Equation 2, where CQ i
denotes the querying cost in the ith
t
iteration, CM denotes the misclassification cost after t iterations, which further
can be calculated as the product of the average misclassification cost4 after t
t
iterations (denoted by AvgCM ) and the number of future predicted examples
(denoted by n).
t
 t

CTt = i
CQ + t
CM = i
CQ t
+ AvgCM ×n (2)
i=1 i=1

To obtain the minimum total cost for the learning model, we greedily choose
the query that maximumly reduces the total cost in each learning iteration.
More formally, Equation 3 shows the objective function for searching query in
iteration t, where all notations keep same as above.

q t = arg maxt (CTt−1 − CTt (q))


q∈Q
t−1 t
= arg maxt ((AvgCM − AvgCM (q)) × n − CQ (q)) (3)
q∈Q

t
In the current setting, we assume that CQ and n are both known, thus we need
t−1 t
estimate AvgCM and AvgCM (q) separately, according to Equation 3. We again
t−1
adopt the similar strategy as in the previous subsection. Specifically, AvgCM
4
Average misclassification cost represents the misclassification cost averaged on each
tested examples.
400 J. Du and C.X. Ling

could be directly estimated by cross-validation or leave-one-out on the original


t
training set, and AvgCM (q) can be optimistically estimated by assuming the
label of q is certainly 0 and 1 respectively (see Section 3.1 for details).

Searching Strategy. Given the above two objective functions for two scenarios,
the learner is required to search the query space and find the optimal one in each
iteration.
In most traditional active learning studies, each unlabeled example is directly
regarded as a candidate query. Thus, in each iteration, the query space simply
contains all the current unlabeled examples, and exhaustive search is usually ap-
plied directly. However, when asking generalized queries, each unlabeled example
can generate a set of candidate generalized queries, due to the existence of the
don’t-care
  features. For instance, given a specific example with
 d features, there
exist d1 generalized queries with one don’t-care feature, d2 generalized queries
with two don’t-care features, and so on. Thus, altogether 2d corresponding gener-
alized queries could be constructed from each specific example. Therefore, given
an unlabeled dataset with l examples, the entire query space would be 2d l. This
query space is thus quite large (grows exponentially to the feature dimension),
and it is unrealistic to exhaustively evaluate every candidate. Instead, we apply
greedy search to find the optimal query in each iteration.
Specifically, for each unlabeled example (with d features), we  first construct
all the generalized queries with only one don’t-care feature (i.e., d1 = d queries),
and choose the best as the current candidate. Then, based only on this candidate,
we continue
  to construct all the generalized queries with two don’t-care features
(i.e., d−1
1 = d− 1 queries), and again only keep the best. The process repeats to
greedily increase the number of don’t-care features in the query, until no better
query can be generated. The last generalized query thus is regarded as the best
for the current unlabeled example. We conduct the same procedure on all the
unlabeled examples, thus we can find the optimal generalized query based on
the whole unlabeled set.
With such greedy search strategy, the computation complexity of searching is
thus O(d2 ) with respect to the feature dimension d. This indicates an exponential
improvement over the complexity of the original exhaustive search Θ(2d ). Note
that, it is true that such local greedy search cannot guarantee finding the true
optimal generalized query in the entire query space, but the empirical study (see
Section 4) will show it still works effectively in most cases.

3.2 Updating Learning Model


After finding the optimal query in each iteration, the learner will request the
corresponding label from the oracle, and update the learning model accordingly.
However, the generalized queries often contain don’t-care features, and the labels
for such generalized queries are also likely to be uncertain. In this subsection,
we study how to update the learning model by appropriately handling such
don’t-care features and uncertain answers in the queries.
Asking Generalized Queries with Minimum Cost 401

Roughly speaking, we consider the don’t-care features as missing values, and


handle the uncertain labels by taking partial examples in the learning process.
More specifically, we simply treat the generalized queries with don’t-care features
as specific ones with missing values. As many learning algorithms (such as decision
tree based algorithms, most generative models, and so on) have their own mech-
anisms to naturally handle missing values, this simple strategy can be widely ap-
plied. In terms of the uncertain labels of the queries, we handle them by taking par-
tial examples in the learning process. For instance, given a query with an uncertain
label (such as, 90% probability as 1 and 10% probability as 0), the learning algo-
rithm simply takes 0.9 part of the example as certainly 1 and 0.1 part as certainly
0. Taking partial examples into learning is often implemented by re-weighting ex-
amples, which is also applicable to many popular learning algorithms.
This simple strategy can elegantly update the learning model. However, a pit-
fall of the strategy also occurs. When updating the learning model, the current
strategy always regards one generalized query as only one specific example (with
missing values). This might significantly degrade the power of the generalized
queries. On the other hand, if one generalized query is regarded as too many
specific examples, it might also overwhelm the original training data. There-
fore, here we regard each generalized query as n (same) examples (with missing
values), where n is suggested to be half of the initial training set size by the
empirical study.
So far, we have proposed a novel method to construct the generalized query
and update the learning model in each active learning iteration. In particular,
we have designed two objective functions to balance the accuracy/cost trade-off
and minimize the total cost of misclassification and querying. In the following
section, we will conduct experiments on real-world datasets, to empirically study
the performance of the proposed algorithms.

4 Empirical Study
In this section, we empirically study the performance of the proposed algorithms
on 15 real-world datasets from the UCI Machine Learning Repository [1], and
compare them with the existing active learning algorithms.

4.1 Experimental Configurations


We compare the proposed algorithms with the traditional pool-based active
learning (with uncertain sampling) [7] (denoted by “Pool”) and the active learn-
ing with generalized queries [4] (denoted by “AGQ”). “Pool” and “AGQ” repre-
sent two special cases for querying cost: “Pool” only asks specific queries (with
low querying cost), but cannot take advantage of the generalized queries to im-
prove the predictive performance; on the other hand, “AGQ” tends to ask as
general as possible queries to promptly improve the predictive performance, but
with the expense of high querying cost. We expect that the proposed algorithms
(for the two scenarios) can simultaneously maximize the predictive performance
and minimize the querying cost, thus outperforming “Pool” and “AGQ”.
402 J. Du and C.X. Ling

Table 2. The 15 UCI datasets used in the experiments

Dataset Type of Att. No. of Att. Class Dist. Training Size


breast-cancer nom 9 196/81 1/5
breast-w num 9 458/241 1/10
colic nom/num 22 232/136 1/5
credit-a nom/num 15 307/383 1/20
credit-g nom/num 20 700/300 1/100
diabetes num 8 500/268 1/10
heart-statlog num 13 150/120 1/10
hepatitis nom/num 19 32/123 1/5
ionosphere num 33 126/225 1/20
kr-vs-kp nom 36 1669/1527 1/100
mushroom nom 22 4208/3916 1/200
sick nom 27 3541/231 1/200
sonar num 60 97/111 1/5
tic-tac-toe nom 9 332/626 1/10
vote nom 16 267/168 1/20

All of the 15 UCI datasets have binary class and no missing values. Infor-
mation on these datasets is tabulated in Table 2. Each whole dataset is first
split randomly into three disjoint subsets: the training set, the unlabeled set,
and the test set. The test set is always 25% of the whole dataset. To make sure
that active learning can possibly show improvement when the unlabeled data
are labeled and included into the training set, we choose a small training set
for each dataset such that the “maximum reduction” of the error rate5 is large
enough (greater than 10%). The training sizes of the 15 UCI datasets range from
1/200 to 1/5 of the whole datasets, also listed in Table 2. The unlabeled set is
the whole dataset taking away the test set and the training set.
In our experiments, we set the querying cost (CQ ) for any specific query as
1, and study the following three cost settings for generalized queries with r
don’t-care features, as follows:

– CQ = 1 + 0.5 × r: This setting represents a linear growth of CQ with respect


to r. For instance, the cost of asking a generalized query with two don’t-care
features is (CQ = 1 + 0.5 × 2 = 2), which equals to the cost of asking two
specific ones.
– CQ = 1 + 0.05 × r: This setting also represents a linear growth of CQ with
respect to r. However, the cost of asking generalized queries is rather low
in this case. For instance, the cost of asking a generalized query with 20
don’t-care features equals to the cost of asking two specific ones.
– CQ = 1 + 0.5 × r2 : This setting represents a non-linear growth of CQ with
respect to r. In addition, the cost of asking generalized queries is higher in
this case. For instance, the cost of asking a generalized query with only two
don’t-care features equals to the cost of asking three specific ones.

5
The “maximum reduction” of the error rate is the error rate on the initial training set
R alone (without any benefit of the unlabeled examples) subtracting the error rate
on R plus all the unlabeled data in U with correct labels. The “maximum reduction”
roughly reflects the upper bound on error reduction that active learning can achieve.
Asking Generalized Queries with Minimum Cost 403

Note that, these settings of querying cost are only used here for empirically
study, any other types of querying cost could be easily applied without changing
the algorithms.
As for all the 15 UCI datasets, we have neither true target functions nor human
oracles to answer the generalized queries, we simulate the target functions by
constructing learning models on the entire datasets in the experiments. The
simulated target function regards each generalized query as a specific example
with missing values, and provides the posterior class probability as the answer
to the learner. The experiment is repeated 10 times on each dataset (i.e., each
dataset is randomly split 10 times), and the experimental results are recorded.

4.2 Results for Balancing Acc./Cost Trade-Off


In Scenario 1, we use accuracy to measure the performance of the learning model.
Thus, we use an ensemble of bagged decision trees (implemented in Weka [5])
as the learning algorithm in the experiment. Any other learning algorithms can
also be implemented in real-world applications.
Figure 1 demonstrates the performance of the proposed algorithm considering
only querying cost (denoted by “AGQ-QC”; see Section 3.1), compared with
“Pool” and “AGQ” on a typical UCI dataset “breast-cancer”. We can see from
the subfigures of Figure 1 that, with all the three querying cost settings, “AGQ-
QC” can always effectively increase the predictive accuracy of the learning model
with low querying cost, and outperform “Pool” and “AGQ”. More specifically,
in the case that (CQ = 1 + 0.5 × r), “AGQ-QC” significantly outperforms both
“Pool” and “AGQ” during the entire learning process. In the case that (CQ =
1 + 0.05 × r), although “AGQ-QC” still outperforms the other two algorithm,
it performs similarly to “AGQ”. As the cost of asking generalized queries is
rather low in this case, “AGQ-QC” tends to discover as more as possible don’t-
care features in the queries, thus producing similar predictive performance as
“AGQ”. In the case that (CQ = 1 + 0.5 × r2 ), “AGQ-QC” still significantly
outperforms the other algorithms. Note that, In this case, the cost of asking
generalized queries is relatively high (i.e., grows quadratically with the number
of don’t-care feature), thus “AGQ” tends to discover as few as possible don’t-care
features, and consequently behaves similarly to “Pool”.


   
   
    
  
  
     
  
  

  


 !

 !

 !

  

  

  

  

  


                    
 !"#  !"#  !"#

Fig. 1. Comparison between “AGQ-QC”, “AGQ” and “Pool” on a typical UCI data
“breast-cancer”, for balancing acc./cost trade-off
404 J. Du and C.X. Ling

Table 3. Summary of the t-test for balancing acc./cost trade-off

AGQ-QC
C = 1 + 0.5 × r C = 1 + 0.05 × r C = 1 + 0.5 × r2
Pool 6/7/2 10/4/1 5/6/4
AGQ 14/0/1 6/7/2 15/0/0

To quantitatively compare the learning curves, we measure the actual values


of the accuracies in 10 equal-distance points on the x-axis. The 10 accuracies of
one curve are compared with the 10 accuracies of another using the two-tailed,
paired t-test with 95% confidence level. The t-test results on all the 15 UCI
datasets with all the three querying cost settings are summarized in Table 3.
Each entry in the table, w/t/l, means that the algorithm in the corresponding
column wins on w, ties on t, and loses on l datasets, compared with the algorithm
in the corresponding row. We can observe the similar phenomena from Table 3
that, “AGQ-QC” significantly outperforms “AGQ” when the querying cost is
relatively high (CQ = 1 + 0.5 × r2 and CQ = 1 + 0.5 × r), and significantly
outperforms “Pool” when the querying cost is relatively low (CQ = 1 + 0.05 × r).

4.3 Results for Minimizing Total Cost

In Scenario 2, we use total cost to measure the performance of the learning


model. Thus, we use a cost-sensitive algorithm CostSensitiveClassifier based on
an ensemble of bagged decision trees (implemented in Weka [5]) as the learning
algorithm in the experiments. In addition, we set the false negative (FN) and
false positive (FP) costs as 2 and 10 respectively, and we set the number of the
future predicted examples as 1000. Still, any other settings can be easily applied
without changing the algorithm.
Figure 2 demonstrates the performance of the proposed algorithm consider-
ing total cost (denoted by “AGQ-TC”), compared with “Pool” and “AGQ” on
the same UCI dataset “breast-cancer”. We can see from Figure 2 that “AGQ-
TC” effectively decreases the total cost of the learning model, and significantly
outperforms “Pool” and “AGQ” with most querying cost settings. More specif-
ically, we can discover the similar pattern between “AGQ-TC” and “AGQ” as
in the previous subsection: When the querying cost is relatively low (such as
CQ = 1 + 0.05 × r), “AGQ-TC” and “AGQ” tend to perform similarly; when the
querying cost is relatively high (such as CQ = 1 + 0.5 × r2 ), “AGQ-TC” often
significantly outperforms “AGQ”.
The t-test results on the 15 UCI datasets are summarized in Table 4. It
clearly shows that, “AGQ-TC” performs significantly better than “AGQ” on
most (or even all) tested datasets, when the querying cost is relatively high
(CQ = 1 + 0.5 × r2 and CQ = 1 + 0.5 × r). When compared with “Pool”, “AGQ-
TC” still wins (or at least ties) on a majority of tested datasets, especially when
the querying cost is relatively low (CQ = 1+0.05×r). These experimental results
clearly indicate that “AGQ-TC” can indeed significantly decrease the total cost,
and outperforms “AGQ” and “Pool”.
Asking Generalized Queries with Minimum Cost 405


   
   
    
% 
& &  &

   
    


 


& 
& 

& 
 
 


$ 
$
$

$


  


  $  %   $  %   $  %               
' (
) "  ' (
) "  ' (
) " 

Fig. 2. Comparison between “AGQ-QC”, “AGQ” and “Pool” on a typical UCI data
“breast-cancer”, for minimizing total cost

Table 4. Summary of the t-test for minimizing total cost

AGQ-TC
C = 1 + 0.5 × r C = 1 + 0.05 × r C = 1 + 0.5 × r2
Pool 6/7/2 10/4/1 6/6/3
AGQ 15/0/0 6/6/3 15/0/0

4.4 Approximate Probabilistic Answers


In the previous experiments, we have assumed that the oracle is always capable
of providing accurate probabilistic answers for the generalized queries. However,
in real-world situations, it is more common that only “approximate probabilistic
answers” are provided (especially when the oracles are human experts). We spec-
ulate that small perturbations in the probabilistic answers will not dramatically
affect the performance of the proposed algorithms. This is because small pertur-
bations in label probabilities only represent light noises. These light noises could
be cancelled out in the successive updates of the training set. With a robust base
learning algorithm (such as the bagged decision trees), such small noises would
be insensitive. In this subsection, we study this issue experimentally.
To simulate the approximate probabilistic answer, we first calculate the exact
accurate probabilistic answer from the target model, and then randomly alter
it with up to 20% noise. Figure 3 demonstrates the performance the proposed
algorithms with such approximate probabilistic labels (denoted by “AGQ-AC
(appr)” and “AGQ-TC (appr)”), compared with “AGQ-AC” and “AGQ-TC”,
with the setting (CQ = 1 + 0.5 × r) and on the typical data (“breast-cancer”).


   
   
 
 &
  **  & ** 





& 
 !



 $


$


 
         $  %   $  %   $  % 
 !"# ' (
) " 

(a) Comparison between (b) Comparison between


“AGQ-QC” and “AGQ-QC “AGQ-TC” and “AGQ-TC
(appr)” (with up to 20% noise). (appr)” (with up to 20% noise).

Fig. 3. Experimental results with approximate probabilistic answer on “breast-cancer”


406 J. Du and C.X. Ling

We can clearly see from these figures that, when only the approximate prob-
abilistic answers are provided by the oracle, the performance of the proposed
algorithms are not significantly affected. The similar experimental results can
be shown with other settings and on other datasets. This indicates that, the
proposed algorithms are rather robust with such more realistic approximate
probabilistic answers, thus can be directly deployed in real-world applications.

5 Conclusion
In this paper, we assume that the oracles are capable of answering general-
ized queries with non-uniform costs, and study active learning with generalized
queries in cost-sensitive framework. In particular, we design two objective func-
tions to choose generalized queries in the learning process, so as to either balance
the accuracy/cost trade-off or minimize the total cost of misclassification and
querying. The empirical study verifies the superiority of the proposed methods
over the existing active learning algorithms.

References
1. Asuncion, A., Newman, D.J.: UCI machine learning repository (2007)
2. Baram, Y., El-Yaniv, R., Luz, K.: Online choice of active learning algorithms.
Journal of Machine Learning Research 5, 255–291 (2004)
3. Cohn, D.A., Ghahramani, Z., Jordan, M.I.: Active learning with statistical models.
Journal of Artificial Intelligence Research 4, 129–145 (1996)
4. Du, J., Ling, C.X.: Active learning with generalized queries. In: Proceedings of the
9th IEEE International Conference on Data Mining, pp. 120–128 (2009)
5. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The
weka data mining software: an update. SIGKDD Explorations 11(1), 10–18 (2009)
6. Kapoor, A., Horvitz, E., Basu, S.: Selective supervision: Guiding supervised learn-
ing with decision-theoretic active learning. In: Proceedings of International Joint
Conference on Artificial Intelligence (IJCAI), pp. 877–882 (2007)
7. Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learn-
ing. In: Proceedings of ICML 1994, 11th International Conference on Machine
Learning, pp. 148–156 (1994)
8. Margineantu, D.D.: Active cost-sensitive learning. In: Nineteenth International
Joint Conference on Artificial Intelligence (2005)
9. Roy, N., Mccallum, A.: Toward optimal active learning through sampling estima-
tion of error reduction. In: Proc. 18th International Conf. on Machine Learning,
pp. 441–448 (2001)
10. Settles, B., Craven, M., Friedland, L.: Active learning with real annotation costs.
In: Proceedings of the NIPS Workshop on Cost-Sensitive Learning (2008)
11. Seung, H.S., Opper, M., Sompolinsky, H.: Query by committee. In: Proceedings
of the Fifth Annual Workshop on Computational Learning Theory, pp. 287–294
(1992)
12. Tong, S., Koller, D.: Support vector machine active learning with applications to
text classification. Journal of Machine Learning Research 2, 45–66 (2002)
Ranking Individuals and Groups by Influence
Propagation

Pei Li1 , Jeffrey Xu Yu2 , Hongyan Liu3 , Jun He1 , and Xiaoyong Du1
1
Renmin University of China, Beijing, China
{lp,hejun,duyong}@ruc.edu.cn
2
The Chinese University of Hong Kong, Hong Kong, China
yu@se.cuhk.edu.hk
3
Tsinghua University, Beijing, China
hyliu@tsinghua.edu.cn

Abstract. Ranking the centrality of a node within a graph is a fun-


damental problem in network analysis. Traditional centrality measures
based on degree, betweenness, or closeness miss to capture the structural
context of a node, which is caught by eigenvector centrality (EVC) mea-
sures. As a variant of EVC, PageRank is effective to model and measure
the importance of web pages in the web graph, but it is problematic to
apply it to other link-based ranking problems. In this paper, we propose
a new influence propagation model to describe the propagation of pre-
defined importance over individual nodes and groups accompanied with
random walk paths, and we propose new IPRank algorithm for ranking
both individuals and groups. We also allow users to define specific decay
functions that provide flexibility to measure link-based centrality on dif-
ferent kinds of networks. We conducted testing using synthetic and real
datasets, and experimental results show the effectiveness of our method.

1 Introduction
Ranking the centrality (or importance) of nodes within a graph is a fundamental
problem in network analysis. Recently, the online social networking sites, such as
Facebook and MySpace, provide users with a platform to make people connected.
Learning and mining on these large-scale social networks attract attentions of
many researchers in the literature [1]. In retrospect, Freeman [2] reviewed and
evaluated the methods about centrality measures, and categorized them into
three conceptual foundations: degree, betweenness, and closeness. Accompanied
with eigenvector centrality (EVC) proposed by Bonacich [3], these four measures
dominate the empirical usage. The first three methods measure the centrality by
simply calculating the edge degree or the mean or fraction of geodesic paths [4],
and treat every node equally. In this paper, we focus on EVC, which ranks the
centrality of a node v by considering the centrality of nodes that surround v.
In the literature, most of link analysis approaches focus on the link structures
and ignore the intrinsic characteristics of nodes over a graph. However, in many
networks, nodes also contain important information, such as the page content in
a web graph. Simply overlooking these predefined importance may facilitate the

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 407–419, 2011.
c Springer-Verlag Berlin Heidelberg 2011
408 P. Li et al.

Table 1. Notations

G(V, E) A directed and weighted graph with group feature


I(i) The set of in-coming neighbors of node i
O(i) The set of out-going neighbors of node i
w(i, j) The weight of edge (i, j)
T The transition matrix of graph G
|X| The size of set X
X The sum of vector X or matrix X
Z Predefined importance (or initial influence) of all nodes
Zi (a) The influence received by node a on the i-th iteration/step
K The maximum iterations/steps in IP model
R The final ranking vector of nodes
Rj (i) The ranking of node i on the j-th iteration/step
GR The final ranking vector of groups

usage of link spam. We believe the intrinsic characteristics of nodes also affect
link-based ranking significantly.
The main contributions of this work are summarized below. First, we discuss
the problems with the current EVC approaches, for example, PageRank, which
ignores the intrinsic impacts of nodes on the ranking. Second, we propose a new
Influence Propagation model, called IP model, which propagates the user-defined
importance over nodes in a graph by random walking. We allow users to specify
decay functions to control how the influence propagates over nodes. It is worth
noting that most of EVC approaches only use an exponential function, that is
not proper in many cases which we will address later. Third, we give algorithms
to rank an individual node and all nodes in a graph efficiently. Fourth, we discuss
how to rank a group (a set of nodes) regarding the centrality using both inner
and outer structural information.
The remainder of the paper is organized as follows. Section 2 gives the mo-
tivation of our work. Section 3 discuss our new influence propagation model,
and ranking algorithms for individual nodes and groups. We conducted exten-
sive performance studies and report our findings in Section 4. The related work
is given in Section 5 and we conclude in Section 6. The notations used in this
paper are summarized in Table 1.

2 The Motivation
In this section, first, we discuss our motivation to propose a new influence model,
and explain why PageRank is not applicable in some cases. Second, we give our
intuitions on how to rank the centrality for a set of nodes.

Why Not PageRank: As a typical variant of EVC [3], PageRank [5] models the
behavior of a random surfer, who clicks some hyperlink in the current page with
probability c, and periodically jumps to a random page because “gets bored”
with probability (1 − c). Let T be a transition matrix for a directed graph.
For the p-th row and q-th column element of T , Tp,q = 0 if (p, q) ∈ / E, and
Ranking Individuals and Groups by Influence Propagation 409

a e Initial Importance 𝑅0 Normalized IPRank Scores (%)


[0.2, 0.2, 0.2, 0.2, 0.2] [14.9, 10.0, 33.4, 24.3, 17.4]
d [0.3, 1.0, 0.2, 0.2, 0.8] [15.5, 14.2, 30.5, 21.2, 18.6]
[0.8, 1.0, 0.2, 0.2, 0.8] [17.8, 13.8, 30.4, 20.5, 17.5]
b c PageRank scores: [0.149, 0.100, 0.334, 0.243, 0.174]
( )
(a) (b)

Fig. 1. (a) A simple directed network in which every node has an predefined impor-
tance. (b) PageRank scores and normalized IPRank scores corresponding to different
predefined importance Z. Decay function is set to be f (k) = 0.8k .

Tp,q = w(p, q)/ i∈O(p) w(p, i) otherwise, where w(p, q) is the weight of edge
(p, q). The matrix form of PageRank can be written below.
R = cRT + (1 − c)U (1)
Here, U corresponds to the distribution vector of web pages that a random
surfer periodically jumps to and U  = 1 holds. Based on Eq. (1), PageRank
scores can be iteratively computed by Rk = cRk−1 T + (1 − c)U . The solution R
is a steady probability distribution with R = 1, and decided by T and U only.
It is important to note that the initial importance R0 of all nodes in PageRank
are ignored (refer to Eq. (1)). In other words, R0 is not propagated in PageRank.
As shown in Fig. 1, for the graph shown in Fig. 1(a), the PageRank scores for
a, b, c, d, and e are 0.149, 0.1, 0.223, 0.243, and 0.174, respectively, regardless
any given initial importance R0 . However, in many real applications, the initial
importance R0 plays a significant role and greatly influences the resulting R.
In addition, simply applying the PageRank to measure centrality in general
may result in unexpected results, because PageRank is originally designed to
bring order to the web graph. For example, to model the “word-of-mouth” effect
in social networks [6] where people are likely to be influenced by their friends,
the behavior of “random jumping” used in PageRank is not reasonable, since
the influence only occurs between two directly connected persons.
Motivated by propagating the initial predefined importance of nodes and ran-
domly jumping, we claim that PageRank is not applicable for link-based ranking
in all possible cases. In this paper, we propose a more general and customizable
model for link-based ranking. We propose a new Influence Propagation (IP)
model and IPRank to rank nodes and groups, based on their structural contexts
in the graph and predefined importance.

Group Ranking: In this paper, a group is a set of nodes in a graph. We


categorize group centrality measures into two types. The first type exploits the
inner information of a group. Two simple approaches to rank a group are either
to sum or to average the centrality scores of nodes in group. However, summing
is obviously problematic because larger groups tend to obtain higher scores.
Averaging is unacceptable in some cases where a group with only one but high-
score node beats another group with a large number of nodes. The second type
employs the information outside a group. [7] analyzed this problem and proposed
a measure based on the number of nodes outside a group that are connected to
410 P. Li et al.

(a) Group 1 (b) Group 2 (c) Group 3

Fig. 2. Three groups with the same degrees connected to outside nodes. ((a) and (b)
are altered from Fig. 4.2.1 in [7].)

members of this group. More explicitly, let C be a group, N (C) be the set of all
nodes that are not in C, but are neighbors of a member in C. [7] normalizes and
|N (C)|
computes group degree centrality by |V |−|C|
, where |V | is the number of nodes in
the graph. Clearly, this method measures group centrality from the view of nodes
outside this group. However, given two large groups A and B where |A| > |B|,
we find that |N (A)| > |N (B)| is more likely to be and |V |−|A| < |V |−|B| holds,
making larger groups have a higher degree centrality more easily. Moreover, this
method ignores the centrality scores of nodes in groups.
In this work, we investigate how to combine the inner and outer structural
context of a specific group. Some intuitions are given below. Consider Fig. 2.
First, regarding the outer structural context, Group 2 should have a higher
score than Graph 1, because Group 2 is with a larger span of neighbors. This
intuition is drown from real-world networks such as friendship network, where
a group with more contacts outside this group have a higher ranking. Second,
regarding the inner structure of a group, both Group 2 and Group 3 have the
same outside neighbors, but the inner structure of Group 3 is more compact and
cohesive, so Group 3 is with a higher score than Group 2.

3 Ranking Nodes and Groups


In our influence propagation model, we consider every node has its own in-
fluence that needs to be propagated. This influence represents the predefined
importance of a node, such as content, status, or preference. We consider a di-
rected edge-weighted graph G(V, E), where V and E are the sets of nodes and
edges respectively. Every node in V has attributes to describe its properties, and
the attributes of a node can be used to indicate which groups the node belongs
to. We use MA (a) to denote the belonging of the node a to a group A, and call
it membership degree. Let Z be a vector to represent the predefined importance
of nodes in G based on the attributes, and every element in Z is non-negative.
The influence propagates following random walk [8]. Like the existing work
[9,6,10,11], in our approach, influence propagation is a process that the in-
coming influence from in-neighbors of a node a to the node a itself at time t
propagates to the out-neighbors of the node a with transition probability and
decay effected at next time (t + 1). Regarding decay, we introduce a discrete
decay function f (k) to describe the retained influence on the k-th step dur-
ing the propagation with decay, where k ∈ {1, 2, ..., K} and K is the maxi-
mum propagation steps. The most prevalent decay function used in PageRank
Ranking Individuals and Groups by Influence Propagation 411

is f (k) = ck where 0 < c < 1. Generally, f (k) is a non-increasing function


that satisfies f (k) < 1 and smaller f (k) results in smaller maximum steps k.
We allow users to configure f (k) into other forms such as linear function to
adapt different situations. To help assessing the maximum propagation steps
K, a user needs to specify a threshold h that satisfies the following condition.

f (K) ≥ h and f (K + 1) < h (2)

which also defines the condition of convergence. According to the definition, we


show a proposition to describe the influence propagation on a random walk path
with cycles permitted, and define IPRank scores in Definition 1.

Proposition 3.1: For a random path p = v0 , v1 , ..., vk  that starts at time 0,
the influence Z(0) propagating from v0 to vk is

k−1
Z(k) = Z(0) · f (k) · Ti,i+1 (3)
i=0

Proof Sketch: Let’s analyze the case of one-step propagation. For an edge
vi , vj , the influence Z(j) propagating from vi is ff (j)
(i)
· Z(i) · Ti,j . Since path p
can be viewed as a sequence of one-step propagations, Eq. (3) holds. 2

Definition 1. The IPRank score of a node in a graph is measured by the influ-


ence of this node and the influence propagated in from other nodes.
Like PageRank, the assumption behind IPRank is that, more influence a node
receives, more important this node is. However, our IPRank is more general than
the mutual enforcement based rankings. First, the initial importance Z of nodes
will be taken into consideration. Z is propagated in our method and influences
IPRank scores. Reconsider Fig. 1(a). We show that IPRank scores are different
corresponding to different Z as shown in Fig. 1(b). Second, we allow users to
specify a decay function.

3.1 IPRanking Nodes


The key to compute IPRank score R(v) of a node v is how we collect the influence
propagated in from other nodes. Noting that after a propagation over k steps,
the influence will be so small that it can be ignored. Therefore, we only need
to collect random walk paths that reach the node v within k steps. A possible
method is by random walk backwards, where the random surfer walks reversely
along links starting from v and traverses nodes recursively. All nodes traversed
can be viewed as the starting points of such random paths, whose probability
can be assessed by Proposition 3.1.
Consider the node a in the graph G in Fig. 1(a) and suppose k = 1. Since we
reverse all edges and traverse b and e starting from the node a, two random walk
paths on G that reach the node a in one step are collected. We summarize the
recursive procedure IPRank-One in Algorithm 1, which computes IPRank score
of the given node a. Furthermore, supposing the average size of out-neighbors
412 P. Li et al.

Algorithm 1. IPRank-One(G, v, Z, T , K)
Input: graph G(V, E), node v, predefined importance Z, transition matrix T ,
and maximum step K
Output: IPRank score R(v)

1: initialize R(v) = Z(v);


2: PathRecursion(v, v, 1, 0);
3: return R(v);

4: Procedure PathRecursion(v, n, x, y)
5: y = y + 1;
6: for every node u in in-neighbor set of the node n in G do
7: R(v) = R(v) + Z(u) · x · Tu,n · f (y);
8: if y < K then
9: PathRecursion(v, u, x · Tu,n , y);
10: end if
11: end for

k
in graph G is d, Algorithm 1 needs to traverse i=1 di nodes and thus collects
the same number of random walk paths. The time complexity of IPRank-One
is O(dk ) and acceptable for querying IPRank scores for one or a few nodes.
But it is obviously inefficient when we need to compute IPRank scores of all
nodes in a graph. Based on our observations, random walk paths generated by
IPRank queries of different nodes contain the shared segments, which can be
reused to save computational cost. For example, influence propagation along
path a, b, a, c and a, b, a are both computed on IPRank queries for node c
and a, but they contain the same segment a, b, a.
We develop an algorithm to compute IPRank for all nodes in matrix form that
works as follows. We call it IPRank-All, which is motivated by our IP model,
where different nodes propagate their influence with different steps. The initial
influence of all nodes is stored in a row vector Z. In the first step, all nodes
propagate influence to its out-neighbors with decay factor f (1). Let us consider
the influence received by a node. Suppose that in-neighbor set of a node v is I(v),
I(v)
the influence received by v is Z1 (v) = f (1) · i=1 (Z(i) · Ti,v ). Consider all nodes
such as v, we get Z1 = f (1) · ZT in matrix form. In the second step, according
to our IP model, all elements in Z1 will propagate to its out-neighbors and the
influence vector received on the second step is Z2 = f (2) · ZT 2 . Analogously, the
influence vector received on the k-th step can be computed iteratively by
f (k)
Zk = f (k) · ZT k = · Zk−1 T (4)
f (k − 1)

Recall Definition 1, IPRank vector obtained withink steps is as follows.




k 
k 
k
Rk = Zi + Z = Z + f (i) · ZT i = Z 1+ f (i) · T i (5)
i=1 i=1 i=1

Eq. (4) and Eq. (5) form the main computation of IPRank-All algorithm. Let
Xk = ZT k , Zk can be computed iteratively by applying Zk = f (k) · Xk =
Ranking Individuals and Groups by Influence Propagation 413

Algorithm 2. IPRank-All(G, Z, T , h)
Input: graph G(V, E), initial influence vector Z, transition matrix T ,
and threshold h
Output: IPRank scores R

1: initialize R = Z;
2: for every node v ∈ V do
3: obtain K according to Eq. (2);
4: RefineRecursion(v, Z(v), 0, K);
5: end for
6: return R;

7: Procedure RefineRecursion(v, x, y, K)
8: y = y + 1;
9: for every node u in out-neighbor set of node v do
10: R(u) = R(u) + x · Tv,u · f (y);
11: if y < K then
12: RefineRecursion(u, x · Tv,u , y, K);
13: end if
14: end for

f (k) · Xk−1 T . So time complexity of IPRank-All algorithm is O(KN d), where d


is the average in-degree and N is the graph size. When it is specific to the most
popular decay function f (k)= ck , we get 

k
Rk = Z 1+ ci · T i = c · Rk−1 T + Z (6)
i=1

Hence, Rk can be computed iteratively if the decay function is exponential.


Eq. (6) implies a mutual reinforcement of importance like PageRank does. How-
ever, when f (k) is not exponential, we cannot compute Rk iteratively using
Eq. (6). In this case, the efficient way to obtain Rk is to compute all Zk itera-
tively by Eq. (4) and sum them up. The algorithm for IPRank-All is given in
Algorithm 2. While the IPRank in Eq. (4) and Eq. (5) propagates one step of
all nodes at a time, Algorithm 2 propagates all steps of one node.
Some useful propositions about IPRank computing are given below.

Proposition 3.2: The convergence rate of IPRank scores Rk is decided by the


decay function f (k). 2

Proof Sketch: According to Eq. 5, Rk − Rk−1 = f (k) · Z · T k . Since each row


of T is normalized to one unless all elements in this row are zero, T k  ≤ |V | is
tenable. Proposition 3.2 holds. 2

Proposition 3.3: When f (k) = ck , IPRank is an extension of PageRank in


fact, and also a variant of eigenvector centrality (EVC) measure. 2
414 P. Li et al.

Proof Sketch: Letting Z = (1 − c)U in Eq. (6), we get PageRank as shown


in Eq. (1). When k → ∞, Rk = Rk−1 , and therefore R = c · RT + Z. Suppose
that X is a |V |-by-|V | matrix with non-zero values only on the diagonal that
satisfies RX = Z, we get R = R(T · c) + RX = R(T · c + X). Therefore, R is an
eigenvector of (T · c + X).

3.2 IPRanking Groups


As a set of nodes, the group’s structural context consists of links from both the
outside and inside. If we view a group as a big node and apply IPRank on it, we
simply get the group centrality measured from the outside of this group, which says
“group centrality is the influence propagated in from nodes outside this group”.
Formally, if we use Z(u, v) to represent the influence Z(u) propagating from
node u to node v (no matter via how many steps), we rank a group A from the
viewpoint of the outside structure.  
 
GRout = MA (v) · Z(u, v) (7)
v∈A u∈A
/

MA (v) is the membership degree. On the other hand, if nodes in the group
are more connected to each other, this group should have a higher centrality.
We do not use the simple approaches such as summing and averaging, because
they ignore the link information between individual nodes in a group. To reduce
the effect of the group size, individual nodes with a high centrality should play
a more important role, especially when they are highly connected. IP model is
also effective to help rank groups from the viewpoint of the inner structure, by
propagating the influence of these
 high-score
 individuals via
links. That is,
 
GRin = MA (v) · Z(v) + Z(u, v) (8)
v∈A u∈A

Finally we combine rankings from outer and inner structural context together
to rank groups in graph G(V,E), as shown
 below. 
 
GR = MA (v) · Z(v) + Z(u, v) (9)
v∈A u∈V

Ranking groups in a graph G is an extension of our IPRank algorithms. The


basic idea of our IPRank algorithms is to collect influence propagated in from
other nodes. In brief, we show three steps to perform group ranking in a graph
G. (i) Set the centrality score R(v) of a node v as initial influence Z(v). (ii)
Propagate influence via links by our IPRank algorithm. (iii) Rank groups by
IPRank scores and the membership degree according to Eq. (9).

4 Experimental Study
We report our experimental results to confirm the effectiveness of our IPRank on
both individual and group levels. We compare IPRank with other four centrality
measures on accuracy, and we use various synthetic datasets and a large real
co-authorship network from DBLP. All algorithms were implemented in Java,
and all experiments were run on a machine with a 2.8 GHz CPU.
Ranking Individuals and Groups by Influence Propagation 415

Table 2. (a) Normalized centrality scores of different measures. (b) Normalized IPRank
scores while predefined importance of node b increases step by step.
(a) (b)
CD [0.20, 0.10, 0.30, 0.20, 0.20] Z Normalized IPRank Scores (%)
CB [0.21, 0.21, 0.19, 0.19, 0.21] [0.2, 0.0, 0.2, 0.2, 0.2] [16.2, 5.66, 33.2, 25.8 19.1]
CC [0.29, 0.00, 0.33, 0.04, 0.33] [0.2, 0.2, 0.2, 0.2, 0.2] [16.1, 11.6, 31.9, 23.2, 17.2]
P ageRank [0.16, 0.12, 0.32, 0.23, 0.17] [0.2, 0.4, 0.2, 0.2, 0.2] [16.0, 15.6, 31.1, 21.4, 15.9]
IP Rank [0.16, 0.21, 0.30, 0.19, 0.14] [0.2, 0.6, 0.2, 0.2, 0.2] [16.0 18.4, 30.5, 20.2, 14.9]

IPRank Vs. Others: A Case Study: In this experiment we evaluate the


results produced by IPRank and other centrality measures based on degree,
betweenness, closeness, and eigenvector. The comparison was performed on the
small graph shown in Fig. 1(a). Note that for each centrality measure, there
are many variants, so we adopt the definition from Wikipedia [12]. For graph
G(V, E), the degree centrality CD (a) for node a  is CD (a) = indegee(a)/(|V |− 1).
For the betweenness centrality, we define CB (a) s=a=t (σst (a)/σst ), where σst is
the number of shortest paths from s to t and σst (a) is the number of such paths
that pass through node a. Closeness centrality is a little complex, because the
shortest path between two nodes may not exist on directed graphs.  So we adopt
the definition in [13] where the closeness centrality CC (a) = t∈V 2−d(a,t) , in
which d(a, t) is the shortest distance from node a to t and d(a, t) = ∞ if it is
unreachable. Finally, we use PageRank in [5] to measure eigenvector centrality.
To make the results comparable, we normalize centrality scores of different
measures to be one and show them in Table 2(a). For IPRank, we set the pre-
defined importance Z = [0.2, 0.8, 0.2, 0.2, 0.2] and decay f (k) = 0.7k . Intuitively,
degree centrality is similar to EVC but only considers directed neighbors. For
betweenness and closeness centrality, they base on shortest distance and empha-
size on the “prestige” rather than “popularity” of a node. Thus, CB (c) < CB (b)
and CC (b) = 0 is not compliant to the human intuition of ranking that is mainly
based on popularity. In IPRank, we set a higher predefined importance on b,
which contributes to a and makes R(a) > R(e) finally, to be contrary of PageR-
ank. Moreover, we show different normalized IPRank scores in Table 2(b) while
Z(b) increases step by step. Increasing predefined importance of a node generally
results in a higher IPRank score of this node.
Decay functions also influence IPRank scores significantly. For example, based
on the same predefined importance Z = [0.2, 0.8, 0.2, 0.2, 0.2], we define f  (k) =
0.2 for k ≤ 3 and f  (k) = 0 for k ≥ 4, and alter decay function from f (k) = 0.7k
to f  (k). Finally, we obtain a new IPRank score vector [0.15, 0.34, 0.22, 0.16, 0.13],
which is quite different from [0.16, 0.21, 0.30, 0.19, 0.14] shown in Table 2(a).

Results on DBLP Co-Authorship Network: We use the author information


of the entire DBLP1 conference papers (a total of 745,593) to build a large co-
authorship network. This network consists of 534,058 authors (nodes), 1,589,343
co-author relationships (edges). There are 2,644 different conferences, and an
author is associated with a vector showing how many papers he/she contributes
1
http://dblp.uni-trier.de/xml/, last modified in September 2009.
416 P. Li et al.

Table 3. (a) Ranking without predefined importance. (b) IPRank on KDD area. (c)
IPRank on WWW area.

(a) (b) (c)


Author Conf. Author Conf. Author Conf.
Wei Li iscas Jiawei Han kdd Wei-Ying Ma www
Wei Wang icra Philip S. Yu icdm Zheng Chen icde
Wen Gao icip Christos Faloutsos icml C. Lee Giles sigir
Wei Zhang hicss Heikki Mannila icde Ravi Kumar semweb
Jun Zhang hci Padhraic Smyth sigmod Erik Wilde sigmod
Chin-Chen Chang chi Bing Liu nips Katsumi Tanaka cikm
Li Zhang wsc Jian Pei aaai Yong Yu chi
Lei Wang vtc Vipin Kumar sdm Wolfgang Nejdl vldb
Alberto L. S-V iccS Mohammed Javeed Zaki vldb Torsten Suel kdd
C. C. Jay Kuo icc Srinivasan Parthasarathy www Andrew Tomkins aaai

to conferences. The maximum number of co-authors is 361 by Philip S. Yu and


nearly 54.3% authors appear once in DBLP. We set the number of co-authors as
the edge weight. A conference serves as a group, and membership degree between
author a and conference C is decided by the ratio of a publishing in C. IPRank
algorithm follows a human intuition: if author a and author b have the same
number of co-authors but a’s co-authors are more important, according to our
IP model, a receives more influence from neighbors and earns a higher centrality
compared with b. Besides, we consider the decay of influence propagation via
co-authorships should be not exponential, since an author means a lot to his co-
authors but little to other authors away from several hops. In this experiment,
we set decay function f (k) = 1 − 0.3k for k ≤ 3 and f (k) = 0 if k ≥ 4.
To illustrate the necessity of predefined importance, first we consider every
author is equal important and show the corresponding top-10 authors and con-
ferences in Table 3(a). Some authors rank high only because they have lots of
co-authors, and larger conferences result in higher rankings. Second, we bias
the ranking to a special area by predefining importance for authors. In Table
3(b), the authors published papers in KDD are given a higher predefined impor-
tance and shortly we obtain top-10 centrality ranking of authors and conferences
in the Knowledge Discovery and Data Mining area. Third, we bias IPRank to
WWW area by giving higher predefined importance to authors who published
in WWW, and show results in Table 3(c). We can see IPRank with predefined
importance produces reasonable results. Experiments display that IPRank-All
only takes 0.91 second to complete all three iterations.

Efficiency and Convergence Rate: PageRank does not provide ways to com-
pute the score of only one node. In contrast, IPRank-One can do this without
accuracy loss, and an advantage is that if we only need to obtain IPRank scores
of a few nodes, IPRank-One is more efficient than IPRank-All. We execute ex-
periments on a random graph with 1M nodes and 3M edges. IPRank-All takes
3.65s to perform all iterations, whereas IPRank-One only needs 0.01s to respond
IPRank query for one node. IPRank-All+ provides a more accurate measure
Ranking Individuals and Groups by Influence Propagation 417

30 400 1.0

25 0.9
300

Time Cost (ms)


20

Precision
Time (s)

0.8
15 200
0.7
10
100
5 0.6
Nodes Iterations
0 K 0 0.5
5 6 7 8 9 0.1M 0.2M 0.3M 0.4M 0.5M 0.6M 0.7M 0.8M 1 2 3 4 5 6 7 8 9 10 11

(a) (b) (c)

Fig. 3. (a) Time cost of traverse increasing with steps K. (b) Performance of IPRank-
All as node size increases. (c) Convergence rate of IPRank-All on DBLP dataset.

than IPRank-All when the decay of some large predefined importance needs
more iterations. Algorithms show that IPRank-One and IPRank-All+ are all
based on traverse of nodes that reach the target node within K steps. Fig. 3(a)
shows time cost of such a traverse increases rapidly when K increases. We rec-
ommend IPRank-One for IPRank query of a few nodes and IPRank-All+ for
more accurate IPRanking.
IPRank-All is suitable for most of cases. We set |E|/|V | = 5 and let the
graph size |V | increase. Time cost of each IPRank-All iteration increases near
linearly and looks acceptable, as shown in Fig. 3(b). We test the convergence
rate of IPRank-All on DBLP co-authorship network with decay f (k) = 0.7k .
The precision of iteration k is defined by averaging Rk (a)/R(a) for every node
a. Fig. 3(c) shows that after 10 iterations, the error of precision is below 0.01.

5 Related Work
Historically, measuring the centrality of nodes (or individuals) in a network has
been widely studied. Freeman [2] reviewed and categorized these methods into
three conceptual foundations: degree, betweenness, and closeness. Accompanied
with eigenvector centrality (EVC) proposed by Bonacich [3], these four measures
dominate the empirical usage of centrality. A recent summary can be found in
[4]. Besides, Tong et al. [14] proposed cTrack to find central objects in a skewed
time-evolving bipartite graph, based on random walk with restart.
In recent years, the trend of exploiting structural context becomes prevalent
in network analysis. The crucial intuition behind this trend is that “individuals
relatively closer in a network are more likely to have the similar characters”. A
typical example is PageRank [5], where the page importance is flowing and mu-
tual reinforced along hyperlinks. Other examples and applications were explored
in recent works such as [10,11,15]. [15] analyzed the propagation of trust and
distrust on large networks consisting of people. [11] used a few labeled exam-
ples to discriminate irrelevant results by computing proximity from the relevant
nodes. Gyongyi et al. discovered other good pages by propagating the trust of a
small set of good pages [10].
418 P. Li et al.

Other studies that applied predefined importance to their measures include


[16] and [17]. [16] modified PageRank to be topic-sensitive by assigning im-
portance scores for each pages with respect to a particular topic. [17] assigned
PageRank scores to each page, and measured the similarity between web pages
by propagating their own similarity and receiving similarities from other pages.
On the other hand, most of works use exponential decay simply and there are few
studies on applying user-defined decay functions in random walking. Perhaps the
most explicit study on decay function is [18], which discussed three decay (or
damping) functions on link-based ranking and showed a linear approximation
to PageRank. We are the first to introduce predefined importance and decay
function into EVC under a well-established intuitive model.
We categorize group centrality measures into two types, which exploit the
inner and outer information of a group respectively. Approaches that sum or
average centrality scores of individuals in this group belong to the first type. As
an example of the second type, [7] ranked group C by the nodes that are not
in C, but are neighbors to a member in C. Besides, there are some studies on
quasi-cliques [13,19], which can be viewed as a special kind of groups.

6 Conclusion
In this paper, we proposed an new influence propagation model to propagate
user-defined importance on nodes to others along random walk paths with user
control by allowing users to define decay functions. We propose new algorithms
to measure the centrality of individuals and groups according to the user’s view.
We tested our approaches using large real dataset from DBLP, and confirmed
the effectiveness and efficiency of our approaches.

Acknowledgement: The work was supported in part by grants of the Research


Grants Council of the Hong Kong SAR, China No. 419109, and National Nature
Science Foundation of China No. 70871068, 70890083 and 60873017.

References

1. Zhang, H., Smith, M., Giles, C.L., Yen, J., Foley, H.C.: Snakdd 2008 social network
mining and analysis report. SIGKDD Explorations 10(2), 74–77 (2008)
2. Freeman, L.C.: Centrality in social networks: conceptual clarification. Social Net-
works 1, 215–239 (1978)
3. Bonacich, P.: Factoring and weighting approaches to status scores and clique iden-
tification. Journal of Mathematical Sociology 2(1), 113–120 (1972)
4. Newman, M.: The mathematics of networks. In: Blume, L., Durlauf, S. (eds.) The
New Palgrave Encyclopedia of Economics, 2nd edn. Palgrave MacMillan, Bas-
ingstoke (2008), http://www-ersonal.umich.edu/~ mejn/papers/palgrave.pdf
5. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking:
Bringing order to the web. Technical Report 1999-66, Stanford InfoLab (1999)
6. Kempe, D., Kleinberg, J.M., Tardos, É.: Maximizing the spread of influence
through a social network. In: KDD, pp. 137–146 (2003)
Ranking Individuals and Groups by Influence Propagation 419

7. Everett, M.G., Borgatti, S.P.: Extending centrality. In: Wasserman, S., Faust, K.
(eds.) Social network analysis: methods and applications, pp. 58–63. Cambridge
University Press, Cambridge (1994)
8. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press,
Cambridge (1995)
9. Valente, T.: Network Models of the Diffusion of Innovations. Hampton Press, New
Jersey (1995)
10. Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.O.: Combating web spam with
trustrank. In: VLDB, pp. 576–587 (2004)
11. Sarkar, P., Moore, A.W.: Fast dynamic reranking in large graphs. In: WWW, pp.
31–40 (2009)
12. Centrality in Wikipedia, http://en.wikipedia.org/wiki/Centrality
13. Dangalchev, C.: Mining frequent cross-graph quasi-cliques. Physica A: Statistical
Mechanics and its Applications 365(2), 556–564 (2006)
14. Tong, H., Papadimitriou, S., Yu, P.S., Faloutsos, C.: Proximity tracking on time-
evolving bipartite graphs. In: SDM, pp. 704–715 (2008)
15. Guha, R.V., Kumar, R., Raghavan, P., Tomkins, A.: Propagation of trust and
distrust. In: WWW, pp. 403–412 (2004)
16. Haveliwala, T.H.: Topic-sensitive pagerank. In: WWW, pp. 517–526 (2002)
17. Lin, Z., Lyu, M.R., King, I.: Pagesim: a novel link-based measure of web page
aimilarity. In: WWW, pp. 1019–1020 (2006)
18. Baeza-Yates, R.A., Boldi, P., Castillo, C.: Generalizing pagerank: damping func-
tions for link-based ranking algorithms. In: SIGIR, pp. 308–315 (2006)
19. Jiang, D., Pei, J.: Mining frequent cross-graph quasi-cliques. TKDD 2(4) (2009)
Dynamic Ordering-Based Search Algorithm for Markov
Blanket Discovery

Yifeng Zeng, Xian He, Yanping Xiang, and Hua Mao

Department of Computer Science, Aalborg University, DK-9220 Aalborg, Denmark


Department of Computer Science, Uni. of Electronic Sci. and Tech. of China, P.R. China
{yfzeng,huamao}@cs.aau.dk,{hexian1987,xiangyanping}@gmail.com

Abstract. Markov blanket discovery plays an important role in both Bayesian


network induction and feature selection for classification tasks. In this paper,
we propose the Dynamic Ordering-based Search algorithm (DOS) for learning
a Markov blanket of a domain variable from statistical conditional independence
tests on data. The new algorithm orders conditional independence tests and up-
dates the ordering immediately after a test is completed. Meanwhile, the algo-
rithm exploits the known independence to avoid unnecessary tests by reducing
the set of candidate variables. This results in both efficiency and reliability ad-
vantages over the existing algorithms. We theoretically analyze the algorithm on
its correctness and empirically compare it with the state-of-the-art algorithm. Ex-
periments show that the new algorithm achieves computational savings of around
40% on multiple benchmarks while securing similar or even better accuracy.

Keywords: Graphical Models, Markov Blanket, Conditional Independence.

1 Introduction
Bayesian network (BN) [1] is a type of statistical models that efficiently represent the
joint probability distribution of a domain. It is a directed acyclic graph where nodes rep-
resent domain variables of a subject of matter, and arcs between the nodes describe the
probabilistic relationship of variables. One problem that naturally arises is the learning
of such a model from data. Most of the existing algorithms fail to construct a network of
hundreds of variables in size. A reasonable strategy for learning a large BN is to firstly
discover the Markov blanket of variables, and then to guide the construction of the full
BN [2,3,4,5].
Markov blanket indeed is an important concept and possesses potential uses in nu-
merous applications. For every variable of interest T , the Markov blanket contains a set
of parents, children, and spouses (i.e., parents of common children) of T in a BN [1].
The parents and children reflect the direct cause and direct effect of T respectively
while the spouses represent the direct cause of T ’s direct effect. Such causal knowledge
is essential if domain experts desire to manipulate the data process, e.g. to perform a
troubleshooting on a faulty device, or to test the body reaction to a medicine, or to
study the symptom of a disease, etc. Furthermore, conditioned on its Markov blanket
variables, the variable T is probabilistically independent of all other variables in the
domain. Given this important property, the Markov blanket is inextricably connected to

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 420–431, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Dynamic Ordering-Based Search Algorithm 421

B C D

E F T H

R I J

M K L

Fig. 1. Markov blanket of the target node T in the BN. It includes the parents and children of T ,
P C(T ) = {C, D, I}, and the spouses, SP (T ) = {R, H}.

the feature selection problems. Koller and Sahami [6] showed the Markov blanket of T
is the theoretically optimal set of features to predict T ’s values. We show an instance
of Markov blanket within a small BN in Fig. 1. The goal of this paper is to identify the
Markov blanket of a target variable from data in an efficient and reliable manner.
Research on Markov blanket discovery is traced back to the Grow-Shrink algo-
rithm (GS) in Margaritis and Thrun’s work [7]. The Grow-Shrink algorithm is the first
Markov blanket discovery algorithm proved to be correct. Tsamardinos et. al. [8,9] pro-
posed several variants of GS, like the incremental association Markov blanket (IAMB)
and Interleaved IAMB that aim at the improved speed and reliability. However, the
algorithms are still limited on achieving data efficiency. To overcome this limitation, at-
tempts have been made including the Max-Min Parents and Children (MMPC) [10] and
HITON-PC [11] algorithms for Markov blanket discovery. Neither of them is shown to
be correct. This motivates a new generation of algorithms like the Parent-Child based
search of Markov blanket (PCMB) [12] and the improved one - Iterative PCMB (IPC-
MB) [13]. Besides the proved soundness, IPC-MB inherits the searching strategy from
the MMPC and HITON-PC algorithms: it starts to learn both parents and children of the
target variable and then proceeds to identify spouses of the target variable. It results in
the Markov blanket from which we are able to differentiate direct causes (effects) from
indirect relation to the target variable. The differentiation on Markov blanket variables
is rather useful when the Markov blanket will be further analyzed to recover the causal
structure, e.g., providing a partial order to speed up the learning of the full BN. In a
similar vein, we will base the new algorithm on the IPC-MB and provide improvement
on both time and data efficiency.
In this paper, we propose a novel Markov blanket discovery algorithm, called Dy-
namic Ordering-based Search (DOS) algorithm. Akin to the existing algorithms, the
DOS takes an independence − based search to find a Markov blanket by assuming that
data were generated from a f aithf ul BN modeling the domain. It conducts a series
of statistical conditional independence tests toward the goal of identifying a number of
Markov blanket variables (parents and children as well as spouses). Our main contri-
bution on developing the DOS is on two aspects. Firstly, we arrange the sequence of
independence tests by ordering variables not only in the candidate set, but also in the
conditioning sets. We order the candidates using the independence measurement like
the mutual information [14], the p-value returned by the G2 tests [15], etc. Meanwhile,
we order the conditioning variables in terms of the f requency that the variables enter
422 Y. Zeng et al.

into the conditioning set in the known independence tests. We re-order the variables im-
mediately when an independence test is completed. By ordering both types of variables,
we are able to detect true negatives effectively within a small amount of conditional in-
dependence tests.
Secondly, we exploit the known conditional independence tests to remove true neg-
atives from the candidate set at the earliest time. By doing so, we need test only a small
number of the conditioning sets (generated from the candidate set) thereby improving
time efficiency. In addition, we can limit the conditioning set into a small size in the
new independence tests, which achieves data efficiency. We further provide the proof
on the correctness of the new DOS algorithm. Experimental results show the benefit
of dynamically ordering independence tests and demonstrate the superior performance
over the IPC-MB algorithm.

2 Background and Notations


In the present paper we use uppercase letters (e.g., X, Y , Z) to denote random variables
and boldface uppercase letters (e.g., X, Y, Z) to represent sets of variables. We use U
to denote the set of variables in the domain. A “target” variable is denoted as T unless
stated otherwise. “Nodes” and “variables” will be used interchangeably.
We use I(X, Y |Z) to denote the fact that two nodes X and Y are conditionally
independent given the set of nodes Z. Using conditional independence, we may define
the Markov blanket of the target variable T , denoted by M B(T ), as follows.

Definition 1 (Markov Blanket). The Markov blanket of T is a minimal set of vari-


ables conditioned on which all other variables are independent of T , i.e., ∀X ∈ U −
{M B(T ) ∪ T } I(X, T |M B(T )).

Bayesian network (BN) [1] is a directed acyclic graph G where each node is annotated
with a conditional probability distribution (CPD) given any instantiation of its parents.
The multiplication of all CPDs constitutes a joint probability distribution P modeling
the domain. In a BN, a node is independent of its non-descendants conditioned on its
parents.

Definition 2 (Faithfulness). A BN G and a joint probability distribution P is faithful to


one another iff every conditional independence entailed by the graph G is also present
in P [1,15]. A BN is faithful if it is faithful to its corresponding distribution P , i.e.,
IG (X, Y |Z)=IP (X, Y |Z).

A graphical criterion for entailed conditional independence is that of d-separation [1]


in a BN. It is defined as follows.

Definition 3 (d-separation). Node X is d-separated from node Y conditioned on Z in


the graph G if, for all paths between X and Y , either of the following two conditions
holds:
1. The connection is serial or diverging and Z is instantiated.
2. The connection is converging, and neither Z nor any of Z’s descendants is instanti-
ated.
Dynamic Ordering-Based Search Algorithm 423

Due to the faithfulness assumption and d-separation criterion, we are able to learn a BN
from the data generated from the domain. We may utilize statistical tests to establish
conditional independence between variables that is structured in the BN. This moti-
vates the main idea of an independence-based (or constraint-based) search for learning
BN [15]. Most of current BN or Markov blanket learning algorithms are based on the
following theorem [15].

Theorem 1. If a BN G is faithful to a joint probability distribution P then:


1. Nodes X and Y are adjacent iff X and Y are conditionally dependent given any
other set of nodes.
2. For each triplet of nodes X, Y and Z in G such that X and Y are adjacent to Z,
but X and Y are not adjacent, X → Y ← Z is a sub-graph of G iff X and Y are
dependent conditioned on every other set of nodes that contains Z.

A faithful BN allows the Markov blanket to be graphically represented in G. Further-


more, Tsamardinos et. al. [9] shows the uniqueness of Markov blanket in Theorem 2.

Theorem 2. If a BN G is faithful, then for every variable T , the Markov blanket of T


is unique and is the set of parents, children, and spouses of T .

We observe that the first part of Theorem 1 allows one to find parents and children of the
target node T , denoted by P C(T ), since there shall be an edge between P C(T ) and T ;
the second part provides possibility on identifying a spouse of T , denoted by SP (T ).
Hence Theorem 1 together with Theorem 2 provide a foundation for the Markov blanket
discovery.

3 DOS Algorithm
The Dynamic Ordering-based Search algorithm (DOS) discovers the Markov blanket
M B(T ) through two procedures. In the first procedure, the algorithm finds a candidate
set of parents and children of the target node T , called CP C(T ). It starts with the
whole set of domain variables and gradually excludes those that are independent of T
conditioned on a subset of the remained set. In the second procedure, the algorithm
identifies spouses of the target node, called SP (T ), and removes the false positives
from CP C(T ). The resulted CP C(T ) is the output M B(T ).
Prior to presenting the DOS algorithm, we introduce three functions. The first func-
tion, called Indep(X, T |S), measures the independence between the variable X and
the target variable T conditioned on a set of variables S. In our algorithm, we use G2
tests to compute the conditional independence and take the p-value (returned by G2
test) for the independence measurement [15]. The smaller the p-value, the higher the
dependence. In practice, we compare the p-value to a confidence threshold 1-α. More
precisely, we let Indep(X, T |S) be equivalent to the p-value so that we are able to con-
nect the independence measurement to conditional independence, i.e., I(X, T |S)=true
iff Indep(X, T |S) ≥ 1 − α. Notice that we assume independence tests are correct.
The second function, called F req(Y ), is a counter that measures how frequent a
variable Y enters into the conditioning set S in the previous conditional independence
tests Indep(X, T |S). A large F req(Y ) value implies a large probability of d-separating
424 Y. Zeng et al.

X from T using Y in the conditioning set S. In general, the variables belonging to


P C(T ) have a large F req(Y ) value.
The third function, called GenSubset(V, k), generates all subsets of size k from the
set of variables V in the Banker’s sequence [16]. The Banker’s sequence is one way of
enumerating all subsets of a set. It examines subsets in monotonically increasing order
by size. For all subsets of size k, it constructs the subset by sequentially picking up k
elements from the set. We denote the resulted set by SS, i.e., SS = GenSubset(V, k).
Notice that SS contains a set of ordered subsets of identical size. For example, we may
have SS = GenSubset({A, B, C}, 2) = {{A, B}, {A, C}, {B, C}}.

3.1 Algorithm Formulation


We present details of the DOS algorithm in Fig 2. As mentioned above, the new algo-
rithm uses two procedures, called GenCP C and Ref CP C respectively, to discover the
Markov blanket of the target variable T . It starts with the GenCP C procedure that aims
to find a candidate set of parents and children of T . The GenCP C procedure searches
the CP C(T ) by shrinking the set of T ’s adjacent variables called ADJ(T ). The initial
ADJ(T ) is the whole set of domain variables except T , i.e., ADJ(T ) = U − {T }.
The procedure then removes an Non-PC (non-parent and child) variable from ADJ(T )
if the variable is conditionally independent of T given a subset of the adjacent set (lines
7-9). We use G2 estimation in the conditional independence tests (line 7), and check
the independence for each adjacent variable (line 4) by examining all empty condition-
ing sets (cutsize=0) first, then all conditioning sets of size 1, later all those of size 2
and so on, until cutsize → |ADJ(T )| (lines 1 and 15). Recall that the number of data
instances to reliable G2 tests is exponential to the size of the conditioning set S. Hence
the strategy of monotonically increasing size of S contributes to the improvement on
data efficiency.
We observe that the plain algorithm needs to iterate every T ’s adjacent variable
and test the conditional independence possibly given all subsets of the adjacent set
ADJ(T ). Clearly we may speed up the procedure by reducing ADJ(T ) at the earli-
est time. In other words, we shall remove Non-PC variables from ADJ(T ) as early
as possible using effective conditional independence tests. This is relevant to two is-
sues: 1) Selection of an adjacent variable (X ∈ ADJ(T )) that is most likely to be-
come the Non-PC variable; 2) Selection of the conditioning set S that can effectively
d-separate the adjacent variable X from T . We solve the first issue by choosing the
variable that has the minimum relevance with T measured by the p-values (line 4), i.e.,
X = argmax Indep(X, T |S). A large p-value(=Indep(X, T |S)) implies a large
X∈ADJ(T )
probability of claiming conditional independence. The selected variable X is the one
that has not been visited and has the largest p-value among all un-visited adjacent vari-
ables. Notice that we use the known p-values in the previous independence tests where
the size of S is 1 smaller than that in the new tests (line 7).
We solve the second issue by setting a counter function F req(Y ) to each variable
Y . The function records how often the variable Y (in the conditioning set) d-separates
an Non-PC variable from T . We update the counter immediately after an effective test
is executed, and order the adjacent variables in the descending order of counters (lines
12-13). We generate the conditioning sets SS, each of which has the size cutesize, from
Dynamic Ordering-Based Search Algorithm 425

Dynamic Ordering-based Search algorithm (DOS)


Input: Data D, Target Variable T , Confidence Level 1-α
Output: Markov Blanket M B(T )

Main Procedure
1: Initialize the adjacent set of T : ADJ(T ) = U − {T }
2: Find the CP C(T ) through GenCP C:CP C(T ) = GenCP C(D, T, ADJ(T ))
3: Find the SP (T ) and remove the false positives through Ref CP C:M B(T ) =
Ref CP C(D, T, CP C(T ))
Sub-Procedure: Generate the CP C(T )
GenCP C(D, T, ADJ (T ))
1: Initialize the size of conditioning set S: cutsize=0
2: WHILE (|ADJ(T )| > cutsize) DO
3: Initialize the Non-PC set: N P C(T )=∅
4: FOR each X ∈ ADJ(T ) and
choose X = argmax Indep(X, T |S) DO
5: Generate the conditioning sets:
SS = GenSubset(ADJ(T ) − {X}, cutsize)
6: FOR each S ∈ SS DO
7: IF (Indep(X, T |S) ≥ 1 − α) THEN
8: N P C(T )=N P C(T ) ∪ X
9: ADJ(T ) = ADJ(T ) − N P C(T )
10: Keep the d-separate sets: Sepset(X, T )=S
11: FOR each Y ∈ S DO
12: Update F req(Y )
13: Order ADJ(T ) using F req(Y ) in the descending order
14: Break
15: cutsize = cutsize + 1
16: Return CP C(T ) = ADJ(T )
Sub-Procedure: Refine the CP C(T )
Ref CP C(D, T, CP C(T ))
1: FOR each X ∈ CP C(T ) DO
2: Find the CP C for X:
CP C(X) = GenCP C(D, X, U − {X})
3: IF T ∈ CP C(X) THEN
4: Remove the false positives: CP C(T ) = CP C(T ) − {X}
5: Continue
6: FOR each Y ∈ {CP C(X) − CP C(T )} DO
7: IF (Indep(Y, T |X ∪ Sepset(X, T )) < 1 − α) THEN
8: Add the spouse Y : SP (T ) = SP (T ) ∪ {Y }
9: CP C(T ) = CP C(T ) ∪ SP (T )
10: Return M B(T ) = CP C(T )

Fig. 2. The DOS algorithm contains two sub-procedures. The GenCP C procedure finds a can-
didate set of parents and children of T by efficiently removing Non-PC from the set of domain
variables while the Ref CP C procedure mainly adds spouses of T and removes false positives.
426 Y. Zeng et al.

ADJ(T ) using the GenSubSet function (line 5). Since we order ADJ(T ) variables
and generate the subsets in the Banker’s sequence, the conditioning set S(∈ SS) firstly
selected will have a large probability of being P C(T ) or its subset. Consequently, we
may detect an Non-PC variable within few tests. Once we identify the Non-PC variable
we immediately remove it from ADJ(T ) (lines 8-9). The reduced ADJ(T ) avoids to
generate a large number of the conditioning sets as well as a big size of the conditioning
set in the new tests.
The GenCP C procedure returns the candidate set of T ’s parents and children that
excludes false negatives. However, it may include possible false positives. For instance,
in Fig. 1, the variable M still remains in the output CP C(T ) because M is d-separated
from T only conditioned on the set {R, I}. However, the variable R is removed early
since it is independent from T given the empty set. Hence the tests will not condition
on both R and I simultaneously. The problem is fixed by checking the symmetric re-
lation between T and T ’s PC, i.e., T shall be in the PC set of T ’s PC variable and
vice versa [2,12]. For example, we may find the candidate set of M ’s parents and chil-
dren CP C(M ). If T does not belong to CP C(M ) we could safely remove M from
CP C(T ). We present this solution in the procedure Ref CP C.
In the procedure Ref CP C, we start to search the parent and children set for each
variable in CP C(T ) (line 2). If the candidate PC variable violates the symmetry (e.g.,
T ∈ CP C(X)) it will be removed from CP C(T ) (line 4). If T ∈ CP C(X), we
know that X is a true PC of T and CP C(X) may contain T ’s spouse candidates. A
spouse is not within CP C(T ), but shares common children with T . We again use G2
tests to detect the dependence between the spouse and T , and identify the true spouse
set SP (T ) (lines 7-9). We refine the CP C(T ) by removing the false positives and
retrieving the spouses, and finally return the true M B(T ).

3.2 Theoretical Analysis


The new algorithm DOS bases the searching scheme on the state-of-the-art algorithm
IPC-MB. It embeds three functions (Indep, F req and GenSubSet) for the improve-
ment on both the time and data efficiency. Its correctness stands on the two procedures,
namely GenCP C and Ref CP C. The procedure GenCP C removes the Non-PC vari-
able X if X is independent of T conditioned on any subset of ADJ(T ) − {X}. On
the removal of false positives, the algorithm resorts to a check on the symmetric rela-
tion between T and each of T ’s PC. The additional check ensures a correct PC set of
T . Besides removing the false positives, the procedure Ref CP C adds T ’s spouses to
complete the M B(T ). Its correctness lies in the inference: the spouse Y is not a candi-
date of T ’s PC, but dependent of T conditioned on common children. We conclude the
correctness of the DOS algorithm below. More technical proof is found in [13].

Theorem 3 (Correctness). The Markov blanket M B(T ) returned by the DOS algo-
rithm is correct and complete given two assumptions: 1) the data D are faithful to a
BN; and 2) the independence tests are correct.

The primary complexity of the DOS algorithm is due to the procedure GenCP C in
Fig. 2. Similar to the performance evaluation of BN learning algorithms, the complex-
ity is measured in the number of conditional independence tests executed [15]. The
Dynamic Ordering-Based Search Algorithm 427

procedure needs to calculate the independence function Indep(X, T |S) for each do-
main variable given all subsets of ADJ(T ) in the worst case. Hence the number of
tests is bounded by O(|U| · 2|ADJ(T )| ). Our strategy of selecting both the candidate
variable X and the conditioning set S will quickly reduce the ADJ (T ) by removing
Non-PC variables and test only the subsets of P C(T ) in most times. Ideally, we may
expect the complexity is in the order of O(|U|·2|P C(T )| ). This is a significant reduction
on the complexity since |P C(T )| |ADJ(T )| in most cases.

4 Experimental Results

We evaluate the DOS algorithm performance over triple benchmark networks and com-
pare it with the state-of-the-art algorithm IPC-MB. To be best of our knowledge, the
IPC-MB is the best algorithm for Markov blanket discovery in the current study. Both
algorithms are implemented in Java and the experiments are run on a WindowsXP plat-
form with Pentium(R) Dual-core (2.60 GHz) with 2G memory.
We describe the used networks in Table 1. The networks range from 20+ to 50+
variables in the domain and differ in the connectivity measured by both in/out-degree
and PC numbers. They provide useful tools in a wide range of practical applications and
have been proposed as benchmarks for evaluating both BN and Markov blanket learning
algorithms [2]. For each of the networks we randomly sample data from the probability
distribution of these networks. We use both the DOS and IPC-MB algorithms to re-
construct Markov blanket of every variable from the data.
We compare the algorithms in terms of speed measured by both times and the num-
ber of conditional independence (CI) tests executed, and accuracy measured by both
precision and recall. P recision is the ratio of true positives in the output (returned by
the algorithms) while recall is the ratio of returned true positives in the true M B(T ).
In addition, we use a combined measure that is the proximity of precision and recall
of the algorithm
 to perfect precision and recall expressed as the Euclidean distance:
Distance = (1 − precision)2 + (1 − recall)2 . The smaller the distance the closer
the algorithm output is to the true Markov blanket.
For a single experiment on a particular dataset we ran the algorithms using as targets
all variables in a network and computed the average values for each measurement. For
a particular size of dataset we randomly generated 10 sets and measured the average
performance of each algorithm. We set α =0.05. Tables 2 reports the experimental re-
sults for datasets of different sizes. Each entry in the tables shows average and standard
deviation values over 10 datasets of a particular size. In the table, “Insts.” refers to data

Table 1. Bayesian networks used in the experiments

Networks |U| Max In/Out-degree Min/Max|P C|


Insurance 27 3/7 1/9
ALARM 37 4/5 1/6
Hailfinder 56 4/16 1/17
428 Y. Zeng et al.

Table 2. Both speed and accuracy comparison between the DOS and IPC-MB algorithms

Networks Insts. Algs. Speed(Reduction) Accuracy(Improvement)


Times(sec.) # CI tests Precision Recall Distance
IPC-MB 82±6 10631±1090 0.75±0.05 0.24±0.04 0.86±0.04
300 DOS 49±4 6003±421 0.87±0.03 0.31±0.03 0.74±0.03
– 40.24% 43.53% 16.00% 29.17% 13.95%
IPC-MB 97±14 15308±2400 0.84±0.05 0.30±0.04 0.76±0.04
500 DOS 55±7 8288±1045 0.90±0.03 0.37±0.03 0.67±0.04
Insurance – 43.30% 46.25% 7.14% 23.33% 11.84%
IPC-MB 79±12 11605±902 0.93±0.05 0.36±0.04 0.66±0.03
1000 DOS 47±2 6535±327 0.95±0.04 0.42±0.03 0.60±0.03
– 41.77% 43.61% 2.15% 16.66% 9.09%
IPC-MB 143±18 15537±1491 0.97±0.05 0.46±0.03 0.54±0.03
2000 DOS 85±9 8988±734 0.98±0.04 0.51±0.03 0.50±0.02
– 40.56% 42.15% 1.03% 10.86% 7.41%
IPC-MB 78±6 11329±678 0.76±0.06 0.44±0.03 0.65±0.06
500 DOS 52±2 7209±235 0.80±0.03 0.49±0.02 0.59±0.04
– 33.33% 36.37% 5.26% 11.36% 9.23%
IPC-MB 115±5 15143±811 0.78±0.06 0.55±0.04 0.55±0.05
1000 DOS 71±6 8862±98 0.83±0.01 0.58±0.02 0.49±0.01
ALARM – 38.26% 41.48% 6.41% 9.09% 10.91%
IPC-MB 183±17 19538±886 0.89±0.03 0.67±0.01 0.39±0.04
2000 DOS 110±6 10812±429 0.91±0.01 0.68±0.01 0.36±0.02
– 38.89% 44.66% 2.25% 1.49% 7.69%
IPC-MB 416±20 24781±897 0.98±0.01 0.85±0.02 0.15±0.02
5000 DOS 234±13 12406±238 0.99±0.01 0.87±0.02 0.13±0.02
– 43.75% 49.94% 1.02% 2.35% 13.33%
IPC-MB 63±6 9952±142 0.85±0.01 0.38±0.03 0.63±0.03
500 DOS 51±4 7852±86 0.88±0.01 0.41±0.02 0.59±0.02
– 19.05% 21.10% 3.53% 7.89% 6.64%
IPC-MB 88±6 12046±327 0.91±0.02 0.48±0.03 0.53±0.04
1000 DOS 63±3 8363±274 0.94±0.02 0.50±0.03 0.50±0.03
Hailfinder – 28.41% 30.57% 3.30% 4.17% 5.66%
IPC-MB 144±20 15269±486 0.94±0.02 0.55±0.02 0.50±0.01
2000 DOS 98±7 10056±217 0.95±0.03 0.57±0.02 0.48±0.01
– 31.94% 34.14% 1.05% 3.63% 4.00%
IPC-MB 255±10 20152±524 0.98±0.02 0.67±0.01 0.37±0.01
5000 DOS 148±4 11327±301 0.98±0.02 0.67±0.01 0.37±0.01
– 41.96% 43.79% 0.00% 0.00% 0.00%

instances and “Algs.” to both algorithms. For the speed comparison purpose, “# CI
tests” denotes the total number of conditional independence tests. Reduction shows the
percentage by which the DOS algorithm reduces the times and number of CI tests over
the IPC-MB algorithm. For the accuracy comparison purpose, “Improvement” refers to
the improvement of the DOS algorithm over the IPC-MB algorithm in terms of accuracy
measurements like precision, recall and distance.
Dynamic Ordering-Based Search Algorithm 429

In the middle part of Table 2, we show the speed comparison between the DOS and
IPC-MB algorithms over four different datasets on three networks. The DOS algorithm
executes much faster than the I-PCMB for discovering the Markov blanket. This re-
sults from a significant reduction on the required CI tests in the DOS algorithm. As
Table 2 shows, the DOS requires average 40% of CI tests less than that done by the
IPC-MB. In some case (like ALARM network on 5000 data instances) the reduction is
up to 49.94%. The improved time efficiency is mainly due to our ordering strategy that
enables the DOS algorithm to quickly spot true negatives and reduce T ’s adajcent set
thereby avoiding uncessary CI tests.
In the right part of Table 2, we shows the accuracy of both algorithms on discovering
the Markov blanket. As expected, both algorithms perform better (smaller distance)
with a larger number of data instances. In most cases, the DOS algorithm has better
performance than the IPC-MB algorithms. It has around 8% improvement in terms of
the distance measurement compared with the IPC-MB algorithm. The improvement
is mainly due to more true positives found in the DOS algorithm (shown by more im-
provement on the recall measurement).
More importantly, the DOS demonstrates a larger improvement on the distance over
a smaller number of data instances. For the example of Insurance network, the distance
improvement is 13.95% over 300 data instances while it is 7.41% over 2000 data in-
stances. This implies more reliable CI tests in the DOS algorithm. The significant re-
duction of CI tests (shown in Table 2) also indicates improved test reliability for the
DOS algorithm. The reliability advantage appears because the DOS algorithm always
conditions on the conditioning set of small size by removing as early as possible true
negatives.

5 Related Work
Margaritis and Thrun [7] proposed the first probably correct Markov blanket discovery
algorithm - the Grow-Shrink algorithm. As implied by its name, the GS algorithm con-
tains two phases: a growing phase and a shrinking phase. It attempts to firstly add po-
tential variables into the Markov blanket and then remove false positives in the followed
phase. As the GS conducts statistical independence tests conditioned on the superset of
Markov blanket and many false positives may be included in the growing phase, it turns
out to be inefficient and cannot be scaled to a large application. However, its soundness
makes it a proved subject for future research.
The IAMB [8] was proposed to improve the GS on the time and data efficiency. It
orders the set of variables each time when a new variable is included into the Markov
blanket in the growing phase. By doing so, the IAMB is able to add fewer false positives
the first phase. However the independence tests are still conditioned on the whole (even
large) set of Markov blanket, which does not really improve the data efficiency. More-
over, the computation of conditional information values for sorting the variables in each
iteration is rather expensive in the IAMB. Yaramakala and Margaritis [17] proposed
a new heuristic function to determine the independence tests and order the variables.
However, as reported, there is no fundamental difference from the IAMB.
Later, several IAMB’s variants appeared to improve the IAMB’s limit on data ef-
ficiency like the Max-Min Parents and Children (MMPC) [10], HITON-PC [11] and
430 Y. Zeng et al.

so on. Unfortunately, both algorithms (MMPC and HITON-PC) were proved incor-
rect [12], but they do introduce a new approach on identifying the Markov blanket. The
algorithms find the Markov blanket by searching T ’s parents and children first, and then
discover T ’s spouses. This novel strategy allows independence tests to be conditioned
on a subset of T ’s neighboring or adjacent nodes instead of the whole set of Markov
blanket.
Following the same idea of MMPC and HITON-PC, Pena et. al. [12] proposed the
PCMB to conquer the data efficiency problem of the IAMB. More importantly, the
PCMB is proved correct in a theoretical way. Recently, Fu and Desmarais [13] pro-
posed the IPC-MB that always conducts statistical independence tests conditioned on
the minimum set of T ’s neighbors, which improves the PCMB on both the time and
data efficiency. However, both algorithms need to iterate a large number of subsets of
T ’s neighboring nodes in most cases and do not update the set of neighboring nodes
immediately after a true negative is detected. This allows our improvement as presented
in this paper.

6 Conclusion and Future Work

We presented a new algorithm for Markov blanket discovery, called Dynamic Ordering-
based Search (DOS). The DOS algorithm orders conditional independence tests through
a strategic selection of both the candidate variable and the conditioning set. The selec-
tion is achieved by exploiting the known independence tests to order the variables. By
doing so, the new algorithm can efficiently remove true negatives so that it avoids un-
necessary conditional independence tests and the tests condition on a small set in size.
We analyzed the correctness of the DOS algorithm as well as its complexity in terms
of the number of conditional independence tests. Our empirical results show that the
DOS algorithm performs much faster and more reliably than the state-of-the-art al-
gorithm IPC-MB. The reliability advantage is more evident with a small number of
data instances. A potential research direction is investigating the utility of our ordering
scheme in independence-based algorithms for BN learning.

Acknowledgment

The first author acknowledges partial support from National Natural Science Founda-
tion of China (No. 60974089 and No. 60975052). Yanping Xiang thanks the support
from National Natural Science Foundation of China (No. 60974089).

References

1. Pearl, J.: Probabilistic reasoning in intelligent systems: networks of plausible inference. Mor-
gan Kaufmann Publishers Inc., San Francisco (1988)
2. Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max-min hill-climbing bayesian network
structure learning algorithm. Machine Learning 65(1), 31–78 (2006)
Dynamic Ordering-Based Search Algorithm 431

3. Zeng, Y., Poh, K.L.: Block learning bayesian network structure from data. In: Proceedings
of the Fourth International Conference on Hybrid Intelligent Systems (HIS 2004), pp. 14–19
(2004)
4. Zeng, Y., Hernandez, J.C.: A decomposition algorithm for learning bayesian network struc-
tures from data. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008.
LNCS (LNAI), vol. 5012, pp. 441–453. Springer, Heidelberg (2008)
5. Zeng, Y., Xiang, Y., Hernandez, J.C., Lin, Y.: Learning local components to understand large
bayesian networks. In: Proceedings of The Ninth IEEE International Conference on Data
Mining (ICDM), pp. 1076–1081 (2009)
6. Koller, D., Sahami, M.: Toward optimal feature selection. In: Proceedings of the Thirteenth
International Conference on Machine Learning, pp. 284–292 (1996)
7. Margaritis, D., Thrun, S.: Bayesian network induction via local neighborhoods. Advances in
Neural Information Processing Systems 12, 505–511 (1999)
8. Tsamardinos, I., Aliferis, C.F., Statnikov, A.R.: Algorithms for large scale markov blanket
discovery. In: Proceedings of the Sixteenth International Florida Artificial Intelligence Re-
search Society Conference, pp. 376–381 (2003)
9. Tsamardinos, I., Aliferis, C.: Towards principled feature selection: Relevancy, filters and
wrappers. In: Proceedings of the Ninth International Workshop on Artificial Intelligence and
Statistics (2003)
10. Tsamardinos, I., Aliferis, C., Statnikov, A.: Time and sample efficient discovery of markov
blankets and direct causal relations. In: KDD, pp. 673–678 (2003)
11. Aliferis, C., Tsamardinos, I., Statnikov, A.: Hiton: A novel markov blanket algorithm for
optimal variable selection. In: Proceedings of American Medical Informatics Association
Annual Symposium (2003)
12. Pena, J.M., Nilsson, R., Bjorkegren, J., Tegner, J.: Towards scalable and data efficient learn-
ing of markov boundaries. International Journal of Approximate Reasoning 45(2), 211–232
(2007)
13. Fu, S., Desmarais, M.C.: Fast markov blanket discovery algorithm via local learning within
single pass. In: Proceedings of the Twenty-First Canadian Conference on Artificial Intelli-
gence, pp. 96–107 (2008)
14. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley-Interscience,
New York (2006)
15. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search. MIT Press, Cam-
bridge (2000)
16. Loughry, J., van Hemert, J., Schoofs, L.: Efficiently enumerating the subsets of a set. Depart-
ment of Mathematics and Computer Science, University of Antwerp, RUCA, Belgium, pp.
1–10 (2000)
17. Yaramakala, S., Margaritis, D.: Speculative markov blanket discovery for optimal feature
selection. In: Proceedings of the Fifth IEEE International Conference on Data Mining, pp.
809–812 (2005)
Mining Association Rules for Label Ranking

Cláudio Rebelo de Sá1 , Carlos Soares1,2, Alı́pio Mário Jorge1,3,


Paulo Azevedo5 , and Joaquim Costa4
1
LIAAD-INESC Porto L.A., Rua de Ceuta 118-6, 4050-190, Porto, Portugal
2
Faculdade de Economia, Universidade do Porto
3
DCC - Faculdade de Ciencias, Universidade do Porto
4
DM - Faculdade de Ciencias, Universidade do Porto
5
CCTC, Departamento de Informática, Universidade do Minho
claudio@liaad.up.pt, csoares@fep.up.pt, amjorge@fc.up.pt,
pja@uminho.pt, jpcosta@fc.up.pt

Abstract. Recently, a number of learning algorithms have been adapted


for label ranking, including instance-based and tree-based methods. In
this paper, we propose an adaptation of association rules for label rank-
ing. The adaptation, which is illustrated in this work with APRIORI
Algorithm, essentially consists of using variations of the support and
confidence measures based on ranking similarity functions that are suit-
able for label ranking. We also adapt the method to make a prediction
from the possibly conflicting consequents of the rules that apply to an
example. Despite having made our adaptation from a very simple vari-
ant of association rules for classification, the results clearly show that
the method is making valid predictions. Additionally, they show that it
competes well with state-of-the-art label ranking algorithms.

1 Introduction

Label ranking is an increasingly popular topic in the machine learning litera-


ture [12,7,25]. Label ranking studies the problem of learning a mapping from
instances to rankings over a finite number of predefined labels. It can be con-
sidered as a variant of the conventional classification problem [7]. In contrast to
a classification setting, where the objective is to assign examples to a specific
class, in label ranking we are interested in assigning a complete preference order
of the labels to every example.
There are two main approaches to the problem of label ranking. Decomposition
methods decompose the problem into several simpler problems (e.g., multiple
binary problems). Direct methods adapt existing algorithms or develop new ones
to treat the rankings as target objects without any transformation. An example
of the former is the ranking by pairwise comparisons [12]. Examples of algorithms
that were adapted to deal with rankings as the target objects include decision
trees [24,7], k -Nearest Neighbor [5,7] and the linear utility transformation [13,9].
This second group of algorithms can be divided into two approaches. The first one
contains methods (e.g., [7]) that are based on statistical distributions of rankings,

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 432–443, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Mining Association Rules for Label Ranking 433

such as Mallows [17]. The other group of methods are based on measures of
similarity or correlation between rankings (e.g., [24,2]).
In this paper, we propose an adaptation of association rules mining for label
ranking based on similarity measures. Association rules mining is a very impor-
tant and successful task in data mining. Although its original purpose was only
descriptive, several adaptations have been proposed for predictive problems.
The paper is organized as follows: sections 2 and 3 introduce the label ranking
problem and the task of association rule mining, respectively; section 4 describes
the measures proposed here; section 5 presents the experimental setup and dis-
cusses the results; finally, section 6 concludes this paper.

2 Label Ranking

The formalization of the label ranking problem given here follows the one pro-
vided in [7].1 In classification, given an instance x from the instance space X, the
goal is to predict the label (or class) λ to which x belongs, from a pre-defined
set L = {λ1 , . . . , λk }. In label ranking the goal is to predict the ranking of the
labels in L that are associated with x. We assume that the ranking is a total
order over L defined on the permutation space Ω. A total order can be seen as a
permutation π of the set {1, . . . , k}, such that π(a) is the position of λa in π. Let
us also denote π −1 as the result of inverting the order in π. As in classification,
we do not assume the existence of a deterministic X → Ω mapping. Instead,
every instance is associated with a probability distribution over Ω. This means
that, for each x ∈ X, there exists a probability distribution P (·|x) such that, for
every π ∈ Ω, P (π|x) is the probability that π is the ranking associated with x.
The goal in label ranking is to learn the mapping X → Ω. The training data is
a set of instances T = {< xi , πi >}, i = 1, . . . , n, where xi are the independent
variables describing instance i and πi is the corresponding target ranking.
As an example, given a scenario where we have financial analysts making
predictions about the evolution of volatile markets, it would be advantageous
to be able to predict which analysts are more profitable in a certain market
context [2]. Moreover, if we could have beforehand the full ordered list of the best
analysts, this would certainly increase the chances of making good investments.
Given the ranking π̂ predicted by a label ranking model for an instance x,
which is, in fact, associated with the true label ranking π, we need to evaluate
the accuracy of the prediction. For that, we need a loss function on Ω. One such
function is the number of discordant label pairs,

D(π, π̂) = #{(i, j)|π(i) > π(j) ∧ π̂(i) < π̂(j)}

which, if normalized into the interval [−1, 1], is equivalent to Kendall’s τ coeffi-
cient. The latter is as a correlation measure where D(π, π) = 1 and D(π, π −1 ) =
−1. We obtain a loss function by averaging this function over a set of exam-
ples. We will use it as evaluation measure in this paper, as it has been used in
1
An alternative formalization can be found in [25].
434 C.R. de Sá et al.

recent studies [7]. However, other distance measures could have been used, like
Spearman’s rank correlation coefficient [22].

3 Association Rules Mining



An association rule (AR) is an implication: A → C where A C = ∅, A, C ⊆
desc (X) where desc (X) is the set of descriptors of instances in X, typically
pairs attribute, value. We also denote desc (xi ) as the set of descriptors of
instance xi .
Association rules are typically characterized by two measures, support and
confidence. The support of rule A → C in T is sup if sup% of the cases in it
contain A and C. Additionally, it has a confidence conf in T if conf % of cases
in T that contain A also contain C.
The original method for induction of AR is the APRIORI algorithm that was
proposed in 1994 [1]. APRIORI identifies all AR that have a support and confi-
dence higher than a given minimal support threshold (minsup) and a minimal
confidence threshold (minconf ), respectively. Thus, the model generated is a set
of AR of the form A → C, where A, C ⊆ desc (X), and sup(A → C) ≥ minsup
and conf (A → C) ≥ minconf . For a more detailed description see [1].
Despite the usefulness and simplicity of APRIORI, it runs a time consuming
candidate generation process and needs space and memory that is proportional
to the number of possible combinations in the database. Additionally it needs
multiple scans of the database and typically generates a very large number of
rules. Because of this, many new pruning methods were proposed in order to
avoid that. Such as the hashing [19], dynamic itemset counting [6], parallel and
distributed mining [20], relational database systems integrated with mining [23].
Association rules were originally proposed for descriptive purposes. However,
they have been adapted for predictive tasks such as classification (e.g., [18]).
Given that label ranking is a predictive task, we describe some useful notation
from an adaptation of AR for classification in Section 3.2.

3.1 Pruning
AR algorithms typically generate a large number of rules (possibly tens of thou-
sands), some of which represent only small variations from others. This is known
as the rule explosion problem [4]. It is due to the fact that the algorithm might
find rules for which the confidence can be marginally improved by adding further
conditions to the antecedent.
Pruning methods are usually employed to reduce the amount of rules, without
reducing the quality of the model. A common pruning method is based on the
improvement that a refined rule yields in comparison to the original one [4]. The
improvement of a rule is defined as the smallest difference between the confidence
of a rule and the confidence of all sub-rules sharing the same consequent. More
formally, for a rule A → C

imp(A → C) = min (∀A ⊂ A, conf (A → C) − conf (A → C))


Mining Association Rules for Label Ranking 435

As an example, if one defines minImp = 0.1%, the rule A1 → C will be kept,


if, and only if conf (A1 → C) − conf (A → C) ≥ 0.001, where A ⊂ A1 .

3.2 Class Association Rules

Classification Association Rules (CAR), were proposed as part of the Classifica-


tion Based on AR (CBA) algorithm [18]. A class association rule (CAR) is an
implication of the form: A → λ where A ⊆ desc (X), and λ ∈ L, which is the
class label. A rule A → λ holds in T with confidence conf if conf % of cases in
T that contain A are labeled with class λ, and with support sup in T if sup%
of the cases in it contain A and are labeled with class λ.
CBA takes a tabular data set T = {xi , λi }, where xi is a set of items and λi
the corresponding class, and look for all frequent ruleitems of the form A, λ,
where A is a set of items and λ ∈ L. The algorithm aims to choose a set of high
accuracy rules Rλ to match T . Rλ matches an instance < xi , λi >∈ T if there is
at least one rule A → λ ∈ Rλ , with A ⊆ desc(xi ), xi ∈ X, and λ ∈ L. If the rules
cannot classify all examples, a default class is given to them (e.g., the majority
class in the training data).

4 Association Rules for Label Ranking

We define a Label Ranking Association Rule (LRAR) as a straightforward adap-


tation of class association rules (CAR):

A→π

where A ⊆ desc (X) and π ∈ Ω. The only difference is that the label λ ∈ L is
replaced by the ranking of the labels, π ∈ Ω. Similar to what the prediction
made in CBA, when an example matches the rule A → π, the predicted ranking
is π. In this regard, we can use the same basic principle of the ruleitem for CARs
in LRARs, which is A, π where A is a set of items and π ∈ Ω.
This approach has two important problems. First, the number of classes can
be extremely large, up to a maximum of k!, where k is the size of the set of
labels, L. This means that the amount of data required to learn a reasonable
mapping X → Ω is too big.
The second disadvantage is that this approach does not take into account
the differences in nature between label rankings and classes. In classification,
two examples either have the same class or not. In this regard, label ranking is
more similar to regression than to classification. This property can be used in
the induction of prediction models. In regression, a large number of observations
with a given target value, say 5.3, increases the probability of observing similar
values, say 5.4 or 5.2, but not so much for very different values, say -3.1 or
100.2. A similar reasoning can be made in label ranking. Let us consider the
case of a data set in which ranking πa = {A, B, C, D, E} occurs in 1% of the
examples. Treating rankings as classes would mean that P (πa ) = 0.01. Let us
further consider that the rankings πb = {A, B, C, E, D}, πc = {B, A, C, D, E}
436 C.R. de Sá et al.

and πd = {A, C, B, D, E} occur in 50% of the examples. Taking into account the
stochastic nature of these rankings [7], P (πa ) = 0.01 seems to underestimate the
probability of observing πa . In other words it is expected that the observation of
πb , πc and πd increases the probability of observing πa and vice-versa, because
they are similar to each other.
This affects even rankings which are not observed in the available data. For
example, even though πe = {A, B, D, C, E} is not present in the data set it
would not be entirely unexpected to see it in future data.

4.1 Similarity-Based Support and Confidence

To take this characteristic into account, we can argue that the support of a rank-
ing π increases with the observation of similar rankings and that the variation
is proportional to the similarity. Given a measure of similarity between rankings
s(πa , πb ), we can adapt the concept of support of the rule A → π as follows:

s(πi , π)
i:A⊆desc(xi )
suplr (A → π) =
n
Essentially, what we are doing is assigning a weight to each target ranking in
the training, πi , data that represents its contribution to the probability that π
may be observed. Some instances xi ∈ X give full contribution to the support
count (i.e., 1), while others may give partial or even a null contribution.
Any function that measures the similarity between two rankings or permuta-
tions can be used, such as Kendall’s τ [16] or Spearman’s ρ [22]. The function
used here is of the form:
 
s (πa , πb ) if s (πa , πb ) ≥ θsup
s(πa , πb ) = (1)
0 otherwise

where s is a similarity function. This general form assumes that below a given
threshold, θsup , is not useful to discriminate between different similarity values,
as they are so different from πa . This means that, the support sup of A, πa  will
have contributions from all the ruleitems of the form A, πb , for all πb where
s (πa , πb ) > θsup ). Again, many functions can be used as s .
The confidence of a rule A → π is obtained simply by replacing the measure
of support with the new one.

suplr (A → π)
conflr (A → π) =
sup (A → π)

Given that the loss function that we aim to minimize is known beforehand, it
makes sense to use it to measure the similarity between rankings. Therefore, we
use Kendall’s τ . In this case, we think that θsup = 0 would be a reasonable value,
given that it separates the negative from the positive contributions. Table 1
shows an example of a label ranking dataset represented following this approach.
Mining Association Rules for Label Ranking 437

Table 1. An example of a label ranking dataset to be processed by the APRIORI-LR


algorithm

π1 π2 π3
TID A1 A2 A3 (1, 3, 2) (2, 1, 3) (2, 3, 1)
1 L XL S 0.33 0.00 1.00
2 XXL XS S 0.00 1.00 0.00
3 L XL XS 1.00 0.00 0.33

Algorithm 1. APRIORI-LR - APRIORI for Label Ranking


Require: minsup and minconf
Ck : Candidate ruleitems of size k
Fk : Frequent ruleitems of size k
T = {< xi , πi >}: Transactions in the database
F1 = {A, π : #A = 1 AND suplr (A, π) ≥ minsup}
k=1
while Fk = ∅ do
Ck+1 = {cand = A1 ∩ A2, π : A1 , π , A2 , π ∈ Fk , #(A1 ∩ A2 ) = k + 1}
Fk+1 = {c : c ∈ Ck+1 ∧ suplr (c) ≥ minsup}
k =k+1
end while
F = ∪ki=1 Fi
Rπ = {A → π : A, π ∈ F ∧ conflr (A → π) ≥ minconf }
return Rπ

To present a more clear interpretation, the example given in table 1, the in-
stance ({A1 = L, A2 = XL, A3 = S}) (TID=1) contributes to the support count
of the ruleitem {A1 = L, A2 = XL, A3 = S}, π3  with 1. The same instance,
will also give a small contribution of 0.33 to the support count of the ruleitem
{A1 = L, A2 = XL, A3 = S}, π1 , given their similarity. On the other hand, no
contribution to the count of the ruleitem’s {A1 = L, A2 = XL, A3 = S}, π2 
support is given, which are clearly different.

4.2 APRIORI-LR Algorithm


Using the definitions of support and confidence proposed, adaptation of any AR
learning algorithm for label ranking is simple. However, for illustration purposes,
we will present an adaptation of the APRIORI algorithm, called APRIORI-LR.
Given a training set T = {< xi , πi >}, i = 1, . . . , n, frequent ruleitems are
generated with Algorithm 1 and transformed in LRARs.
Let Rπ be the set of all the generated label ranking association rules. The
algorithm aims to create a set of high accuracy rules rπ ∈ Rπ to cover T . The
classifier has the following format:

< rπ1 , rπ2 , . . . , rπn >


438 C.R. de Sá et al.

However, if these are insufficient to rank the given examples, a def ault ranking
is used. The default ranking can be the average ranking [5], which is often used
for this purpose.
This approach has two problems. The first is that it can only predict rankings
which were present in the training set (except when no rules apply and the
predicted ranking is the default ranking). The second problem is that it solves
conflicts between rankings without taking into account the “continuous” nature
of rankings, which was illustrated earlier. The problem of generating a single
permutation from a set of conflicting rankings has been studied in the context
of consensus rankings.
It has been shown in [15] that a ranking obtained by ordering the average ranks
of the labels across all rankings minimizes the euclidean distance to all those
rankings. In other words, it maximizes the similarity according to Spearman’s ρ
[22]. Given m rankings πi (i = 1, . . . , m) we aggregate them by computing for
each item j (j = 1, . . . , k):
m
πi,j
rj = i=1
m
The predicted ranking π̂ is obtained by ranking the items according to the value
of rj .
We can take advantage of this in the ranker builder in the following way: the
final predicted label ranking is the consensus of all the label rankings in the
consequent of the rules rπ triggered by the test example.
To implement pruning based on improvement for LR, some adaptation is re-
quired as well. Given that the relation between target values is different from
classification, as discussed in Section 4.1, we have to limit the comparison be-
tween rules with different consequents, if the similarity function S  (π, π  ) ≥ θimp .

4.3 Parameter Tuning


Due to the intrinsic nature of each different dataset, or even of the pre-processing
methods used to prepare the data (e.g., the discretization method), the maximum
minsup/minconf needed to obtain a rule set Rπ that matches all or at least most
of the examples, may vary significantly. We used a greedy method to define the
minimum confidence. As stated earlier, a rule set Rπ matches an example if
at least one rule (A → λ) ∈ Rπ , with A ⊆ desc(xi ), xi ∈ X. Then, our goal
is to obtain a rule set Rπ that maximizes the number of examples that are
matched, here defined as M . Additionally, we want the best rules, the rules with
the highest confidence values.
The parameter tuning method (Algorithm 2) determines the minconf that
obtains the rule set according to those criteria. To set the step value we consider
that, on one hand, a suitable minconf must be found as soon as possible. On
the other hand, this very same value should be as high as possible. Therefore,
5% seems a reasonable step value.
The ideal value for the minsup, is as close to 1% as possible. However, in some
datasets, namely those with a larger number of attributes, frequent ruleitem
Mining Association Rules for Label Ranking 439

Table 2. Summary of the datasets

Datasets type #examples #labels #attributes


autorship A 841 4 70
bodyfat B 252 7 7
calhousing B 20640 4 4
cpu-small B 8192 5 6
elevators B 16599 9 9
fried B 40769 5 9
glass A 214 6 9
housing B 506 6 6
iris A 150 3 4
pendigits A 10992 10 16
segment A 2310 7 18
stock B 950 5 5
vehicle A 846 4 18
vowel A 528 11 10
wine A 178 3 13
wisconsin B 194 16 16

Algorithm 2. Parameter tuning Algorithm


minconf = 100%
minsup = 1
while M < 100% do
minconf = minconf − 5%
Run Algorithm 1 with (minsup,minconf ) and determine M
end while
return minconf

generation can be a very time consuming task. In this case, minsup must be set
to a value larger than 1%. In this work, one such example is authorship, which
has 70 attributes.
This procedure has the important advantage that it does not take into account
the accuracy of the rule sets generated, thus reducing the risk of over-fitting.

5 Experimental Results
The data sets in this work were taken from KEBI Data Repository in the Philipps
University of Marburg [7] (Table 2). Continuous variables were discretized with
two distinct methods: (1) recursive minimum entropy partitioning criterion ([11])
with the minimum description length (MDL) as stopping rule, motivated by [10]
and (2) equal width bins.
The evaluation measure is Kendall’s τ and the performance of the method was
estimated using ten-fold cross-validation. The performance of APRIORI-LR is
compared with a baseline method, the default ranking (explained earlier) and
RPC [14]. For the generation of frequent ruleitems we used CAREN [3]. The
base learner used in RPC is the Logistic Regression Algorithm, with the default
configurations of the function Logit from the Stats package of R Programing
Language [21].
Additionally, we compare the performance of our algorithm with the re-
sults obtained with constraint classification (CC), instance-based label ranking
(IBLR) and ranking trees (LRT), that were presented in [7]. We note that we did
not run experiments with these methods and simply compared our results with
440 C.R. de Sá et al.

the published results of the other methods. Thus, they were probably obtained
with different partitions of the data and can not be compared directly. However,
they provide some indication of the quality of our method, when compared to
the state-of-the-art.
The value θimp was set to 0 in all experiments. This option may not be as
intuitive as it is in θsup . However, since the focus of this work is the reduction
of the number of generated rules, this value is suitable.

5.1 Results
Table 3 shows that the method obtains results with both discretization methods
that are clearly better than the ones obtained by the baseline method. This
means that the APRIORI-LR is identifying valid patterns that can predict label
rankings.
Table 4 presents the results obtained with pruned rules using the same minsup
and minconf values as in the previous experiments and compares it to RPC
using as a base learner Logistic Regression. Rd represents the percentage of the
number of rules reduced by pruning. The results presented clearly show that the
minImp constraint, set to 0.00 and 0.01, succeeded to reduce the number of
rules. However, there was no improvement in accuracy, although it also did not
decrease. Further tests are required to understand how this parameter affects
the accuracy of the models.
Finally, table 5 compares APRIORI-LR with state of the art methods based
on published results [7]. Given that the methods were not compared under
the same conditions, this simply gives us a rough idea of the quality of the
method proposed here. It indicates that, despite the simplicity of the adapta-
tion, APRIORI-LR is a competitive method. We expect that the results can
be significantly improved, for instance, by implementing more complex pruning
methods.

Table 3. Results obtained with minimum entropy discretization and with equal width
discretization with 3 bins for each attribute

Minimum entropy Equal width (3 bins)


τ τbaseline minsup minconf #rules M τ τbaseline minsup minconf #rules M
authorship .608 .568 20 60 3717 100% NA - - - - -
bodyfat .059 -.064 1 15 3289 98% .161 -.064 1 25 16222 100%
calhousing .291 .048 1 35 221 97% .139 .048 1 20 889 100%
cpu-small .439 .234 1 35 2774 100% .279 .234 1 30 1559 100%
elevators .643 .289 1 60 1864 98% .623 .289 1 60 18160 100%
fried .774 -.005 1 35 1959 97% .676 -.005 1 35 14493 100%
glass .871 .684 1 85 485 99% .794 .684 1 75 11385 100%
housing .758 .058 1 60 2547 96% .577 .058 1 45 5027 100%
iris .960 .089 1 90 115 100% .883 .089 1 80 69 100%
pendigits NA - - - - - .684 .451 10 75 18590 90%
segment .829 .372 4 85 4949 96% .496 .372 35 75 4688 49%
stock .890 .070 1 75 1606 100% .836 .070 1 65 1168 100%
vehicle .774 .179 7 80 10480 99% .675 .179 15 80 6662 83%
vowel .680 .195 1 70 21419 99% .709 .195 1 70 143882 100%
wine .844 .329 15 95 5960 100% .910 .329 1 95 165263 100%
wisconsin .031 -.031 1 0 1224 92% .280 -.031 5 20 404773 100%
Mining Association Rules for Label Ranking 441

Table 4. Comparison of APRIORI-LR with RPC

Minimum entropy Equal width (3 bins)


APRIORI-LR RPC APRIORI-LR RPC
τ mImp=0 Rd(%) mImp=1 Log. R. τ mImp=0 Rd(%) mImp=1 Log. R.
authorship 0.608 0.634 -40 0.637 0.900 NA NA - NA 0.905
bodyfat 0.059 0.057 -20 0.058 0.264 0.161 0.156 -98 0.156 0.175
calhousing 0.291 0.299 -54 0.300 0.227 0.139 0.112 -83 0.110 0.132
cpu-small 0.439 0.421 -91 0.418 0.446 0.279 0.271 -97 0.271 0.286
elevators 0.643 0.647 -93 0.651 0.650 0.623 0.620 -98 0.621 0.621
fried 0.774 0.731 -71 0.730 0.827 0.676 0.674 -93 0.676 0.671
glass 0.871 0.834 -83 0.833 0.898 0.794 0.767 -98 0.776 0.846
housing 0.758 0.753 -84 0.753 0.648 0.577 0.559 -96 0.562 0.552
iris 0.960 0.961 -75 0.961 0.862 0.883 0.876 -63 0.881 0.756
pendigits NA NA NA NA NA 0.684 0.682 -96 0.685 0.879
segment 0.829 0.828 -89 0.828 0.935 0.496 0.496 -100 0.500 0.878
stock 0.890 0.875 -75 0.874 0.795 0.836 0.822 -88 0.822 0.675
vehicle 0.774 0.781 -91 0.775 0.841 0.675 0.675 -97 0.674 0.820
vowel 0.680 0.686 -91 0.685 0.670 0.709 0.718 -97 0.721 0.571
wine 0.844 0.871 -96 0.884 0.925 0.910 0.877 -99 0.875 0.892
wisconsin 0.031 0.030 -8 0.031 0.612 0.280 0.286 -99 0.293 0.478

Table 5. Comparison of APRIORI-LR with state-of-the-art methods

APRIORI-LR
EW ME CC IBLR LRT
authorship NA 0.608 0.920 0.936 0.882
bodyfat 0.161 0.059 0.281 0.248 0.117
calhousing 0.139 0.291 0.250 0.351 0.324
cpu-small 0.279 0.439 0.475 0.506 0.447
elevators 0.623 0.643 0.768 0.733 0.760
fried 0.676 0.774 0.999 0.935 0.890
glass 0.794 0.871 0.846 0.865 0.883
housing 0.577 0.758 0.660 0.745 0.797
iris 0.883 0.960 0.836 0.966 0.947
pendigits 0.684 NA 0.903 0.944 0.935
segment 0.496 0.829 0.914 0.959 0.949
stock 0.836 0.890 0.737 0.927 0.895
vehicle 0.675 0.774 0.855 0.862 0.827
vowel 0.709 0.680 0.623 0.900 0.794
wine 0.910 0.844 0.933 0.949 0.882
wisconsin 0.280 0.031 0.629 0.506 0.343

6 Conclusions

In his paper we present a simple adaptation of an association rules algorithm


for label ranking. This adaptation essentially consists of 1) ensuring that rules
have label rankings in their consequent, 2) using variations of the support and
confidence measures that are suitable for label ranking and 3) generating the
model with parameters selected by a simple greedy algorithm.
These results clearly show that this is a viable label ranking method. It out-
performs a simple baseline and competes well with RPC, which means that,
despite its simplicity, it is inducing useful patterns.
Additionally, the results obtained indicate that the choice of the discretization
method and the number of bins per attribute play an important role in the
accuracy of the models. The tests indicate that the supervised discretization
method (minimum entropy), gives better results than equal width partitioning.
This is, however, not the main focus of this work.
Improvement-based pruning was successfully implemented and reduced the
number of rules in a substantial number. This plays an important role in gener-
ating models with higher interpretability.
442 C.R. de Sá et al.

The new framework proposed in this work, based on distance functions, is


consistent with the classical concepts underlying association rules. Furthermore,
although it was developed in the context of the label ranking task, it can also be
adapted for other tasks such as regression and classification. In fact, Classifica-
tion Association Rules can be regarded as a special case of distance-based AR,
where the distance function is 0-1 loss.
This work uncovered several possibilities that could be better studied in or-
der to improve the algorithm’s performance. They include: improving the pre-
diction generation method; implementing better pruning methods; developing a
discretization method that is suitable for label ranking; and the choice of pa-
rameters.
For evaluation, we have used a measure that is typically used in label rank-
ing. However, it is important to give more importance to higher ranks than to
lower ones which can be done, for instance, with the weighted rank correlation
coefficient [8].
Additionally, it is essential to test the methods on real label ranking problems.
The KEBI datasets are adapted from UCI classification problems. We plan to
test our methods on other problems including algorithm selection and predicting
the rankings of financial analysts [2]. In terms of real world applications, these
can be adapted to rank analysts, based on their past performance and also radios,
based on user’s preferences.

Acknowledgments

This work was partially supported by project Rank! (PTDC/EIA/81178/2006)


from FCT and Palco AdI project Palco3.0 financed by QREN and Fundo Eu-
ropeu de Desenvolvimento Regional (FEDER). We thank the anonymous referees
for useful comments.

References

1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large
databases. In: VLDB, pp. 487–499 (1994)
2. Aiguzhinov, A., Soares, C., Serra, A.P.: A similarity-based adaptation of naive
bayes for label ranking: Application to the metalearning problem of algorithm
recommendation. In: Pfahringer, B., Holmes, G., Hoffmann, A. (eds.) DS 2010.
LNCS, vol. 6332, pp. 16–26. Springer, Heidelberg (2010)
3. Azevedo, P.J., Jorge, A.M.: Ensembles of jittered association rule classifiers. Data
Min. Knowl. Discov. 21(1), 91–129 (2010)
4. Bayardo, R., Agrawal, R., Gunopulos, D.: Constraint-based rule mining in large,
dense databases. Data Mining and Knowledge Discovery 4(2), 217–240 (2000)
5. Brazdil, P., Soares, C., Costa, J.: Ranking Learning Algorithms: Using IBL and
Meta-Learning on Accuracy and Time Results. Machine Learning 50(3), 251–277
(2003)
Mining Association Rules for Label Ranking 443

6. Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic itemset counting and im-
plication rules for market basket data. In: Proceedings of the 1997 ACM SIGMOD
international conference on Management of data - SIGMOD 1997, pp. 255–264
(1997)
7. Cheng, W., Hühn, J., Hüllermeier, E.: Decision tree and instance-based learning
for label ranking. In: ICML 2009: Proceedings of the 26th Annual International
Conference on Machine Learning, pp. 161–168. ACM, New York (2009)
8. Pinto da Costa, J., Soares, C.: A weighted rank measure of correlation. Australian
& New Zealand Journal of Statistics 47(4), 515–529 (2005)
9. Dekel, O., Manning, C.D., Singer, Y.: Log-linear models for label ranking. Advances
in Neural Information Processing Systems (2003)
10. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretiza-
tion of continuous features. In: Machine Learning - International Workshop Then
Conference, pp. 194–202 (1995)
11. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued at-
tributes for classification learning. In: IJCAI, pp. 1022–1029 (1993)
12. Fürnkranz, J., Hüllermeier, E.: Preference learning. KI 19(1), 60 (2005)
13. Har-Peled, S., Roth, D., Zimak, D.: Constraint classification: A new approach to
multiclass classification. In: Cesa-Bianchi, N., Numao, M., Reischuk, R. (eds.) ALT
2002. LNCS (LNAI), vol. 2533, pp. 365–379. Springer, Heidelberg (2002)
14. Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label ranking by learning
pairwise preferences. Artif. Intell. 172(16-17), 1897–1916 (2008)
15. Kemeny, J., Snell, J.: Mathematical Models in the Social Sciences. MIT Press,
Cambridge (1972)
16. Kendall, M., Gibbons, J.: Rank correlation methods. Griffin, London (1970)
17. Lebanon, G., Lafferty, J.D.: Conditional Models on the Ranking Poset. In: NIPS,
pp. 415–422 (2002)
18. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining.
In: Knowledge Discovery and Data Mining, pp. 80–86 (1998)
19. Park, J.S., Chen, M.S., Yu, P.S.: An effective hash-based algorithm for mining
association rules. ACM SIGMOD Record 24(2), 175–186 (1995)
20. Park, J.S., Chen, M.S., Yu, P.S.: Efficient parallel and data mining for association
rules. In: CIKM, pp. 31–36 (1995)
21. R Development Core Team: R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing, Vienna, Austria (2010),
http://www.R-project.org ISBN 3-900051-07-0
22. Spearman, C.: The proof and measurement of association between two things.
American Journal of Psychology 15, 72–101 (1904)
23. Thomas, S., Sarawagi, S.: Mining generalized association rules and sequential pat-
terns using sql queries. In: KDD, pp. 344–348 (1998)
24. Todorovski, L., Blockeel, H., Džeroski, S.: Ranking with Predictive Clustering
Trees. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI),
vol. 2430, pp. 444–455. Springer, Heidelberg (2002)
25. Vembu, S., Gärtner, T.: Label Ranking Algorithms: A Survey. In: Fürnkranz, J.,
Hüllermeier, E. (eds.) Preference Learning. Springer, Heidelberg (2010)
Tracing Evolving Clusters
by Subspace and Value Similarity

Stephan Günnemann1 , Hardy Kremer1 ,


Charlotte Laufkötter2 , and Thomas Seidl1
1
Data Management and Data Exploration Group
RWTH Aachen University, Germany
{guennemann,kremer,seidl}@cs.rwth-aachen.de
2
Institute of Biogeochemistry and Pollutant Dynamics
ETH Zürich, Switzerland
charlotte.laufkoetter@env.ethz.ch

Abstract. Cluster tracing algorithms are used to mine temporal evo-


lutions of clusters. Generally, clusters represent groups of objects with
similar values. In a temporal context like tracing, similar values corre-
spond to similar behavior in one snapshot in time. Each cluster can be
interpreted as a behavior type and cluster tracing corresponds to track-
ing similar behaviors over time. Existing tracing approaches are designed
for datasets satisfying two specific conditions: The clusters appear in all
attributes, i.e. fullspace clusters, and the data objects have unique iden-
tifiers. These identifiers are used for tracking clusters by measuring the
number of objects two clusters have in common, i.e. clusters are traced
based on similar object sets.
These conditions, however, are strict: First, in complex data, clusters
are often hidden in individual subsets of the dimensions. Second, map-
ping clusters based on similar objects sets does not reflect the idea of
tracing similar behavior types over time, because similar behavior can
even be represented by clusters having no objects in common. A tracing
method based on similar object values is needed. In this paper, we in-
troduce a novel approach that traces subspace clusters based on object
value similarity. Neither subspace tracing nor tracing by object value
similarity has been done before.

1 Introduction
Temporal properties of patterns and their analysis are under active research [5].
A well known type of pattern are clusters, corresponding to similarity-based
groupings of data objects. A good example for clusters are customer groups.
Clusters can change in the course of time and understanding this evolution can
be used to guide future decisions [5], e.g. predicting whether a specific customer
behavior will occur. The evolution can be mined by cluster tracing algorithms
that find mappings between clusters of consecutive time steps [8,13,14].
The existing algorithms have a severe limitation: Clusters are mapped if the
corresponding object sets are similar, i.e. the algorithms check whether the pos-
sibly matching clusters have a certain fraction of objects in common; they are

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 444–456, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Tracing Evolving Clusters by Subspace and Value Similarity 445

unable to map clusters with different objects, even if the objects have similar
attribute values. Our novel method, however, maps clusters only if their corre-
sponding object values are similar, independently of object identities. That is,
we trace similar behavior types, which is a fundamentally different concept. This
is a relevant scenario, as the following two examples illustrate.
Consider scientific data of the earth’s surface with the attributes temperature
and smoke degree. The latter correlates with forest fire probability. The attribute
values are recorded over several months. In this dataset, at some point in time
a high smoke degree and high temperatures occur in the northern hemisphere;
sixth months later the same phenomenon occurs in the southern hemisphere, as
the seasons on the hemispheres are shifted half-yearly to each other. Another
example is the customer behavior of people in different countries. Often it is
similar, but shifted in time. For example, the customer behavior in Europe is
similar to the behavior in North America, but only some months later. Obviously,
a cluster tracing algorithm should detect these phenomena; however, existing
methods do not, since the observed populations, i.e. the environment and the
people respectively, stay at the same place, and thus there are no shared objects
between clusters — only the behavior migrates.
With today’s complex data, patterns are often hidden in different subsets
of the dimensions; for detecting these clusters with locally relevant dimensions,
subspace clustering was introduced. However, despite that many temporal data
sets are of this kind, e.g. gridded scientific data, subspace clustering has never
been used in a cluster tracing scenario. The existing cluster tracing methods
can only cope with fullspace clusters, and thus cannot exploit the information
mined by subspace clustering algorithms. Our novel tracing method measures
the subspace similarity of clusters and thus handles subspace clusters by design.
Summarized, we introduce a method for tracing behavior types in tempo-
ral data; the types are represented by clusters. The decision, which clusters of
consecutive time steps are mapped is based on a novel distance function that
tackles the challenges of object value similarity and subspace similarity. Our
approach can handle the following developments: emerging or disappearing be-
havior as well as distinct behaviors that converge into uniform behavior and
uniform behavior that diverges into distinct behaviors. By using subspaces, we
enable the following evolutions: Behavior can gain or lose characteristics; i.e.,
the representing subspace clusters can gain or lose dimensions over time, and
clusters that have different relevant dimensions can be similar. Varying behavior
can be detected; that is, to some extent the values of the representing clusters
can change.
Clusterings of three time steps are illustrated in Fig. 1. The upper part shows
the objects; the lower part abstracts from the objects and illustrates possible
clusterings of the datasets and tracings between the corresponding clusters. Note
that the three time steps do not share objects, i.e. each time step corresponds to
a different database from the same attribute domain {d1 , d2 }; to illustrate the
different objects, we used varying object symbols. An example for behavior that
gains characteristics is the mapping of Cluster C1,1 to C2,1 , i.e. the cluster gains
446 S. Günnemann et al.

d2 d2 d2

C3,1
w1
w3 C4,1
d1 d1 d1 C1,1 C2,1
& w4 C3,2 w6
&
& w2
C4,2
& C1,2 C2,2 w C3,3 w7
5

& t t+1 t+2 t+3


&
&
t t+1 t+2

Fig. 1. Top: databases of consecutive time Fig. 2. Example of a mapping graph,


steps; bottom: possible clusterings and exem- illustrating four time steps
plary cluster tracings

one dimension. Varying behavior is illustrated by the mapping from C1,2 to C2,2 ;
the values of the cluster have changed. If the databases were spatial, this could
be interpreted as a movement. A behavior divergence can be seen from time step
t + 1 to t + 2: the single cluster C2,1 is mapped to the two clusters C3,1 and C3,2 .

2 Related Work
Several temporal aspects of data are regarded in the literature [5]. In stream clus-
tering scenarios, clusters are adapted to reflect changes in the observed data, i.e.
the distribution of incoming objects changes [2]. A special case of stream cluster-
ing is for moving objects [10], focusing on spatial attributes. Stream clustering
in general, however, gives no information about the actual cluster evolution over
time [5]. For this, cluster tracing algorithms were introduced [8,13,14]; they rely
on mapping clusters of consecutive time steps. These tracing methods map clus-
ters if the corresponding object sets are similar, i.e. they are based on shared
objects. We, in contrast, map clusters only if their corresponding object values
are similar, independently of shared objects. That is, we trace similar types of
behavior, which is a fundamentally different concept.
Clustering of trajectories [7,15] can be seen as an even more limited variant
of cluster tracing with similar object sets, as trajectory clusters have constant
object sets that do not change over time.
The work in [1] analyzes multidimensional temporal data based on dense
regions that can be interpreted as clusters. The approach is designed to detect
substantial changes of dense regions; however, tracing of evolving clusters that
slightly change their position or subspace is not possible, especially when several
time steps are observed.
A further limitation of existing cluster tracing algorithms is that they can
only cope with fullspace clusters. Fullspace clustering models use all dimensions
in the data space [6]. For finding clusters hidden in individual dimensions, sub-
space clustering was introduced [4]. An overview of different subspace clustering
Tracing Evolving Clusters by Subspace and Value Similarity 447

approaches can be found in [9], and the differences between subspace clustering
approaches are evaluated in [11]. Until now, subspace clusters were only ap-
plied in streaming scenarios [3], but never in cluster tracing scenarios; deciding
whether subspace clusters of varying dimensionalities are similar is a challenging
issue. Our algorithm is designed for this purpose.

3 A Novel Tracing Model


Our main objective is to trace behavior types and their developments over time.
First, some basic notations: For each time step t ∈ {1, . . . , T } of our temporal
data we have a D-dim. database DBt ⊆ RD . We assume the data to be normal-
ized between [0, 1]. A subspace cluster Ct,i = (Ot,i , St,i ) at time step t is a set
of objects Ot,i ⊆ DBt along with a set of relevant dimensions St,i ⊆ {1, . . . , D}.
The objects are similar within these relevant dimensions. The set of all subspace
clusters {Ct,1 , . . . , Ct,k } at time step t is denoted subspace clustering Clust , and
each included subspace cluster represents a behavior type (e.g. a person group).

3.1 Tracing of Behavior Types


In this section, we determine whether a typical behavior in time step t continues
in t + 1. Customer behavior, for example, can be imitated by another population
in the next time step. Other kinds of temporal developments are the disappear-
ance of a behavior or a split-up into different behaviors. We have to identify
these temporal developments for effective behavior tracing. Formally, we need
a mapping function that maps each cluster at a given time step to a set of its
successors in the next time step; we denote these successors as temporal contin-
uations. Two clusters Ct,i and Ct+1,j are mapped if they are identified as similar
behaviors. We use a cluster distance function, introduced in Sec. 3.2, to measure
these similarities. If the distance is small enough, the mapping is performed.
Definition 1. Mapping function. Given a distance function dist for two clus-
ters, the mapping function Mt : Clust → P(Clust+1 ) that maps a cluster to its
temporal continuations is defined by
Mt (Ct,i ) = {Ct+1,j | dist(Ct,i , Ct+1,j ) < τ }
A cluster can be mapped to zero, one, or several clusters (1:n), and several map-
pings to the same cluster are possible (m:1), enabling detection of disappearance
or convergence of behaviors. We describe pairs of mapped clusters by a binary
relation: Rt = {(Ct,i , Ct+1,j ) | Ct+1,j ∈ Mt(Ct,i )} ⊆ Clust × Clust+1 .
Each tuple corresponds to one cluster mapping, i.e. for a behavior type in t we
have a similar one in the next time step t + 1. These mappings and the clusters
can be represented by a mapping graph. Reconsider that it is possible to map a
behavior to several behaviors in the next time step (cf. Fig. 1, t+1 → t+2). These
mappings, however, are not equally important. We represent this by edge weights
within the mapping graph; the weights indicate the strength of the temporal
continuation. We measure similarity based on distances, and thus small weights
denote a strong continuation and high weights reflect a weaker one. Formally,
448 S. Günnemann et al.

Definition 2. Mapping graph. A mapping graph G = (V, E, w) is a directed


and weighted graph with the following properties:
T
– Nodes represent clusters, i.e. V = i=1 Clust
T −1
– Edges represent cluster mappings, i.e. E = i=1 Rt
– Edge weights indicate the strength of the temporal continuations, i.e.
∀(Ci , Cj ) ∈ E : w(Ci , Cj ) = dist(Ci , Cj )

Figure 2 illustrates an exemplary mapping graph with edge weights. A mapping


graph allows us to categorize temporal developments:

Definition 3. Kinds of temporal developments. Given a mapping graph


G = (V, E, w), the behaviors represented by clusters C ∈ V can be categorized:

– a behavior disappears, if outdegree(C) = 0


– a behavior emerges, if indegree(C) = 0
– a behavior diverges, if outdegree(C) > 1
– different behaviors converge into a single behavior, if indegree(C) > 1

These categories show whether a behavior appears in similar ways in the sub-
sequent time step. Since the characteristics of a behavior can naturally change
over time, we also trace single behaviors over several time steps, denoted as an
evolving cluster and described by a path in the mapping graph.
Evolving clusters are correctly identified if specific evolution criteria are ac-
counted in our distance function. These are presented in the following section.

3.2 Cluster Distance Measure

Our objective is to identify similar behaviors. Technically, a distance measure is


needed that determines the similarity of two given clusters. Keep in mind that
measuring the similarity simply based on the fraction of shared objects cannot
satisfy our objective, since even totally different populations can show a similar
behavior in consecutive time steps.
We have to distinguish two kinds of evolution: A cluster can gain or lose
characteristics, i.e. the relevant dimensions of a subspace cluster can evolve, and
within the relevant dimensions the values can change. Our distance function has
to reflect both aspects for effective similarity measurement of evolving clusters.
Similarity based on subspaces. Each cluster represents a behavior type,
and because we are considering subspace clusters, the characteristics of a be-
havior are restricted to a subset of the dimensions. If a behavior remains stable
over time, its subspace remains also unchanged. The relevant dimensions of the
underlying clusters are identical. Consider the clusters Ct,i = (Ot,i , St,i ) and
Ct+1,j = (Ot+1,j , St+1,j ) of time steps t and t + 1: the represented behaviors are
very similar if the dimensions St,i are also included in St+1,j .
However, a behavior can lose some of its characteristics. In Fig. 1, for example,
the dimension d1 is no longer relevant in time step t+2 for the behavior depicted
on the bottom. Accordingly, a distance measure is reasonable if behavior types
Tracing Evolving Clusters by Subspace and Value Similarity 449

are considered to be similar even if they lose some relevant dimensions. That is,
|S ∩S |
the smaller the term 1 − t,i|St,it+1,j
| , the more similar are the clusters.
This formula alone, however, would prevent an information gain: If a cluster
Ct,i evolves to Ct+1,j by spanning more relevant dimensions, this would not
be assessed positively. We would get the same distance for a cluster with the
same shared dimensions as Ct,i and without additional relevant dimensions as
Ct+1,j . Since more dimensions mean more information, we do consider this.
|S \St,i |
Consequently, the smaller the term 1 − t+1,j |St+1,j | , the more similar the clusters.
Usually it is more important for tracing that we retain relevant dimensions.
Few shared dimensions and many new ones normally do not indicate similar be-
havior. Thus, we need a trade-off between retained dimensions and new (gained)
dimensions. This is achieved by a linear combination of the two introduced terms:

Definition 4. Distance w.r.t. subspaces. The similarity w.r.t. to subspaces


between two clusters Ct,i = (Ot,i , St,i ) and Ct+1,j = (Ot+1,j , St+1,j ) is defined by
   
|St+1,j \St,i | |St,i ∩ St+1,j |
S(Ct,i , Ct+1,j ) = α · 1 − + (1 − α) · 1 −
|St+1,j | |St,i |

with the trade-off factor α ∈ [0, 1].

By choosing α  1 − α we achieve that the similarity between two behaviors is


primarily rated based on their shared dimensions.
Similarity based on statistical characteristics. Besides the subspace
similarity, the actual values within these dimensions are important. E.g., al-
though two clusters share a dimension like ’income’, they can differ in their
values extremely (high vs. low income); these behaviors should not be mapped.
A small change in the values, however, is possible for evolving behaviors. For a
spatial dimension, this change would correspond to a slight cluster movement.
Given a cluster C = (O, S), we denote the set of values in dimension d with
v(C, d) = {o[d] | o ∈ O}. The similarity between two clusters Ct,i and Ct+1,j is
thus achieved by analyzing the corresponding sets v(Ct,i , d) and v(Ct+1,j , d). By
deducing two normal distribution Xd and Yd with means μx , μy and variances σx ,
σy from the two sets, the similarity can be measured by the information theoretic
Kullback-Leibler divergence (KL). Informally, we calculate the expected number
of bits required to encode a new distribution of values at time step t + 1 (Yd )
given the original distribution of the values at time step t (Xd ). Formally,

σx σy2 + (μy − μx )2 1
KL(Yd Xd ) = ln( )+ 2
− =: KL(Ct,i , Ct+1,j , d)
σy 2σx 2

By using the KL, we do not just account for the absolute deviation of the
means, but we have also the advantage of including the variances. A behavior
with a high variance in a single dimension allows a higher evolution of the means
for successive similar behaviors. A small variance of the values, however, only
permits a smaller deviation of the means.
450 S. Günnemann et al.

^G«G`
^G«G`

^G«G`
FRUH ^G«G`

^G«G`
&
&

& &
FRUH ^G«G`

^GG` ^GG` ^GG` ^GG`


W W W W

Fig. 3. Core dimensions in a 7-dim. space. (left: databases; right: clusters)

We use the KL for the similarity per dimension, and the overall similarity is
attained by cumulating over several dimensions. Apparently, we just have to use
dimensions that are in the intersection of both clusters. The remaining dimen-
sions are non-relevant for at least one cluster and hence are already penalized by
our subspace distance function. Our first approach for computing the similarity
based on statistical characteristics is

V (Ct,i , Ct+1,j , I) = ( KL(Ct,i , Ct+1,j , d))/|I| (1)
d∈I

with I = St,i ∩ St+1,j for averaging.


In a perfect scenario this distance is a good way to trace behaviors. In prac-
tice, however, we face the following problem: Consider Fig. 3 (note the 7-dim.
space) with the cluster C1,2 at time step t and the cluster C2,2 with the same
relevant dimensions in t+1. However, C2,2 is shifted in dimensions d1 and d2 ; the
distance function proposed above (Eq. 1) would determine a very high value and
hence the behaviors would not be mapped. A large part {d3 , ..., d7 } of the shared
relevant dimensions {d1 , ..., d7 }, however, show nearly the same characteristics
in both clusters. The core of the behaviors is completely identical, and thus a
mapping is reasonable; as illustrated by the mapping in the right part of Fig. 3.
Consider another example: the core of the customer behaviors of North Ameri-
cans and Europeans is identical; however, North Americans and Europeans have
further typical characteristics like their favorite sport (baseball vs. soccer). These
additional, non-core, dimensions provide us with further informations about the
single clusters at their current time step. They are mainly induced by the individ-
ual populations. For the continuation of the behavior, however, these dimensions
are not important. Note that non-core dimensions are a different concept than
non-relevant ones; non-core dimensions are shared relevant ones with differing
values. In other words, there are two different kinds of relevant dimensions: one
for subspace clusters and one for tracing of subspace clusters.
An effective distance function between clusters has to identify the core of the
behaviors and incorporate it into the distance. We achieve this by using a sub-
set Core ⊆ St,i ∩ St+1,j for comparing the values in Eq. 1 instead of the whole
intersection. Unfortunately, this subset is not known in advance, and it is not
reasonable to exclude dimensions from the distance calculation by a fixed thresh-
old if the corresponding dissimilarity is too large. Thus, we develop a variant to
automatically determine the core. We choose the ’best’ core among all possible
cores for the given two clusters. That is, for each possible core we determine the
Tracing Evolving Clusters by Subspace and Value Similarity 451

distance w.r.t. their value distributions, and we additionally penalize dimensions


not included in the core. The core with the smallest overall distance is selected,
i.e. we trade off the size of the core against the value V (Ct,i , Ct+1,j , Core):

Definition 5. The core-based distance function w.r.t. values for two


clusters Ct,i = (Ot,i , St,i ) and Ct+1,j = (Ot+1,j , St+1,j ) is defined by
 |N onCore| 
V (Ct,i , Ct+1,j ) = min β· +(1−β)·V (Ct,i , Ct+1,j , Core)
Core⊆St,i ∩St+1,j |St,i ∩ St+1,j |
∧|Core|>0

with the penalty factor β ∈ [0, 1] for dimensions N onCore = (St,i ∩St+1,j )\Core.
By selecting a smaller core, the first part of the distance formula enlarges. The
second part, however, gains the possibility of determining a smaller value. The
core must comprise at least one dimension; otherwise, we could map two clusters
even if they have no dimensions with similar characteristics.
Overall distance function. To correctly identify the evolving clusters in
our temporal data we have to consider evolutions in the relevant dimensions as
well as in the value distributions. Thus, we have to use both distance measures
simultaneously. Again, we require that two potentially mapped clusters share at
least one dimension; otherwise, these clusters cannot represent similar behaviors.

Definition 6. The Overall distance function for clusters Ct,i = (Ot,i , St,i )
and Ct+1,j = (Ot+1,j , St+1,j ) with |St,i ∩ St+1,j | > 0 is defined by
dist(Ct,i , Ct+1,j ) = γ · V (Ct,i , Ct+1,j ) + (1 − γ) · S(Ct,i , Ct+1,j )
with γ ∈ [0, 1]. In the case of |St,i ∩ St+1,j | = 0, the distance is set to ∞.

3.3 Clustering for Improved Tracing Quality


Until now, we assume a given clustering per time step such that we can de-
termine the distances and the mapping graph. In general, our tracing model is
independent of the used clustering method. However, since there are temporal
relations between consecutive time steps, we develop a clustering method whose
accuracy is improved by these relations and that avoids totally different clus-
terings in consecutive time steps. A direct consequence is an improved tracing
effectiveness. We adapt the effective cell-based clustering paradigm [12,16,11],
where clusters are approximated by hypercubes with at least minSup many
objects. The extent of a hypercube is restricted to w in its relevant dimensions.
Definition 7. Hypercube and valid subspace cluster. A hypercube HS with
the relevant dimensions S is defined by lower and upper bounds
HS = [low1 , up1 ] × [low2 , up2 ] × . . . × [lowD , upD ]
with upi − lowi ≤ w ∀i ∈ S and lowi = −∞, upi = ∞ ∀i ∈ S. The mean
of HS is called mHS . The hypercube HS represents all objects Obj(HS ) ⊆ DB
with o ∈ Obj(HS ) ⇔ ∀d ∈ {1, . . . , D} : lowd ≤ o[d] ≤ upd . A subspace cluster
C = (O, S) is valid ⇔ ∃HS with Obj(HS ) = O and |Obj(HS )| ≥ minSup.
452 S. Günnemann et al.

We now introduce how temporal relations between time steps can be exploited.
Predecessor information. We assume an initial clustering at time step
t = 1. (We discuss this later.) Caused by the temporal aspect of the data,
clusters at a time step t occur with high probability in t + 1 — not identical,
but similar. Given a cluster and the corresponding hypercube HS at time step
t, we try to find a cluster at the next time step in a similar region. We use a
Monte Carlo approach, i.e. we draw a random point mt+1 ∈ RD that represents
the initiator of a new hypercube and that is nearby the mean mHS of HS . After
inducing an hypercube by an initiator, the corresponding cluster’s validity is
checked. The quantity of initiators is calculated by a formula introduced in [16].
Definition 8. Initiator of a hypercube. A point p ∈ RD , called initiator,
together with a width w and a subspace S induces a hypercube HSw (p) defined by
∀d ∈ S : lowd = p[d] − w2 , upd = p[d] + w2 and ∀i ∈ S : lowi = −∞, upi = ∞.
Formally, the initiator mt+1 is drawn from the region HS2w (mHS ), permitting a
change of the cluster. The new hypercube is then HSw (mt+1 ). With this method
we detect changes in the values; however, also the relevant dimensions can
change: The initiator mt+1 can induce different hypercubes for different rele-
vant dimensions S. Accordingly, beside the initiator, we have to determine the
relevant subspace of the new cluster. The next section discusses both issues.
Determining the best cluster. A first approach is to use a quality function
[12,16]: μ(HS ) = Obj(HS ) · k |S| . The more objects or the more relevant dimen-
sions are covered by the cluster, the higher is its quality. These objectives are
contrary: a trade-off is realized with the parameter k. In time step t + 1 we could
choose the subspace S that maximizes μ(HSw (mt+1 )).
This method, however, optimizes the quality of each single cluster; it is not
intended to find good tracings. Possibly, the distance between each cluster from
the previous clustering Clust and our new cluster is large, and we would find no
similar behaviors. Our solution is to directly integrate the distance function dist
into the quality function. Consequently, we choose the subspace S such that the
hypercube HSw (mt+1 ) maximizes our novel distance based quality function.
Definition 9. Distance based quality function. Given the hypercube HS in
subspace S and a clustering Clust , the distance based quality function is

q(HS ) = μ(HS ) · (1 − min {dist(Ct , CS )})


Ct ∈Clust
where CS indicates the induced subspace cluster of the hypercube HS .
We enhance the quality of the clustering by selecting a set of possible initiators
M from the specified region; this is also important as the direction of a cluster
change is not known in advance. From the resulting set of potential clusters, we
select the one that has the highest quality.
Overall we realize that for each cluster C ∈ Clust a potential temporal con-
tinuation is identified in time step t + 1. Nonetheless it is also possible that no
valid hypercube is identified for a single cluster C ∈ Clust . This indicates that
a behavior type has disappeared in the current time step.
Tracing Evolving Clusters by Subspace and Value Similarity 453

Uncovered objects and the initial clustering. When behavior emerges


or disappears, there will be some objects of the current time step that are not
part of any identified cluster: if we denote
 the set of clusters generated so far by
Clust+1 , the set Remaint+1 := DBt+1 \ Ci =(Oi ,Si ) Oi can still contain objects
Ci ∈Clust+1
and therefore clusters. Especially for the initial clustering at time step t = 1
we have no predecessor information and hence Clus1 = ∅. To discover as many
patterns as possible, we have to check if the objects within Remaint+1 induce
novel clusters. We draw a set of initiators M ⊆ Remaint+1 , where each m ∈ M
induces a set of hypercubes HSw (m) in different subspaces. Finally, we choose
the hypercube that maximizes our quality function. If this hypercube is a valid
cluster, we add it to Clust+1 , and thus Remaint+1 is reduced. This procedure is
repeated until no valid cluster is identified or the set Remaint+1 is empty. Note
that our method has the advantage of generating overlapping clusters.

4 Experiments
Setup. We use real world and synthetic data for evaluation. Real world data
are scientific grid data reflecting oceanographic characteristics as temperature
and salinity of the oceans1 . It contains 20 time steps, 8 dimensions, and 71,430
objects. The synthetic data cover 24 time steps and 20 dimensions. In average,
each time step contains 10 clusters with 5-15 relevant dimensions. We hide de-
velopments (emerge, converge, diverge, or disappear) and evolutions (subspace
and value changes) within the data. In our experiments we concentrate on the
quality of our approach. For synthetic data, the correct mappings between the
clusters are given. Based on the detected mappings we calculate the precision
and recall values: we check whether all but only the true mappings between
clusters are detected. For tracing quality we use the F1 value corresponding to
the harmonic mean of recall and precision. Our approach tackles the problem of
tracing clusters with varying subspaces and is based on object-value-similarity.
Even if we would constrain our approach to handle only full-space clusters as
existing solutions, such a comparison is only possible when we artificially add
object ids to the data (to be used by these solutions). Tracing clusters based on
these artificial object ids, however, cannot reflect the ground truth in the data.
Summarized, comparisons to other approaches are not performed since it would
be unfair. We use Opteron 2.3GHz CPUs and Java6 64bit.
Tracing quality. First, we analyze how the parameters affect the tracing
effectiveness. For lack of space, we only present a selection of the experiments.
For α, a default value of 0.1 was empirically determined. γ is evaluated in Fig. 4
for three different τ values using synthetic data. By γ we determine the trade-off
between subspace similarity and value similarity in our overall distance function.
Obviously we want to prevent extreme cases for effective tracing, i.e. subspace
similarity with no attribute similarity at all (γ → 0), or vice versa. This is
confirmed by the figure, as the tracing quality highly degrades, when γ reaches
0 or 1 for all τ values. As γ = 0.3 enables a good tracing quality for all three τ ,
1
Provided by the Alfred Wegener Institute for Polar and Marine Research, Germany.
454 S. Günnemann et al.

ʏ=0.1 ʏ=0.3 ʏ=0.6 quality #ofnonͲcoredimensions


1.0 1 20

#ofnonͲcoredimensions
16

tracingquality
0.96
tracingquality

0.8 12
0.92
8
0.6
0.88 4
0.4 0.84 0
0 0.2 0.4 0.6 0.8 1 0 0.5 1
ɶ (tradeͲoffbetweenvalues&subspaces) ɴ (penaltyfornonͲcoredimensions)

Fig. 4. Tracing quality for different γ & τ Fig. 5. Eval. of core dimension concept

we use this as default. Note that with the threshold τ we can directly influence
how many cluster mappings are created. τ = 0.1 is a good trade-off and is used
as default. With a bigger τ the tracing quality worsens: too many mappings are
created and we cannot distinguish between meaningful or meaningless mappings.
The same is true for τ → 0: no clusters are mapped and thus the clustering
quality reaches zero; thus we excluded plots for τ → 0.
The core dimension concept is evaluated in Fig. 5. We analyze the influence on
the tracing quality (left axis) with a varying β on the x-axis; i.e., we change the
penalty for non-core dimensions. Note, non-core dimensions are a different con-
cept than non-relevant ones; non-core dimensions are shared relevant dimensions
with differing values. The higher the penalty, the more dimensions are included
in the dimension core; i.e., more shared dimensions are used for the value-based
similarity. In a second curve, we show the absolute number of non-core dimen-
sions (right axis) for the different penalties: the number decreases with higher
penalties. In this experiment the exact number of non-core dimensions in the
synthetic data is 10. We can draw the following conclusions regarding tracing
quality: A forced usage of a full core (β → 1) is a bad choice, as there can be
some shared dimensions with different values. By lowering the penalty we allow
some dimensions to be excluded from the core and thus we can increase the
tracing quality. With β = 0.1 the highest tracing quality is obtained; this is
plausible as the number of non-core dimensions corresponds to the number that
is existent in the data. A too low penalty, however, results in excluding nearly
all dimensions from the core (many non-core dimensions, β → 0) and dropping
quality. In the experiments, we use β = 0.1 as default.
Detection of behavior developments. Next we analyze whether our model
is able to detect the different behavior developments. Up to now, we used our
enhanced clustering method that utilizes the predecessor information and the
distance based quality function. Now, we additionally compare this method with
a variant that performs clustering of each step independently. In Fig. 6 we use
the oceanographic dataset and we determine for each time step the number of
disappeared behaviors for each clustering method. The experiment indicates that
the number of unmapped clusters for the approach without any predecessor or
distance information is larger than for our enhanced approach. By transferring
the clustering information between the time steps, a larger number of clusters
from one time step to the next can be mapped. We map clusters over a longer
time period, yielding a more effective tracing of evolving clusters.
Tracing Evolving Clusters by Subspace and Value Similarity 455

with predecessor&distanceinformation calculated intended


without 16

numberofoccurences
#ofdissapearedclusters 100
12

8
10
4

emerge

converge

diverge
dimension

dimension

disappear
1

gain

loss
0 5 10 15 20
timestep

Fig. 6. Effects of predecessor information Fig. 7. Cumulated number of evolutions


& distance quality function & developments over 24 time steps

100000 dimensiongain emerge converge


numberofoccurences

dimensionloss disappear diverge


10000
10000

numberofoccurences
1000
1000
100
100
10
1 10
diverge
emerge

disappear

converge
dimension

dimension

1
gain

loss

0 5 10 15 20
timestep

Fig. 8. Number of evolutions & developments on real world data; left: cumulated over
20 time steps, right: for each time step

The aim of tracing is not just to map similar clusters but also to identify
different kinds of evolution and development. In Fig. 7 we plot the number of
clusters that gain or lose dimensions and the four kinds of development cumu-
lated over all time steps. Beside the numbers our approach detects, we show the
intended number based on this synthetic data. The first four bars indicate that
our approach is able to handle dimension gains or losses; i.e., we enable sub-
space cluster tracing, which is not considered by other models. The remaining
bars show that also the developments can be accurately detected. Overall, the
intended transitions are found by our tracing. In Fig. 8 we perform a similar
experiment on real world data. We report only the detected number of patterns
because exact values are not given. On the left we cumulate over all time steps.
Again, our approach traces clusters with varying dimensions. Accordingly, on
real world data it is a relevant scenario that subspace clusters lose some of their
characteristics, and it is mandatory to use a tracing model that handle these
cases. The developments are also identified in this real world data. To show
that the effectiveness is not restricted to single time steps, we analyze the de-
tected patterns for each time step individually on the right. Based on the almost
constant slopes of all curves, we can see that our approach performs effectively.

5 Conclusion

In this paper, we proposed a model for tracing evolving subspace clusters in high
dimensional temporal data. In contrast to existing methods, we trace clusters
456 S. Günnemann et al.

based on their behavior; that is, clusters are not mapped based on the fraction of
objects they have in common, but on the similarity of their corresponding object
values. We enable effective tracing by introducing a novel distance measure that
determines the similarity between clusters; this measure comprises subspace and
value similarity, reflecting how much a cluster has evolved. In the experimental
evaluation we showed that high quality tracings are generated.

Acknowledgments. This work has been supported by the UMIC Research


Centre, RWTH Aachen University.

References
1. Aggarwal, C.C.: On change diagnosis in evolving data streams. TKDE 17(5), 587–
600 (2005)
2. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving
data streams. In: VLDB, pp. 81–92 (2003)
3. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clustering
of high dimensional data streams. In: VLDB, pp. 852–863 (2004)
4. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clus-
tering of high dimensional data for data mining applications. In: SIGMOD, pp.
94–105 (1998)
5. Böttcher, M., Höppner, F., Spiliopoulou, M.: On exploiting the power of time in
data mining. SIGKDD Explorations 10(2), 3–11 (2008)
6. Ester, M., Kriegel, H.P., Jörg, S., Xu, X.: A density-based algorithm for discovering
clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996)
7. Gaffney, S., Smyth, P.: Trajectory clustering with mixtures of regression models.
In: KDD, pp. 63–72 (1999)
8. Kalnis, P., Mamoulis, N., Bakiras, S.: On discovering moving clusters in spatio-
temporal data. In: Anshelevich, E., Egenhofer, M.J., Hwang, J. (eds.) SSTD 2005.
LNCS, vol. 3633, pp. 364–381. Springer, Heidelberg (2005)
9. Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: A sur-
vey on subspace clustering, pattern-based clustering, and correlation clustering.
TKDD 3(1), 1–58 (2009)
10. Li, Y., Han, J., Yang, J.: Clustering moving objects. In: KDD, pp. 617–622 (2004)
11. Müller, E., Günnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace
projections of high dimensional data. In: VLDB, pp. 1270–1281 (2009)
12. Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A monte carlo algorithm
for fast projective clustering. In: SIGMOD, pp. 418–427 (2002)
13. Rosswog, J., Ghose, K.: Detecting and tracking spatio-temporal clusters with adap-
tive history filtering. In: ICDM Workshops, pp. 448–457 (2008)
14. Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., Schult, R.: MONIC - modeling and
monitoring cluster transitions. In: KDD, pp. 706–711 (2006)
15. Vlachos, M., Gunopulos, D., Kollios, G.: Discovering similar multidimensional tra-
jectories. In: ICDE, pp. 673–684 (2002)
16. Yiu, M.L., Mamoulis, N.: Frequent-pattern based iterative projected clustering. In:
ICDM, pp. 689–692 (2003)
An IFS-Based Similarity Measure to Index
Electroencephalograms

Ghita Berrada and Ander de Keijzer

MIRA - Institute for Biomedical Technology and Technical Medicine


University of Twente, P.O. Box 217,7500 AE
Enschede, The Netherlands
{g.berrada,a.dekeijzer}@utwente.nl

Abstract. EEG is a very useful neurological diagnosis tool, inasmuch


as the EEG exam is easy to perform and relatively cheap. However, it
generates large amounts of data, not easily interpreted by a clinician.
Several methods have been tried to automate the interpretation of EEG
recordings. However, their results are hard to compare since they are
tested on different datasets. This means a benchmark database of EEG
data is required. However, for such a database to be useful, we have to
solve the problem of retrieving information from the stored EEGs with-
out having to tag each and every EEG sequence stored in the database
(which can be a very time-consuming and error-prone process). In this
paper, we present a similarity measure, based on iterated function sys-
tems, to index EEGs.

Keywords: clustering, indexing, electroencephalograms (EEG), iterated


function systems (IFS).

1 Introduction

An electroencephalogram (EEG) captures the brain’s electric activity through


several electrodes placed on the scalp1 . The result is a multidimensional time
series2 . An EEG signal can be classified into several types of cerebral waves char-
acterised by their frequencies, amplitudes, morphology, stability, topography and
reactivity. The interpretation of the sequence of cerebral waves, their localisa-
tion and context of occurrence (eg eyes closed EEG or sleep EEG) leads to a
diagnosis. The complexity of the sequences of cerebral waves, the non-specificity
of EEG recordings (for example, without any context being given, the EEG
recording of a chewing artifact can be mistaken as that of a seizure (see figure
1)) and the amount of data generated make the interpretation process a difficult,
time-consuming and error-prone one. Consequently, the interpretation process is
being automated, in part at least, through several methods mostly consisting in
extracting features from EEGs and applying classification algorithms to the sets
1
Usually 21 in the International 10/20 System.
2
19 channels in the International 10/20 System.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 457–468, 2011.
c Springer-Verlag Berlin Heidelberg 2011
458 G. Berrada and A. de Keijzer

Fig. 1. EEG with a chewing artifact

of extracted features to discriminate between two different patient states (usu-


ally the ”normal” state and a pathological state). For example, empirical mode
decomposition and Fourier-Bessel expansion are used in [13] to discriminate be-
tween ictal EEGs (i.e EEGs of an epileptic seizure) and seizure-free EEGs. The
interpretation methods are usually tested on different datasets. To make them
comparable, a benchmark database of EEGs is required. Such a database has
to designed so as to be able to handle queries in natural language such as the
following sample queries:
1. find EEGs of non-convulsive status epilepticus
2. find EEGs showing rhythms associated with consumption of benzodiazepines
and remove all artefacts from them
Obtaining a simple answer to this set of queries would require the EEG dataset
to be heavily and precisely annotated and tagged. But what if the annotations
are scarce or not available? Furthermore, the whole process of annotating and
tagging each and every sequence of the EEG dataset is time-consuming and error-
prone. This means that feature extraction techniques are necessary to solve all of
these queries since they can help define a set of clinical features representative of
a particular pathology (query 1) or detect particular sets of patterns and process
the EEG based on them (query 2). EEG recordings correspond to very diverse
conditions ( eg. ”normal” state, seizure episodes, Alzheimer disease). Therefore,
a generic method to index EEGs without having to deal with disease-specific
features is required3 . Generic methods to index time series often rely on the def-
inition of a similarity measure. Some of the similarity measures proposed include
a function interpolation step, be it piecewise linear interpolation or interpolation
with AR (as in [8] to distinguish between normal EEGs and EEGs originating
from the injured brain undergoing transient global ischemia) or ARIMA models,
that can followed by a feature extraction step (eg. computation of LPC cep-
stral coefficients from the ARIMA model of the time series as in [9]). However,
ARIMA/AR methods assume that the EEG signal is stationary, which is not
a valid assumption. In fact, EEG signals can only be considered as stationary
during short intervals, especially intervals of normal background activity, but
3
As the number of disease-specific classifiers grows exponentially.
An IFS-Based Similarity Measure to Index EEGs 459

the stationarity assumption does not hold during episodes of physical or men-
tal activity, such as changes in alertness and wakefulness, during eye blinking
and during transitions between various ictal states. Therefore, EEG signals are
quasi-stationary. In view of that, we propose a similarity measure based on IFS
interpolation to index EEGs in this paper, as fractal interpolation does not as-
sume stationarity of the data and can adequately model complex structures.
Moreover, using fractal interpolation makes computing features such as the frac-
tal dimension simple (see theorem 21 for the link between fractal interpolation
parameters and fractal dimension) and the fractal dimension of EEGs is known
to be a relevant marker for some pathologies such as dementia (see [7]).

2 Background
2.1 Fractal Interpolation
Fractal dimension. Any given time series can be viewed as the observed data
generated by an unknown manifold or attractor. One important property of this
attractor is its fractal dimension. The fractal dimension of an attractor counts
the effective number of degrees of freedom in the dynamical system and therefore
quantifies its complexity. It can also be seen as the statistical quantity that gives
an indication of how completely a fractal object appears to fill space, as one
zooms down to finer and finer scales. Another dimension, called the topological
dimension or Lebesgue Covering dimension, is also defined for any object and a
fortiori for the attractor. A space has Lebesgue Covering dimension n if for every
open cover4 of that space, there is an open cover that refines it such that the
refinement5 has order at most n + 1. For example, the topological dimension of
the Euclidean space Rn is n. The attractor of a time series can be fractal (ie its
fractal dimension is higher than its topological dimension) and is then called a
strange attractor. The fractal dimension is generally a non-integer or fractional
number. Typically, for a time series, the fractal dimension is comprised between
1 and 2 since the (topological) dimension of a plane is 2 and that of a line is 1.
The fractal dimension has been used to:
– uncover patterns in datasets and cluster data ([10,2,15])
– analyse medical time series ([14,6]) such as EEGs ([1,7])
– determine the number of features to be selected from a dataset for a similarity
search while obviating the ”dimensionality curse” ([12])

Iterated function systems. We denote as K a compact metric space for which


a distance function d is defined and as C(K) the space of continuous functions
4
A covering of a subset S is a collection C of open subsets in X whose union contains
all of S at least. A subset S ⊂ X is open if it is an arbitrary union of open balls in
X. This means that every point in S is surrounded by an open ball which is entirely
contained in X. An open ball in a in metric space X is defined as a subset of X of
the form B(x0, ) = {x ∈ X|d(x, x0 ) < } where x0 is a point of X and  a radius.
5
A refinement of a covering C of S is another covering C  of S such that each set B
in C  . is contained in some set A in C.
460 G. Berrada and A. de Keijzer

on K. We define over K a finite collection of mappings W = wi i∈[1,n] and their


associated probabilities pii∈[1,n] such that
n

pi ≥ 0 and pi = 1
i=1

n
We also define an operator T on C(K) as (T f )(x) = i=1 pi (f ◦ wi )(x). If T
maps C(K) into itself, then the pair (wi , pi) is called an iterated function system
on (K, d). The condition on T is satisfied for any set of probabilities pi if the
transformations wi are contracting, in other words, if, for any i, there exists a
δi < 1 such that: d(wi (x), wi (y)) ≤ δi d(x, y) ∀x, y ∈ K . The IFS is also denoted as
hyperbolic in this case.

Principle of fractal interpolation. If we define a set of points (xi , Fi ) ∈ R2 :


i = 0, 1, ..., n with x0 < x1 < ... < xn , then an interpolation function corre-
sponding to this set of points is a continuous function f : [x0 , xn ] → R such that
f (xi ) = Fi for i ∈ [0, n]. In fractal interpolation, the interpolation function is
often constructed with n affine maps of the form:
      
x ai 0 x ei
wi = + i = 1, 2, ..., n
y ci d i y fi

where di is constrained to satisfy: −1 ≤ di ≤ 1. Furthermore, we have the


following constraints:
       
x0 xi−1 xn xi
wi = and wi =
y0 yi−1 yn yi

After determining the contraction parameter di , we can estimate the four re-
maining parameters (namely ai ,ci ,ei ,fi ):
xi − xi−1
ai = (1)
xn − x0
xn xi−1 − x0 xi
ci = (2)
xn − x0
yi − yi−1 yn − y0
ei = − di (3)
xn − x0 xn − x0
xn yi−1 − x0 yi xn y0 − x0 yn
fi = − di (4)
xn − x0 xn − x0

di can be determined using the geometrical approach given in [11]. Let t be a


time-series with end-points (x0 , y0 ) and (xn , yn ), and (xp , yp ) and (xq , yq ) two
consecutive interpolation points so that the map parameters desired are those
defined for wp . We also define α as the maximum height of the entire function
measured from the line connecting the end-points (x0 , y0 ) and (xn , yn ) and β
as the maximum height of the curve measured from the line connecting (xp , yp )
and (xq , yq ). α and β is positive (respectively negative) if the maximum value is
reached above the line (respectively below the line). The contraction factor dp
β
is then defined as α . This procedure is also valid when the contraction factor is
computed for an interval instead of for the whole function. The end-points are
then taken as being the end-points of the interval. For more details on fractal
interpolation, see [3,11].
An IFS-Based Similarity Measure to Index EEGs 461

Estimation of the fractal dimension from a fractal interpolation. The


theorem that links the fractal interpolation function and its fractal dimension is
given in [3].The theorem is as follows:
Theorem 21. Let n be a positive integer greater than 1, (xi , Fi )∈ R2 : i = 1, 2, ..., n
a set of points and R2 ; wi , i = 1, 2, .., n an IFS associated with the set of points
where:
      
x ai 0 x ei
wi = +
y ci di y fi
for i = 1, 2, .., n.

The vertical scaling factors di satisfy 0 ≤ di < 1 and the constants ai ,ci ,ei and
fi are defined as in section 2.1 (in equations 1,2,3 and 4) for i = 1, 2, ..., n.
We denote G the attractor of the IFS such that G is the graph of a fractal
interpolation
 function associated with the set of points.
If ni=1 |di | > 1 and the interpolation points do not lie on a straight line, then

the fractal dimension of G is the unique real solution D of i=1 |di |aD−1 i = 1.

2.2 K-Medoid Clustering


An m × m symmetric similarity matrix S can be associated to the EEGs to be
indexed (with m being the number of EEGs to index):
⎛ ⎞
d11 d12 . . . d1m
⎜ d12 d22 . . . d2m ⎟
⎜ ⎟
S=⎜ . . .. . ⎟ where dnm is the distance between EEGs n and m (5)
⎝ . . . . ⎠
. . .
d1m d2m . . . dmm

Given the computed similarity matrix S (defined by equation 5), we can use the
k-medoids algorithm to cluster the EEGs. This algorithm requires the number
of clusters k to be known. We describe our choice of the number of clusters
below, in section 2.3. The k-medoids algorithm is similar to k-means and can be
applied through the use of the EM algorithm. k random elements are, initially,
chosen as representatives of the k clusters. At each iteration, a representative
element of a cluster is replaced by a randomly chosen nonrepresentative element
of the cluster if the selected criterion (e.g. mean-squared error) is improved by
this choice. The data points are then reassigned to their closest cluster, given
the new cluster representative elements. The iterations are stopped when no
reassignments is possible. We use the PyCluster function kmedoids described in
[5] to make our k-medoids clustering.

2.3 Choice of Number of Clusters


The number of clusters in the dataset is estimated based on the similarity matrix
obtained following the steps in section 3 and using the method described in [4].
The method described in [4] takes the similarity matrix and outputs a vector
called envelope intensity associated to the similarity matrix. The number of
distinct regions in the plot of the envelope intensity versus the index gives an
estimation of the number of clusters. For details on how the envelope intensity
vector is computed, see [4].
462 G. Berrada and A. de Keijzer

3 An IFS-Based Similarity Measure

3.1 Fractal Interpolation Step

We interpolate each channel of each EEG (except the annotations channel) us-
ing piecewise fractal interpolation. For this purpose, we split each EEG channel
into windows and then estimate the IFS for each window. The previous descrip-
tion implies that a few parameters, namely the window size and therefore the
embedding dimension, have to be determined before estimating the piecewise
fractal interpolation function for each channel. The embedding dimension is de-
termined thanks to Takens’ theorem which states that, for the attractor of a
time series to be reconstructed correctly (i.e the same information content is
found in the state (latent) and observation spaces), the embedding dimension
denoted m satisfies : m > 2D + 1 where D is the dimension of the attractor, in
other words its fractal dimension. Since the fractal dimension of a time series
is between 1 and 2, we can get a satisfactory embedding dimension as long as
m > 2 ∗ 2 + 1 i.e m > 5. We therefore choose an embedding dimension equal to
6. And we choose the lag τ between different elements of the delay vector to be
equal to the average duration of an EEG data record i.e 1s. Therefore, we split
our EEGs in (non-overlapping) windows of 6 seconds. A standard 20-minutes
EEG (which therefore contains about 1200 data records of 1 second) would then
be split in about 200 windows of 6 seconds. Each window is subdivided into
intervals of one second each and the end-points of these intervals are taken as
interpolation points. This means there are 7 interpolation points per interval:
the starting point p0 of the window, the point one second away from p0 , the
point two seconds from p0 , the point three seconds away from p0 , the point four
seconds away from p0 , the point five seconds away from p0 and the last point of
the window. The algorithm6 to compute the fractal interpolation function per
window is as follows:

1. Choose, as an initial point, the starting point of the interval considered (the
first interval considered is the interval corresponding to the first second of
the window).
2. Choose, as the end point of the interval considered, the next interpolation
point.
3. Compute the contraction factor d for the interval considered.
4. If |d| > 1 go to 2, otherwise go to 5.
5. Form the map wi associated with the interval considered. In other words,
compute the a, c, e and f parameters associated to the interval (see equa-
tions).
 Apply the map to the entire window (i.e six seconds window) to yield
x
wi for all x in the window. 6. Compute and store the distance between
y
the original values of the time series on the interval considered (i.e the inter-
val constructed in steps 2 and 3) and the values given by wi on that interval.
A possible distance is the Euclidean distance.
6
Inspired from [11].
An IFS-Based Similarity Measure to Index EEGs 463

6. Go to 2 until the end of the window is reached.


7. Store the interpolation points and contraction factor which yield the min-
imum distance between the original values on the interval and the values
yielded by the computed map under the influence of each individual map in
steps 5 and 6.
8. Repeat steps from 1 to 8 for each window of the EEG channel.
9. Apply steps 1 to 9 to all EEG channels.

3.2 Fractal Dimensions Estimation

After this fractal interpolation step, each window of each signal is represented by
5 parameters instead of by signal frequency.window duration points. The
dimension of the analysed time series is therefore reduced in this step. For a stan-
dard 20-minutes EEG containing 23 signals of frequency 250 Hz, this amounts to
representing each signal with 1000 values instead 50000 and the whole EEG with
23000 values instead of 1150000, thus to reducing the number of signal values
by almost 98%. This dimension reduction may be exploited in future work to
compress EGGs and store compressed representations of EEGs in the database
instead of raw EEGs as the whole EEGs can be reconstructed from their fractal
interpolations. Further work needs to be done on the compression of EEG data
using fractal interpolation and the loss of information that may result from this
compression. Then, for each EEG channel and for each window, we compute the
fractal dimension thanks to theorem 21. The equation of theorem 21 is solved
heuristically for each 6-second interval of each EEG signal using a bisection al-
gorithm. As we know that the fractal dimension for a time series is between 1
and 2, we search a root of the equation of theorem 21 in the interval [1,2] and
split the search interval by half at each iteration until the value of the root is
approached by an -margin ( 7 being the admissible error on the desired root).
Therefore, for each EEG channel, we have the same number of computed frac-
tal dimensions as the number of windows. This feature extraction extraction
step (fractal dimension computations) further reduces the dimensionality of the
analysed time series. In fact, the number of values representing the time series is
divided by 5 in this step. This leads to representing a standard 20-minute EEG
containing 23 signals of frequency 250 Hz by 4600 values instead of the initial
1150000 points.

3.3 Similarity Matrix Computation

We only compare EEGs that have at least a subset of identical channels (i.e
having the same labels). When two EEGs don’t have any channels (except the
annotations channel) in common, the similarity measure between them is set to
1 (as the farther (resp. closer) the distance between two EEGs, the higher (resp.
lower) and the closer to 1 (resp. closer to 0) the similarity measure). If, for the
two EEGs compared, the matching pairs of feature vectors (i.e vectors made of
7
We choose  = 0.0001 in our experiments.
464 G. Berrada and A. de Keijzer

the fractal dimensions computed for each signal) do not have the same dimension
then the vector of highest dimension is approximated by a histogram and the
M most frequent values according to the histogram (M being the dimension of
the shortest vector) are taken as representatives of that vector and the distance
between the two feature vectors is approximated by the distance between the
shortest feature vector and the vector formed with the M most frequent values
of the longest vector. The similarity measure between two EEGs is given by:
N 1 d(chi
EEG1 EEG
,chi 2
)−dmin
i=1 N dmax −dmin

where N is the number of EEG channels, d(chEEG i


1
, chEEG
i
2
) the distance be-
tween the fractal dimensions extracted from channels with the same label in the
two EEGs compared and dmin and dmax respectively the minimum and maxi-
mum distances between two EEGs in the analysed set. We choose as metrics (d)
the Euclidean distance and the normalized mutual information.

4 Description of the Dataset and Experiments


We interpolate (with fractal interpolation, as described in section 3) 476 EEGs8
whose durations range from 1 minute 50 seconds to 5 hours 21 minutes and
whose sizes are between 1133KB and 138 MB. All signals in all these files have
a frequency of 250Hz. Of the files used, 260 have a duration between 15 and 30
minutes (54.6%)-which is the most frequent duration range for EEGs-, 40 files
(8.4%) a duration below 15 minutes and 176 files (37%) a duration higher than
30 minutes. Moreover, 386 files contain 23 signals (81.1 %), 63 20 signals (13.2
%), 13 19 signals (2.7 %), 7 25 signals (1.5 %), 3 28 signals (0.6 %), 1 12 signals
(0.2 %), 2 13 signals (0.4 %) and 1 2 signals (0.2 %). The experiments were run
on an openSuSe 10.3(x86-64) (kernel version 2.6.22.5-31) server (RAM 32GB,
IntelR
Quad-Core Xeon R
E5420@2.50GHz processor). The files for which the
diagnosis conclusion is either unknown or known to be abnormal without any
further details are not considered in the distance computation and clustering
steps described in section 3. This means that the distance computation and
clustering steps are performed on a subset of 328 files of the original 476 files.
The similarity matrice obtained is a 328 × 328 matrix . The files contained in the
subset chosen for clustering can be separated in 4 classes: normal EEG (195 files
i.e 59.5%), EEG of epilepsy(64 files i.e 19.5%), EEG of encephalopathy(31 files
i.e 9.5%) and EEG of brain damage (vascular damage, infarct, or ischemia)(34
files i.e 10.4%). Figure 2 shows the plot of the envelope intensity versus the
index for the euclidean-distance-based similarity measure and the plot of the
envelope intensity versus the index for the mutual-information-based similarity
measure. The plot for the Euclidean-distance based similarity matrix exhibits
2 distinct regions whereas the plot for the mutual-information based similarity
matrix exhibits 4 distinct regions. We therefore cluster the data first in 2 different
clusters using the Euclidean-based similarity matrix and then in 4 clusters using
8
Unprocessed and unnormalised.
An IFS-Based Similarity Measure to Index EEGs 465

(a) Euclidean distance-based matrix

(b) Mutual information-based matrix

Fig. 2. Envelope intensity of the dissimilarity matrices

the mutual-information based matrix. As we can see, the mutual information-


based measure yields the correct number of clusters while the Euclidean distance-
based similarity measure isn’t spread enough to yield the correct number of
clusters. We compare the performance of the IFS-based similarity measure with
an autoregressive (AR)-based similarity measure inspired from [9]:
– An AR model is fitted to each of the signals of each of the EEG files consid-
ered (at this stage 476). The order of the AR model fitted is selected using
the AIC criterion. The order is equal to 4 for our dataset.
– The LPC cepstrum coefficients are computed based on the AR model fitted
to each signal using the formulas given in [9]. The number of coefficients
selected is the PGCD of the number of points for all signals from all files.
– The Euclidean distance, as well as the mutual information between the com-
puted cepstral coefficients are computed in the same way as with the fractal
dimension-based distances for the subset of 328 files for which the diagnosis
are known. The resulting similarity matrices (328 × 328 matrices) are used
to perform k-medoid clustering.
Finally, we use the similarity matrices to cluster the EEGs (see Section 3.3).
466 G. Berrada and A. de Keijzer

5 Results

Figure 3 illustrates the relation between the duration of the EEG and the time it
takes to interpolate EEGs. It shows that the increase of the fractal interpolation
time with respect to the interpolated EEG’s duration is less than linear.

Fig. 3. Execution times of the fractal interpolation in function of the EEG dura-
tion compared to the AR modelling of the EEGs. The red triangles represent the
fractal interpolation execution times and the blue crosses the AR modelling execu-
tion times. the black stars the fitting of the fractal interpolation measured execu-
tion times with function 1.14145161064 ∗ (1 − exp(−(0.5 ∗ x)2.0 )) + 275.735500586 ∗
(1 − exp(−(0.000274218988011 ∗ (x))2.12063087537 )) using the Levenberg-Marquardt
algorithm

In comparison, AR modelling execution times increase almost linearly with


the EEG duration. Therefore, fractal interpolation is a scalable method and is
more scalable than AR modelling. In particular, the execution times for files of
durations between 15 and 30 minutes are between 8.8 seconds and 131.7 seconds,
that is execution times between 6.8 to 204.5 times lower than the duration of
the original EEGs. Furthermore, the method doesn’t impose any condition on
the signals to be compared as it handles the cases where EEGs to be compared
have no or limited common channels and have signals of different lengths. More-
over, fractal interpolation doesn’t require model selection as AR modelling does,
which considerably speeds up EEG interpolation. Moreover, with our dataset,
the computation of the Euclidean distance between the cepstrum coefficients
calculated based on the EEGs AR models leads to a matrix of NaN9 : the AR
modelling method is therefore less stable than the fractal interpolation-based
method. Table 1 summarises the clustering results for all similarity matrices. The
low sensitivity obtained for the abnormal EEGs (epilepsy,encephalopathy,brain
damage) can be be explained through the following reasons:
9
The same happens when the mutual information is used instead of the Euclidean
distance (all programs are written in Python 2.6).
An IFS-Based Similarity Measure to Index EEGs 467

Table 1. Specificity and sensivity of the EEG clusterings

Specificity Sensitivity Specificity Sensitivity


normal EEG 0.312 0.770833333333 normal EEG 0.297752808989 0.657534246575
abnormal EEG 0.770833333333 0.312 epilepsy 0.65564738292 0.183006535948
encephalopathy 0.838709677419 0.051724137931
brain damage 0.818713450292 0.114285714286

– most of the misclassified abnormal EEGs are EEGs representing mild forms
of the pathology represented therefore their deviation from a normal EEG
is minimal
– most of the misclassified abnormal EEGs (in particular for epilepsy and brain
damage) exhibit abnormalities on only a restricted number of channels (lo-
calised version of the pathologies considered). The similarity measures, giving
equal weights to all channels, are not sensitive enough to abnormalities affect-
ing one channel. In future work, we will explore the influence of weights on
the clustering performance. About 76% of the normal EEGs are well classi-
fied. The remaining misclassified EEGs are misclassified because they exhibit
artifacts, age-specific patterns and/or sleep-specific patterns that distort the
EEGs significantly enough to make the EEGs seem abnormal. Filtering arti-
facts before computing the similarity measures and incorporating metadata
knowledge in the similarity measure would improve the clustering results.

6 Conclusion

In this paper, we considered the problem of defining a similarity measure for


EEGs that would be generic enough to cluster EEGs without having to build
an exponential number of disease-specific classifiers. We use fractal interpolation
followed by fractal dimension computation to define a similarity measure. Not
only does the fractal interpolation provide a very compact representation of
EEGs (which may be used later on to compress EEGs) but it also yields execution
times that grow less than linearly with the EEG duration and is therefore a highly
scalable method. It is a method that can compare EEGs of different lengths
containing at least a common subset of channels. It also overcomes several of
the shortcomings of an AR modelling-based measure as it doesn’t require model
selection and is more stable and scalable than AR modelling-based measures.
Furthermore, the mutual-information based measure is more sensitive to the
correct number of clusters than the Euclidean distance-based one. In future
work, we will explore other entropy-based measures. It was also shown that the
shortcomings of the similarity measure when it comes to clustering abnormal
EEGs can be overcome through pre-processing the EEGs before interpolation
to remove artifacts, tuning the weight parameters in the measure to account for
small localised abnormalities and incorporating qualitative metadata knowledge
to the measure. All those solutions constitute future work.
468 G. Berrada and A. de Keijzer

References
1. Accardo, A., Affinito, M., Carrozzi, M., Bouquet, F.: Use of the fractal dimen-
sion for the analysis of electroencephalographic time series. Biological Cybernet-
ics 77(5), 339–350 (1997)
2. Barbará, D., Chen, P.: Using the fractal dimension to cluster datasets. In: KDD,
pp. 260–264 (2000)
3. Barnsley, M.: Fractals everywhere. Academic Press Professional, Inc., San Diego
(1988)
4. Climescu-Haulica, A.: How to Choose the Number of Clusters: The Cramer Mul-
tiplicity Solution. In: Decker, R., Lenz, H.J. (eds.) Advances in Data Analysis,
Proceedings of the 30th Annual Conference of the Gesellschaft für Klassifikation
e.V., Freie Universität Berlin, March 8-10. Studies in Classification, Data Analysis,
and Knowledge Organization, pp. 15–22. Springer, Heidelberg (2006)
5. De Hoon, M., Imoto, S., Nolan, J., Miyano, S.: Open source clustering software.
Bioinformatics 20, 1453–1454 (2004),
http://portal.acm.org/citation.cfm?id=1092875.1092876
6. Eke, A., Herman, P., Kocsis, L., Kozak, L.: Fractal characterization of complexity
in temporal physiological signals. Physiological measurement 23(1), R–R38 (2002)
7. Goh, C., Hamadicharef, B., Henderson, G.T., Ifeachor, E.C.: Comparison of Fractal
Dimension Algorithms for the Computation of EEG Biomarkers for Dementia. In:
Proceedings of the 2nd International Conference on Computational Intelligence in
Medicine and Healthcare (CIMED 2005), Costa da Caparica, Lisbon, Portugal,
June 29-July 1 (2005)
8. Hao, L., Ghodadra, R., Thakor, N.V.: Quantification of Brain Injury by EEG
Cepstral Distance during Transient Global Ischemia. In: Proceedings - 19th Inter-
national Conference - IEEE/EMBS, Chicago, IL., USA, October 30-November 2
(1997)
9. Kalpakis, K., Gada, D., Puttagunta, V.: Distance Measures for Effective Clus-
tering of ARIMA Time-Series. In: ICDM 2001: Proceedings of the 2001 IEEE
International Conference on Data Mining, pp. 273–280. IEEE Computer Society,
Washington, DC (2001)
10. Lin, G., Chen, L.: A Grid and Fractal Dimension-Based Data Stream Clustering
Algorithm. In: International Symposium on Information Science and Engieering,
vol. 1, pp. 66–70 (2008)
11. Mazel, D.S., Hayes, M.H.: Fractal modeling of time-series data. In: Conference
Record of the Twenty-Third Asilomar Conference of Signals, Systems and Com-
puters, pp. 182–186 (1989)
12. Malcok, M., Aslandogan, Y.A., Yesildirek, A.: Fractal dimension and similarity
search in high-dimensional spatial databases. In: IRI, pp. 380–384 (2006)
13. Pachori, R.B.: Discrimination between ictal and seizure-free EEG signals using
empirical mode decomposition. Res. Let. Signal Proc. 2008, 1–5 (2008)
14. Sarkar, M., Leong, T.Y.: Characterization of medical time series using fuzzy
similarity-based fractal dimensions. Artificial Intelligence in Medicine 27(2), 201–
222 (2003)
15. Yan, G., Li, Z.: Using cluster similarity to detect natural cluster hierarchies. In:
FSKD (2), pp. 291–295 (2007)
DISC: Data-Intensive Similarity Measure for
Categorical Data

Aditya Desai, Himanshu Singh, and Vikram Pudi

International Institute of Information Technology-Hyderabad,


Hyderabad, India
{aditya.desai,himanshu.singh}@research.iiit.ac.in, vikram@iiit.ac.in
http://iiit.ac.in

Abstract. The concept of similarity is fundamentally important in al-


most every scientific field. Clustering, distance-based outlier detection,
classification, regression and search are major data mining techniques
which compute the similarities between instances and hence the choice
of a particular similarity measure can turn out to be a major cause of
success or failure of the algorithm. The notion of similarity or distance
for categorical data is not as straightforward as for continuous data and
hence, is a major challenge. This is due to the fact that different values
taken by a categorical attribute are not inherently ordered and hence a
notion of direct comparison between two categorical values is not pos-
sible. In addition, the notion of similarity can differ depending on the
particular domain, dataset, or task at hand. In this paper we present a
new similarity measure for categorical data DISC - Data-Intensive Simi-
larity Measure for Categorical Data. DISC captures the semantics of the
data without any help from domain expert for defining the similarity. In
addition to these, it is generic and simple to implement. These desirable
features make it a very attractive alternative to existing approaches. Our
experimental study compares it with 14 other similarity measures on 24
standard real datasets, out of which 12 are used for classification and 12
for regression, and shows that it is more accurate than all its competitors.

Keywords: Categorical similarity measures, cosine similarity, knowl-


edge discovery, classification, regression.

1 Introduction
The concept of similarity is fundamentally important in almost every scientific
field. Clustering, distance-based outlier detection, classification and regression
are major data mining techniques which compute the similarities between in-
stances and hence choice of a particular similarity measure can turn out to be a
major cause of success or failure of the algorithm. For these tasks, the choice of
a similarity measure can be as important as the choice of data representation or
feature selection. Most algorithms typically treat the similarity computation as
an orthogonal step and can make use of any measure. Similarity measures can
be broadly divided in two categories: similarity measures for continuous data
and categorical data.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 469–481, 2011.
c Springer-Verlag Berlin Heidelberg 2011
470 A. Desai, H. Singh, and V. Pudi

The notion of similarity measure for continuous data is straightforward due


to inherent numerical ordering. M inkowski distance and its special case, the
Euclidean distance are the two most widely used distance measures for contin-
uous data. However, the notion of similarity or distance for categorical data is
not as straightforward as for continuous data and hence is a major challenge.
This is due to the fact that the different values that a categorical attribute takes
are not inherently ordered and hence a notion of direct comparison between two
categorical values is not possible. In addition, the notion of similarity can differ
depending on the particular domain, dataset, or task at hand.
Although there is no inherent ordering in categorical data, there are other
factors like co-occurrence statistics that can be effectively used to define what
should be considered more similar and vice-versa. This observation has motivated
researchers to come up with data-driven similarity measures for categorical at-
tributes. Such measures take into account the frequency distribution of different
attribute values in a given data set but most of these algorithms fail to capture
any other feature in the dataset apart from frequency distribution of different
attribute values in a given data set. One solution to the problem is to build a
common repository of similarity measures for all commonly occurring concepts.
As an example, let the similarity values for the concept “color” be determined.
Now, consider 3 colors red, pink and black. Consider the two domains as follows:
– Domain I: The domain is say determining the response of cones of the eye
to the color, then it is obvious that the cones behave largely similarly to red
and pink as compared to black. Hence similarity between red and pink must
be high compared to the similarity between red and black or pink and black.
– Domain II: Consider another domain, for example the car sales data. In
such a domain, it may be known that the pink cars are extremely rare as
compared to red and black cars and hence the similarity between red and
black must be larger than that between red and pink or black and pink in
this case.
Thus, the notion of similarity varies from one domain to another and hence the
assignment of similarity must involve a thorough understanding of the domain.
Ideally, the similarity notion is defined by a domain expert who understands the
domain concepts well. However, in many applications domain expertise is not
available and the users don’t understand the interconnections between objects
well enough to formulate exact definitions of similarity or distance.
In the absence of domain expertise it is conceptually very hard to come up
with a domain independent solution for similarity. This makes it necessary to
define a similarity measure based on latent knowledge available from data instead
of a fit-to-all measure and is the major motivation for this paper.
In this paper we present a new similarity measure for categorical data, DISC
– Data-Intensive Similarity Measure for Categorical Data. DISC captures the
semantics of the data without any help from domain experts for defining sim-
ilarity. It achieves this by capturing the relationships that are inherent in the
data itself, thus making the similarity measure “data-intensive”. In addition, it
is generic and simple to implement.
DISC: Data-Intensive Similarity Measure for Categorical Data 471

The remainder of the paper is organized as follows. In Section 2 we discuss


related work and problem formulation in Section 3. We present the DISC algo-
rithm in Section 4 followed by experimental evaluation and results in Section 5.
Finally, in Section 6, we summarize the conclusions of our study and identify
future work.

1.1 Key Contributions


– Introducing a notion of similarity between two values of a categorical at-
tribute based on co-occurrence statistics.
– Defining a valid similarity measure for capturing such a notion which can be
used out-of-the-box for any generic domain.
– Experimentally validating that such a similarity measure provides a signifi-
cant improvement in accuracy when applied to classification and regression
on a wide array of dataset domains. The experimental validation is especially
significant since it demonstrates a reasonably large improvement in accuracy
by changing only the similarity measure while keeping the algorithm and its
parameters constant.

2 Related Work
Determining similarity measures for categorical data is a much studied field as
there is no explicit notion of ordering among categorical values. Sneath and
Sokal were among the first to put together and discuss many of the categorical
similarity measures and discuss this in detail in their book [2] on numerical
taxonomy.
The specific problem of clustering categorical data has been actively stud-
ied. There are several books [3,4,5] on cluster analysis that discuss the problem
of determining similarity between categorical attributes. The problem has also
been studied recently in [17,18]. However, most of these approaches do not offer
solutions to the problem discussed in this paper, and the usual recommenda-
tion is to “binarize” the data and then use similarity measures designed for
binary attributes. Most work has been carried out on development of clustering
algorithms and not similarity functions. Hence these works are only marginally
or peripherally related to our work. Wilson and Martinez [6] performed a de-
tailed study of heterogeneous distance functions (for categorical and continuous
attributes) for instance based learning. The measures in their study are based
upon a supervised approach where each data instance has class information in
addition to a set of categorical/continuous attributes.
There have been a number of new data mining techniques for categorical data
that have been proposed recently. Some of them use notions of similarity which
are neighborhood-based [7,8,9], or incorporate the similarity computation into
the learning algorithm[10,11]. These measures are useful to compute the neigh-
borhood of a point and neighborhood-based measures but not for calculating
similarity between a pair of data instances. In the area of information retrieval,
Jones et al. [12] and Noreault et. al [13] have studied several similarity measures.
472 A. Desai, H. Singh, and V. Pudi

Another comparative empirical evaluation for determining similarity between


fuzzy sets was performed by Zwick et al. [14], followed by several others [15,16].
In our experiments we have compared our approach with the methods dis-
cussed in [1]; which provides a recent exhaustive comparison of similarity mea-
sure for categorical data.

3 Problem Formulation
In this section we discuss the necessary conditions for a valid similarity measure.
Later, in Section 4.5 we describe how DISC satisfies these requirements and
prove the validity of our algorithm. The following conditions need to hold for a
distance metric “d” to be valid where d(x, y) is the distance between x and y.
1. d(x, y) ≥ 0
2. d(x, y) = 0 if and only if x=y
3. d(x, y) = d(y, x)
4. d(x, z) ≤ d(x, y) + d(y, z)
In order to come up with conditions for a valid similarity measure we use sim =
1
1+dist
, a distance-similarity mapping used in [1]. Based on this mapping we come
up with the following definitions for valid similarity measures:
1. 0 ≤ Sim(x, y) ≤ 1
2. Sim(x, y) = 1 if and only if x = y
3. Sim(x, y) = Sim(y, x)
1
4. Sim(x,y) 1
+ Sim(y,z) ≥ 1 + Sim(x,z)
1

Where, Sim(x, y) is the similarity between x and y:

4 DISC Algorithm
In this section we present the DISC algorithm. First in Section 4.1 we present
the motivation for our algorithm followed by data-structure description in Sec-
tion 4.2 and a brief overview of the algorithm in Section 4.3. We then describe the
algorithm for similarity matrix computation in Section 4.4. Finally in Section 4.5
we validate our similarity measure.

4.1 Motivation and Design


As can be seen from the related work, current similarity (distance) measures
for categorical data only examine the number of distinct categories and their
counts without looking at co-occurrence statistics with other dimensions in the
data. Thus, there is a high possibility that, the latent information that comes
along is lost during the process of assigning similarities. Consider the example
in Table 1, let there be a 3-column dataset where the Brand of a car and Color
are independent variables and the P rice of the car is a dependant variable. Now
there are three brands a, b, c with average price 49.33, 32.33, 45.66. It can be
intuitively said that based on the available information that similarity between
a and c is greater than that between the categories a and b. This is true in
DISC: Data-Intensive Similarity Measure for Categorical Data 473

Table 1. Illustration

Brand Color Price


a red 50
a green 48
a blue 50
b red 32
b green 30
b blue 35
c red 47
c green 45
c blue 45

real life where a, c, b may represent low, medium and high end cars and hence
the similarity between a low-end and a medium-end car will be more than the
similarity between a low-end and a high-end car. Now the other independent
variable is Color. The average prices corresponding to the three colors namely
red, green and blue are 43, 41, 43.33. As can be seen, there is a small difference
in their prices which is in line with the fact that the cost of the car is very loosely
related to its color.
It is important to note that a notion of similarity for categorical variables has a
cognitive component to it and as such each one is debatable. However, the above
explained notion of similarity is the one that best exploits the latent information
for assigning similarity and will hence give predictors of high accuracy. This claim
is validated by the experimental results. Extracting these underlying semantics
by studying co-occurrence data forms the motivation for the algorithm presented
in this section.

4.2 Data Structure Description


We first construct a data structure called the categorical information table (CI).
The function of the CI table is to provide a quick-lookup for information re-
lated to the co-occurrence statistics. Thus, for the above example CI[Brand:a]
[Color:red] = 1, as for only a single instance, Brand:a co-occurs with Color:red.
For a categorical-numeric pair, e.g. CI[Brand:a][Price] = 49.33 the value is the
mean value of the attribute P rice for instances whose value for Brand is a.
Now, for every value v of each categorical attribute Ak a representative point
τ (Ak : v) is defined. The representative point is a vector consisting of the means
of all attributes other than Ak for instances where attribute Ak takes value v:
τ (Ak : v) =< μ(Ak : v, A1 ), . . . , μ(Ak : v, Ad ) > (1)
It may be noted that the term μ(Ak : v, Ak ) is skipped in the above expression.
As there is no standard notion of mean for categorical attributes we define it as
μ(Ak : v, Ai ) =< CI[Ak : v][Ai : vi1 ], . . . , CI[Ak : v][Ai : vin ] > (2)
474 A. Desai, H. Singh, and V. Pudi

where domain(Ai ) = {vi1 , . . . , vin } It can thus be seen that, the mean itself is a
point in a n-dimensional space having dimensions as vi1 ,. . . ,vin with magnitudes:
< CI[Ak : v][Ai : vi1 ], . . . , CI[Ak : v][Ai : vin ] >.
Initially all distinct values belonging to the same attribute are conceptually
vectors perpendicular to each other and hence the similarity between them is 0.
For, the given example, the mean for dimension Color when Brand : a is
denoted as μ(Brand : a, Color). As defined above, the mean in a categorical
dimension is itself a point in a n-dimensional space and hence, the dimensions
of mean for the attribute Color are red, blue, green and hence
μ(Brand : a, Color) = {CI[Brand : a][Color : red], CI[Brand : a][Color :
blue], CI[Brand : a][Color : green]}
Similarly, μ(Brand : a, P rice) = {CI[Brand : a][P rice]}
Thus the representative point for the value a of attribute Brand is given by,
τ (Brand : a) =< μ(Brand : a, Color), μ(Brand : a, P rice) >

4.3 Algorithm Overview

Initially we calculate the representative points for all values of all attributes. We
then initialize similarity in a manner similar to the overlap similarity measure
where matches are assigned similarity value 1 and the mismatches are assigned
similarity value 0. Using the representative points calculated above, we assign a
new similarity between each pair of values v, v belonging to the same attribute
Ak as equal to the average of cosine similarity between their means for each
dimension. Now the cosine similarity between v and v  in dimension Ai is denoted
by CS(v : v  , Ai ) and is equal to the cosine similarity between vectors μ(Ak :
v, Ai ) and μ(Ak : v  , Ai ). Thus, similarity between Ak : v and Ak : v is:
d 
l=0,l=i CS(v : v , Al )
d−1
Thus, for the above example, the similarity between Brand:a and Brand:b is the
average of cosine similarity between their respective means in dimensions Color
and Price. Thus Sim(a, b) is given as: CS(a:b,Color)+CS(a:b,P
2
rice)

An iteration is said to have been completed, when similarity between all pairs
of values belonging to the same attribute (for all attributes) are computed using
the above methodology. These, new values are used for cosine similarity compu-
tation in the next iteration.

4.4 DISC Computation

In this section, we describe the DISC algorithm and hence the similarity ma-
trix construction. The similarity matrix construction using DISC is described as
follows:

1. The similarity matrix is initialized in a manner similar to overlap similarity


measure where ∀i,j,k Sim(vij , vik ) = 1, if vij = vik and Sim(vij , vik ) = 0,
if vij = vik
DISC: Data-Intensive Similarity Measure for Categorical Data 475

Table 2. Cosine Similarity computation between vij , vik


|CI[Ai :vij ][Am ]−CI[Ai :vik ][Am ]|
1− ; if Am is N umeric
Similaritym = M ax[Am ]−M in[Am ]
CosineP roduct(CI[Ai : vij ][Am ], CI[Ai : vik ][Am ]); if Am is Categorical
where CosineP roduct(CI[Ai : vij ][Am ], CI[Ai : vik ][Am ]) is def ined as f ollows :

vml ,v Am CI[Ai : vij ][Am : vml ] ∗ CI[Ai : vik ][Am : Vml̄ ] ∗ Sim(vml̄ , vml )
ml̄
N ormalV ector1 ∗ N ormalV ector2

N ormalV ector1 = ( vml ,v Am CI[Ai : vij ][Am : vml ] ∗ CI[Ai : vij ][Am , vml̄ ] ∗ Sim(vml , vml̄ ))1/2
 ml̄
N ormalV ector2 = ( vml ,v Am CI[Ai : vik ][Am : vml ] ∗ CI[Ai : vik ][Am , vml̄ ] ∗ Sim(vml , vml̄ ))1/2
1
d ml̄
Sim(vij , vik ) = d−1 m=1,m=i Similaritym

2. Consider a training dataset to be consisting of n tuples. The value of the


feature variable Aj corresponding to the ith tuple is given as T rainij . We
construct a data-structure “Categorical Information” which for any categor-
ical value (vil ) of attribute Ai returns number of co-occurrences of value vjk
taken by feature variable Aj if Aj is categorical and returns the mean value
of feature variable Aj for the corresponding set of instances if it is numeric.
Let this data-structure be denoted by CI. The value corresponding to the
number of co-occurrences of categorical value vjk when feature variable Ai
takes value vil is given by CI[Ai , vil ][Aj , vjk ] when Aj is categorical. Also,
when Aj is numeric, CI[Ai , vil ][Aj ] corresponds to the mean of values taken
by attribute Aj when Ai takes value vil .
3. The Sim(vij ,vik ) (Similarity between categorical values vij and vik ) is now
calculated as the average of the per-attribute cosine similarity between their
means (Similaritym ), where the means have a form as described above.
The complicated nature of cosine product arises due to the fact that, the
transformed space after the first iteration has dimensions which are no longer
orthogonal (i.e. Sim(vij , vik ) is no longer 0).
4. The matrix is populated using the above equation for all combinations
∀i,j,k Sim(vij , vik ). To test the effectiveness of the similarity matrix, we plug
the similarity values in a classifier (the nearest neighbour classifier in our
case) in case of classification and compute its accuracy on a validation set.
If the problem domain is regression we plug in the similarity values into a
regressor (the nearest neighbour regressor in our case) and compute the cor-
responding root mean square error. Thus, such an execution of 3 followed by
4 is termed an iteration.
5. The step 3 is iterated on again using the new similarity values until the
accuracy parameter stops increasing. The matrix obtained at this iteration
is the final similarity matrix that is used for testing. (In case of regression
we stop when the root mean square error increases.)
In addition, the authors have observed that, most of the improvement
takes place in the first iteration and hence in domains like clustering
476 A. Desai, H. Singh, and V. Pudi

(unsupervised) or in domains with tight limits on training time the algorithm


can be halted after the first iteration.

4.5 Validity of Similarity Measure


The similarity measure proposed in this paper is basically a mean of cosine simi-
larities derived for individual dimensions in non-orthogonal spaces. The validity
of the similarity measure can now be argued as follows:
1. As the similarity measure is a mean of cosine similarities which have a range
from 0-1, it is implied that the range of values output by the similarity
measure will be between 0-1 thus satisfying the first constraint.
2. For the proposed similarity measure Sim(X, Y ) = 1, if and only if
Simk (Xk , Yk ) = 1 for all feature variables Ak . Now, constraint 2 will be vio-
lated if X = Y and Sim(X, Y ) = 1. This implies that there exists an Xk , Yk
such that Xk = Yk and for which Sim(Xk , Yk ) = 1. Now for Sim(Xk , Yk ) = 1
implies cosine product of CI[Ak : Xk ][Am ] and CI[Ak : Yk ][Am ] is 1 for all
Am which implies that CI[Ak : Xk ][Am ], CI[Ak : Yk ][Am ] are parallel and
hence can be considered to be equivalent with respect to the training data.
3. As cosine product is commutative, the third property holds implicitly.
4. It may be noted that the resultant similarity is a mean of similarities com-
puted for each dimension. Also, the similarity for each dimension is in essence
a cosine product and hence, the triangle inequality holds for each component
of the sum. Thus the fourth property is satisfied.

5 Experimental Study
In this section, we describe the pre-processing steps and the datasets used in
Section 5.1 followed by experimental results in Section 5.2. Finally in Section 5.3
we provide a discussion on the experimental results.

5.1 Pre-processing and Experimental Settings


For our experiments we used 24 datasets out of which 12 were used for classi-
fication and 12 for regression. We compare our approach with the approaches
discussed in [1], which provides a recent exhaustive comparison of similarity
measures for categorical data.
Eleven of the datasets used for classification were purely categorical and one
was numeric (Iris). Different methods can be used to handle numeric attributes
in datasets like discretizing the numeric variables using the concept of minimum
description length [20] or equi-width binning. Another possible way to handle a
mixture of attributes is to compute the similarity for continuous and categorical
attributes separately, and then do a weighted aggregation. For our experiments
we used MDL for discretizing numeric variables for classification datasets.
Nine of the datasets used for regression were purely numeric, two (Abalone
and Auto Mpg) were mixed while one (Servo) was purely categorical. It may
be noted that the datasets used for regression were discretized using equi-width
DISC: Data-Intensive Similarity Measure for Categorical Data 477

binning using the following weka setting: “weka.f ilters.unsupervised.attribute.


Discretize − B10 − M − 1.0 − Rf irst − last” The k-Nearest Neighbours (kN N )
was implemented with number of neighbours 10. The weight associated with
each neighbour was the similarity between the neighbour and the input tuple.
The class with the highest weighted votes was the output class for classification
while the output for regression was a weighted sum of the individual responses.
The results have been presented for 10-folds cross-validation. Also, for our
experiments we used the entire train set as the validation set. The numbers in
brackets indicate the rank of DISC versus all other competing similarity mea-
sures. For classification, the values indicate the accuracy of the classifier where a
high value corresponds to high percentage accuracy and hence such a similarity
measure is assigned a better (higher) rank. On the other hand, for regression
Root Mean Square Error (RMSE) value has been presented where a compara-
tively low value indicates lower error and better performance of the predictor
and hence such a similarity measure is assigned a better rank. It may be noted
that a rank of 1 indicates best performance with the relative performance being
poorer for lower ranks.

5.2 Experimental Results

The experimental results for classification and regression are presented in Ta-
ble 3, 4 and Table 5, 6 respectively. In these tables each row represents compet-
ing similarity measure and the column represents different datasets. In Table 3
and 4, each cell represents the accuracy for the corresponding dataset and simi-
larity measure respectively. In Table 5 and 6, each cell represents the root mean
square error (RMSE) for the corresponding dataset and similarity measure re-
spectively.

5.3 Discussion of Results

As can be seen from the experimental results, DISC is the best similarity measure
for classification for all datasets except Lymphography, Primary Tumor and
Hayes Roth Test where it is the third best for the first two and the second
best for the last one. On the basis of overall mean accuracy, DISC outperforms
the nearest competitor by about 2.87% where we define overall mean accuracy
as as the mean of accuracies over all classification datasets considered for our
experiments. For regression, DISC is the best performing similarity measure on
the basis of Root Mean Square Error (RMSE) for all datasets.
For classification datasets like Iris, Primary Tumor and Zoo the algorithm
halted after the 1st iteration while for datasets like Balance, Lymphography,
Tic-Tac-Toe, Breast Cancer the algorithm halted after the 2nd iteration. Also,
for Car-Evaluation, Hayes Roth, Teaching Assistant and Nursery the algorithm
halted after the 3rd iteration while it halted after the 4th iteration for Hayes Roth
Test. For regression, the number of iterations was less than 5 for all datasets ex-
cept Compressive Strength for which it was 9. Thus, it can be seen that the
number of iterations for all datasets is small. Also, the authors observed that
478 A. Desai, H. Singh, and V. Pudi

Table 3. Accuracy for k-NN with k = 10

Dataset Balance Breast Car Hayes Iris Lymphog-


Sim. Measure Cancer Evaluation Roth raphy
DISC 90.4(1) 76.89(1) 96.46(1) 77.27(1) 96.66(1) 85.13(3)
Overlap 81.92 75.81 92.7 64.39 96.66 81.75
Eskin 81.92 73.28 91.2 22.72 96.0 79.72
IOF 79.84 76.89 91.03 63.63 96.0 81.75
OF 81.92 74.0 90.85 17.42 95.33 79.05
Lin 81.92 74.72 92.7 71.96 95.33 84.45
Lin1 81.92 75.09 90.85 18.93 94.0 82.43
Goodall1 81.92 74.36 90.85 72.72 95.33 86.48
Goodall2 81.92 73.28 90.85 59.09 96.66 81.08
Goodall3 81.92 73.64 90.85 39.39 95.33 85.13
Goodall4 81.92 74.72 91.03 53.78 96.0 81.08
Smirnov 81.92 71.48 90.85 59.84 94.0 85.81
Gambaryan 81.92 76.53 91.03 53.03 96.0 82.43
Burnaby 81.92 70.39 90.85 19.69 95.33 75.0
Anderberg 81.92 72.2 90.85 25.0 94.0 80.4

Table 4. Accuracy for k-NN with k = 10

Dataset Primary Hayes Tic Zoo Teaching Nursery Mean


Sim. Measure Tumor Roth Tac Assist. Accuracy
Test Toe
DISC 41.66(3) 89.28(2) 100.0(1) 91.08(1) 58.94(1) 98.41(1) 83.51(1)
Overlap 41.66 82.14 92.48 91.08 50.33 94.75 78.81
Eskin 41.36 75.0 94.46 90.09 50.33 94.16 74.19
IOF 38.98 71.42 100.0 90.09 47.01 94.16 77.57
OF 40.17 60.71 84.96 89.1 43.7 95.74 71.08
Lin 41.66 67.85 95.82 90.09 56.95 96.04 79.13
Lin1 42.26 42.85 82.56 91.08 54.96 93.54 70.87
Goodall1 43.15 89.28 97.07 89.1 51.65 95.74 80.64
Goodall2 38.09 92.85 91.54 88.11 52.98 95.74 78.52
Goodall3 41.66 71.42 95.51 89.1 50.99 95.74 75.89
Goodall4 32.73 82.14 96.24 89.1 55.62 94.16 77.38
Smirnov 42.55 78.57 98.74 89.1 54.3 95.67 78.57
Gambaryan 39.58 89.28 98.74 90.09 50.33 94.16 78.59
Burnaby 3.86 60.71 83.29 71.28 40.39 90.85 65.30
Anderberg 37.79 53.57 89.14 90.09 50.33 95.74 71.75

the major bulk of the accuracy improvement is achieved with the first iteration
and hence for domains with time constraints in training the algorithm can be
halted after the first iteration. The reason for the consistently good performance
can be attributed to the fact that a similarity computation is a major component
in nearest neighbour classification and regression techniques, and DISC captures
similarity accurately and efficiently in a data driven manner.
DISC: Data-Intensive Similarity Measure for Categorical Data 479

Table 5. RMSE for k-NN with k = 10

Dataset Comp. Flow Abalone Bodyfat Housing Whitewine


Strength
DISC 4.82(1) 13.2(1) 2.4(1) 0.6(1) 4.68(1) 0.74(1)
Overlap 6.3 15.16 2.44 0.65 5.4 0.74
Eskin 6.58 16.0 2.45 0.66 6.0 0.77
IOF 6.18 15.53 2.42 0.76 5.48 0.75
OF 6.62 14.93 2.41 0.66 5.27 0.75
Lin 6.03 16.12 2.4 0.63 5.3 0.74
Lin1 7.3 16.52 2.41 0.87 5.41 0.74
Goodall1 6.66 14.97 2.41 0.64 5.27 0.74
Goodall2 6.37 15.09 2.43 0.66 5.33 0.75
Goodall3 6.71 14.96 2.41 0.65 5.27 0.74
Goodall4 5.98 15.67 2.47 0.71 6.4 0.78
Smirnov 6.89 15.5 2.4 0.67 5.17 0.74
Gambaryan 6.01 15.46 2.46 0.67 5.73 0.76
Burnaby 6.63 15.23 2.41 0.65 5.32 0.74
Anderberg 7.15 15.16 2.42 0.67 5.84 0.75

Table 6. RMSE for k-NN with k = 10

Dataset Slump Servo Redwine Forest Fires Concrete Auto Mpg


DISC 6.79(1) 0.54(1) 0.65(1) 65.96(1) 10.29(1) 2.96(1)
Overlap 7.9 0.78 0.67 67.13 11.61 3.58
Eskin 8.12 0.77 0.68 67.49 11.15 3.98
IOF 8.11 0.77 0.68 67.95 11.36 3.71
OF 7.72 0.8 0.68 67.76 12.55 3.3
Lin 8.33 0.76 0.67 67.16 10.99 3.74
Lin1 8.42 1.1 0.68 67.96 12.16 3.89
Goodall1 7.76 0.77 0.66 67.97 11.45 3.5
Goodall2 7.82 0.81 0.67 68.64 12.21 3.39
Goodall3 7.75 0.78 0.67 68.48 11.52 3.39
Goodall4 7.87 0.95 0.71 70.28 12.96 3.92
Smirnov 8.22 0.78 0.69 67.07 11.59 3.39
Gambaryan 7.8 0.83 0.69 69.54 12.38 3.75
Burnaby 7.89 0.8 0.68 67.73 12.62 3.28
Anderberg 7.94 0.9 0.7 66.63 12.66 3.53

The computational complexity for determining the similarity measure is equiv-


alent to the computational complexity of computing cosine similarity for each
pair of values belonging to the same categorical attribute. Let the number of
pairs of values, the number of tuples, number of attributes and the average
number of values per attribute be V , n, d and v respectively. It can be seen that,
construction of categorical collection is O(nd). Also, for all pairs of values V,
we compute the similarity as the mean of cosine similarity of their representa-
tive points for each dimension. This is essentially (v 2 d) for each pair and hence
the computationally complexity is O(V v 2 d) and hence the overall complexity
480 A. Desai, H. Singh, and V. Pudi

is O(nd + V v 2 d). Once, the similarity values are computed, using them in any
classification, regression or a clustering task is a simple table look up and is
hence O(1).

6 Conclusion
In this paper we have presented and evaluated DISC, a similarity measure for
categorical data. DISC is data intensive, generic and simple to implement. In
addition to these features, it doesn’t require any domain expert’s knowledge.
Finally our algorithm was evaluated against 14 competing algorithms on 24
standard real-life datasets, out of which 12 were used for classification and 12 for
regression. It outperformed all competing algorithms on almost all datasets. The
experimental results are especially significant since it demonstrates a reasonably
large improvement in accuracy by changing only the similarity measure while
keeping the algorithm and its parameters constant.
Apart from classification and regression, similarity computation is a pivotal
step in a number of application such as clustering, distance-based outliers detec-
tion and search. Future work includes applying our algorithm for these techniques
also. We also intend to develop a weighing measure for different dimensions for
calculating similarity which will make the algorithm more robust.

References
1. Boriah, S., Chandola, V., Kumar, V.: Similarity Measures for Categorical Data: A
Comparative Evaluation. In: Proceedings of SDM 2008. SIAM, Atlanta (2008)
2. Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy: The Principles and Practice of
Numerical Classification. W. H. Freeman and Company, San Francisco (1973)
3. Anderberg, M.R.: Cluster Analysis for Applications. Academic Press, London
(1973)
4. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood
Cliffs (1988)
5. Hartigan, J.A.: Clustering Algorithms. John Wiley & Sons, New York (1975)
6. Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif.
Intell. Res. (JAIR) 6, 1–34 (1997)
7. Biberman, Y.: A context similarity measure. In: Bergadano, F., De Raedt, L. (eds.)
ECML 1994. LNCS, vol. 784, pp. 49–63. Springer, Heidelberg (1994)
8. Das, G., Mannila, H.: Context-based similarity measures for categorical databases.
In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI),
vol. 1910, pp. 201–210. Springer, Heidelberg (2000)
9. Palmer, C.R., Faloutsos, C.: Electricity based external similarity of categorical
attributes. In: Whang, K.-Y., Jeon, J., Shim, K., Srivastava, J. (eds.) PAKDD
2003. LNCS (LNAI), vol. 2637, pp. 486–500. Springer, Heidelberg (2003)
10. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with
categorical values. Data Mining and Knowledge Discovery 2(3), 283–304 (1998)
11. Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS–clustering categorical data
using summaries. In: KDD 1999. ACM Press, New York (1999)
12. Jones, W.P., Furnas, G.W.: Pictures of relevance: a geometric analysis of similarity
measures. J. Am. Soc. Inf. Sci. 38(6), 420–442 (1987)
DISC: Data-Intensive Similarity Measure for Categorical Data 481

13. Noreault, T., McGill, M., Koll, M.B.: A performance evaluation of similarity mea-
sures, document term weighting schemes and representations in a boolean environ-
ment. In: SIGIR 1980: Proceedings of the 3rd Annual ACM Conference on Research
and Development in Information Retrieval, Kent, UK, pp. 57–76. Butterworth &
Co. (1981)
14. Zwick, R., Carlstein, E., Budescu, D.V.: Measures of similarity among fuzzy con-
cepts: A comparative analysis. International Journal of Approximate Reason-
ing 1(2), 221–242 (1987)
15. Pappis, C.P., Karacapilidis, N.I.: A comparative assessment of measures of simi-
larity of fuzzy values. Fuzzy Sets and Systems 56(2), 171–174 (1993)
16. Wang, X., De Baets, B., Kerre, E.: A comparative study of similarity measures.
Fuzzy Sets and Systems 73(2), 259–268 (1995)
17. Gibson, D., Kleinberg, J.M., Raghavan, P.: Clustering categorical data: An ap-
proach based on dynamical systems. VLDB Journal 8(34), 222–236 (2000)
18. Guha, S., Rastogi, R., Shim, K.: ROCK–a robust clusering algorith for categorical
attributes. In: Proceedings of IEEE International Conference on Data Engineering
(1999)
19. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and tech-
niques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
20. Fayyad, U.M., Irani, K.B.: On the handling of continuous-valued attributes in
decision tree generation. Machine Learning 8, 87–102 (1992)
ListOPT: Learning to Optimize for XML Ranking

Ning Gao1 , Zhi-Hong Deng1 2 , Hang Yu1 , and Jia-Jian Jiang1




1
Key Laboratory of Machine Perception (Ministry of Education),
School of Electronic Engineering and Computer Science, Peking University
2
The State Key Lab of Computer Science, Institute of Software,
Chinese Academy of Sciences, Beijing 100190, China
 

   


  

 


Abstract. Many machine learning classification technologies such as boosting,


support vector machine or neural networks have been applied to the ranking prob-
lem in information retrieval. However, since the purpose of these learning-to-
rank methods is to directly acquire the sorted results based on the features of
documents, they are unable to combine and utilize the existing ranking meth-
ods proven to be e ective such as BM25 and PageRank. To solve this defect,
we conducted a study on learning-to-optimize, which is to construct a learning
model or method for optimizing the free parameters in ranking functions. This
paper proposes a listwise learning-to-optimize process ListOPT and introduces
three alternative di erentiable query-level loss functions. The experimental re-
sults on the XML dataset of Wikipedia English show that these approaches can
be successfully applied to tuning the parameters used in an existing highly cited
ranking function BM25. Furthermore, we found that the formulas with optimized
parameters indeed improve the e ectiveness compared with the original ones.

Keywords: learning-to-optimize, ranking, BM25, XML.

1 Introduction
Search engines have become an indispensable part of life and one of the key issues on
search engine is ranking. Given a query, the ranking modules can sort the retrieval doc-
uments for maximally satisfying the user’s needs. Traditional ranking methods aim to
compute the relevance of a document to a query, according to the factors, term frequen-
cies and links for example. The search result is a ranked list in which the documents
are sequenced by their relevance score in descending order. These kinds of methods
include the content based functions such as TF*IDF [1] and BM25 [2], and link based
functions such as PageRank [3] and HITS [4].
Recently, machine learning technologies have been successfully applied to informa-
tion retrieval, known and named as “learning-to-rank”. The main procedure of “learning-
to-rank” is as follow: In learning module, a set of queries is given, and each of the
queries is associated with a ground-truth ranking list of documents. The process targets
Corresponding author.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 482–492, 2011.
­c Springer-Verlag Berlin Heidelberg 2011
ListOPT: Learning to Optimize for XML Ranking 483

at creating a ranking model that can precisely predict the order of documents in the
ground-truth list. Many learning-to-rank approaches have been proposed and based on
the di erences of their learning samples, these methods can be classified into three cate-
gories [5]: pointwise, pairwise and listwise. Taking single document as learning object,
the pointwise based methods intent to compute the relevance score of each document
with respect to their closeness to the ground-truth. On the other side, pairwise based
approaches take the document pair as learning sample, and rephrase the learning prob-
lem as classification problem. Lisewise based approaches take a ranked list as learning
sample, and measure the di erences between the current result list and the ground-truth
list via using a loss function. The learning purpose of listwise methods is to minimize
the loss. The experimental results in [5] [11] [12] show that the listwise based methods
perform the best among these three kinds of methods.
It is worth noting that, from the perspective of ranking, the aforementioned learning-
to-rank methods belong to the learning based ranking technologies. Here the search
results are directly obtained from the learning module, without considering the tradi-
tional content based or link based ranking functions. However, there is no evidence to
confirm that the learning based methods perform better than all the other classic content
based or link based methods. Accordingly, to substitute the other two kinds of ranking
technologies with the learning based methods might not be appropriate.
We hence consider a learning-to-optimize method ListOPT that can combine and
utilize the benefits of learning-to-rank methods and traditional content based meth-
ods. Here the ranking method is the extension to the widely known ranking function
BM25. Due to previous studies, experiments are conducted on selecting the parameters
of BM25 with the best performance, typically after thousands of runs. However, this
simple but exhaustive procedure is only applicable to the functions with few free pa-
rameters. Besides, whether the best parameter values are in the testing set is also under
suspect. To attack this defect, a listwise learning method to optimize the free parameters
is introduced.
Same as learning-to-rank methods, the key issue of learning-to-optimize method is
the definition of loss function. In this paper, we discuss the e ect of three distinct defini-
tion of loss in the learning process and the experiments show that all three loss functions
converge. The experiments also reveal that the ranking function using tuned parameter
set indeed performs better.
The primary contributions of this paper include: (1) proposed a learning-to-optimize
method which combine and utilize the traditional ranking function BM25 and listwise
learning-to-rank method, (2) introduced the definition of three query-level loss func-
tions on the basis of cosine similarity, Euclidean distance and cross entropy, confirmed
to converge by experiments, (3) the verified the e ectiveness of the learning-to-optimize
approach on a large XML dataset Wikipedia English[6].
The paper is organized as follows. In section 2, we introduce the related work. Sec-
tion 3 gives the general description on learning-to-optimize approach ListOPT. The
definition of the three loss functions are discussed in section 4. Section 5 reports our
experimental results. Section 6 is the conclusion and future work.
484 N. Gao et al.

2 Related Work

2.1 Learning-to-Rank

In recent years, many machine learning methods were applied to the problem of
ranking for information retrieval. The existing learning-to-rank methods fall into three
categories, pointwise, pairwise and listwise. The pointwise approaches [7] are firstly
proposed, transforming the ranking problem into regression or classification on single
candidate documents. On the other side, pairwise approaches, published later, regard
the ranking process as a classification of document pairs. For example, given a query Q
and an arbitrary document pair P  (d1 d2 ) in the data collection, where di means the
i-th candidate document, if d1 shows higher relevance than d2 , then object pair P is set
as (p)  0, otherwise P is marked as (p)  0. The advantage of pointwise and pair-
wise approaches is that the existing classification or regression theories can be directly
applied. For instance, borrowing support vector machine, boosting and neural network
as the classification model leads to the methods of Ranking SVM [8], RankBoost [9]
and RankNet [10].
However, the objective of pointwise and pairwise learning methods is to minimize
errors in classification of single document or document pairs rather than to minimize
errors in ranking of documents. To overcome this drawback of the aforementioned two
approaches, listwise methods, such as ListNet [5], RankCosine [11] and ListMLE [12],
are proposed. In lisewise approaches, the learning object is the result list and various
kinds of loss functions are defined to measure the similarity of the predict result list
and the ground-truth result list. ListNet, the first listwise approach proposed by Cao et
al., uses the cross entropy as loss function. Qin et al. discussed about another listwise
method called RankCosine, where the cosine similarity is defined as loss function. Xia
et al. introduced likelihood loss as loss function in the listwise learning-to-rank method
ListMLE.

2.2 Ranking Function BM25

In information retrieval, BM25 is a highly cited ranking function used by search en-
gines to rank matching documents according to their relevance to a given search query.
It is based on the probabilistic retrieval framework developed in the 1970s and 1980s.
Though BM25 is proposed to rank the HTML format documents originally, it was in-
troduced to the area of XML documents ranking in recent years. In the last three years
of INEX1 [6] Ad Hoc track 2 [17] [18] [19], all the search engines that perform the best
use BM25 as basic ranking function. To improve the performance of BM25, Taylor et
al. introduced the pairwise learning-to-rank method RankNet to tune the parameters in
BM25, named as RankNet Tuning method [13] in this paper. However, as mentioned in
2.1, the inherent disadvantages of pairwise methods had a pernicious influence on the
1
Initiative for the Evaluation of XML retrieval (INEX), a global evaluation platform, is launched
in 2002 for organizations from Information Retrieval, Database and other relative research
fields to compare the e ectiveness and eÆciency of their XML search engines.
2
In Ad Hoc track, participates are organized to compare the retrieval e ectiveness of their XML
search engines.
ListOPT: Learning to Optimize for XML Ranking 485

approach. Experiments in section 5 will compare the e ectiveness of RankNet Tuning


with the other methods proposed in this paper.

3 ListOPT: A Learning-to-Optimize Approach

In this section, we describe the details in the learning-to-optimize approach. Firstly,


we give the formal definition of the ranking function BM25 used in XML retrieval
and analyze the parameters in the formula. Then, the training process of the listwise
learning-to-optimize approach ListOPT is proposed in 3.2.

3.1 BM25 in XML Retrieval

Unlike the HTML retrieval, the searching retrieval results are elements in XML re-
trieval, the definition of BM25 is thus di erent from the traditional BM25 formula used
in HTML ranking. The formal definition is as follow:

(k  1) t f (t e)
ps(e Q)  Wt (1)
t¾ Q k (1  b  b len(e)
avel )  t f (t e)
Nd
Wt  log
n(t)
In the formula, t f (t e) is the frequency of keyword t appeared in element e; Nd is the
number of files in the collection; n(t) is the number of files that contains keyword t;
len(e) is the length of element e; avel is average length of elements in the collection; Q
is a set of keywords; ps(e Q) is the predict relevance score of element e corresponding
to query Q; b and k are two free parameters.
As observed, the parameters in BM25 fall into three categories: constant parameters,
fixed parameters and free parameters. For example, parameters describing the features
of data collection like avel and Nd are defined as constant parameters. Given a certain
query and a candidate element, t f (t e) and len(e) in the formula are fixed values. This
kind of parameters is called fixed parameters. Moreover, free parameters, such as k
and b in the function, are set to make the formula more adaptable to various kinds of
data collections. Therefore, the ultimate objective of learning-to-optimize approach is
to learn the optimal set of free parameters.

3.2 Training Process


In training, there is a set of query Q  q1 q2 qm . Each query qi is associated with a
list of candidate elements E i  (ei1 ei2 ein(i) ), where eij denotes the the j-th candidate
element to query qi and n(i) is the size of E i . The candidate elements are defined as the
elements that contain at least one occurrence of each keyword in the query. Moreover,
each candidate elements list E i is associated with a ground-truth list Gi  (gi1 gi2 gin(i) ),
indicating the relevance score of each elements in E i . Given that the data collection
we used only contains information of whether or not the passages in a document are
486 N. Gao et al.

relevant, we apply the F measure cite14 to evaluate the ground truth score. Given a
query qi, the ground-truth score of the j-th candidate element is defined as follow:
relevant
precision 
relevant  irrelevant
relevant
recall 
REL
(1  012 ) precision recall
gj 
i
(2)
012 precision  recall
In the formula, relevant is the length of relevant contents highlighted by user in e,
while irrelevant stands for the length of irrelevant parts. REL indicates the total length
of relevant contents in the data collection. The general bias parameter  is set as 0.1,
denoting that the weight of precision is ten times as much as recall.
Furthermore, for each query qi , we use the ranking function BM25 mentioned in 3.1
to get the predict relevant score of each candidate element, recorded in Ri  (r1i r2i rn(i)
i
).
i i
Then each ground-truth score list G and predicted score list R form a ”instance”. The
loss function is defined as the ”distance” between standard results lists Di and search
results lists Ri .
m
L(Gi Ri ) (3)
i 1
In each training epoch, the ranking function BM25 was used to compute the predicted
score Ri . Then the learning module replaced the current free parameters with the new
parameters tuned according to the loss between Gi and Ri . Finally the process stops
either while reaching the limit cycle index or when the parameters do not change.

4 Loss Functions
In this section, three query level loss functions and the corresponding tuning formulas
are discussed. Here the three definitions of loss are based on cosine similarity, Euclidean
distance and cross entropy respectively. After computing the loss between the ground-
truth Gi and the predicted Ri , the two free parameters k and b in BM25 are tuned as
formula (4). Especially, and are set to control the learning speed.

k  k  k
bb  b (4)

4.1 Cosine Similarity


Widely used in text mining and information retrieval, cosine similarity is a measure of
similarity between two vectors by finding the cosine of the angle between them. The


definition of the query level loss function based on cosine similarity is:

  
n(i) i
1 gij rij
L(G R ) 
i i i 1 j
(1  ) (5)
2 n(i) i 2 n(i) i 2
j 1 (g j ) j 1 (r j )
ListOPT: Learning to Optimize for XML Ranking 487

Note that in large data collection, given a query, the amount of relevant documents is
regularly much less than the number of irrelevant documents. So that a penalty function
i
is set to avoid the learning bias on irrelevant documents. Formula (6) is the weight of
j
relevant documents in learning procedure, while formula (7) is the weight of irrelevant
document. The formal definition is as follow:

 
NRi  NIRi
if (gij 0) (6)




i NRi
j
NR  NIRi
i
if (gij  0) (7)
NIRi
Where NRi is the number of relevant elements according to query qi and NIRi is the
number of irrelevant ones. After measuring the loss between the ground-truth results
and the predicted results, the adjustment range parameters k and b are determined
according to the derivatives of k and b:
With respect to k:
m
L(G i  Ri )
k
k

 
q 1


q
r
n(i) q j
r ¡ k (8)
rqj  gqj )( 
 
n(i) q


n(i) q 2


j 1 j
( j 1 rj j 1 (g j ) )
1
m n(i)
gqj  k
n(i) q 2
(r )
 ( )   jq  ( 
j 1 j 1 j
)
2 q 1
n(i) q 2
j 1 (r j )  n(i) q 2
j 1 (g j ) ( n(i) q 2
j 1 (r j )
n(i) q 2 2
j 1 (g j ) )

In which:
rqj t f (t e)  (t f (t e)  k  (1  b  b   t f (t e)  (k  1)  (1  b  b  len(e)
len(e)
)) )
 Wt  avel avel
(9)
k t¾Q (t f (t e)  k  (1  b  b  len(e)
avel
)) 2

b analogously:
m
L(G i  Ri )
b
b

 
q 1


q
r
n(i) q j
r ¡ b
rqj n(i) q
 q n(i) q 2

   
j 1 j
( j 1 rj g j )( j 1 (g j ) )
1
m n(i)
gqj  b n(i) q 2
(r )
 ( )   jq  (
j 1
 j 1 j
)
2 q 1
n(i) q 2
j 1 (r j )  n(i) q 2
j 1 (g j ) ( n(i) q 2
j 1 (r j )
n(i) q 2 2
j 1 (g j ) )
(10)

In which:
t f (t e) (k  1) k (1 
q len

rj )
 Wt avel
(11)

b t¾ Q (t f (t e)  k (1  b  b len(e) 2
avel
))

4.2 Euclidean Distance


The Euclidean distance is also used in the definition of loss function. The circumscrip-
tion of penalty parameter ij is the same as in formula (6) and (7). Hence, the loss
function based on Euclidean distance is defined in formula (12).
488 N. Gao et al.


n(i)
L(Gi Ri )  ( ij )2 (rij  gij )2 (12)
j 1

The same as cosine similarity loss, we derive the derivatives of the loss function based
q q
rj rj
on Euclidean distance with respect to k and b. The definition of k and b are the same
as in formula (9) and formula (11) respectively.


With respect to k:


q
n(i) i 2 q q rj

m m

L(Gi Ri ) j 1 ( j ) (r j  gj)
k  k

 (13)

k n(i) i 2 q q
q 1 q 1
j 1 ( j ) (r j  g j )2


b analogously:


q
n(i) i 2 q q rj
m m
j 1 ( j ) (r j gj)
i i 

L(G R )
b  b
 (14)

b n(i) i 2 q q
q 1 q 1
j 1 ( j ) (r j  g j )2

4.3 Cross Entropy

n(i)
L(Gi Ri )  
i
j rij log(gij) (15)
j 1

When considering cross entropy as metric, the loss function turns to formula (15).
Moreover, the penalty parameter ij in the formula is the same as in formula (6) and (7)
and the detailed tuning deflection of k and b is defined in formula (16) and formula
q q
rj rj
(17) respectively. Additionally, the definition of k
and b
are the same as in formula
(9) and formula (11). With respect to k:


m m n(i) q n(i) q

L(Gi Ri )
rj 1
rj
k  
q q q

j ( rj gj ) (16)
q 1

k
q 1 j 1

k n(i)
j 1 gqj j 1

k

b analogously:


m m n(i) q n(i) q

L(Gi Ri )
rj 1
rj
b  
q q q

j ( rj gj ) (17)
q 1

b
q 1 j 1

b n(i)
j 1 gqj j 1

b

5 Experiment
In this section, the XML data set used in comparison experiments is first introduced.
Then in section 5.2 we compare the e ectiveness of the optimized ranking function
BM25 under two evaluation criterions: MAP [15] and NDCG [16]. Additionally
ListOPT: Learning to Optimize for XML Ranking 489

in section 5.3, we focus on testing the association between the number of training
queries and the optimizing performance under the criterion of MAP.

5.1 Data Collection


The data collection used in the experiments consists of 2,666,190 English XML files
from Wikipedia, used by INEX Ad Hoc Track. The total size of these files is 50.7GB.
The query set consists of 68 di erent topics from competition topics of Ad Hoc track,
in which 40 queries are considered as training queries and others are test queries. Each
query in the evaluation system is bound to a standard set of highlighted ”relevant con-
tent”, which is recognized manually by the participants of INEX. In the experiments,
the training regards these highlighted ”relevant content” as ground truth results.

5.2 E«ect of BM25 Tuning


To explore the e ect of the learning-to-optimize method ListOPT, we evaluate the ef-
fectiveness of di erent parameter sets. In the comparison experiments, Traditional Set
stands for a highly used traditional set: k  2 b  075; RankNet Tuning stands for
the tuning method proposed in [10]; cosine similarity, Euclidean distance and cross en-
tropy are the learning-to-optimize methods using cosine similarity, Euclidean distance
and cross entropy as loss function respectively.
Evaluation System of Ad Hoc is the standard experiment platform in e ectiveness
comparison here. We evaluate the searching e ectiveness of the aforementioned five
methods in two criterions: MAP and NDCG. In MAP evaluation, we choose interpo-
lated precision at 1% recall (iP[0.01]), precision at 10% recall (iP[0.10]) and MAiP
as the evaluation criterions. While in NDCG evaluation, we test the retrieval e ect on
NDCG@1 to NDCG@ 10.
Figure 1 illustrates the comparison results under MAP measure. As is shown, the
three learning-to-optimize methods proposed in this paper perform the best. It might
looks confusing since that INEX 2009 Ad Hoc Focused track used to have search
engine reaching ip[0.01]  0.63, compared to 0.34 the highest in our plot. However,
this e ectiveness is regularly obtained by combining various ranking technologies to-
gether, such as two-layer strategy, re-ranking strategy and title tag bias, but not only
BM25 itself. Given that our study is focused on BM25, it is therefore pointless to
do such a combination. In this condition, the searching e ectiveness scores (ip[0.01])
in this experiment are unable to show the high level as the competitive engines in
INEX did.
The result presented in figure 2 show that the learning-to-optimize methods are in-
deed more robust in ranking tasks. The performance of the ranking methods becomes
better when more results are returned. From the perspective of users, this phenomenon
could be explained by the fact that INEX queries are all information query, meaning
that the user’s purpose is to find more relevant information. On contrary, if the query is
a navigation query, the user only needs the exact webpage. The first result is hence of
highest importance and the evaluation score might decreases accordingly.
490 N. Gao et al.

Fig. 1. E ective Comparison on MAiP

Fig. 2. E ectiveness Comparison on NDCG

5.3 Number of Training Queries


Figure 3 shows the relationship between the tuning e ectiveness and the quantity of
training queries. In this experiment, the number of training queries changes from 1 to
40. As illustrated, the MAiP score lines all share an obvious ascent during the first
several query numbers. After that, the pulsation of the performance keeps in a low
level. This situation corresponds with the learning theory: when there are few queries
in the training set, the learning is overfitting. With the increasing of query samples,
the performance of learning procedure gets better and better till finally the most proper
parameters for the data collection are found.
ListOPT: Learning to Optimize for XML Ranking 491

Fig. 3. Number of Training Queries

6 Conclusions and Future Work

In this paper, we proposed a learning-to-optimize method ListOPT. ListOPT combines


and utilizes the benefits of the listwise learn-to-rank technology and the traditional rank-
ing function BM25. In the process of learning, three query level loss functions based on
cosine similarity, Euclidean distance and cross entropy respectively are introduced. The
experiments on a XML data set Wikipedia English confirm that the learning-to-optimize
method indeed leads to a better parameter set.
As future work, we will firstly try to tune Wt in the formula. Then we would like to
extend the learning-to-optimize method ListOPT approach to other tuning fields, like
tuning the parameters in other ranking functions or ranking methods in the future. In a
further, the comparison of ListOPT and some other learning and ranking methods, such
as ListNet, XRank, XReal and so on, will be done on benchmark data sets.

Acknowledgement

This work was supported by the National High-Tech Research and Development Plan
of China under Grant No.2009AA01Z136.

References
1. Carmel, D., Maarek, Y.S., Mandelbrod, M., et al.: Searching XML documents via XML
fragments. In: SIGIR, pp. 151–158 (2003)
2. Theobald, M., Schenkel, R., Wiekum, G.: An EÆcient and Versatile Query Engine for TopX
Search. In: VLDB, pp. 625–636 (2005)
3. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order
to the web. Technical report, Stanford University (1998)
4. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. JACM, 604–632 (1998)
5. Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., Li, H.: Learning to rank: from pairwise approach to
listwise approach. In: ICML, pp. 129–136 (2007)
6. INEX, 
 



7. Nallapati, R.: Discriminative models for information retrieval. In: SIGIR, pp. 64–71 (2004)
492 N. Gao et al.

8. Cao, Y., Xu, J., Liu, T., Li, H., Huang, Y., Hon, H.: Adapting ranking SVM to document
retrieval. In: SIGIR, pp. 186–193 (2006)
9. Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An eÆcient boosting algorithm for combining
preferences. JMLR, 933–969 (2003)
10. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.:
Learning to Rank using Gradient Descent. In: ICML, pp. 89–96 (2005)
11. Qin, T., Zhang, X.D., Tsai, M.F., Wang, D.S., Liu, T.Y., Li, H.: Query-level loss functions
for information retrieval. Information Processing and Management, 838–855 (2007)
12. Xia, F., Liu, T.Y., Wang, J., Zhang, W., Li, H.: Listwise approach to learning to rank: theory
and algorithm. In: ICML, pp. 1192–1199 (2008)
13. Taylor, M., Zaragoza, H., Craswell, N., Robertson, S., Burges, C.: Optimisation Methods for
Ranking Functions with Multiple Parameters. In: CIKM, pp. 585–593 (2006)
14. van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
15. Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval (1999)
16. Jarvelin, K., Kekalainen, J.: IR evaluation methods for retrieving highly relevant documents.
In: SIGIR, pp. 41–48 (2000)
17. Geva, S., Kamps, J., Lethonen, M., Schenkel, R., Thom, J.A., Trotman, A.: Overview of the
INEX 2009 Ad Hoc Track. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS,
vol. 6203, pp. 4–25. Springer, Heidelberg (2010)
18. Itakura, K.Y., Clarke, C.L.A.: University of waterloo at INEX 2008: Adhoc, book, and link-
the-wiki tracks. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631,
pp. 132–139. Springer, Heidelberg (2009)
19. Liu, J., Lin, H., Han, B.: Study on Reranking XML Retrieval Elements Based on Combining
Strategy and Topics Categorization. In: INEX, pp. 170–176 (2007)
Item Set Mining Based on Cover Similarity

Marc Segond and Christian Borgelt

European Centre for Soft Computing


Calle Gonzalo Gutiérrez Quirós s/n, E-33600 Mieres (Asturias), Spain
{marc.segond,christian.borgelt}@softcomputing.es

Abstract. While in standard frequent item set mining one tries to find
item sets the support of which exceeds a user-specified threshold (mini-
mum support) in a database of transactions, we strive to find item sets
for which the similarity of their covers (that is, the sets of transactions
containing them) exceeds a user-specified threshold. Starting from the
generalized Jaccard index we extend our approach to a total of twelve
specific similarity measures and a generalized form. We present an effi-
cient mining algorithm that is inspired by the well-known Eclat algorithm
and its improvements. By reporting experiments on several benchmark
data sets we demonstrate that the runtime penalty incurred by the more
complex (but also more informative) item set assessment is bearable and
that the approach yields high quality and more useful item sets.

1 Introduction

Frequent item set mining and association rule induction are among the most
intensely studied topics in data mining and knowledge discovery in databases.
The enormous research efforts devoted to these tasks have led to a variety of
sophisticated and efficient algorithms, among the best-known of which are Apri-
ori [1], Eclat [27,28] and FP-growth [13]. However, these approaches, which find
item sets whose support exceeds a user-specified minimum in a given transac-
tion database, have the disadvantage that the support does not say much about
the actual strength of association of the items in the set: a set of items may
be frequent simply because its elements are frequent and thus their frequent
co-occurrence can even be expected by chance. As a consequence, the (usually
few) interesting item sets drown in a sea of irrelevant ones.
In order to improve this situation, we propose in this paper to change the
selection criterion, so that fewer irrelevant items sets are produced. For this
we draw on the insight that for associated items their covers—that is, the set
of transactions containing them—are more similar than for independent items.
Starting from the Jaccard index to illustrate this idea, we explore a total of
twelve specific similarity measures that can be generalized from pairs of sets
(or, equivalently, from pairs of binary vectors) as well as a generalized form.
By applying an Eclat-based mining algorithm to standard benchmark data sets
and to the 2008/2009 Wikipedia Selection for schools, we demonstrate that the
search times are bearable and that high quality item sets are produced.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 493–505, 2011.
c Springer-Verlag Berlin Heidelberg 2011
494 M. Segond and C. Borgelt

2 Frequent Item Set Mining


Frequent item set mining was originally developed for market basket analysis,
aiming at finding regularities in the shopping behavior of the customers of su-
permarkets, mail-order companies and online shops. Formally, we are given a
set B of items, called the item base, and a database T of transactions. Each item
represents a product, and the item base represents the set of all products on
offer. The term item set refers to any subset of the item base B. Each transac-
tion is an item set and represents a set of products that has been bought by an
actual customer. Note that two or even more customers may have bought the
exact same set of products. Note also that the item base B is usually not given
explicitly, but only implicitly as the union of all transactions.
We write T = (t1 , . . . , tn ) for a database with n transactions, thus distinguish-
ing equal transactions by their position in the vector. In order to refer to the
index set, we introduce the abbreviation INn := {k ∈ IN | k ≤ n} = {1, . . . , n}.
Given an item set I ⊆ B and a transaction database T , the cover KT (I) of I
w.r.t. T is defined as KT (I) = {k ∈ INn | I ⊆ tk }, that is, as the set of indices of
transactions that contain I. The support sT (I) of an item set I ⊆ B is the num-
ber of transactions in the database T it is contained in, that is, sT (I) = |KT (I)|.
Given a user-specified minimum support smin ∈ IN, an item set I is called fre-
quent in T iff sT (I) ≥ smin . The goal of frequent item set mining is to identify
all item sets I ⊆ B that are frequent in a given transaction database T .
A standard approach to find all frequent item sets w.r.t. a given database T
and a support threshold smin , which is adopted by basically all frequent item
set mining algorithms (except those of the Apriori family), is a depth-first search
in the subset lattice of the item base B. Viewed properly, this approach can
be interpreted as a simple divide-and-conquer scheme. All subproblems that
occur in this scheme can be defined by a conditional transaction database and
a prefix. The prefix is a set of items that has to be added to all frequent item
sets that are discovered in the conditional database, from which all items in the
prefix have been removed. Formally, all subproblems are tuples S = (TC , P ),
where TC is a conditional transaction database and P ⊆ B is a prefix. The
initial problem, with which the recursion is started, is S = (T, ∅), where T is
the given transaction database to mine and the prefix is empty. A subproblem
S0 = (T0 , P0 ) is processed as follows: Choose an item i ∈ B0 , where B0 is the
set of items occurring in T0 . This choice is arbitrary, but usually follows some
predefined order of the items. If sT0 (i) ≥ smin , then report the item set P0 ∪ {i}
as frequent with the support sT0 (i), and form the subproblem S1 = (T1 , P1 ) with
P1 = P0 ∪{i}. The conditional transaction database T1 comprises all transactions
in T0 that contain the item i, but with the item i removed. This also implies that
transactions that contain no other item than i are entirely removed: no empty
transactions are ever kept. If T1 is not empty, process S1 recursively. In any
case (that is, regardless of whether sT0 (i) ≥ smin or not), form the subproblem
S2 = (T2 , P2 ), where P2 = P0 and the conditional transaction database T2
comprises all transactions in T0 (including those that do not contain the item i),
but again with the item i removed. If T2 is not empty, process S2 recursively.
Item Set Mining Based on Cover Similarity 495

3 Jaccard Item Sets


We base our item set mining approach on the similarity of item covers rather than
on item set support. In order to measure the similarity of a set of item covers, we
start with the Jaccard index [16], which is a well-known statistic for comparing
sets. For two arbitrary sets A and B it is defined as J(A, B) = |A ∩ B|/|A ∪ B|.
Obviously, J(A, B) is 1 if the sets coincide (i.e. A = B) and 0 if they are disjoint
(i.e. A∩B = ∅). For overlapping sets its value lies between 0 and 1. The core idea
of using the Jaccard index for item set mining lies in the insight that the covers
of (positively) associated items are likely to have a high Jaccard index, while
a low Jaccard index indicates independent or even negatively associated items.
However, since we consider also item sets with more than two items, we need a
generalization to more than two sets (here: item covers). In order to achieve this,
we define the carrier LT (I) of an item set I w.r.t. a transaction database T as

LT (I) = {k ∈ INn | I ∩ tk = ∅} = {k ∈ INn | ∃i ∈ I : i ∈ tk } = i∈I KT ({i}).

The extent rT (I) of an item set I w.r.t. a transaction database T is the size of its
carrier, that is, rT (I) = |LT (I)|. Together with the notions of cover and support
(see above), we can define the generalized Jaccard index of an item set I w.r.t.
a transaction database T as its support divided by its extent, that is, as

sT (I) |KT (I)| | KT ({i})|
JT (I) = = = i∈I .
rT (I) |LT (I)| | i∈I KT ({i})|

Clearly, this is a very natural and straightforward generalization of the Jaccard


index. Since for an arbitrary item a ∈ B it is obviously KT (I ∪ {a}) ⊆ KT (I)
and equally obviously LT (I ∪ {a}) ⊇ LT (I), we have sT (I ∪ {a}) ≤ sT (I) and
rT (I ∪ {a}) ≥ rT (I). From these two relations it follows JT (I ∪ {a}) ≤ JT (I) and
thus that the generalized Jaccard index w.r.t. a transaction database T over an
item base B is an anti-monotone function on the partially ordered set (2B , ⊆).
Given a user-specified minimum Jaccard value Jmin , an item set I is called
Jaccard-frequent if JT (I) ≥ Jmin . The goal of Jaccard item set mining is to
identify all item sets that are Jaccard-frequent in a given transaction database T .
Since the generalized Jaccard index is anti-monotone, this task can be addressed
with the same basic scheme as the task of frequent item set mining. The only
problem to be solved is to find an efficient scheme for computing the extent rT (I).

4 The Eclat Algorithm


Since we will draw on the basic scheme of the well-known Eclat algorithm for
mining Jaccard item sets, we briefly review some of its core ideas. Eclat [27] uses a
purely vertical representation of conditional transaction databases, that is, it uses
lists of transaction indices, which represent the cover of an item or an item set.
It then exploits the obvious relation KT (I ∪ {a, b}) = KT (I ∪ {a}) ∩ KT (I ∪ {b}),
which allows to extend an item set by an item. This is used in the recursive
496 M. Segond and C. Borgelt

divide-and-conquer scheme described above by intersecting the list of transaction


indices associated with the split item with the lists of transaction indices of all
items that have not yet been considered in the recursion.
An alternative to the intersection approach, which is particularly useful for
mining dense transaction databases, relies on so-called difference sets (or diffsets
for short) [28]. The diffset DT (a | I) of an item a w.r.t. an item set I and a
transaction database T is defined as DT (a | I) = KT (I) − KT (I ∪ {a}). That
is, a diffset DT (a | I) lists the indices of all transactions that contain I, but
not a. Since sT (I ∪ {a}) = sT (I) − |DT (a | I)|, diffsets are equally effective
for finding frequent item sets, provided one can derive a formula that allows
to compute diffsets with a larger conditional item set I without going through
covers (using the above definition of a diffset). However, this is easily achieved,
because DT (b | I ∪ {a}) = DT (b | I) − DT (a | I) [28]. This formula allows to
formulate the search entirely with the help of diffsets.

5 The JIM Algorithm (Jaccard Item Set Mining)

The diffset approach as it was reviewed in the previous section can easily be
transferred in order to find an efficient scheme for computing the carrier and
thus the extent of item sets. To this end we define the extra set ET (a | I) as

ET (a | I) = KT ({a}) − i∈I KT ({i}) = {k ∈ INn | a ∈ tk ∧ ∀i ∈ I : i ∈
/ tk }.

That is, ET (a | I) is the set of indices of all transactions that contain a, but
no item in I, and thus identifies the extra transaction indices that have to be
added to the carrier if item a is added to the item set I. For extra sets we have
ET (a | I ∪ {b}) = ET (a | I) − ET (b | I), which corresponds to the analogous
formula for diffsets reviewed above. This relation is easily verified as follows:

ET (a | I) − ET (b | I)
= {k ∈ INn | a ∈ tk ∧ ∀i ∈ I : i ∈
/ tk } − {k ∈ INn | b ∈ tk ∧ ∀i ∈ I : i ∈
/ tk }
= {k ∈ INn | a ∈ tk ∧ ∀i ∈ I : i ∈
/ tk ∧ ¬(b ∈ tk ∧ ∀i ∈ I : i ∈
/ tk )}
= {k ∈ INn | a ∈ tk ∧ ∀i ∈ I : i ∈
/ tk ∧ (b ∈
/ tk ∨ ∃i ∈ I : i ∈ tk )}
= {k ∈ INn | (a ∈ tk ∧ ∀i ∈ I : i ∈
/ tk ∧ b ∈
/ tk )
∨ (a ∈ tk ∧ ∀i ∈ I : i ∈/ tk ∧ ∃i ∈ I : i ∈ tk )}
  
=false
= {k ∈ INn | a ∈ tk ∧ ∀i ∈ I : i ∈
/ tk ∧ b ∈
/ tk }
= {k ∈ INn | a ∈ tk ∧ ∀i ∈ I ∪ {b} : i ∈
/ tk }
= ET (a | I ∪ {b})

In order to see how extra sets can be used to compute the extent of item sets,
let I = {i1 , . . . , im }, with some arbitrary, but fixed order of the items that is
indicated by the index. This will be the order in which the items are used as
Item Set Mining Based on Cover Similarity 497

Table 1. Quantities in terms of which the considered similarity measures are specified,
together with their behavior as functions on the partially ordered set (2B , ⊆)

quantity behavior
nT constant
 
sT (I) = |KT (I)| =  i∈I KT ({i}) anti-monotone
 
rT (I) = |LT (I)| =  i∈I KT ({i}) monotone
qT (I) = rT (I) − sT (I) monotone
zT (I) = nT − rT (I) anti-monotone

split items in the recursive divide-and-conquer scheme. It is


m m k−1

LT (I) = k=1 KT ({ik }) = k=1 KT ({ik }) − l=1 KT ({il })


m
= k=1 E(ik | {i1 , . . . , ik−1 }),
and since the terms of the last union are clearly all disjoint, we have immediately
m

rT (I) = |E(ik | {i1 , . . . , ik−1 })| = rT (I − {im }) + |E(im | I − {im })|.
k=1

Thus we have a simple recursive scheme to compute the extent of an item set
from its parent in the search tree (as defined by the divide-and-conquer scheme).
The mining algorithm can now easily be implemented as follows: initially
we create a vertical representation of the given transaction database. The only
difference to the Eclat algorithm is that we have two transaction lists per item i:
one represents KT ({i}) and the other ET (i | ∅), which happens to be equal to
KT ({i}). (That is, for the initial transaction database the two lists are identical,
which, however, will obviously not be maintained in the recursive processing.)
In the recursion the first list for the split item is intersected with the first list
of all other items to form the list representing the cover of the corresponding
pair. The second list of the split item is subtracted from the second lists of all
other items, thus yielding the extra sets of transactions for these items given the
split item. From the sizes of the resulting lists the support and the extent of the
enlarged item sets and thus their generalized Jaccard index can be computed.

6 Other Similarity Measures


Up to now we focused on the (generalized) Jaccard index to measure the simi-
larity of sets (covers). However, there is a large number of alternatives. Recent
extensive overviews for the pairwise case include [5] and [6].
The JIM algorithm (as presented above) allows us to easily compute the
quantities listed in Table 1. With these quantities a wide range of similarity
measures for sets or binary vectors can be generalized. Exceptions are those
measures that refer explicitly to the number of cases in which a vector x is 1
while the other vector y is 0, and distinguish this number from the number
498 M. Segond and C. Borgelt

Table 2. Considered similarity measures for sets/binary vectors

Measures derived from inner product: Measures derived from Hamming distance:
Russel & Rao s s Sokal & Michener s+z n−q
SR = = S = =
[21] n r+z Hamming [23,15] M n n
Kulczynski s s Faith 2s + z s + 12 z
SK = = SF = =
[19] q r−s [10] 2n n
Jaccard [16] s s AZZOO [5] s + σz
SJ = = SZ =
Tanimoto [26] s+q r σ ∈ [0, 1] n
Dice [8] Rogers & s+z n−q
2s 2s ST = =
Sørensen [25] SD = = Tanimoto [20] n+q n+q
2s + q r+s
Czekanowski [7] Sokal & Sneath 2 2(s + z) n−q
SN = =
Sokal & Sneath 1
SS =
s
=
s [24,22] n+s+z n − 12 q
[24,22] s + 2q r+q Sokal & Sneath 3 s+z n−q
SO = =
[24,22] q q

Baroni-Urbani sz + s
SB = √
& Buser [3] sz + r

of cases in which y is 1 and x is 0. This distinction is difficult to generalize


beyond the pairwise case, because the number of possible assignments of zeros
and ones to the different vectors, each of which one would have to consider for
a generalization, grows exponentially with the number of these vectors (here:
covers, and thus: items) and therefore becomes quickly infeasible.
By collecting from [6] similarity measures that are specified in terms of the
quantities listed in Table 1, we compiled Table 2. Note that the index T and the
argument I are omitted to make the formulas more easily readable. Note also
that the Hamann measure SH = x+z−s n
= n−2s
n
[14] listed in [6] is equivalent to
the Sokal& Michener measure SM , because SH + 1 = 2SM ,√and hence omitted.
xz+x−q
Likewise, the second Baroni-Urbani& Buser measure SU = √ xz+o
listed in [6]
is equivalent to the one given in Table 2, because SU + 1 = 2SB . Finally, note
that all of the measures listed in Table 2 have range [0, 1] except SK (Kulczynski)
and SO (Sokal& Sneath 3), which have range [0, ∞).
Table 2 is split into two parts depending on whether the numerator of a
measure refers only to the support s or to both the support s and the number z
of transactions that do not contain any of the items in the considered set. The
former are often referred to as based on the inner product, because in the pairwise
case s is the value of the inner (or scalar) product of the binary vectors that
are compared. The latter measures (that is, those referring to both s and z)
are referred to as based on the Hamming distance, because in the pairwise case
q is the Hamming distance of the two vectors and n − q = s + z their Hamming
similarity. The decision whether for a given application the term z should be
considered in the numerator of a similarity measure or not is difficult. Discussions
of this issue for the pairwise case can be found in [22] and [9].
Item Set Mining Based on Cover Similarity 499

Note that the Russel& Rao measure is simply normalized support, demon-
strating that our framework comprises standard frequent item set mining as a
special case. The Sokal& Michener measure is simply the normalized Hamming
similarity. The Dice/Sørensen/Czekanowski measure may be defined without the
factor 2 in the numerator, changing the range to [0, 0.5]. The Faith measure is
equivalent to the AZZOO measure (alter zero zero one one) for σ = 0.5 and
the Sokal& Michener measure results for σ = 1. AZZOO is meant to introduce
flexibility in how much weight should be placed on z, the number of transactions
which lack all items in I (zero zero) relative to s (one one).
All measures listed in Table 2 are anti-monotone on the partially ordered
set (2B , ⊆), where B is the underlying item base. This is obvious if in at least
one of the formulas given for a measure the numerator is (a multiple of) a
constant or anti-monotone quantity or a (weighted) sum of such quantities, and
the numerator is (a multiple of) a constant or monotone quantity or a (weighted)
sum of such quantities (see Table 1). This is the case for all but SD , SN and SB .
That SD is anti-monotone can be seen by considering its reciprocal value
−1 q −1
SD = 2s+q
2s = 1 + 2s . Since q is monotone and s is anti-monotone, SD is clearly
monotone and thus S√D is anti-monotone.

Applying the same approach to SB ,
we arrive at SB−1 = √sz+rsz+s
= sz+s+q

sz+s
= 1 q
+ √sz+s . Since q is monotone and
√ −1
both s and sz are anti-monotone, SB is clearly monotone and thus SB is anti-
q q
monotone. Finally, SN can be written as SN = 2n−2q 2n−q
= 1 − 2n−q = 1 − n+s+z .
Since q is monotone, the numerator is monotone, and since n is constant and s
and z are anti-monotone, the denominator is anti-monotone. Hence the fraction
is monotone and since it is subtracted from 1, SN is anti-monotone.
Note that all measures in Table 2 can be expressed as

c0 s + c1 z + c2 n + c3 sz
S= √ (1)
c4 s + c5 z + c6 n + c7 sz
by specifying appropriate coefficients c0 , . . . , c7 . For example, we obtain SJ for
c0 = c6 = 1, c5 = −1 and c1 = c2 = c3 = c4 = c7 = 0, since SJ = sr = n−z s
.
Similarly, we obtain SO for c0 = c1 = c6 = 1, c4 = c5 = −1 and c2 = c3 = c7 = 0,
since SO = s+zq
s+z
= n−s−z . This general form allows for a flexible specification of
various similarity measures. Note, however, that not all selections of coefficients
lead to an anti-monotone measure and hence one has to carefully check this
property before using a measure that differs from the pre-specified ones.

7 Experiments
We implemented the described item set mining approach as a C program that
was derived from an Eclat implementation by adding the second transaction
identifier list for computing the extent of item sets. All similarity measures listed
in Table 2 are included as well as the general form (1). This implementation has
been made publicly available under the GNU Lesser (Library) Public License.1
1
See http://www.borgelt.net/jim.html
500 M. Segond and C. Borgelt

In a first set of experiments we applied the program to five standard bench-


mark data sets, which exhibit different characteristics, and compared it to a
standard Eclat search. We used BMS-Webview-1 (a web click stream from a
leg-care company that no longer exists, which has been used in the KDD cup
2000 [17]), T10I4D100K (an artificial data set generated with IBM’s data gen-
erator [29]), census (a data set derived from an extract of the US census bureau
data of 1994, which was preprocessed by discretizing numeric attributes), chess
(a data set listing chess end game positions for king vs. king and rook), and
mushroom (a data set describing poisonous and edible mushrooms by different
attributes). The first two data sets are available in the FIMI repository [11],
the last three in the UCI machine learning repository [2]. The discretization of
the numeric attributes in the census data set was done with a shell/gawk script
that can be found on the web page given in footnote 1 (previous page). For the
experiments we used an Intel Core 2 Quad Q9650 (3GHz) machine with 8 GB
main memory running Ubuntu Linux 10.4 (64 bit) and gcc version 4.4.3.
The goal of these experiments was to determine how much the computation
of the carrier/extent of an item set affected the execution time. Therefore we
ran the JIM algorithm with Jmin = 0, using only a minimum support threshold.
As a consequence, JIM and Eclat always found exactly the same set of frequent
item sets and any difference in execution time comes from the additional costs
of the carrier/extent computation. In addition, we checked which item order
(ascending or descending w.r.t. their frequency) yields the shortest search times.
The results are depicted in the diagrams in Figure 1. We observe that pro-
cessing the items in increasing order of frequency always works better for Eclat
(black and grey curves)—as expected. For JIM, however, the best order depends
on the data set: on census, BMS-Webview-1 and T10I4D100K descending order
is better (red curve is lower than blue), on chess ascending order is better (blue
curve is lower than red), while on mushroom it depends on the minimum support
which order yields the shorter time (red curve intersects blue).
We interpret these findings as follows: for the support computation (which
is all Eclat does) it is better to process the items in ascending order, because
this reduces the average length of the transaction id lists. By intersecting with
short lists early, the lists processed in the recursion tend to be shorter and thus
are processed faster. However, for the extent computation the opposite order is
preferable. Since it works on extra sets, it is advantageous to add frequent items
as early as possible to the carrier, because this increases the size of the already
covered carrier and thus reduces the average length of the extra lists. Therefore,
since there are different preferences, it depends on the data set which operation
governs the complexity and thus which item order is better.
From Figure 1 we conjecture that dense data sets (high fraction of ones in
a bit matrix representation), like chess and mushroom, favor ascending order,
while sparse data sets, like census, BMS-Webview-1 and T10I4D100K, favor
descending order. This is plausible, because in dense data sets intersection lists
tend to be long, so it is important to reduce them. In sparse data sets, however,
extra lists tend to be long, so here it is more important to focus on them.
Item Set Mining Based on Cover Similarity 501

census 3 chess
jim asc.
log of execution time

log of execution time


2 jim asc.
jim desc. jim desc.
eclat asc. 2
eclat asc.
eclat desc. eclat desc.
1 1

0
0

2 10 20 30 40 50 60 70 80 90 100 1000 1200 1400 1600 1800 2000


minimum support minimum support
webview1 mushroom
2 2
log of execution time

log of execution time


jim asc. jim asc.
jim desc. jim desc.
1
eclat asc. eclat asc.
eclat desc. 1 eclat desc.

0
0

–1

33 34 35 36 37 38 39 40 100 200 300 400 500 600 700 800


minimum support minimum support
T10I4D100K
log of execution time

2 jim asc.
jim desc.
eclat asc.
eclat desc.
1

0
2 5 10 15 20 25 30 35 40 45 50
minimum support

Fig. 1. Logarithms of execution times, measured in seconds, over absolute minimum


support for Jaccard item set mining compared to standard Eclat frequent item set
mining. Items were processed in ascending or descending order w.r.t. their frequency.
Jaccard item set mining was executed with Jmin = 0, thus ensuring that exactly the
same item sets are found.

Naturally, the execution times of JIM are always greater than those of the
corresponding Eclat runs (with the same order of the items), but the execution
times are still bearable. This shows that even if one does not use a similarity
measure to prune the search, this additional information can be computed fairly
efficiently. However, it should be kept in mind that the idea of the approach is to
set a threshold for the similarity measure, which can effectively prune the search,
so that the actual execution times found in applications are much lower. In our
own practice we basically always achieved execution times that were lower than
for the Eclat algorithm (but, of course, with a different output).
502 M. Segond and C. Borgelt

Table 3. Jaccard item sets found in the 2008/2009 Wikipedia Selection for schools

item set sT JT
Reptiles, Insects 12 1.0000
phylum, chordata, animalia 34 0.7391
planta, magnoliopsida, magnoliophyta 14 0.6667
wind, damag, storm, hurrican, landfal 23 0.1608
tournament, doubl, tenni, slam, Grand Slam 10 0.1370
dinosaur, cretac, superord, sauropsida, dinosauria 10 0.1149
decai, alpha, fusion, target, excit, dubna 12 0.1121
conserv, binomi, phylum, concern, animalia, chordata 14 0.1053

In another experiment we used an extract from the 2008/2009 Wikipedia


Selection for schools2 , which consisted of 4861 web pages. Each of these web
pages was taken as a transaction and processed with standard text processing
methods (name detection, stemming, stop word removal etc.) to extract a total
of 59330 terms/keywords. The terms occurring on a web page are the items
occurring in the corresponding transaction. The resulting data file was then
mined for Jaccard item sets with a threshold of Jmin = 0.1. Some examples of
found term associations are listed in Table 3.
Clearly, there are several term sets with surprisingly high Jaccard indices and
thus strongly associated terms. For example, “Reptiles” and “Insects” always
appear together (on a total of 12 web pages) and never alone. A closer inspection
revealed, however, that this is an artifact of the name detection, which extracts
these terms from the Wikipedia category title “Insects, Reptiles and Fish” (but
somehow treats “Fish” not as a name, but as a normal word). All other item
sets contain normal terms, though (only “Grand Slam” is another name), and
are no artifacts of the text processing step. The second item set captures several
biology pages, which describe different vertebrates, all of which belong to the
phylum “chordata” and the kingdom “animalia”. The third set indicates that
this selection contains a surprisingly high number of pages referring to magnolias.
The remaining item sets show that term sets with five or even six terms can
exhibit a quite high Jaccard index, even though they have a fairly low support.
An impression of the filtering power can be obtained by comparing the size
of the output to standard frequent item set mining: for smin = 10 there are
83130 frequent item sets and 19394 closed item sets with at least two items.
A threshold of Jmin = 0.1 for the (generalized) Jaccard index reduces the output
to 5116 (frequent) item sets. From manual inspection, we gathered the impression
that the Jaccard item sets contained more meaningful sets and that the Jaccard
index was a valuable additional piece of information. It has to be conceded,
though, that whether item sets are more “meaningful” or “interesting” is difficult
to assess, because this requires an objective measure, which is not available.
However, the usefulness of our method is indirectly supported by a successful
application of the Jaccard item set mining approach for concept detection, for
2
See http://schools-wikipedia.org/
Item Set Mining Based on Cover Similarity 503

which standard frequent item set mining did not yield sufficiently good results.
This was carried out in the EU FP7 project BISON3 and is reported in [18].

8 Conclusions

We introduced the notion of a Jaccard item set as an item set for which the
(generalized) Jaccard index of its item covers exceeds a user-specified threshold.
In addition, we extended this basic idea to a total of twelve similarity measures
for sets or binary vectors, all of which can be generalized in the same way and
can be shown to be anti-monotone. By exploiting an idea that is similar to
the difference set approach for the well-known Eclat algorithm, we derived an
efficient search scheme that is based on forming intersections and differences of
sets of transaction indices in order to compute the quantities that are needed
to compute the similarity measures. Since it contains standard frequent item
set mining as a special case, mining item sets based on cover similarity yields a
flexible and versatile framework. Furthermore, the similarity measures provide
highly useful additional assessments of found item sets and thus help us to select
the interesting ones. By running experiments on standard benchmark data sets
we showed that mining item sets based on cover similarity can be done fairly
efficiently, and by evaluating the results obtained with a threshold for the cover
similarity measure we demonstrated that the output is considerably reduced,
while expressive and meaningful item sets are preserved.

Acknowledgements

This work was supported by the European Commission under the 7th Framework
Program FP7-ICT-2007-C FET-Open, contract no. BISON-211898.

References
1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proc.
20th Int. Conf. on Very Large Databases (VLDB 1994), Santiago de Chile, pp.
487–499. Morgan Kaufmann, San Mateo (1994)
2. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. School of Infor-
mation and Computer Science, University of California at Irvine, CA, USA (2007),
http://www.ics.uci.edu/~ mlearn/MLRepository.html
3. Baroni-Urbani, C., Buser, M.W.: Similarity of Binary Data. Systematic Zool-
ogy 25(3), 251–259 (1976)
4. Bayardo, R., Goethals, B., Zaki, M.J. (eds.): Proc. Workshop Frequent Item Set
Mining Implementations (FIMI 2004), Brighton, UK. CEUR Workshop Proceed-
ings 126, Aachen, Germany (2004), http://www.ceur-ws.org/Vol-126/
5. Cha, S.-H., Tappert, C.C., Yoon, S.: Enhancing Binary Feature Vector Similarity
Measures. J. Pattern Recognition Research 1, 63–77 (2006)
3
See http://www.bisonet.eu/
504 M. Segond and C. Borgelt

6. Choi, S.-S., Cha, S.-H., Tappert, C.C.: A Survey of Binary Similarity and Distance
Measures. Journal of Systemics, Cybernetics and Informatics 8(1), 43–48 (2010)
7. Czekanowski, J.: Zarys metod statystycznych w zastosowaniu do antropologii [An
Outline of Statistical Methods Applied in Anthropology]. Towarzystwo Naukowe
Warszawskie, Warsaw (1913)
8. Dice, L.R.: Measures of the Amount of Ecologic Association between Species. Ecol-
ogy 26, 297–302 (1945)
9. Dunn, G., Everitt, B.S.: An Introduction to Mathematical Taxonomy. Cambridge
University Press, Cambirdge (1982)
10. Faith, D.P.: Asymmetric Binary Similarity Measures. Oecologia 57(3), 287–290
(1983)
11. Goethals, B. (ed.): Frequent Item Set Mining Dataset Repository. University of
Helsinki, Finland (2004), http://fimi.cs.helsinki.fi/data/
12. Goethals, B., Zaki, M.J. (eds.): Proc. Workshop Frequent Item Set Mining Imple-
mentations (FIMI 2003), Melbourne, FL, USA. CEUR Workshop Proceedings 90,
Aachen, Germany (2003), http://www.ceur-ws.org/Vol-90/
13. Han, J., Pei, H., Yin, Y.: Mining Frequent Patterns without Candidate Generation.
In: Proc. Conf. on the Management of Data (SIGMOD 2000), Dallas, TX, pp. 1–12.
ACM Press, New York (2000)
14. Hamann, V.: Merkmalbestand und Verwandtschaftsbeziehungen der Farinosae. Ein
Beitrag zum System der Monokotyledonen 2, 639–768 (1961)
15. Hamming, R.V.: Error Detecting and Error Correcting Codes. Bell Systems Tech.
Journal 29, 147–160 (1950)
16. Jaccard, P.: Étude comparative de la distribution florale dans une portion des Alpes
et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles 37, 547–579
(1901)
17. Kohavi, R., Bradley, C.E., Frasca, B., Mason, L., Zheng, Z.: KDD-Cup 2000 Or-
ganizers’ Report: Peeling the Onion. SIGKDD Exploration 2(2), 86–93 (2000)
18. Kötter, T., Berthold, M.R.: Concept Detection. In: Proc. 8th Conf. on Computing
and Philosophy (ECAP 2010). University of Munich, Germany (2010)
19. Kulczynski, S.: Classe des Sciences Mathématiques et Naturelles. Bulletin Int. de
l’Acadamie Polonaise des Sciences et des Lettres Série B (Sciences Naturelles) (Sup-
plement II), 57–203 (1927)
20. Rogers, D.J., Tanimoto, T.T.: A Computer Program for Classifying Plants. Sci-
ence 132, 1115–1118 (1960)
21. Russel, P.F., Rao, T.R.: On Habitat and Association of Species of Anopheline
Larvae in South-eastern Madras. J. Malaria Institute 3, 153–178 (1940)
22. Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy. Freeman Books, San Francisco
(1973)
23. Sokal, R.R., Michener, C.D.: A Statistical Method for Evaluating Systematic Re-
lationships. University of Kansas Scientific Bulletin 38, 1409–1438 (1958)
24. Sokal, R.R., Sneath, P.H.A.: Principles of Numerical Taxonomy. Freeman Books,
San Francisco (1963)
25. Sørensen, T.: A Method of Establishing Groups of Equal Amplitude in Plant Soci-
ology based on Similarity of Species and its Application to Analyses of the Vegeta-
tion on Danish Commons. Biologiske Skrifter / Kongelige Danske Videnskabernes
Selskab 5(4), 1–34 (1948)
26. Tanimoto, T.T.: IBM Internal Report, November 17 (1957)
Item Set Mining Based on Cover Similarity 505

27. Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New Algorithms for Fast Dis-
covery of Association Rules. In: Proc. 3rd Int. Conf. on Knowledge Discovery and
Data Mining (KDD 1997), Newport Beach, CA, pp. 283–296. AAAI Press, Menlo
Park (1997)
28. Zaki, M.J., Gouda, K.: Fast Vertical Mining Using Diffsets. In: Proc. 9th ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2003), Wash-
ington, DC, pp. 326–335. ACM Press, New York (2003)
29. Synthetic Data Generation Code for Associations and Sequential Patterns. Intelli-
gent Information Systems, IBM Almaden Research Center,
http://www.almaden.ibm.com/software/quest/Resources/index.shtml
Learning to Advertise:
How Many Ads Are Enough?

Bo Wang1 , Zhaonan Li2 , Jie Tang2 , Kuo Zhang3 , Songcan Chen1 ,


and Liyun Ru3
1
Department of Computer Science and Engineering, Nanjing University of
Aeronautics and Astronautics, China
bowang@nuaa.edu.cn
2
Department of Computer Science, Tsinghua University, Beijing, China
jietang@tsinghua.edu.cn
3
Sohu Inc. R&D center, Beijing, China
zhangkuo@sohu-rd.com

Abstract. Sponsored advertisement(ad) has already become the ma-


jor source of revenue for most popular search engines. One fundamental
challenge facing all search engines is how to achieve a balance between
the number of displayed ads and the potential annoyance to the users.
Displaying more ads would improve the chance for the user clicking an
ad. However, when the ads are not really relevant to the users’ interests,
displaying more may annoy them and even “train” them to ignore ads.
In this paper, we study an interesting problem that how many ads should
be displayed for a given query. We use statistics on real ads click-through
data to show the existence of the problem and the possibility to predict
the ideal number. There are two main observations: 1) when the click en-
tropy of a query exceeds a threshold, the CTR of that query will be very
near zero; 2) the threshold of click entropy can be automatically deter-
mined when the number of removed ads is given. Further, we propose a
learning approach to rank the ads and to predict the number of displayed
ads for a given query. The experimental results on a commercial search
engine dataset validate the effectiveness of the proposed approach.

1 Introduction

Sponsored search places ads on the result pages of web search engines for dif-
ferent queries. All major web search engines (Google, Microsoft, Yahoo!) derive
significant revenue from such ads. However, the advertisement problem is often
treated as the same problem as traditional web search, i.e., to find the most
relevant ads for a given query. One different and also usually ignored problem
is “how many ads are enough for a sponsored search”. Recently, a few research
works have been conducted on this problem [5,6,8,17]. For example, Broder et
al. study the problem of “whether to swing”, that is, whether to show ads for an
incoming query [3]; Zhu et al. propose a method to directly optimize the revenue
in sponsored search [22]. In most existing search engines, the problem has been

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 506–518, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Learning to Advertise: How Many Ads Are Enough? 507

treated as an engineering issue. For example, some search engine always displays
a fixed number of ads and some search engine uses heuristic rules to determine
the number of displayed ads. However, the key question is still open, i.e., how
to optimize the number of displayed ads for an incoming query?
Motivation Example. Figure 1 (a) illustrates an example of sponsored search.
The query is “house” and the first one is a suggested ad with yellow background,
and the search results are listed in the bottom of the page. Our goal is to predict
the number of displayed ads for a given query. The problem is not easy, as it
is usually difficult to accurately define the relevance between an ad and the
query. We conduct several statistical studies on the log data of a commercial
search engine, the procedure is in two-stage: First, for each query, we obtain all
the returned ads by the search engine; Second, we use some method to remove
several unnecessarily displayed ads (detailed in Section 4). Figure 1 (b) and (c)
are the statistical results on a large click-through data (DS BroadMatch dataset
in section 3). The number of “removed ads” refers to the total number of ads
cut off in the second stage for all the queries. Figure 1(b) shows how #clicks
and Click-Through-Rate (CTR) vary with the number of removed ads. We see
that with the number of removed ads increasing, #clicks decreases, while CTR
clearly increases. This matches our intuition well, displaying more ads will gain
more clicks, but if many of them are irrelevant, it will hurt CTR. Figure 1(c)
further shows how CTR increases as #clicks decreases. This is very interesting.
It is also reported that many clicks on the first displayed ad are done before the
users realize that it is not the first search result. A basic idea here is that we
can remove some displayed ads to achieve a better performance on CTR.

1000 0.017 70%

950 0.016 60%


900 0.015
50%
Increase of CTR

850 0.014
#Clicks

40%
CTR

800 0.013
30%
750 0.012

700 0.011 20%

650 0.01 10%

600 0.009 0
0 1 2 3 4 5 6 7 0 −5% −10% −15% −20%
The number of removed ads 4
x 10 Decline of clicks

(a) Sponsored search (b) #removed ads vs. #clicks & CTR (c) #clicks vs. CTR

Fig. 1. Motivation example

Thus, the problem becomes how to predict the number of displayed ads for
an incoming query which is non-trivial and poses two unique challenges:

• Ad ranking. For a given query, a list of related ads will be returned. Ads
displayed at the top positions should be more relevant to the query. Thus,
the first challenge is how to rank these ads.
• Ad Number Prediction. After we get the ranking list of ads, it is necessary
to answer the question “how many ads should we show?”.
508 B. Wang et al.

Contributions. To address the above two challenges, we propose a learning-


based framework. To summarize, our contributions are three-fold:

• We performed a deep analysis of the click-through data and found that when
the click entropy of a query exceeds a threshold, the CTR of that query will
be very near zero.
• We developed a method to determine the number of displayed ads for a given
query by an automatically selected threshold of click entropy.
• We conducted experiments on a commercial search engine and experimental
results validate the effectiveness of the proposed approach.

2 Problem Definition
Suppose we have the click-through data collected from a search engine, each
record can be represented by a triple {q, adq (p), cq (p)}, where for each query q,
adq (p) is the ad at the position p returned by the search engine and cq (p) is a
binary indicator which is 1 if this ad is clicked under this query, otherwise 0.
For each ad adq (p), there is an associated feature vector xq (p) extracted from a
query-ad pair (q, adq (p)) and can be utilized for ranking model learning.
Ad Ranking: Given the training data denoted by L = {q, ADq , Cq }q∈Q in
which Q is the query collection, for each q ∈ Q, ADq = {adq (1), · · · , adq (nq )}
is its related ad list and Cq = {cq (1), · · · , cq (nq )} is the click indicators where
nq is the total number of displayed ads. Similarly, the test data can be denoted
by T = {q  , ADq }q ∈Q where Q is the query collection. In this task, we try
to learn a ranking function for displaying the query-related ads by relevance.
For each query q  ∈ Q , the output of this task is the ranked ad list Rq =
{adq (i1 ), · · · , adq (inq )} where (i1 , · · · , inq ) is a permutation of (1, · · · , nq ).
Ad Number Prediction: Given the ranked ad list Rq for query q  , in this
task we try to determine the number of displayed ads k and then display the
top-k ads. The output of this task can be denoted by a tuple O = {q  , Rqk }q ∈Q
where Rqk are the top-k ads from Rq .
Our problem is quite different from existing works on advertisement recom-
mendation. Zhu et al. propose a method to directly optimize the revenue in
sponsored search [22]. However, they only consider how to maximize the rev-
enue, but ignore the experience of users. Actually, when no ads are relevant to
the users’ interests, displaying irrelevant ads may lead to much complains from
the users and even train the user to ignore ads. Broder et al. study the problem
of “whether to swing”, that is, whether to show ads for an incoming query [3].
However, they simplify the problem as a binary classification problem, while
in most real cases, the problem is more complex and often requires a dynamic
number for the displayed ads. Few works have been done about dynamically
predicting the number of displayed ads for a given query.
Learning to Advertise: How Many Ads Are Enough? 509

3 Data Insight Analysis


3.1 Data Set
In this paper, we use one month click-through data collected from the log of a
famous Chinese search engine Sogou1 , the search department of Sohu company
which is a premier online brand in China and indispensable to the daily life of
millions of Chinese. In the data set, each record consists of user’s query, ad’s
keyword, ad’s title, ad’s description, displayed position of ad, and ad’s bidding
price . For the training dataset DS BroadMatch, the total size is around 3.5GB
which contains about 4 million queries, 60k keywords and 80k ads with 25 million
records and 400k clicks. In this dataset, the ad is triggered when there are
common words between keyword and user’s search query. We also have another
little training dataset DS ExactMatch which a subset of DS BroadMatch and
contains about 28k queries, 29k keywords and 53k ads with 4.4 million records
and 150k clicks. In DS ExactMatch, the ad is triggered only when there are
exactly matched words between keywords and user’s search query. For the test
set, the total size is about 90MB with 430k records and 1k clicks.

3.2 Position vs. Click-Through Rate (CTR)


Figure 2 illustrates how CTR varies according to the positions on the dataset
DS ExactMatch. We can see that the actions of clicks mainly fall into the top
three positions of ad list for a query, so the clicks are position-dependent.
4
x 10
0.06 7
the number of removed ads

6
0.05
5
0.04
4
CTR

0.03
3
0.02
2

0.01 1

0 0
1 2 3 4 5 6 7 8 9 10 0.5 1 1.5 2 2.5 3 3.5 4
Position click entropy of query

Fig. 2. How CTR varies with the positions Fig. 3. How the number of removed ads
varies with the click entropy of a query

3.3 Click Entropy


In this section we conduct several data analyses based on the measure called
click entropy. For a given query q, the click entropy is defined as follows [11]:

ClickEntropy(q) = −P (ad|q)log2P (ad|q) (1)
ad∈P(q)

where P(q) is the collection of ads clicked on query q and P (ad|q) = |Clicks(q,ad)|
|Clicks(q)|
is the ratio of the number of clicks on ad to the number of clicks on query q.
1
http://www.sogou.com
510 B. Wang et al.

12 10

10
8

Max clicked position

Max clicked position


8
6
6
4
4

2
2

0 0
0 1 2 3 4 5 6 0 2 4 6 8 10
Click entropy of query Click entropy of query

(a)DS ExactMatch (b)DS BroadMatch

Fig. 4. How Max-Clicked-Position varies with the click entropy on the two datasets

A smaller click entropy means that the majorities of users agree with each
other on a small number of ads while a larger click entropy indicates a bigger
query diversity, that is, many different ads are clicked for the same query.
Click Entropy vs. #Removed ads. Figure 3 shows how the number of removed
ads varies with the click entropy of a query on the dataset DS BroadMatch. By
this distribution, for a query, if we want to remove a given number of ads, we can
automatically obtain the threshold of the click entropy which can be utilized for
helping determine the number of displayed ads.
Click Entropy vs. Max-Clicked-Position. For a query, Max-Clicked-Position
is the last position of clicked ad. Figure 4 shows how the Max-Clicked-Position
varies with the click entropy on the two datasets. The observations are as follows:

• As the click entropy increases, the Max-Clicked-Position will be larger.


• The values of click entropy on the dataset DS ExactMatch are in a smaller
range than DS BroadMatch which implies that when the query and keywords
are exactly matched, the click actions of users are more likely consistent.
• On the dataset DS ExactMatch, the clicked positions vary from 1 to 10, while
on the dataset DS BroadMatch, the clicked positions only vary from 1 to 4.
The intuition behind this observation is that while query and keyword are
exactly matched, users will scan all the ads because of the high relevance,but
while query and keyword are broadly matched, users will only scan the top
four ads and ignore the others. This is very interesting which implies that for
a query broadly matched with the ads’ keywords, we should display fewer
ads than those exactly matched with the ads’ keywords.

Click Entropy vs. QueryCTR. Figure 5 shows how QueryCTR varies with
the click entropy of a query. QueryCTR is the ratio of the number of clicks of
a query to the number of impressions of this query. We can conclude that when
the click entropy of a query is greater than 3, the QueryCTR will be very near
zero. This observation is very interesting, the QueryCTR is the summation of
the ads’ click entropy, so we can utilize this observation to help determine the
number of displayed ads for a given query.
Learning to Advertise: How Many Ads Are Enough? 511

1 1

0.8 0.8

Query CTR
Query CTR 0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 1 2 3 4 5 0 1 2 3 4 5
Click Entropy of Query Click Entropy of Query

(a)DS ExactMatch (b)DS BroadMatch

Fig. 5. How QueryCTR varies with the click entropy on the two datasets

4 Ad Ranking and Number Prediction

4.1 Basic Idea

We propose a two-stage approach corresponding to the two challenges of our


problem. First, we learn a function for predicting CTR based on the click-through
data by which the ads can be ranked. Second, we propose a heuristic method to
determine the number of displayed ads based on the click entropy of query. For
a query, the click entropy is the summation of entropy of each clicked ads, so we
consider the ads in a top-down mode, once the addition of one ad leads to the
excess of a predefined threshold by the click entropy, we then cut off the rest
ads. By this way, we can automatically determine the number of displayed ads.

4.2 Learning Algorithm

Ad Ranking: In this task, we aim to rank all the related ads of a given query by
relevance. Specifically, given each record {q, adq (p), cq (p)} from the click-through
data L = {q, ADq , Cq }q∈Q , we can first extract its associated feature vector xq (p)
from the query-ad pair, then obtain one training instance {xq (p), cq (p)}. Simi-
larly, we can generate the whole training data L = {xq (p), cq (p)}q∈Q,p=1,···,nq ∈
Rd × {0, 1} from the click-through data where d is the number of features.
Let (x, c) ∈ L be an instance from the training data where x ∈ Rd is the
feature vector and c ∈ {0, 1} is the associated click indicator. In order to predict
the CTR of an ad, we can learn a logistic regression model as follows whose
output is the probability of that ad being clicked:
1
P (c = 1|x) =  (2)
1 + e− i wi xi

where xi is the i-th feature of x and wi is the weight for that feature. P (c = 1|x)
is the predicted CTR of that ad whose feature vector is x.
For training, we can use the maximum likelihood method for parameter learn-
ing; for test, given a query, we can use the learnt logistic regression model for
predicting the CTR of one ad.
512 B. Wang et al.

Algorithm 1. Learning to Advertise


Input: Training set: L = {q, ADq , Cq }q∈Q
Test set: T = {q  , ADq }q ∈Q
Threshold of click entropy: η
Output: the number of displayed ads k and O = {q  , Rq }q ∈Q

Ad ranking:
1: Function learning for predicting CTRs from L
P (c = 1|x) = 1+e− 1 i wi xi

Ad Number Prediction:
2: for q  ∈ Q do
3: Rank ADq by the predicted CTRs P (c = 1|x)
4: Let the number k = 0 and click entropy CE = 0
5: while CE ≤ η do
6: k=k+1
7: if adq (k) is predicted to be clicked
8: CE = CE − P (adq (k)|q  )log2 P (adq (k)|q  )
9: end if
10: end while
11: Rq = ADq (1 : k)
12: Output k and O = {q  , Rq }q ∈Q
13: end for

Ad Number Prediction: Given a query q, we can incrementally add one ad to


the set of displayed ads in the top-down mode, and the clicked ads will contribute
to the click entropy. We can repeat this process until the click entropy exceeds a
predefined threshold, and then stop. By then, the size of that set is exactly the
number of the displayed ads for that query.
It is also worth noting that how to automatically determine the threshold
of click entropy. Figure 3 demonstrates that when the click entropy of a query
exceeds 3, the QueryCTR of that query will be very near zero. According to
this relationship, we can learn a fitting model(eg. regression model) from the
statistics of data, then for a given number of ads to be cut down, we can use the
learned model to predict the threshold of click entropy.
The method can also be applied to a new query. Based on the learned logistic
model, we can first predict the CTR for each ad related to the new query [17],
then predict the number of ads based on the click entropy for the new query.

4.3 Feature Definition

Table 1 lists all the 30 features extracted from query-ad title pair, query-ad pair,
and query-keyword pair which can be divided into three categories: Relevance-
related, CTR-related and Ads-related.
Relevance-related features. The relevance-related features consist of low-
level and high-level ones. The low-level features include highlight, TF, TF*IDF
Learning to Advertise: How Many Ads Are Enough? 513

Table 1. Feature definitions between query and ads

Category No. Feature Name Feature Description


1 Highlight of title ratio of highlight terms of query within title to the length of title
2 Highlight of ad ratio of highlight terms of query within ad title+description to
the length of ad title+description
3 TF of title term frequency between query and the title of ad
4 TF of ad term frequency between query and the title+description of ad
5 TF of keyword term frequency between query and the keyword of ad
6 TF*IDF of title TF*IDF between query and the title of ad
7 TF*IDF of ad TF*IDF between query and the title+description of ad
8 TF*IDF of keyword TF*IDF between query and the keyword of ad
9 Overlap of title 1 if query terms appear in the title of ad; 0 otherwise
Relevance
10 Overlap of ad 1 if query terms appear in the title+description of ad; 0 otherwise
11 Overlap of keyword 1 if query terms appear in the keywords of ad; 0 otherwise
12 cos sim of title cosine similarity between query and the title of ad
13 cos sim of ad cosine similarity between query and the title+description of ad
14 cos sim of keyword cosine similarity between query and the keywords of ad
15 BM25 of title BM25 value between query and the title of ad
16 BM25 of ad BM25 value between query and the title+description of ad
17 BM25 of keyword BM25 value between query and the keywords of ad
18 LMIR of title LMIR value between query and the title of ad
19 LMIR of ad LMIR value between query and the title+description of ad
20 LMIR of keyword LMIR value between query and the keywords of ad
21 keyCTR CTR of keywords
22 titleCTR CTR of the title of ad
CTR 23 adCTR CTR of title+description of ad
24 keyTitleCTR CTR of keyword+title of ad
25 keyAdCTR CTR of keyword+title+description of ad
26 title length the length of title of ad
27 ad length the length of title+description of ad
Ads 28 biding price biding price of keyword
29 match type match type between query and ad (exact match, broad match)
30 position position of ad in the ad list

and the overlap, which can be used to measure the relevance based on keyword
matching. The high-level features include cosine similarity, BM25 and LMIR,
which can be used to measure the relevance beyond keyword matching.
CTR-related features. AdCTR can be defined as the ratio of the number of
ad clicks to the total number of ad impressions. Similarly, we can define keyCTR
and titleCTR. KeyCTR corresponds to the multiple advertising for the specific
keyword. And titleCTR corresponds to multiple advertising with the same ad
title. We also introduce features keyTitleCTR and keyAdCTR, because usually
the assignment of a keyword to an ad is determined by the sponsors and the
search engine company, the quality of this assignment will affect the ad CTR.
Ads-related features. We introduce some features for ads themself, such as
the length of ad title, the bidding price, the match type and the position.

5 Experimental Results

5.1 Evaluation, Baselines and Experiment Setting

Evaluation. We qualitatively evaluate all the methods nqby the total number of
clicks for all queries in the test dataset: #click(q) = p=1 cq (p).
514 B. Wang et al.

4
1000 x 10
LR_CTR 1500 7

Total number of click

Number of removed ads


LR_CE
1400 6
the number of clicks

900 LR_RANDOM
1300 5
800 1200 4
1100 3
700
1000 2
600 900 1
800 0
500 0.5 1 1.5 2 2.5 3
0 2 4 6 8 entropy_threshold
the number of removed ads 4
x 10

(a) (b)

Fig. 6. (a) How the total number of clicks varies with the number of removed ads for
all three methods; (b) How the total number of clicks and the total number of removed
ads vary with the threshold of click entropy

For evaluation, we first remove a certain number of ads for a query in the
test dataset by different ways, and then find the way which leads to the least
reduction of the number of clicks.
Baselines. In order to quantitatively evaluate our approach, we compare our
method with two other baselines. Assume that we want to cut down N ads
in total. For the first baseline LR CTR, for each query in the test dataset, we
predict the CTRs for the query-related ads, and then pool the returned ads for
all the queries and re-rank them by the predicted CTRs, finally remove the last
N ads with lowest CTRs. The major problem for LR CTR is that it cannot be
updated in an online manner, that is, we need to know all the predicted CTRs for
all the queries in the test dataset in advance. This is impossible for determining
the removed ads for a given query. For the second baseline LR RANDOM, we
predict the CTRs of the query-related ads for each query in the test dataset,
and then only remove the last ad with some probability for each query. We can
tune the probability for removing a certain number of ads, the disadvantage
is that there is no explicit correspondence between these two. For our proposed
approach LR CE, we first automatically determine the threshold of click entropy
for a query and then use Algorithm 1 to remove the ads. Our approach does not
suffer from the disadvantages of the above two baselines.
Experiment Setting. All the experiments are carried out on a PC running
Windows XP with AMD Athlon 64 X2 Processor(2GHz) and 2G RAM.
We use the predicted CTRs from the ad ranking task to approximate the
term P (ad|q) in Eq. 1 in this way: P (ad|q) =  CT R(ad)
where CT R(ad) and
i CT R(adi )
CT R(adi ) are the predicted CTRs of the current ad and the i-th related ad for
query q respectively. For the training, we use the feature “position”; while for
testing, we set the feature “position” as zero for all instances.

5.2 Results and Analysis

#Removed ads vs. #Clicks. Figure 6(a) shows all the results of two baselines
and our approach. From that, the main observations are as follows:
Learning to Advertise: How Many Ads Are Enough? 515

• Performance. The method LR CTR obtains the optimal solution by the


measure #click. Our approach LR CE is near the optimal solution, and the
baseline LR RANDOM is the worst.
• User specification. From the viewpoint of the search engines, they may
want to cut down a specific number of ads to reduce the number of irrelevant
impressions while preserving the relevant ones. For addressing this issue, our
approach LR CE can first automatically determine the threshold of click
entropy via the relationship in Figure 3 and then determine the displayed
ads. This case cannot be dealt with by LR CTR, because it needs to know all
the click-through information in advance and then make a global analysis for
removing irrelevant ads. Further, for a specific query, it can not determine
exactly which ads should be displayed.

#Removed ads vs. CTR and #Clicks. Figure 6(b) shows how the total
number of clicks and the number of removed ads vary with the threshold of click
entropy. As the threshold of click entropy increases, the total number of clicks
increases while the number of removed ads decreases.

5.3 Feature Contribution Analysis


All the following analyses are conducted on the dataset DS ExactMatch.
Features vs. keyTitleCTR. Figure 7 shows some statistics of the ad click-
through data. When the values of features in 7 (a) and (c) increase, the keyTi-
tleCTRs also increase, while the feature in 7 (b) increases, the keyTitleCTR first
increases and then decreases.

(a)Highlight (b)TF (c)Cosine

Fig. 7. How keyTitleCTR varies with three different features

Feature Ranking. Recursive feature elimination(RFE) uses greedy strategy for


feature selection [22]. At each step, the algorithm tries to find the most useless
feature and eliminate it. In this analysis, we use the measure Akaike Information
Criterion (AIC) to select useful features. After excluding one feature, the lower
the increase of AIC is, the more useless the removed feature is. The process will
be repeated until only one feature left. Finally, we can get a ranking list of our
features, and the top three are keyTitleCTR, position and cos sim of title.
516 B. Wang et al.

6 Related Work

CTR-based advertisement. In this category, people try to predict CTRs, by


which the query-related ads can be ranked. These methods can be divided into
two main categories: click model [12,23] and regression model [15].
Regarding click model, Agarwal et al. propose a spatio-temporal model to
estimate CTR by a dynamic Gamma-Poisson model [1]. Craswell et al. propose
four simple hypotheses for explaining the position bias, and find that the cas-
cade model is the best one [9]. Chapelle and Zhang propose a dynamic Bayesian
network to provide an unbiased estimation of relevance from the log data [5].
Guo et al. propose the click chain model based on Bayesian modeling[13].
Regarding regression model, Richardson et al. propose a positional model and
leverage logistic regression to predict the CTR for new ads [17]. Chen et al. de-
sign and implement a highly scalable and efficient algorithm based on a linear
Poisson regression model for behavioral targeting in MapReduce framework [6].
There are also many other works [14]. For example, Dembczyński et al. pro-
pose a method based on decision rule [10].
Revenue-based advertisement. In this category, people try to take relevance
or revenue into consideration rather than CTR while displaying ads.
Radlinski et al. propose a two-stage approach to select ads which are both
relevant and profitable by rewriting queries [16]. Zhu et al. propose two novel
learning-to-rank methods to maximize search engine revenue while preserving
high quality of displayed ads [22]. Ciaramita et al. propose three online learning
algorithms to maximize the number of clicks based on preference blocks [8].
Streeter et al. formalize the sponsored search problem as an assignment of items
to positions which can be efficiently solved in the no-regret model [19]. Carterette
and Jones try to predict document relevance from the click data [4].
Threshold-based methods. In this category, people try to utilize thresholds
for determining whether to display ads or where to cut off the ranking list.
Broder et al. propose a method based on global threshold to determine whether
to show ads for a query because showing irrelevant ads will annoy the user [3].
Shanahan et al. propose a parameter free threshold relaxation algorithm to ensure
that support vector machine will have excellent precision and relatively high recall
[18]. Arampatzis et al. propose a threshold optimization approach for determining
where to cut off a ranking list based on score distribution [2].

7 Conclusion
In this paper, we study an interesting problem that how many ads should be
displayed for a given query. There are two challenges: ad ranking and ad number
prediction. First, we conduct extensive analyses on real click-through data of ads
and the two main observations are 1) when the click entropy of a query exceeds
a threshold the CTR of that query will be very near zero; 2) the threshold of
click entropy can be automatically determined when the number of removed ads
Learning to Advertise: How Many Ads Are Enough? 517

is given. Second, we propose a learning approach to rank the ads and to predict
the number of displayed ads for a given query. Finally, the experimental results
on a commercial search engine validate the effectiveness of our approach.
Learning to recommend ads in sponsored search presents a new and interesting
research direction. One interesting issue is how to predict the user intention
before recommending ads [7]. Another interesting issue is how to exploit click-
through data in different domains where the click distributions may be different
for refining ad ranking [21]. It would also be interesting to study how collective
intelligence (social influence between users for sentiment opinions on an ad) can
help improve the accuracy of ad number prediction [20].

Acknowledgments. Songcan Chen and Bo Wang are supported by NSFC


(60773061), Key NSFC(61035003). Jie Tang is supported by NSFC(61073073,
60703059, 60973102), Chinese National Key Foundation Research (60933013,
61035004) and National High-tech R&D Program(2009AA01Z138).

References
1. Agarwal, D., Chen, B.-C., Elango, P.: Spatio-temporal models for estimating click-
through rate. In: WWW 2009, pp. 21–30 (2009)
2. Arampatzis, A., Kamps, J., Robertson, S.: Where to stop reading a ranked list?:
threshold optimization using truncated score distributions. In: SIGIR 2009, pp.
524–531 (2009)
3. Broder, A., Ciaramita, M., Fontoura, M., Gabrilovich, E., Josifovski, V., Metzler,
D., Murdock, V., Plachouras, V.: To swing or not to swing: learning when (not) to
advertise. In: CIKM 2008, pp. 1003–1012 (2008)
4. Carterette, B., Jones, R.: Evaluating search engines by modeling the relationship
between relevance and clicks. In: NIPS 2007 (2007)
5. Chapelle, O., Zhang, Y.: A dynamic bayesian network click model for web search
ranking. In: WWW 2009, pp. 1–10 (2009)
6. Chen, Y., Pavlov, D., Canny, J.F.: Large-scale behavioral targeting. In: KDD 2009,
pp. 209–218 (2009)
7. Cheng, Z., Gao, B., Liu, T.-Y.: Actively predicting diverse search intent from user
browsing behaviors. In: WWW 2010, pp. 221–230 (2010)
8. Ciaramita, M., Murdock, V., Plachouras, V.: Online learning from click data for
sponsored search. In: WWW 2008, pp. 227–236 (2008)
9. Craswell, N., Zoeter, O., Taylor, M., Ramsey, B.: An experimental comparison of
click position-bias models. In: WSDM 2008, pp. 87–94 (2008)
10. Dembczyński, K., Kotlowski, W., Weiss, D.: Predicting ads’ click-through rate with
decision rules. In: TROA 2008, Beijing, China (2008)
11. Dou, Z., Song, R., Wen, J.: A large-scale evaluation and analysis of personalized
search strategies. In: WWW 2007, pp. 581–590 (2007)
12. Dupret, G.E., Piwowarski, B.: A user browsing model to predict search engine click
data from past observations. In: SIGIR 2008, pp. 331–338. ACM, New York (2008)
13. Guo, F., Liu, C., Kannan, A., Minka, T., Taylor, M., Wang, Y.-M., Faloutsos, C.:
Click chain model in web search. In: WWW 2009, pp. 11–20 (2009)
14. Gupta, M.: Predicting click through rate for job listings. In: WWW 2009, pp.
1053–1054 (2009)
518 B. Wang et al.

15. König, A.C., Gamon, M., Wu, Q.: Click-through prediction for news queries. In:
SIGIR 2009, pp. 347–354 (2009)
16. Radlinski, F., Broder, A.Z., Ciccolo, P., Gabrilovich, E., Josifovski, V., Riedel, L.:
Optimizing relevance and revenue in ad search: a query substitution approach. In:
SIGIR 2008, pp. 403–410 (2008)
17. Richardson, M., Dominowska, E., Ragno, R.: Predicting clicks: estimating the click-
through rate for new ads. In: WWW 2007, pp. 521–530 (2007)
18. Shanahan, J., Roma, N.: Boosting support vector machines for text classification
through parameter-free threshold relaxation. In: CIKM 2003, pp. 247–254 (2003)
19. Streeter, M., Golovin, D., Krause, A.: Online learning of assignments. In: NIPS
2009, pp. 1794–1802 (2009)
20. Tang, J., Sun, J., Wang, C., Yang, Z.: Social influence analysis in large-scale net-
works. In: KDD 2009, pp. 807–816 (2009)
21. Wang, B., Tang, J., Fan, W., Chen, S., Yang, Z., Liu, Y.: Heterogeneous cross
domain ranking in latent space. In: CIKM 2009, pp. 987–996 (2009)
22. Zhu, Y., Wang, G., Yang, J., Wang, D., Yan, J., Hu, J., Chen, Z.: Optimizing
search engine revenue in sponsored search. In: SIGIR 2009, pp. 588–595 (2009)
23. Zhu, Z.A., Chen, W., Minka, T., Zhu, C., Chen, Z.: A novel click model and its
applications to online advertising. In: WSDM 2010, pp. 321–330 (2010)
TeamSkill: Modeling Team Chemistry in Online
Multi-player Games

Colin DeLong, Nishith Pathak, Kendrick Erickson, Eric Perrino,


Kyong Shim, and Jaideep Srivastava

Department of Computer Science,


University of Minnesota
200 Union St SE, Minneapolis, MN
{delong,pathak,kjshim,srivasta}@cs.umn.edu,
{kendrick,perr0273}@umn.edu
http://www.cs.umn.edu

Abstract. In this paper, we introduce a framework for modeling el-


ements of “team chemistry” in the skill assessment process using the
performances of subsets of teams and four approaches which make use of
this framework to estimate the collective skill of a team. A new dataset
based on the Xbox 360 video game, Halo 3, is used for evaluation. The
dataset is comprised of online scrimmage and tournament games played
between professional Halo 3 teams competing in the Major League Gam-
ing (MLG) Pro Circuit during the 2008 and 2009 seasons. Using the
Elo, Glicko, and TrueSkill rating systems as “base learners” for our ap-
proaches, we predict the outcomes of games based on subsets of the over-
all dataset in order to investigate their performance given differing game
histories and playing environments. We find that Glicko and TrueSkill
benefit greatly from our approaches (TeamSkill-AllK-EV in particular),
significantly boosting prediction accuracy in close games and improving
performance overall, while Elo performs better without them. We also
find that the ways in which each rating system handles skill variance
largely determines whether or not it will benefit from our techniques.

Keywords: Player rating systems, competitive gaming, Elo, Glicko,


TrueSkill.

1 Introduction

Skill assessment has long been an active area of research. Perhaps the most well-
known application is to the game of chess, where the need to gauge the skill
of one player versus another led to the development of the Elo rating system
[1]. Although mathematically simple, Elo performed well in practice, treating
skill assessment for individuals as a paired-comparison estimation problem, and
was subsequently adopted by the US Chess Federation (USCF) in 1960 and the
World Chess Federation (FIDE) in 1970. Other ranking systems have since been
developed, notably Glicko [2], [3], a generalization of Elo which sought to address

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 519–531, 2011.
c Springer-Verlag Berlin Heidelberg 2011
520 C. DeLong et al.

Elo’s ratings reliability issue, and TrueSkill [4], the well-known Bayesian model
used for player/team ranking on Microsoft’s Xbox Live gaming service.
With hundreds of thousands to millions of players competing on networks such
as Xbox Live, accurate estimations of skill are crucial because unbalanced games
- those giving a distinct advantage to one player or team over their opponent(s) -
ultimately lead to player frustration, reducing the likelihood they will continue to
play. For multiplayer-focused games, this is a particularly relevant issue as their
success or failure is tied to player interest sustained over a long period of time.
While previous work in this area [4] has been evaluated using data from a
general population of players, less attention has been paid to certain boundary
conditions, such as the case where the entire player population is highly-skilled
individually. As in team sports [5], [6], less tangible notions, such as “team chem-
istry”, are often cited as key differentiating factors, particularly at the highest
levels of play. However, in existing skill assessment approaches, player perfor-
mances are assumed to be independent from one another, summing individual
player ratings in order to arrive at an overall team rating.
In this work, we describe four approaches (TeamSkill-K, TeamSkill-AllK,
TeamSkill-AllK-EV, and TeamSkill-AllK-LS) which make use of the observed
performances of subsets of players on teams as a means of capturing “team chem-
istry” in the ratings process. These techniques use ensembles of ratings of these
subsets to improve prediction accuracy, leveraging Elo, Glicko, and TrueSkill as
“base learners” by extending them to handle entire groups of players rather than
strictly individuals. To the best of our knowledge, no similar approaches exist in
the domain of skill assessment.
For evaluation, we introduce a rich dataset compiled over the course of 2009
based on the Xbox 360 game Halo 3, developed by Bungie, LLC in Kirkland,
WA. Halo 3 is a first-person shooter (FPS) played competitively in Major League
Gaming (MLG), the largest professional video game league in the world, and is
the flagship game for the MLG Pro Circuit, a series of tournaments taking place
throughout the year in various US cities. Our evaluation shows that, in general,
predictive performance can be improved through the incorporation of subgroup
ratings into a team’s overall rating, especially in high-level gaming contexts,
such as tournaments, where teamwork is likely more prevalent. Additionally,
the modeling of variance in each rating system is found to play a large role
in determining the what gain (or loss) in performance one can expect from
using subgroup rating information. Elo, which uses a fixed variance, is found
to perform worse when used in concert with any TeamSkill approach. However,
when the Glicko and TrueSkill rating systems are used as base learners (both
of which model variance as player-level variables), several TeamSkill variants
achieve the highest observed prediction accuracy, particularly TeamSkill-AllK-
EV. Upon further investigation, we find this performance increase is especially
apparent for “close” games, consistent with the competitive gaming environment
in which the matches occur.
The paper is structured as follows. Section 2 reviews some of the relevant re-
lated work in the fields of player and team ratings/ranking systems and
TeamSkill: Modeling Team Chemistry in Online Multi-player Games 521

competitive gaming. In Section 3, we introduce our proposed approaches,


TeamSkill-K, TeamSkill-AllK, TeamSkill-AllK-EV, and TeamSkill-AllK-LS. For
Section 4, we describe the Halo 3 dataset, how it was compiled, its charac-
teristics, and where it can be found should other researchers be interested in
studying it. In Section 5, we evaluate the TeamSkill approaches and compare
them to “vanilla” versions of Elo, Glicko, and TrueSkill, in game outcome pre-
diction accuracy. Finally, in Section 6 we provide a number of conclusions and
discuss our future work.

2 Related Work
In games, the question of how to rank (or provide ratings of) players is old, trac-
ing its roots to the work of Louis Leon Thurstone in the mid-1920’s and Bradley-
Terry-Luce models in the 1950’s. In 1927 [7], Thurstone proposed the “law of
comparitive judgement”, a means of measuring the mean distance between two
physical stimuli, Sa and Sb . Thurstone, working with stimuli such as the sense-
distance between levels of loudness, asserted that the distribution underlying
each stimulus process is normal and that as such, the mean difference between
the stimuli Sa and Sb can therefore be quantified in terms of their standard devi-
ation. This work laid the foundation for the formulation of Bradley-Terry-Luce
(BTL) models in 1952 [8], a logistic variant of Thurstone’s model which pro-
vided a rigorous mathematical examination of the paired comparison estimation
problem, using taste preference measurements as its experimental example.
The BTL model framework provided the basis for the Elo rating system,
introduced by Arpad Elo in 1959 [1]. Elo, himself a master chess player, developed
the Elo rating system to replace the US Chess Federation’s Harkness rating
system with one more grounded in statistical theory. Like Thurstone, the Elo
rating system assumes each player’s skill is normally distributed, where player
i’s expected performance is pi ∼ N (μi , β 2 ). Notably, though, Elo also assumes
players’ skill distributions share a constant variance β 2 , greatly simplifying the
mathematical calculation at the expense of capturing the relative certainty of
each player’s skill.
In 1993 [3], Mark Glickman sought to improve upon the Elo rating system by
addressing the ratings reliability issue in the Glicko rating system. By introduc-
ing a dynamic variance for each player, the confidence in a player’s skill rating
could be adjusted to produce more conservative skill estimates. However, the
inclusion of this information at the player level also incurred significant com-
putational cost in terms of updates, and so an approximate Bayesian updating
scheme was devised which estimates the marginal posterior distribution P r(θ|s),
where θ and s correspond to the player strengths and the set of game outcomes
observed thus far, respectively.
With the advent of large-scale console-based multiplayer gaming on the Mi-
crosoft Xbox in 2002 via Xbox Live, there was a growing need for a more gener-
alized ratings system not solely designed for individual players, but teams - and
any number of them - as well. TrueSkill [4], published in 2006 by Ralf Herbrich
and Thore Graepel of Microsoft Research, used a factor graph-based approach
522 C. DeLong et al.

to accomplish this. Like Glicko, TrueSkill also maintains a notion of variance for
each player, but unlike it, TrueSkill samples an expected performance pi given
a player’s expected skill, which is then summed for all players on i’s team to
represent the collective skill of that team. This expected performance pi is also
assumed to be distributed normally, but similar to Elo, a constant variance is
assumed across all players. Of note, TrueSkill’s summation of expected player
performances in quantifying a team’s expected performance assumes player per-
formances are independent of one another. In the case of team games, especially
those occurring at high levels of competition where team chemistry and cooper-
ative strategies play much larger roles, this assumption may prove problematic
in ascertaining which team has the true advantage a priori. We explore this topic
in more depth later on.
Other variants of the aforementioned approaches have also been proposed.
Coulom’s Whole History Rating (WHR) method [9] is, like other rating systems
such as Elo, based on the dynamic BTL model. Instead of incrementally updating
the skill distributions of each player after a match, it approximates the maximum
a posteri over all previous games and opponents, resulting in a more accurate skill
estimation. This comes at the cost of some computational ease and efficiency,
which the authors argue is still minimal if deployed on large-scale game servers.
Others [10] have extended the BTL model to use group comparisons instead
of paired comparisons, but also assume player performance independence by
defining a team’s skill as the sum of its players’.
Birlutiu and Heskes [11] develop and evaluate variants of expectation propaga-
tion techniques for analysis of paired comparison data by rating tennis players,
stating that the methods are generalizable to more complex models such as
TrueSkill. Menke, et al. [12] develop a BTL-based model based on the logistic
distribution, asserting that weaker teams are more likely to win than what a
normally-distributed framework would predict. They also conclude that models
based on normal distributions, such as TrueSkill, lead to an exponential increase
in team ratings when one team has more players than another.
The field of game theory includes a number of related concepts, such as the
Shapley value [13], which considers the problem of how to fairly allocate gains
among a coalition of players in a game. In the traditional formulation of skill
assessment approaches, however, gains or losses are implicitly assumed to be
equal for all players given the limitation to win/loss/team formation history
during model construction and evaluation. That is, no additional information is
available to measure the contribution of each player to a team’s win or loss.

3 Proposed Approaches
As discussed, the characteristic common to existing skill assessment approaches
is that the estimated skill of a team is quantified by summing the individual skill
ratings of each player on the team. Though understandable from the perspective
of minimizing computational costs and/or model complexity, the assumption is
not well-aligned with either intuition or research in sports psychology [5], [6].
Only in cases where the configuration of players remains constant throughout a
TeamSkill: Modeling Team Chemistry in Online Multi-player Games 523

team’s game history can the summation of individual skill ratings be expected
to closely approximate a team’s true skill. Where that assumption cannot be
made, as is the case in the dataset under study in this paper, it is difficult to
know how much of a player’s skill rating can be attributed to the individual and
how much is an artifact of the players he/she has teamed with in the past.
Closely related to this issue is the notion of team chemistry. “Team chemistry”
or “synergy” is a well-known concept [5], [6] believed to be a critical component of
highly-successful teams. It can be thought of as the overall dynamics of a team
resulting from a number of difficult-to-quantify qualities, such as leadership,
confidence, the strength of player/player relationships, and mutual trust. These
qualities are also crucial to successful Halo teams, which is sometimes described
by its players as “real-time chess”, where teamwork is believed to be the key
factor separating good teams from great ones.
The integration of any aspect of team chemistry into the modeling process
doesn’t suggest an obvious solution, though. However, a key insight is that one
need not maintain skill ratings only for individual players - they can be main-
tained for groups of players as well. The skill ratings of these groups can then be
combined to estimate the overall skill of a team. Here, we describe four methods
which make use of this approach - TeamSkill-K, TeamSkill-AllK, TeamSkill-
AllK-EV, and TeamSkill-AllK-LS.

3.1 TeamSkill-K

At a high level, this approach is simple: for a team of K players, choose a sub-
group size k ≤ K, calculate the average skill rating for all k-sized player groups
for that team using some “base learner” (such as Elo, Glicko, or TrueSkill), and
finally scale this average skill rating up by K/k to arrive at the team’s skill
rating. For k = 1, this approach is equivalent to simply summing the individual
player skill ratings together. As such, TeamSkill-K can be thought of as a gen-
eralized approach for combining skill ratings for any K-sized team given player
subgroup histories of size k.
Formally, let s∗i be the estimated skill of team i and fi (k) be a function
returning the set of skill ratings for player subgroups of size k in team i. Let
each member of the set of skill ratings returned by fi (k) be denoted as sikl ,
corresponding to the l-ith configuration of size k for team i. Here, sikl is assumed
to be a random variable drawn from some underlying distribution. Then, given
some k, the collective strength of a team of size K can be estimated as follows:

K
s∗i = E[fi (k)]
k
K!

(k − 1)!(K − k)! 
k!(K−k)!

= sikl (3.1)
(K − 1)!
l=1

Though simple to implement and useful as a generalized approach for estimat-


ing a team’s skill given ratings for player subgroups of size k, this choice of k
524 C. DeLong et al.

introduces a potentially problematic trade-off between two desirable skill esti-


mation properties - game history availability and player subgroup specificity. As
k becomes larger, less history is available and as k becomes smaller, subgroups
capture lower-level interaction information.

3.2 TeamSkill-AllK
To address this issue, a second approach was developed. Here, all available player
subgroup information, 1 ≤ k ≤ K, is used to estimate the skill rating of a
team. The general idea is to model a team’s skill rating as a recursive summa-
tion over all player subgroup histories, building in the (k − 1)-level interactions
present in a player subgroup of size k in order to arrive at the final rating
estimate.
This approach can be expressed as follows. Let s∗ikl be the estimated skill rat-
ing of the l-ith configuration of size k for team i and gi (k) be a function returning
K!
the set of estimated skill ratings s∗ikl , where 1 ≤ l ≤ k!(K−k)! for player sets of
size k in team i. When k = 0, gi (k) = {Ø} and s∗ikl = 0. As before, let sikl be
the skill rating of the l-ith configuration of size k for team i. Additionally, let
αk be a user-specified parameter in the range [0, 1] signifying the weight of the
k-ith level of estimated skill ratings. Then,
k
s∗ikl = αk sikl + (1 − αk ) E[gi (k − 1)]
 k−1 ∗
k s∗ ∈gi (k−1) sik−1l
= αk sikl + (1 − αk ) ik−1l
(3.2)
k−1 |gi (k − 1)|

To compute s∗i , let s∗i = s∗ikl where k = K and l = 1 (since there is only one
player subset rating when k = K). This recursive approach ensures that all
player subset history is used.

3.3 TeamSkill-AllK-EV
In TeamSkill-AllK, if no history is available for a particular subgroup, default
values (scaled to size k) are used instead in order to continue the recursion.
Problematically, cases where limited player subset history is available will pro-
duce team skill ratings largely dominated by default rating values, potentially
resulting in inaccurate skill estimates. As such, another approach was developed,
called TeamSkill-AllK-EV. The core idea behind TeamSkill-AllK - the usage of
all available player subgroup histories - was retained, but the new implementa-
tion eschewed default values for all player subsets save those of individual players
(consistent with existing skill assessment approaches), instead focusing on the
evidence drawn solely from game history. Re-using notation, TeamSkill-AllK-EV
is as follows:
TeamSkill: Modeling Team Chemistry in Online Multi-player Games 525

K

1 K
s∗i = K E[hi (k)]
k=1 |hi (k) = ∅| k=1 k
K

K E[hi (k)]
= K (3.3)
k=1 |hi (k) = ∅| k=1 k

Here, hi (k) = fi (k) where there exists at least one player subset history of size
k, else ∅ is returned.

3.4 TeamSkill-AllK-LS
In this context, it is natural to hypothesize that the most accurate team skill
ratings could be computed using the largest possible player subsets covering all
members of a team. That is, given some player subset X and its associated rating,
ratings for subsets of X should be disregarded since they represent lower-level
interation information X would have already captured in its rating. Formally,
such an approach can be represented as follows:

K 
1
s∗i = 1 E[hi (k)⊆hi (k<j≤K)] (3.4)
m|{hi (m)⊆hi (m<j≤K)}=∅
m=K k=K

One obvious advantage to this approach is its speed, since this method prunes
away from consideration ratings of subsets of previously-used supersets.

4 Dataset
The data under study in this paper was collected throughout 2009 as part of a
larger project to produce a high-quality, competitive gaming dataset. Halo 3, re-
leased in September 2007 on the Xbox 360 game console, is played professionally as
the flagship game in Major League Gaming (as were its predecessors Halo:Combat
Evolved and Halo 2). Major League Gaming (MLG) is the largest video gaming
league in the world and has grown rapidly since its inception in 2002, with Internet
viewership for 2009 events topping 750,000. After its release, Halo 3 replaced Halo
2 beginning with the 2008 season (known as the Pro Circuit).
The dataset contains Halo 3 multiplayer games between two teams of four
players each. Each game was played in one of two environments - over the Internet
on Microsoft’s Xbox Live service in custom games (known as scrimmages) or on
a local area network at an MLG tournament. Information on each game includes
the players and teams involved, the date of the game, the map and game type, the
result (win/loss) and score, and per-player statistics such as kills, deaths, assists
(where one player helps another player from the same team kill an opponent),
and score.
The dataset has several interesting characteristics, such as the high frequency
of team changes from one tournament to the next. With four players per team, it
is not uncommon for a team with a poor showing in one tournament to replace
526 C. DeLong et al.

one or two players before the next. As such, the resulting dataset lends itself
to analyses of skill at the group level since the diversity of player assignments
can aid in isolating interesting characteristics of teams who do well versus those
who do not. Additionally, since the players making up the top professional and
semi-professional teams are all highly-skilled individually, “basic” game famil-
iarity (such as control mechanics) are not considered as important a factor in
winning/losing as overall team strategy, execution, and adaptation to the opposi-
tion. This focus also helps mitigate issues pertaining to the personal motivations
of players since all must be dedicated to winning in order to have earned a spot
in the top 32 teams in the league, winnowing out those who might intentially
lose games for their teams (as is commonplace in standard Halo 3 multiplayer
gaming). Taken together, these elements make for a very high quality research
dataset for those interested in studying competitive gaming, skill ratings sys-
tems, and teamwork.
The dataset has been made available on the HaloFit web site in two formats.
The first, http://stats.halofit.org, contains several views into the dataset similar
to statistics pages of professional sports leagues such as Major League Baseball.
Users can drill down into the dataset using a series of filters to find data rele-
vant to favorite teams or players. The second, http://halofit.org, contains partial
and full comma-separated exports of the dataset. The dataset currently houses
information on over 9,100 games, 566 players, and 186 teams.

5 Experimental Analysis
The four proposed TeamSkill approaches were evaluated by predicting the out-
comes of games occuring prior to 10 Pro Circuit tournaments and comparing
their accuracy to unaltered versions (k = 1) of their base learner rating systems
- Elo, Glicko, and TrueSkill. For TeamSkill-K, all possible choices of k for teams
of 4, 1 ≤ k ≤ 4, were used. Given two teams, t1 and t2 , The prior probability of t1
winning is a straightforward derivation from the negative CDF at 0 of the distri-
bution describing the difference between two independent, normally-distributed
random variables:

P (t1 > t2 ) = 1 − F (0; μ1 − μ2 , σ12 + σ22 )


1 0 − (μ1 − μ2 )
= 1 − (1 + erf (  2 ))
2 
2 2
2 (σ1 + σ2 )
1 μ2 − μ1
= (1 − erf (  2 )) (5.1)
2 2(σ1 + σ22 )
For each tournament, we evaluated each rating approach using:
– 3 types of training data sets - games consisting only of previous tournament
data, games from online scrimmages only, and games of both types.
– 3 periods of game history - all data except for the data between the test
tournament and the one preceding it (“long”), all data between the test
TeamSkill: Modeling Team Chemistry in Online Multi-player Games 527

tournament and the one preceding it (“recent”), and all data before the test
tournament (“complete”).
– 2 types of games - the full dataset and those games considered “close” (i.e.,
prior probability of one team winning close to 50%).
In the case where only tournament data is used as training set data, the most
recent tournament preceding the test tournament replaced the inter-tournament
scrimmage data for the “long” and “recent” game history configurations. Simi-
larly, “recent” game history when considering both tournament and scrimmage
data included the most recent tournament. “close” games were defined using a
slightly modified version of the “challenge” method [4] in which the top 20%
closest games were selected for one rating system and presented to the other
(and vice versa). In this evaluation, the closest games from the “vanilla” ver-
sions of each rating system (i.e., k = 1) were presented to each of the TeamSkill
approaches while the closest games from TeamSkill-AllK-EV were presented to
the “vanilla” versions. The reasons these two were chosen is because all the
TeamSkill approaches are intended to improve upon their respective “vanilla”
versions and that repeated testing had shown TeamSkill-AllK-EV to be the best
performing approach on full datasets in many cases. The default values used dur-
ing the evaluation of Elo (α = 0.07, β = 193.4364, μ0 = 1500, σ02 = β 2 ), Glicko
(q = log(10)/400, μ0 = 1500, σ02 = 1002 ), and TrueSkill ( = 0.5, μ0 = 25,
σ02 = (μ0 /3)2 , β = σ02 /2) correspond to the defaults outlined in [4] and [3].
Additionally, for Glicko, a rating period of one game was assigned due to the
continuity of game history over the course of 2008 and 2009, as well as to ap-
proximate an “apples to apples” comparison with respect to Elo and TrueSkill.
In the interest of space, a subset of the 3,780 total evaluations are presented
corresponding to the “complete” cases. The “long” results essentially mirrored
the “complete” results while the “recent” results were virtually identical across
all TeamSkill variations for all non-close games and produced no clear patterns
for close games (with differences only emerging after one or two tournaments, as
can be seen in the “complete” results).

5.1 Findings and Analysis


The results in figures 1, 2, and 3 show that in general, Glicko and TrueSkill ben-
efit from the incorporation of team chemistry components and tend to improve
the prediction accuracy overall in comparison to the “vanilla” versions (k = 1).
The TeamSkill-AllK and TeamSkill-AllK-EV approaches - TeamSkill-AllK-EV
in particular - outperform k = 1 in nearly all cases. TeamSkill-AllK-LS, on the
other hand, shows no similar performance gain, nor do any of the TeamSkill
versions in the range 1 < k ≤ 4. These results suggest that group-level ratings
alone are insufficient for accurately assessing the strength of a team - player-level
ratings must be incorporated as well.
No similarly positive effect is observed for Elo, although TeamSkill-AllK-EV’s
accuracy does approach that of k = 1. In fact, the accuracy for all non-k = 1
approaches is, at best, equal to k = 1. Interestingly, Elo still performs well
for k = 1, in some cases outperforming Glicko and TrueSkill. Considering Elo
528 C. DeLong et al.

TeamSkill (Elo base) TeamSkill (Glicko base) TeamSkill (TrueSkill base)


k=1
0.68 0.68 0.68 k=2
k=3
k=4
0.66 0.66 0.66
AllK
AllK−EV
0.64 0.64 0.64 AllK−LS

0.62 0.62 0.62


1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Fig. 1. Prediction accuracy for both tournament and scrimmage/custom games using
complete history

TeamSkill (Elo base) TeamSkill (Glicko base) TeamSkill (TrueSkill base)

0.66 0.66 0.66 k=1


k=2
0.64 0.64 0.64 k=3
0.62 0.62 0.62 k=4
AllK
0.6 0.6 0.6 AllK−EV
AllK−LS
0.58 0.58 0.58
0.56 0.56 0.56
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Fig. 2. Prediction accuracy for tournament games using complete history

was developed in the mid-1950’s, that it still competes with state-of-the-art


approaches is an impressive result unto itself.
As to the source of Glicko and TrueSkill’s improved overall performance, fur-
ther inspection (figures 4, 5, and 6) reveals significant performance increases
(with respect to k = 1) in close games. At times, the margin of difference is
as much as 8%. It can also be seen that over time, this margin tends to widen.
Taken together, these results indicate that the group-level ratings have the effect
of better distinguishing which team has the true advantage in close match-ups,
a key finding well-aligned with prior research [5], [6].

5.2 Discussion

As mentioned, Elo doesn’t benefit from the inclusion of group-level ratings in-
formation. The reason stems from Elo’s use of a constant variance and as such,
Elo is not sensitive to the dynamics of a player’s skill over time. For groups of
players, this issue is compounded since the higher the k under consideration, the
less prior game history can be drawn on to infer their collective skill. With the
TeamSkill approaches, the net effect is that incorporating (k > 1)-level group
ratings ‘dilute’ the overall team rating, resulting in a higher number of closer
games since there is no provision for Elo’s constant variance to differ depending
on the size of the group under consideration.
Similarly, variance also accounts for much of the differences between Glicko
and TrueSkill’s performances. Both make use of player-level variances (and, thus,
group-level variances using the TeamSkill approaches). However, TrueSkill also
TeamSkill: Modeling Team Chemistry in Online Multi-player Games 529

TeamSkill (Elo base) TeamSkill (Glicko base) TeamSkill (TrueSkill base)


0.7 0.7 0.7
k=1
k=2
0.68 0.68 0.68 k=3
k=4
0.66 0.66 0.66 AllK
AllK−EV
AllK−LS
0.64 0.64 0.64

0.62 0.62 0.62


1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Fig. 3. Prediction accuracy for scrimmage/custom games using complete history

TeamSkill (Elo base) TeamSkill (Glicko base) TeamSkill (TrueSkill base)


k=1
k=2
0.55 0.55 0.55 k=3
k=4
AllK
0.5 0.5 0.5 AllK−EV
AllK−LS

0.45 0.45 0.45


1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Fig. 4. Prediction accuracy for both tournament and scrimmage/custom games using
complete history, close games only

maintains a constant “performance” variance, β 2 , across all players, which is


applied just prior to computing the predicted ordering of teams during updates.
β 2 is a user-provided parameter which, when increased, similarly increases the
probability of TrueSkill believing teams will draw, discounting the potentially
small differences between them in collective skill. As such, this “performance”
variance has a similar ‘dilution’ effect as in Elo, but are less pronounced because
TrueSkill also maintains player/group-level variances.
These results highlight the critical role played by skill variance in estimating
the skill of a group of players. The ways in which it is maintained can result in
certain biases which arise when models’ prior beliefs are different relative to the
observations. Methods for tackling such an issue could consist of maintaining a
prior distribution over the skill variance itself or using a mixture model for the
skill variance. Such extensions to Glicko or TrueSkill could result in techniques
that can better assimilate new observations with prior beliefs in order to generate
superior predictions.
Additionally, given the ensemble methodology employed by the TeamSkill ap-
proaches, a logical next step is to consider boosted (or online) versions of the
TeamSkill framework to see if any further gains can be made. The additional
computation cost of boosting in this context, though, could render it unfeasible
in a real-world deployment, but would be of academic interest with respect to
studying how accurate skill assessment can be using only win/loss/team for-
mation information. Given these real-world constraints, a fully online ensemble
framework for TeamSkill would be ideal and as such, our future work is concerned
with developing this idea more fully.
530 C. DeLong et al.

TeamSkill (Elo base) TeamSkill (Glicko base) TeamSkill (TrueSkill base)

0.55 0.55 0.55 k=1


k=2
0.5 0.5 0.5 k=3
0.45 0.45 0.45 k=4
AllK
0.4 0.4 0.4 AllK−EV
AllK−LS
0.35 0.35 0.35
0.3 0.3 0.3
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Fig. 5. Prediction accuracy for tournament games using complete history, close games
only

TeamSkill (Elo base) TeamSkill (Glicko base) TeamSkill (TrueSkill base)


k=1
k=2
0.55 0.55 0.55 k=3
k=4
AllK
0.5 0.5 0.5
AllK−EV
AllK−LS
0.45 0.45 0.45

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Fig. 6. Prediction accuracy for scrimmage/custom games using complete history, close
games only

6 Conclusions
Our experiments demonstrate that in many cases, the proposed TeamSkill ap-
proaches can outperform the “vanilla” versions of their respective base learner,
particularly in close games. We find that the ways in which skill variance is ad-
dressed by each base learner has a large effect on the prediction accuracy of the
TeamSkill approaches, the results suggesting that those employing a dynamic
variance (i.e., Glicko, TrueSkill) can benefit from group-level ratings. In our fu-
ture work, we intend to investigate ways of better representing skill uncertainty,
possibly by modeling the uncertainty itself as a distribution, and constructing
an online version of TeamSkill in order to improve skill estimates.

Acknowledgments
We would like to thank the Data Analysis and Management Research group, as well
as the reviewers, for their feedback and suggestions. We would also like to thank
Major League Gaming for making their 2008-2009 tournament data available.

References
1. Elo, A.: The Rating of Chess Players, Past and Present. Arco Publishing, New
York (1978)
2. Glickman, M.: Paired Comparison Model with Time-Varying Parameters. PhD
thesis. Harvard University, Cambridge (1993)
TeamSkill: Modeling Team Chemistry in Online Multi-player Games 531

3. Glickman, M.: Parameter estimation in large dynamic paired comparison experi-


ments. Applied Statistics 48, 377–394 (1999)
4. Herbrich, R., Graepel, T.: Trueskill: A bayesian skill rating system. Microsoft Re-
search, Tech. Rep. MSR-TR-2006-80 (2006)
5. Yukelson, D.: Principles of effective team building interventions in sport: A di-
rect services approach at penn state university. Journal of Applied Sport Psychol-
ogy 9(1), 73–96 (1997)
6. Martens, R.: Coaches guide to sport psychology. Human Kinetics, Champaign
(1987)
7. Thurstone, L.: Psychophysical analysis. American Journal of Psychology 38, 368–
389 (1927)
8. Bradley, R.A., Terry, M.: Rank analysis of incomplete block designs: I. the method
of paired comparisons. Biometrika 39(3/4), 324–345 (1952)
9. Coulom, R.: Whole-history rating: A bayesian rating system for players of time-
varying strength. Computer and Games, Beijing (2008)
10. Huang, T., Lin, C., Weng, R.: Ranking individuals by group comparisons. Journal
of Machine Learning Research 9, 2187–2216 (2008)
11. Birlutiu, A., Heskes, T.: Expectation propagation for rating players in sports
competitions. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S.,
Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 374–
381. Springer, Heidelberg (2007)
12. Menke, J.E., Reese, C.S., Martinez, T.R.: Hierarchical models for estimating indi-
vidual ratings from group competitions. American Statistical Association (2007)
(in preparation)
13. Shapley, L.S.: A value for n-person games. Contributions to the Theory of Games
(Annals of Mathematics Studies) 2(28), 307–317 (1953)
Learning the Funding Momentum
of Research Projects

Dan He and D.S. Parker

Computer Science Dept., Univ. of California, Los Angeles, CA, 90095-1596, USA
{danhe,stott}@cs.ucla.edu

Abstract. In developing grant proposals for funding agencies like NIH


or NSF, it is often important to determine whether a research topic is
gaining momentum — where by ‘momentum’ we mean the rate of change
of a certain measure such as popularity, impact or significance — to eval-
uate whether the topic is more likely to receive grants. Analysis of data
about past grant awards reveals interesting patterns about successful
grant topics, suggesting it is sometimes possible to measure the degree
to which a given research topic has ‘increasing momentum’. In this pa-
per, we develop a framework for quantitative modeling of the funding
momentum of a project, based on the momentum of the individual top-
ics in the project. This momentum follows certain patterns that rise and
fall in a predictable fashion. To our knowledge, this is the first attempt
to quantify the momentum of research topics or projects.

Keywords: Grants, Momentum, Bursts, Technical stock analysis,


Classification.

1 Introduction
Research grants are critical to the development of science and the economy. Every
year billions of dollars are invested in diverse scientific research topics, yet there is
far from sufficient funding to support all researchers and their projects. Funding
of research projects is highly selective. For example, in the past roughly only
20% of submitted projects have been funded by NIH. As a result, to maximize
their chances of being funded, researchers often feel pressured to submit grant
projects on topics that have ‘increasing momentum’ — where ‘momentum’ is
defined as the rate of change of a certain measure such as popularity, impact or
significance. It would be helpful if one could model this pressure quantitatively.

1.1 Basic Models of Topic Popularity


In [6], indicators such as popularity, impact, and significance can serve as impor-
tant measures of a research topic. Popularity can be measured by the number
of publications relevant to the topic, reflecting the volume of attention devoted

This research supported by NIH grants RL1LM009833 (Hypothesis Web) and
UL1DE019580, in the UCLA Consortium for Neuropsychiatric Phenomics.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 532–543, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Learning the Funding Momentum of Research Projects 533

to it. Impact is usually defined in terms of the number of citations to publica-


tions involving the topic, which intuitively measures influence. Significance can
be defined as the number of citations per article or the number of highly-cited
articles involving a topic, giving a measure of overall visibility.
In this work, we focus on using a momentum indicator to study successful
funding awards over the past few decades from NIH on specific biomedical re-
search topics. We want to identify how the momentum of research topics changes
over time, using it as an indicator of success in gaining awards. Can we model
the momentum of a research topic, or of funding success? We believe the answer
is yes: changes in topic momentum in awards have followed measurable trends.
Topic modeling is not new, and there has been considerable work in modeling
the popularity of a topic in a way that reflects increasing momentum. Popu-
larity of a topic is often measured using frequency of occurrence, and trends in
frequency are used as measures of trends in popularity. However, this leads to
relatively naive analysis, such as simple changes in trends involving increase, de-
crease, etc. Another weakness of this analysis is that it focuses on the direction
of changes, and not on more specific features of change. For example, it cannot
answer questions like: How ‘rapidly’ is the popularity of a topic increasing?
We can answer questions like this by adapting more powerful trend models,
such as models of ‘bursts’ (intervals of elevated occurrence [10]). The popularity
in burst periods usually increases ‘rapidly’ and therefore the occurrence of a
burst indicates a significant change. More detailed discussion is given later.

1.2 Modeling Funding Momentum

We are interested in applying trend models to study trends in research funding.


For example the NIH collects extensive information on its funding awards in
RePORTER [2], a database tracking biomedical topics for successful (funded)
proposals since 1986; each award includes a set of topic keywords from a prede-
fined set (‘terms’). We can use this historical data to model trends in research
interests over the past 25 years. Thus, in successful awards, we can study not
only trends for single topics, but also trends for projects (sets of topics).
In this paper, we first propose a definition of momentum for research topics.
According to our definition, the momentum of a topic is a measure of its burst
strength over time, computing derivatives of moving averages as a model of its
momentum. To calculate momentum, we adapt technical analysis methods used
for analyzing stocks. Our experimental results show that our method is able to
model momentum for a research topic, by training classifiers on historical data.
To our knowledge, this is the first attempt to develop quantitative models for
occurrence of research topics in successful grant awards. Other indicators such
as impact and significance can be easily integrated into this model to evaluate
different aspects of momentum.
With a model for the momentum of individual research topics, we study fund-
ing momentum of research projects. As just mentioned, each project involves
a set of research topics. We can therefore model the funding momentum of a
research project in terms of the funding momentum of each research topic it
534 D. He and D.S. Parker

contains. This gives what we believe is the first quantitative definition of fund-
ing momentum of a research project.
The remainder of the paper is structured as follows. In Section 2 we summarize
previous work on analyzing grant databases, and explore the relevance of stock
market momentum models to development of ‘research topic portfolios’ — such
as identifying burst periods and momentum trends. In Section 3, we define a
model of funding momentum for research topics and projects. We propose a
framework to predict funding momentum in Section 4. We report experimental
results and comparisons in Section 5, and conclude the paper in Section 6.

2 Related Work
There has been a great deal of work in the analysis of grant portfolios, such as
in the MySciPI project [1] and the NIH RePORTER system [2]. However, these
works have focused on development of databases to store information about
awards. For example, MySciPI [1] provides a database of projects that can be
queried by keywords or by research topics. When given a research topic, extensive
information about all projects related to the topic can be extracted. However,
beyond the extraction of project information, only basic statistics (such as the
number of hits of the topic in the database, etc.) are shown. The ability to mine
information in these databases has been lacking.
We try to analyze this information based on the indicators such as popular-
ity, impact and significance. We can easily calculate these indicators over a time
window for certain research topics. We are interested in the problem of identi-
fication and prediction of periods in which the indicators have a strong upward
momentum movement, or ‘burst’. Lots of work have been done to identify bursts
of a topic over certain time series. Kleinberg [14] and Leskovec et al. [11] use
an automaton to model bursts, in which each state represents the arrival rate
of occurrences of a topic. Shasha and co-workers [19] defined bursts based on
hierarchies of fixed-length time windows. He and Parker [10] model bursts with
a ‘topic dynamics’ model, where bursts are intervals of positive change in mo-
mentum. Then they apply trend analysis indicators such as EMA and MACD
histogram, which are well-developed momentum computation methods used in
evaluating stocks, to identify burst periods. They show that their topic dynamics
model is successful in identifying bursts for individual topics, while Kleinberg’s
model is more appropriate to identify bursts from a set of topics. He and Parker
also point out that the topic dynamics model may permit adaptation of the
multitude of technical analysis methods used in analyzing market trends.
Of course, an enormous amount of work has gone into prediction of stock
prices and of market trends (such as upward or downward movement, turning
points, etc.), and a multitude of models has been proposed. For example, Al-
Qaheri et al. [5] recently developed rough sets to develop rules for predicting
stock prices. Classical methods involving neural networks (Lawrence [15], Gryc
[8], Nikooa et al. [17], Sadd [18]) in forecasting stock prices. Hassan and Nath
[9] applied Hidden Markov Models to forecast prices. Genetic Algorithms have
Learning the Funding Momentum of Research Projects 535

Table 1. Key symbols used in the paper

Symbol Description
FM(T, m) Funding Momentum for the topic or project T in a period of m months
BS(T, m) Burst Strength for the topic or project T in a period of m months
FI(T, m) Frequency Increase indicator for topic or project T in a period of m months
F(T )i Frequency of the topic or project T at month i
CF(T ) Current frequency, or Start Frequency of the topic or project T
histogram(T )i MACD Histogram value of the topic or project T at month i
cor(A, B) correlation between two topics A, B
co(A, B) co-occurrences between two topics A, B
f req(A) frequency of the topic A

also been applied [13] [16]. Recently Agrawal et al. [4] developed adaptive Neuro-
Fuzzy Inference System (ANFIS) to analyze stock momentum. Bao and Yang [7]
build an intelligent stock trading system using confirmation of turning points and
probabilistic reasoning. This paper shows how this wealth of mining experience
might be adapted in analyzing grant funding histories.

3 Funding Momentum
3.1 Technical Analysis Indicators of Momentum
Key symbols in the paper are summarized in Table 1. To define the momentum of
research topics and projects, we adapt the stock market trend analysis indicators.
Here we first include very well-established background about technical analysis
indicators of momentum.
– EMA — the exponential moving average of the momentum, or the ‘first
derivative’ of the momentum over a time period:
n
 k
EMA(n)[x]t = α xt + (1 − α) EMA(n − 1)[x]t−1 = α (1 − α) xt−k
k=0

where x has a corresponding discrete time series {xt |t = 0, 1, . . .}, α is a


smoothing factor. We also write EMA(n)[x]t = EMA(n) for a n-time unit
period EMA.
– MACD — the difference between EMA’s over two different time periods.
MACD(n1 , n2 ) = EMA(n1 ) − EMA(n2 )

– MACD Histogram — the difference between MACD and its moving average,
or the ‘second derivative’ of the momentum:
signal(n1 , n2 , n3 ) = EMA(n3 )[MACD(n1 , n2 )]
histogram(n1 , n2 , n3 ) = MACD(n1 , n2 ) − signal(n1 , n2 , n3 )

where EMA(n3 )[MACD(n1 , n2 )] denotes the n3 -time unit moving average of


the sequence MACD(n1 , n2 );
– RSI — the relative strength of the upwards movement versus the downwards
movement of the momentum over a time period:
EMAU (n) 1
RS(n) = RSI(n) = 100 − 100
EMAD (n) 1 + RS(n)
 
xt − xt−1 xt > xt−1 xt−1 − xt xt < xt−1
Ut = Dt =
0 otherwise 0 otherwise

where EMAU (n) and EMAD (n) are EMA for time series U and D, respectively.
536 D. He and D.S. Parker

3.2 Funding Momentum for Research Topics


The degree to which a research topic has ‘increasing momentum’ (funding-wise)
can depend on many characteristics. As mentioned earlier, traditional measures
of topic popularity rest on the topic’s frequency of occurrence in the literature.
However, simple measures do not permit us to model things such as intervals in
which popularity of the topic increases ‘sharply’.
In this paper we characterize a topic’s funding success in terms of momentum.
Specifically, we follow previous work of He and Parker [10], which measures
momentum — using the well-known MACD histogram value from technical stock
analysis — in order to define a ‘burst’. The histogram value measures change
in momentum, so they define the period over which it is positive as the burst
period, and the value of the histogram indicates how strong the burst is. The
stronger the burst, the greater its momentum.
For funding, one is often interested in selecting the best time to start in-
vestment in a topic. If momentum plunges after we invest, we have entered the
market at a bad time, and should perhaps have waited. We are also interested
in the ‘staying power’ of a topic, so that the frequency of a topic remains high
even after bursts. Rapid drops in frequency after a burst can make a topic a
poor investment (toward a funded portfolio of topics). Therefore, the funding
momentum of a topic can depend on the presence of bursts, their strength, and
the frequency afterwards. With this in mind we define the funding momentum
FM of a topic for a period of m months as follows:


BS(topic,m)×FI(topic,m)
FM(topic, m) = 1 − e α (1)
m

BS(topic, m) = Hi (topic) (2)
i=1

1 CF(topic) < min1≤i≤m F(topic)i
FI(topic, m) = (3)
0 otherwise

histogram(topic)i histogram(topic)i > 0
Hi (topic) = (4)
0 otherwise

The burst strength of a topic over a m-month period BS(topic, m) is the sum of
the histogram values H for the topic in burst periods, or periods where the values
are all positive. The frequency increase indicator FI(topic, m) indicates if the
frequency of the topic increases or drops within the m-month period. We define
the value of the funding momentum of a topic over a m-month period FM(topic,
m) with an exponential model to normalize the value within the range of [0, 1]
with α as a decay parameter. In this model, a higher burst strength or a higher
momentum increase ratio yield higher funding momentum, with α controlling
the rate of increase with respect to these factors.
The definition encodes the following intuition about funding momentum: (1) if
there is no burst in the m-month period, then no matter how high the frequency
or popularity, the m-month period has no funding momentum; (2) if there is a
burst, but prior to the burst there is a drop in momentum, the m-month period
has no funding momentum. (Hence, it may be advantageous to invest in the
topic after the drop ends.) (3) if there is a burst, but after the burst there is a
Learning the Funding Momentum of Research Projects 537

significant drop in popularity, the m-month period has no funding momentum.


Instead, we may reduce the investment period to a smaller interval m. We also set
a threshold h (0 ≤ h ≤ 1) modeling whether a topic has increasing momentum
or not given its funding momentum. If the momentum is greater than h, we say
the topic has increasing momentum. As explained below, in our experiments,
setting α as 10 and h as 0.2 and following the three selection criteria above has
yielded results that are consistent with the RePORTER data.
This model formalizes three intuitive ideas: (1) research topics have bursts
of popularity, and bursts can be defined in terms of momentum (as in [10]);
(2) funded research proposals often explore novel potential correlations between
popular research topics; (3) real data (like RePORTER) can be used to learn
quantitative models of successful funding. The concept of funding momentum of
a set of topics developed in this paper draws directly on these ideas.

3.3 Funding Momentum of Research Projects: Percentage Model


Although we have explored other models, in this paper we consider only what
we call the percentage model for funding momentum: a research project
has increasing momentum if it contains sufficiently many topics that have in-
creasing momentum for that period. Specifically, we say a research project has
increasing momentum whenever the percentage of these topics exceeds a thresh-
old parameter t. In our experiments, this definition with t=0.2 has been adequate
to accurately identify increasing momentum intervals.

4 Methods
As mentioned earlier, we have adapted methods of technical analysis to compute
momentum. In the stock market, despite claims to the contrary, a common as-
sumption is that past performance is an indicator of likely future achievement.
Of course, this assumption is often violated, since the market is news-driven and
fluctuates rapidly. As we show in our experiments, for our definitions of mo-
mentum and funding momentum, the assumption often works well. Therefore,
training classifiers on past funding momentum makes sense, and in some cases
may even be adequate to forecast future funding momentum. Gryc [8] shows that
indicators such as EMA, MACD histogram, RSI can help to improve prediction
accuracy.
Although we are not able to validate whether the definitions can accurately
identify intervals that are not of increasing momentum, the prediction methods
we propose work well for the selection criteria encoded in Formulas 1–4. Alter-
natively speaking, when we select the increasing momentum topics and projects
based on these criteria, the prediction methods have good accuracy. Again, when
negative information about success in funding is available to reveal more criteria,
or any modification of the criteria, it can be assimilated into our model.
In our experiments, we have used four kinds of classifier to model whether
a topic has increasing momentum: Linear Regression, Decision Tree, SVM and
Artificial Neural Networks (ANN). The linear regression classifier estimates the
538 D. He and D.S. Parker

output as a linear combination of the attributes, fit with a least squares approach.


However, trends of the indicators are usually non-linear. Therefore, non-linear
classifiers are generally able to achieve better performance. ANN is a popular
technique for predicting stock market prices, with excellent capacity for learning
non-linear relationships without prior knowledge. SVM is another well-known
non-linear classifier that has been widely applied in stock analysis. Compared
with ANN and SVM, the Decision tree is much less popular, but as we will show
later for determining whether a topic has increasing momentum or not, in our
experiments the decision tree classifier performed as well as the SVM and ANN.
(We used classifiers implemented in Weka [3], a well-known machine learning and
data mining tool, with default parameter settings. The corresponding classifier
implementations in Weka for C4.5 decision trees, ANN and SVM are the J48,
the MultiLayerPerceptron, and LibSVM classifiers, respectively.)

5 Experimental Results
5.1 Analyzing Bursts for Research Topics and Projects
The RePORTER database [2] provides information for NIH grant awards since
1986. Each award record includes a set of project terms, which can be considered
as research topics in this project (we use ‘topic’ and ‘term’ interchangeably).
Since the total number of years is only 24, we consider the funding momentum
for the terms at each month, and term frequencies are calculated by month.
As RePORTER imposes limits on volume of downloaded data, we considered
awards only for the state of California — a dataset containing 12,378 unique
terms and 119,079 awards.
We use technical analysis indicators such as the MACD histogram value and
RSI to compute momentum and identify burst periods for the terms, adapting
the definition of bursts as intervals of increasing momentum studied in He and
Parker [10]. Burst periods for the terms ‘antineoplastics’, ‘complementary-DNA’,
and ‘oncogenes’ are shown in Figure 1 — as well as their frequencies and funding
momentum. The funding momentum covers the 6 month period beyond any time
point. We set the the threshold value to 0.2 for selection of increasing momen-
tum years (we call these years increasing momentum years). Clearly increasing
momentum years are highly correlated with bursts. Strong bursts usually define
intervals that have increasing momentum. Weak bursts, such as the one for ‘gene
expression’ around 1993, are omitted by the threshold filter. However, it is not
necessarily the case that a strong burst period has increasing momentum. For
example, ‘oncogenes’ has a strong burst from year 1997 to 1999, but the increas-
ing momentum years associated with the burst extended only from 1997 to the
middle of 1998, because the burst levels off. According to criterion (3) in section
3.1, we say after the middle of year 1998, ‘oncogenes’ does not have increas-
ing momentum. We can also observe how criteria (1) and (2) affect increasing
momentum. For example, for ‘antineoplastics’, the increasing momentum years
start after 1997, which is the end of its frequency plunge; criterion (2) avoids
labeling years in a plunge followed by a burst as having increasing momentum.
Learning the Funding Momentum of Research Projects 539

antineoplastics gene expression oncogenes


20

100
10

50

5
0
value

value

value
10

0
20

50
30

histogram(12,26,9) histogram(12,26,9) histogram(12,26,9)

5
term frequency term frequency term frequency
investment potentials
funding momentum investment potentials
funding momentum investment
funding potentials
momentum
40

1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009

1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009

1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
month month month

Fig. 1. Burst periods for terms ‘antineoplastics’, ‘complementary-DNA’, ‘oncogenes’


with histogram parameters (12, 26, 9), with funding momentum for all terms. The
dashed line shows the threshold 0.2 for selection of increasing momentum years. All
values are scaled to fit within one figure.

5.2 Experimental Validation of the Funding Momentum Definition

To validate our definition of the funding momentum, we have checked years


labeled as increasing momentum against historical events. For example, during
the early 1990s, advancements in genetics boosted deeper research on DNA. Two
well-known projects, the Human Genome Project and the cloning of Dolly the
sheep, started in 1990 and 1996, respectively. We checked the increasing mo-
mentum years for all ‘genetic’ terms, and most have increasing momentum years
consistent with the year of the development of these two projects. For illustra-
tion purposes, we randomly drew five terms and summarized their increasing
momentum years in Table 2.
There has been a steady increase of research on HIV and AIDS since 1986,
when the FDA approved the first anti-retroviral drug to treat AIDS [12]. There-
fore we checked the increasing momentum years for terms related to ‘AIDS’,
and obtained the results for the five randomly drawn terms shown in Table 2.
We observed similar consistency between our increasing momentum years and
the periods of expansion in HIV and AIDS research, supporting our funding
momentum definition.
Following the percentage model of funding momentum, we do only the simple
evaluation in which a project has increasing momentum if the percentage of the
topics it contains that have increasing momentum is greater than a threshold
t. This situation for research topics appears to differ significantly from that

Table 2. The increasing momentum years for all terms related to ‘genetic’ and ‘AIDS’

‘genetic’ Terms Increasing Momentum ‘AIDS’ Terms Increasing Momentum


pharmacogenetics 1996 - 1998 AIDS /HIV diagnosis 1988 - 1990
genetic mapping 1986 - 1988, 1996 - 1998 AIDS 1986 - 1990
genetic transcription 1986 - 1991, 1996 - 1999 AIDS /HIV test 1986 - 1990
genetic regulation 1986 - 1993, 1996 - 1998 AIDS therapy 1988 - 1989
virus genetics 1986 - 1991, 1996 - 1998 antiAIDS agent 1988 - 1990
540 D. He and D.S. Parker

Table 3. Accuracy of the funding momentum definition of research projects

t = 0.2 t = 0.5 t = 0.8 average term #


83.89% 43.13% 11.12% 16

Table 4. The prediction error rate of classifier J48, SVM, Linear Regression, ANN
and Naive for 1,000 randomly selected terms for different M (month) period from the
current month
Classifier M=6 M=7 M=8 M=9 M=10 M=11 M=12
J48 0.084 0.085 0.086 0.086 0.088 0.088 0.088
SVM 0.084 0.085 0.086 0.086 0.089 0.091 0.092
Linear Regression 0.112 0.119 0.125 0.132 0.138 0.143 0.146
ANN 0.108 0.105 0.111 0.108 0.115 0.112 0.121
Naive 0.217 0.235 0.240 0.254 0.256 0.272 0.269

for news ‘memes’ presented in [11]. where topics are more correlated and their
clustering is stronger.
We can define the accuracy of our funding momentum definition for research
projects as the percentage of consistency between the increasing momentum
years of the projects and their grant years. The more they are consistent with
one another, the higher the accuracy. Varying the threshold as 0.2, 0.5 and 0.8
and randomly selecting 1000 projects, we obtained the results shown in Table
3, showing the average number of terms contained by each project, which is
around 16. Therefore, a threshold t=0.2 requires at least 3 topics be of increasing
momentum for a project, while a threshold t = 0.8 requires at least 13 such
topics. Accuracy of these results drops as the threshold increases. For t = 0.2,
the accuracy was sufficiently high that we adopted t = 0.2 in all remaining
experiments.

5.3 Prediction of Funding Momentum for Research Topics


In order to predict the funding momentum for the research topics, we create
our dataset where the attribute are the momentum of the topic, the technical
analysis indicators MACD histogram value, and RSI value. The class is binary:
whether the m month period from the current month has increasing momentum
or not. We apply J48, SVM, ANN, Linear Regression classifiers on 1000 randomly
selected research topics. We also vary the length of the period m from half a year
(6 months) to one year. We conduct a ten-fold cross-validation and the error rate
of each classifier is shown in Table 4. As we can see, J48 and SVM achieve low
error rate in general and their error rates remains almost constant for different
time periods. ANN is superior to Linear Regression but both classifiers incur
drops in their performance as m increases.
Figure 2 shows observed sensitivity (true positive rate) and specificity (1 - false
positive rate) of all classifiers. ANN achieved the highest sensitivity. J48, SVM
and ANN all had much better sensitivity than LR. All classifiers had specificity
almost 1. SVM had the highest specificity, while ANN had the lowest. These
results suggest that all classifiers make wrong predictions for positive instances.
Learning the Funding Momentum of Research Projects 541

1.00
0.6

0.95
● ● ● ●
● ● ● ● ●



Sensitivity

Specificity

0.4

0.90
0.2

0.85
● J48 ● J48
ANN ANN
SVM SVM

0.80
LR LR
0.0

6 7 8 9 10 11 12 6 7 8 9 10 11 12
Month Month

Fig. 2. Sensitivity vs. Specificity for classifiers J48, SVM, Linear Regression and ANN

This performance may be caused by the un-balance property of our dataset, in


which most instances are negative instances. This is reasonable since generally
speaking the burst period is much shorter than the non-burst period, and only
periods with strong bursts are reported as increasing momentum.
Earlier, in the related work section, we mentioned a few techniques for stock
price prediction. However, these are not easily adapted to our problem of fund-
ing momentum prediction. We therefore used a Naive method to predict funding
momentum as the baseline. The Naive method simply assigns the funding mo-
mentum of the next month as the funding momentum of the current month,
with the assumption that the trend movement usually lasts for a while. The
performance of the Naive method, reported in Table 4, was much worse than
any of the classifiers, indicating that much better performance can be achieved
with intelligent classifier design.
Next we compared the performance of the classifiers both with and without
TA attributes (technical analysis indicators: MACD, histogram value, and RSI).
Since the overall error rates of the J48 and SVM classifiers were superior to the
others, we reviewed only the performance of these two classifiers. We again ran-
domly selected 1000 research topics and conducted a ten-fold cross-validation.
Comparing the prediction accuracy only for m=6, the experimental results re-
ported in Table 5 show that when the technical analysis indicators were included,
the performance of both classifiers improved significantly. The main performance
improvements when TAs are included were in improved sensitivity; when TAs
were not included, the two classifiers tended to make incorrect predictions for
almost all positive instances.

Table 5. Prediction error rates with J48 and SVM for 1,000 randomly selected terms,
with and without TA (technical analysis indicators) for 6-month periods extending
beyond the current month

Classifier Error Without TA Error With TA Sensitivity Without TA Sensitivity With TA


J48 13.1% 8.4% 0.065 0.451
SVM 13.3% 8.4% 0.10 0.373
542 D. He and D.S. Parker

Table 6. Prediction accuracy with classifier J48, SVM, ANN, Linear Regression and
Naive for 1,000 randomly selected projects with threshold 0.2 for the 6 month period
from the current month

J48 SVM ANN Linear Regression Naive


87.9% 88.0% 83.2% 82.4% 58.9%

5.4 Prediction of Funding Momentum for Projects


To predict the funding momentum of a project, we first predicted the funding
momentum of terms in the project. Then we determined if a project had in-
creasing momentum or not according to the percentage of the terms increasing
momentum. (If the percentage is greater than the threshold t, the project has
increasing momentum.) Again we compared the performance of the classifier
J48, SVM, ANN, Linear Regression and Naive. We conducted the experiments
on 1,000 randomly selected projects with threshold 0.2 for the 6 month period
from the current month. Results are reported in Table 6: the performance of
classifiers was positively related to their performance on research topics. Among
these classifiers, the performance of J48 and SVM remained the best. However,
since the variance of the classification error on terms accumulated, performance
for each classifier got worse, especially for the Naive method.

6 Conclusion and Future Work

In this paper, by analyzing historical NIH grant award data (in RePORTER
[2]), we were able to model occurrence patterns of biomedical topics in successful
grant awards with a corresponding measure that we call funding momentum. We
also developed a classification method to predict funding momentum for these
topics in projects. We were able to show that this method achieved good predic-
tion accuracy. It seems possible that indicators such as impact and significance
could be addressed with variations on funding momentum. To our knowledge,
this is the first quantitative model of funding momentum for research projects.
We also show in this work that the classification problem is highly un-balanced,
Therefore the sensitivity of all the classifiers are not satisfactory. so un-balanced
classification techniques might be used to improve performance.
We proposed a percentage model for the funding momentum of research
projects. There can be other models. For example, in the percentage model,
the topics the project contains may have semantic correlations in that some top-
ics are always show up together. A more complicated model maybe needed to
define the momentum of a research project. Another possible model is instead
of considering the percentage of the increasing momentum topics in the project,
we add the frequency of the topics as the ‘frequency’ of the project. We can then
apply the same trend models to identify intervals of increasing momentum for
the project. The intuition behind this additive model comes from stock market
Learning the Funding Momentum of Research Projects 543

analogies, considering topics as independent stocks and the project as a ‘sector’,


with a sector index defined as the sum of stocks it covers. Therefore, similarly
we add the frequency of the topics to approximate the frequency of the project.

References
1. MySciPI (2010), http://www.usgrd.com/myscipi/index.html
2. RePORTER (2010), http://projectreporter.nih.gov/reporter.cfm
3. Weka (2010), http://www.cs.waikato.ac.nz/ml/weka
4. Agrawal, S., Jindal, M., Pillai, G.N.: Momentum analysis based stock market pre-
diction using adaptive neuro-fuzzy inference system (anfis). In: Proc. of the In-
ternational MultiConference of Engineers and Computer Scientists, IMECS 2010
(2010)
5. Al-Qaheri, H., Hassanien, A.E., Abraham, A.: Discovering Stock Price Prediction
Rules Using Rough Sets. Neural Network World Journal (2008)
6. Andelin, J., Naismith, N.C.: Research Funding as an Investment: Can We Measure
the Returns? U.S. Government Printing Office, Washington, DC (1986)
7. Bao, D., Yang, Z.: Intelligent stock trading system by turning point confirming
and probabilistic reasoning. Expert Systems with Applications 34, 620–627 (2008)
8. Gryc, W.: Neural Network Predictions of Stock Price Fluctuations. Technical re-
port, http://i2r.org/nnstocks.pdf (accessed July 02, 2010)
9. Hassan, M.R., Nath, B.: Stock market forecasting using hidden Markov model: a
new approach.
10. He, D., Parker, D.S.: Topic Dynamics: an alternative model of ‘Bursts’ in Streams
of Topics. In: The 16th ACM SIGKDD Conference on Knowledge Discovery and
Data Mining, SIGKDD 2010, July 25-28 (2010)
11. Kleinberg, J., Leskovec, J., Backstrom, L.: Meme-tracking and the dynamics of the
news cycle. In: Proceedings of the Fifteenth ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining,, Paris, France (July 2009)
12. Johnston, M.I., Hoth, D.F.: Present status and future prospects for HIV therapies.
Science 260(5112), 1286–1293 (1993)
13. Kaboudan, M.A.: Genetic programming prediction of stock prices. Computational
Economics 16(3), 207–236 (2000)
14. Kleinberg, J.M.: Bursty and hierarchical structure in streams. Data Min. Knowl.
Discov. 7(4), 373–397 (2003)
15. Lawrence, R.: Using neural networks to forecast stock market prices. University of
Manitoba (1997)
16. Li, J., Tsang, E.P.K.: Improving technical analysis predictions: an application of
genetic programming. In: Proceedings of The 12th International Florida AI Re-
search Society Conference, Orlando, Florida, pp. 108–112 (1999)
17. Nikooa, H., Azarpeikanb, M., Yousefib, M.R., Ebrahimpourb, R., Shahrabadia, A.:
Using A Trainable Neural Network Ensemble for Trend Prediction of Tehran Stock
Exchange. IJCSNS 7(12), 287 (2007)
18. Saad, E.W., Prokhorov, D.V., Wunsch, D.C.: Comparative study of stock trend
prediction using time delay, recurrent and probabilistic neural networks. IEEE
Transactions on Neural Networks 9(6), 1456–1470 (1998)
19. Zhu, Y., Shasha, D.: Efficient elastic burst detection in data streams. In: Proceed-
ings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, Washington, DC, USA, August 24-27, pp. 336–345 (2003)
Local Feature Based Tensor Kernel for Image
Manifold Learning

Yi Guo1 and Junbin Gao2,


1
Freelance Researcher
yg au@yahoo.com.au
2
School of Computing and Mathematics,
Charles Sturt University, Bathurst, NSW 2795, Australia
jbgao@csu.edu.au

Abstract. In this paper, we propose a tensor kernel on images which are


described as set of local features and then apply a novel dimensionality
reduction algorithm called Twin Kernel Embedding (TKE) [1] on it for
images manifold learning. The local features of the images extracted by
some feature extraction methods like SIFT [2] are represented as tuples
in the form of coordinates and feature descriptor which are regarded as
highly structured data. In [3], different kernels were used for intra and
inter images similarity. This is problematic because different kernels re-
fer to different feature spaces and hence they are representing different
measures. This heterogeneity embedded in the kernel Gram matrix which
was input to a dimensionality reduction algorithm has been transformed
to the image embedding space and therefore led to unclear understand-
ing of the embedding. We address this problem by introducing a tensor
kernel which treats different sources of information in a uniform kernel
framework. The kernel Gram matrix generated by tensor kernel is ho-
mogeneous, that is all elements are from the same measurement. Image
manifold learned from this kernel is more meaningful. Experiments on
image visualization are used to show the effectiveness of this method.

1 Introduction
Conventionally, raster grayscale images can be represented by vectors by stack-
ing pixel brightness row by row. It is convenient for computer processing and
storage of images data. However, it is not natural for recognition and percep-
tion. Human brains are more likely to handle images as collections of features
laying on a highly nonlinear manifold [4]. In recent research, learning image
manifold has attracted great interests in computer vision and machine learning
community. There are two major strategies towards manifold learning for within
class variability, appearance manifolds from different views [5]: (1) Local Feature
based methods; and (2) Key Points based methods.
There exist quit a few feature extraction algorithms such as colour histogram
[6], auto-associator [7], shape context [8] etc. Among them, local appearance

The author to whom all the correspondences should be addressed.

J.Z. Huang, L. Cao, and J. Srivastava (Eds.): PAKDD 2011, Part II, LNAI 6635, pp. 544–554, 2011.
c Springer-Verlag Berlin Heidelberg 2011
Local Feature Based Tensor Kernel for Image Manifold Learning 545

based methods such as SIFT have drawn a lot of attention as their success
in generic object recognition [3]. Nevertheless, it was also pointed out in [3]
that it is quite difficult to study the image manifold from the local features
point of view; moreover, the descriptor itself throws an obstacle in learning a
smooth manifold because it is not in vector space. The authors of [3] proposed
a learning framework called Feature Embedding (FE) which takes local features
of images as input and constructs an interim layer of embedding in metric space
where the dis/similarity measure can be easily defined. This embedding is then
utilized in following process such as classification, visualization etc. As another
main stream, key points based methods have been successful in shape modeling,
matching and recognition, as demonstrated by the Active Shape Models (ASM)
[9] and the Shape Contexts [10].
Generally speaking, key points focus on spatial information/arrangement of
the interested objects in images while local features detail object characteriza-
tion. An ideal strategy for image manifold learning shall incorporate both kinds
of information to assist the learning procedure. Actually the combination of both
spatial and feature information has been applied in object recognition in recent
work, e.g., visual scene recognition in [11]. This kind of approaches has close
link with the task of learning from multiple sources, see [12,13]. Kernel method
is one of the approaches which can be easily adapted to the multiple sources
by the so-called tensor kernels [14] and additive kernels [15,16]. The theoretical
assumption is that the new kernel function is defined over a tensor space de-
termined by multiple source space. In our scenario, there are two sources, i.e.,
the source for the spatial, denoted by y, and the source for the local feature,
denoted by f . Thus each hyper “feature” is the tensor y ⊗ f . For the purpose
of learning image manifold, we aim at constructing appropriate kernels for the
hyper features in this paper.
The paper is organized as follows. Section 2 proposes tensor kernels suit-
able for learning image manifold. Section 3 gives a simple introduction to the
Twin Kernel Embedding which is used for manifold embedding. In Section 4, we
present several examples of using proposed tensor kernels for visualization.

2 Tensor Kernels Built on Local Features


In the sequel, we assume that a set of K images {Pk }K k=1 is given, each of which
Pk is actually represented by a data set as a collection of tensors of spatial and
local features {yik ⊗ fik }N k k
i=1 where fi is the local feature extracted at the location
yi and Nk is the number of features extracted from Pk . yik ⊗ fik is the i-th hyper
k

feature in image k and {yik ⊗ fik }N k


i=1 can be regarded as a tensor field.
A tensor kernel was implicitly defined in [3] based on two kernels, ky (yi , yj )
and kf (fi , fj ), which are for spatial space and features space respectively and the
kernel between two hyper features is defined as

ky (yik , yjl ), k = l
kp (yik ⊗ fik , yjl ⊗ fjl ) = (1)
kf (fik , fjl ), k = l
546 Y. Guo and J. Gao

where i = 1, ..., Nk , j = 1, ..., Nl and k, l = 1, ..., K. In other words, when two


hyper features yik ⊗fik and yjl ⊗fjl are from two different images, kp (·, ·) evaluates
only the features, otherwise kp (·, ·) focuses on only coordinates. If we denote Ky
as kernel Gram matrix of ky (·, ·) and Kf , Kp likewise. Kp will have Ky blocks
in main diagonal and Kf elsewhere.
A kernel is associated with a particular feature mapping function from ob-
ject space to feature space [15] and kernel function is the inner product of the
images of the objects in feature space. There is seldom any proof that any two
kernels share the same feature mapping and hence kernels are working on dif-
ferent feature spaces. If we see kernels as similarity measures [17], we conclude
that every kernel represents an unique measure for its input. This leads to the
observation that the above kp is problematic where two kernels are integrated
together without any treatment. ky and kf operate on different domains, coor-
dinates and features respectively. Thus the similarity of any two hyper features
is determined by the similarity of either spatial or feature information while the
joint contribution of two sources is totally ignored in the above construction of
the tensor kernel. Thus a successive dimensionality reduction algorithm based on
the above separated tensor kernels may erroneously evaluate the embeddings in
a uninformed measure. Every feature in images will be projected onto this lower
dimensional space as a point. However, under this framework, the relationships
of projected features from one image are not comparable with other projected
features from different images. As a consequence, the manifold learnt by the di-
mensionality reduction is distorted and therefore not reliable and interpretable.
To tackle this problem, we need a universal kernel for the images bearing in mind
that we have two sources of information, i.e. spatial information as well as feature
description. Multiple sources integration property of tensor kernel [14] brings ho-
mogeneous measurement for the similarity between hyper features. What follows
is then how to construct a suitable tensor kernel. Basically, we have two options
to choose from, productive and additive tensor kernel which are stated below

ρy ky (yik , yjl ) + ρf kf (fik , fjl ) additive tensor
kt (yik ⊗ fik , yjl ⊗ fjl ) = (2)
ky (yik , yjl ) × kf (fik , fjl ) productive tensor
i = 1, ..., Nk , j = 1, ..., Nl ,
ρy + ρf = 1, ρy ≥ 0, ρf ≥ 0.

As we can see, tensor kernel unifies the spatial and features information together
in harmony. It is symmetric, positive semi-definite and still normalized.
We are particularly interested in the additive tensor kernel. The reason is that
the productive tensor kernel tends to produce very small values thus forcing the
Gram matrix to be close to identity matrix in practice. This will bring some numer-
ical difficulties for dimensionality reduction. Additive tensor kernel does not have
this problem. However, the additive tensor kernel kt defined in (2) takes into ac-
count the spatial similarity between two different images which makes little sense
in practice. So we adopt a revised version of additive tensor kernel as
Local Feature Based Tensor Kernel for Image Manifold Learning 547


ρy ky (yik , yjl ) + ρf kf (fik , fjl ), k = l
kt (yik ⊗ fik , yjl ⊗ fjl ) = (3)
kf (fik , fjl ), k = l
In both (2) and (3) we need to determine two extra parameters ρy and ρf . To the
best knowledge of the authors, there is no principal way to resolve this problem.
In practice, we can optimize them using cross validation.

3 Manifold Learning with Twin Kernel Embedding


Given a tensor kernel kt defined in either (2) or (3) and a set of images, a kernel
matrix Kt can be calculated. Kt contains all the similarity information among
hyper features contained in the given images. The matrix can then be sent to a
kernel-based dimensionality reduction algorithm to find the embedding.
We start with a brief introduction of TKE algorithm and then proceed to
image manifold learning with tensor kernel described in last section. In this
section, for the sake of simplicity, we use oi ’s to denote the super feature data
we are dealing with and xi the corresponding embeddings of oi . So oi could be
an image Pi as collection of features as we mentioned before.

3.1 Twin Kernel Embedding


Twin Kernel Embedding (TKE) preserves the similarity structure of input data
in the latent space by matching the similarity relations represented by two kernel
gram matrices, i.e. one for input data and the other for embedded data. It simply
minimizes the following objective function with respect to xi ’s
 
L=− k(xi , xj )kt (oi , oj ) + λk k 2 (xi , xj ) + λx xi 2 , (4)
ij ij

where k(·, ·) is the kernel function on embedded data and kt (·, ·) the kernel
function on hyper feature data of images. The first term performs the similarity
matching which shares some traits with Laplacian Eigenmaps in that it replaces
the Wij by kt (·, ·) and the Euclidean distance on embedded data xi − xj 2 by
k(·, ·). The second and third terms are regularization to control the norms of the
kernel and the embeddings. λk and λx are tunable positive parameters to control
the strength of the regularization. The logic is to preserve the similarities among
input data and reproduce them in lower dimensional latent space expressed again
in similarities among embedded data.
k(·, ·) is normally a Gaussian kernel, i.e.

k(xi , xj ) = γ exp{−σ||xi − xj ||2 }, (5)

because of its analytical form and strong relationship with Euclidean distance.
A gradient-based algorithm has to be employed for minimization of (4). The
conjugate gradient (CG) algorithm [18] can be applied to get the optimal X
which is the matrix of the embeddings X = [x1 , . . . , xN ] . The hyper-parameters
548 Y. Guo and J. Gao

of the kernel function k(·, ·), γ and σ, can also be optimized as well in the
minimization procedure. It frees us from setting too many parameters. To start
the CG, initial state should be provided. Any other dimensionality reduction
methods could work. However, if the non-vectorial data applicability is desirable,
only a few of them such as KPCA [19], KLE [20] would be suitable.
It is worth explaining the method of locality preserving in TKE. This is done
by the k-nearest neighboring. Given an object oi , for any other input oj , kt (oi , oj )
will be artificially set to 0 if oj is not one of the k nearest neighbors of oi .
The parameter k(> 1) in k-nearest neighboring controls the locality that the
algorithm will preserve. This process is a kind of filtering that retains what we
are interested while leaving out minor details. However, the algorithm also works
without filtering in which case TKE turns out to be a global approach.
The out-of-sample problem [21] can be easily solved by introducing a kernel
mapping as
X = Kt A (6)
where Kt is the Gram matrix of kernel kt (·, ·) and A is a parameter matrix to
be determined. Substitute (6) to TKE and optimize the objective function with
respect to A instead of X will give us a mapping from original space to lower
dimensional space. Once we have the new input, the embedding can be found by
xnew = kt (onew , O)A
and we denote O as collection of all the given data for training. This algorithm
is called BCTKE in [22] where details were provided.

3.2 Manifold Learning Process


An elegant feature of TKE is that it can handle non-vectorial data since in its
objective function, it involves only the kernels that can accept non-vectorial in-
puts. It is particularly useful in this case since the only available information
about the images is the tensor kernel which is built on the local features rep-
resented in non-vectorial form. In last section, we discussed the additive tensor
kernel kt . For each hyper feature in every image expressed as yi ⊗ fi , we can find
a point xi in d dimensional space through TKE where d is pre-specified. It yields
a collection of coordinates {xki }N k
i=1 for k-th image P̂k which is the projection of
Pk in the so-called feature embedding space.
This feature embedding space is only an interim layer of the final image man-
ifold learning. The reason for it is to transform the information in form of local
features to objects in metric space where some distance can be easily defined.
There are several distance metrics to evaluate two sets of coordinates [23]. Haus-
dorff based distance is a suitable candidate since it handles the situation where
the cardinalities of two sets are different which is common in real application.
Once we get the distance between two sets of coordinates, i.e. two images, we can
proceed to manifold learning using TKE again. Suppose the distance is d(P̂i , P̂j ),
we revise the objective function of TKE as follows
 
L= k(zi , zj )d(P̂i , P̂j ) + λk k 2 (zi , zj ) + λz zi 2 , (7)
ij ij
Local Feature Based Tensor Kernel for Image Manifold Learning 549

which differs from (4) in that the kernel kt (·, ·) is replaced by a distance metric.
We can still minimize (7) with respect to zi . The logic is when two images are
close in feature embedding space, they are also close in the manifold.
Another easier way to learn the manifold using TKE is to convert the distance
to a kernel by
k(P̂i , P̂j ) = exp{−σk d(P̂i , P̂j )} (8)
and substitute this kernel in (4) in TKE where σk is positive parameters. So we
minimize the following objective function
 
L=− k(zi , zj ) exp{−σk d(P̂i , P̂j )} + λk k 2 (zi , zj ) + λz zi 2 . (9)
ij ij

As a conclusion of this section, we restate the procedures here. 1. We apply ten-


sor kernel kt to the images as collections of hyper features; 2. Use TKE with Kt to
learn the feature mapping space and projections of the images i.e. P̂i ’s; 3. Finally,
we obtain the image manifold by using TKE again or other DR methods. In follow-
ing experiments, we use KLE, KPCA for comparison. Actually, in step 2, we could
use other methods which are kernel applicable such as KPCA, KLE etc. Interest-
ingly, if we integrate step 1 and 2 and use (8), the Gram matrix of kernel k(P̂i , P̂j )
could be seen as a kernel Gram matrix on the original images and therefore the
whole step 1 and 2 could be a kernel construction on histograms (collections of
local features). The dimensionality of feature embedding space and image mani-
fold could be detected using some automated algorithms such as rank priors [24].
If visualization is the purpose, 3D or 2D manifold would be preferable.

4 Experimental Results
We applied the tensor kernel and TKE to image manifold learning on several
image data sets: the ducks from COIL data set, Frey faces and handwritten
digits. They are widely available online for machine learning and image process
tests. For TKE, we fixed λx = 0.001 and λk = 0.005 as stated in original
paper. We chose Gaussian kernel for kx , Eq. (5), as described in Section 3.
Its hyperparameters were set to be 1 and they were updated in runtime. We
used additive tensor kernel (3) and set ρy = 0.3 and ρf = 0.7 which were
picked from doing the same experiment with different ρy and ρf repeatedly
until best combination is found. It shows the preference to local features over
coordinates. The dimensionality of feature embedding space de and number of
features extracted from images are maximized according to the capability of
computational platform. For the demonstration purpose, we chose to visualize
those images in 2D plane to see the structure of the data.

4.1 COIL 3D Images


We used 36 128×128 greyscale images of ducks and extracted 60 features from
each images. TKE projected all the features to 80 dimensional feature embed-
ding space. Since all the images are perfectly aligned and noise free, traditional
550 Y. Guo and J. Gao

(a) TKE

(b) KLE (c) KPCA

Fig. 1. ducks

methods like PCA, MDS can achieve good embedding using vectorial represen-
tation. As we can see from Fig. 1, tensor kernel on local features can capture the
intrinsic structure of the ducks, that is the horizontal rotation of the toy duck.
The is revealed successfully by KLE which gives a perfect circle like embedding.
The order of the images shows the rotation. TKE seems to focus more on the
classification information. Its embedding shows 3 connected linear components
each of which represents different facing direction. KPCA tries to do the same
thing as KLE, but not as satisfactory as KLE.

4.2 Frey Faces


In this subsection, the objects are 66 images extracted from the whole data
set with 1,965 images (each image is 28 × 20 grayscale) of a single person’s
face. The data set was from a digital movie which is also used in [25]. Two
parameters control the images, that is the face direction and expression. Ideally,
there should be two axes in 2D plane for these two parameters respectively, one
for face direction from left to right and one for face expression from happy to
Local Feature Based Tensor Kernel for Image Manifold Learning 551

(a) TKE

(b) KLE (c) KPCA

Fig. 2. Frey faces

sad. However, the understanding like this is somewhat artificial. This may not
even close to the truth. But we hope our algorithms can show some idea of
these two dimensions. In this case, de = 30 and 80 features were extracted from
each image. The choice of de reflects high computational cost of TKE which is a
major drawback of this algorithm. As the number of samples grows, the number
of objectives to be optimized in TKE increases linearly. So when the number of
images doubled, de has to be half for the limitation of computation resources.
In this case, KLE does not reveal any meaningful patterns (see Fig. 2). On the
other hand, TKE’s classification property is very well exhibited. It successfully
classifies happy and not happy expressions into two different groups. In each
group, from top to bottom, the face direction turns from right to left. So we
can draw two perpendicular axes on TKE’s result, horizontal one for mood, and
vertical one for face direction. KPCA reveals similar pattern as TKE does. The
only difference is that TKE’s result shows clearer cluster structure.
552 Y. Guo and J. Gao

(a) TKE

(b) KLE (c) KPCA

Fig. 3. Handwritten digits

4.3 Handwritten Digits

In this section, a subset of handwritten digits images was extracted from a


binary alphadigits database which contains 20×16 digits of “0” through “9”
and capital “A” through “Z” with 39 examples in each class. It is from Algoval
system (available at http://algoval.essex.ac.uk/). Because of limited resources of
the computation platform, we used only the digits from 0 to 4 with 20 images
per class. We extract 80 features from each image and casted them to de = 20
feature embedding space.
Compared with previous two experiments, this experiment is much harder for
dimensionality reduction algorithms. It is not clear what the intrinsic dimension-
ality is. If we choose too small dimension, DR methods will have to throw away
too much information. As a matter of fact, we do not know what the manifold
should be. The images of the digits are not simply governed by some parameters
as we can see in previous experiments. So we could only expect to see clusters
of digits which is a quite intuitive interpretation.
Local Feature Based Tensor Kernel for Image Manifold Learning 553

It is worth mentioning that for TKE plus tensor kernel, we use KPCA in
the last step instead of TKE for computational difficulty. Fig. 3 shows results of
final image manifold learnt by three different algorithms with tensor kernel. TKE
shows good classification capability even clearer in this experiment. All classes
have clear dominant clusters with some overlapping. Interestingly, by examining
the visualization by TKE closely, we can see digit “1” class has two subclasses of
two different types of drawing. They are properly separated. Moreover, because
they are all “1” from the local feature point of view, these two subclasses are
very close to each other whereby forms a whole digit “1” class. KLE does a very
good job separating digit “1” from others. However, other classes are overlapped
significantly. KPCA has clear “2” and “4” classes but the other classes were not
distinguishable.
This experiment once again confirms the classification ability of TKE and
effectiveness of tensor kernel on local features in depicting the structural rela-
tionships between images in terms of classification, recognition and perception.

5 Conclusion

In this paper, we proposed using tensor kernel on local features and TKE in
image manifold learning. Tensor kernel provides a homogeneous kernel solution
for images which are described as collection of local features instead of conven-
tional vector representation. The most attractive advantage of this kernel is that
it integrates multiple sources of information in a uniform measure framework
such that the following algorithm can be applied without difficulty in theoretical
interpretation.
TKE shows very strong potential in classification when it is used in con-
junction with local feature focused kernel, for example tensor kernel. So it is
interesting to explore more applications of this method in other areas such as
bioinformatics and so on. One drawback of TKE which may limit its application
is its high computational cost. The number of parameters to be optimized is
about O(n2 ) where n is the product of target dimension and number of samples.
Further research on whether some efficient approximation is achievable would
be very interesting.

References
1. Guo, Y., Gao, J., Kwan, P.W.: Twin kernel embedding. IEEE Transaction of Pat-
tern Analysis and Machine Intelligence 30(8), 1490–1495 (2008)
2. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings
of the International Conference on Computer Vision, pp. 1150–1157 (1999)
3. Torki, M., Elgammal, A.: Putting local features on a manifold. In: CVPR (2010)
4. Seung, H., Lee, D.: The manifold ways of perception. Science 290(22), 2268–2269
(2000)
5. Murase, H., Nayar, S.: Visual learning and recognition of 3D objects from appear-
ance. International Journal of Computer Vision 14, 5–24 (1995)
554 Y. Guo and J. Gao

6. Swain, M.J., Ballard, D.H.: Indexing via color histograms. In: Proceedings of the
International Conference on Computer Vision, pp. 390–393 (1990)
7. Verma, B., Kulkarni, S.: Texture feature extraction and classification. LNCS, pp.
228–235 (2001)
8. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition us-
ing shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 24(24), 509–522 (2002)
9. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models: Their
training and application. Computer Vision and Image Understanding 61(1), 38–59
(1995)
10. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using
shape contexts. IEEE Trans. on Pattern Analysis and Machine Intelligence 24(4),
509–522 (2002)
11. Sudderth, E.B., Torralba, A., Freeman, W.T., Willsky, A.S.: Describing visual
scenes using transformed objects and parts. International Journal of Computer
Vision 77(1-3), 291–330 (2008)
12. Crammer, K., Kearns, M., Wortman, J.: Learning from multiple sources. Journal
of Machine Learning Research 9, 1757–1774 (2008)
13. Cesa-Bianchi, N., Hardoon, D.R., Leen, G.: Guest editorial: Learning from multiple
sources. Machine Learning 79, 1–3 (2010)
14. Hardoon, D.R., Shawe-Taylor, J.: Decomposing the tensor kernel support vector
machine for neuroscience data with structured labels. Machine Learning 79, 29–46
(2010)
15. Schölkopf, B., Smola, A.: Learning with Kernels: Support Vector Machines, Regu-
larization, Optimization, and Beyond. The MIT Press, Cambridge (2002)
16. Evgeniou, T., Micchelli, C.A., Pontil, M.: Learning multiple tasks with kernel meth-
ods. Journal of Machine Learning Research 6, 615–637 (2005)
17. Gärtner, T., Lloyd, J.W., Flach, P.A.: Kernels for structured data. In: Proceedings
of the 12th International Conference on Inductive Logic Programming (2002)
18. Nabney, I.T.: NETLAB: Algorithms for Pattern Recognition. In: Advances in Pat-
tern Recognition. Springer, London (2004)
19. Schölkopf, B., Smola, A.J., Müller, K.: Nonlinear component analysis as a kernel
eigenvalue problem. Neural Computation 10, 1299–1319 (1998)
20. Guo, Y., Gao, J., Kwan, P.W.: Kernel laplacian eigenmaps for visualization of non-
vectorial data. In: Sattar, A., Kang, B.-h. (eds.) AI 2006. LNCS (LNAI), vol. 4304,
pp. 1179–1183. Springer, Heidelberg (2006)
21. Bengio, Y., Paiement, J., Vincent, P., Delalleau, O., Roux, N.L., Ouimet, M.: Out-
of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. In:
Advances in Neural Information Processing Systems, vol. 16
22. Guo, Y., Gao, J., Kwan, P.W.: Twin Kernel Embedding with back constraints. In:
HPDM in ICDM (2007)
23. Cuturi, M., Fukumizu, K., Vert, J.P.: Semigroup kernels on measures. Journal of
Machine Learning Research 6, 1169–1198 (2005)
24. Geiger, A., Urtasun, R., Darrell, T.: Rank priors for continuous non-linear di-
mensionality reduction. In: IEEE Conference on Computer Vision and Pattern
Recognition, pp. 880–887 (2009)
25. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear em-
bedding. Science 290(22), 2323–2326 (2000)
Author Index

Adams, Brett I-136 Fan, Xiannian II-309


Akinaga, Yoshikazu I-525 Fang, Gang I-338
Anand, Rajul II-51 Fujimoto, Hiroshi I-525
Azevedo, Paulo II-432 Fujiwara, Yasuhiro II-38
Fung, Pui Cheong Gabriel I-26
Bai, Kun I-500
Baquero, Carlos II-123 Gallagher, Marcus II-135
Bench-Capon, Trevor II-357 Gao, Jun II-270
Berrada, Ghita II-457 Gao, Junbin II-544
Bhat, Harish S. I-399 Gao, Liangcai I-500
Bhattacharyya, Dhruba Kumar I-225 Gao, Ning II-482
Borgelt, Christian II-493 Garg, Dinesh I-13
Bow, Mark I-351 Gong, Shaogang II-296
Bruza, Peter I-363 Greiner, Russell I-124
Buza, Krisztian II-149 Günnemann, Stephan II-444
Guns, Tias II-382
Cao, Wei II-370 Guo, Yi II-544
Caragea, Doina II-75 Guo, Yuanyuan I-100
Chawla, Sanjay II-345 Gupta, Sunil Kumar I-136
Chen, Hui-Ling I-249
Chen, Songcan II-506 He, Dan II-532
Cheng, Victor I-75 He, Dongxiao II-123
Coenen, Frans II-357 He, Jiangfeng II-258
Costa, Joaquim II-432 He, Jing I-375
He, Jun II-407
Dai, Bi-Ru I-1 He, Qinming II-258
de Keijzer, Ander II-457 He, Xian II-420
DeLong, Colin II-519 Hirose, Shuichi II-26
Deng, Zhi-Hong II-482 Hospedales, Timothy M. II-296
De Raedt, Luc II-382 Hsu, Shu-Ming I-1
de Sá, Cláudio Rebelo II-432 Hu, Weiming II-270
Desai, Aditya II-469 Hu, Xuegang I-313
De Smet, Wim I-549 Huang, Bingquan I-411
Di, Nan I-537 Huang, Guangyan I-375
Ding, Chris I-148 Huang, Hao II-258
Ding, Zhiming I-375 Huang, Heng I-148
Dobbie, Gillian I-387 Huang, Houkuan I-38
Du, Jun II-395 Huang, Joshua I-171
Du, Xiaoyong II-407 Huang, Xuanjing I-50
Huang, Ying I-411
Erickson, Kendrick II-519 Huynh, Dat I-476
Etoh, Minoru I-525
Inge, Meador II-198
Faloutsos, Christos II-13 Ivanović, Mirjana I-183
Fan, Jianping II-87 Iwai, Hiroki II-185
556 Author Index

Jia, Peifa I-448 Luo, Chao II-370


Jiang, Jia-Jian II-482 Luo, Dan II-370
Jin, Di II-123 Luo, Dijun I-148
Jing, Liping I-38, I-171 Luo, Jun II-87
Jorge, Alı́pio Mário II-432 Luo, Wei II-135

Kang, U II-13 Ma, Lianhang II-258


Kantarcioglu, Murat II-198 Ma, Wanli I-476, II-246
Kasabov, Nikola II-161 Makris, Dimitrios II-173
Kashima, Hisashi I-62, II-222 Mao, Hua II-420
Kechadi, M.-T. I-411 Mao, Xianling I-537
Khoshgoftaar, Taghi M. I-124 Marukatat, Sanparith I-160
Kimura, Daisuke I-62 Masada, Tomonari I-435
Kinno, Akira I-525 Mayers, André I-265
Kitsuregawa, Masaru II-38 Meeder, Brendan II-13
Koh, Yun Sing I-387 Mladenić, Dunja I-183
Kremer, Hardy II-444 Moens, Marie-Francine I-549
Kuboyama, Tetsuji I-62 Monga, Ernest I-265
Kudo, Mineichi II-234 Morstatter, Fred I-26
Kumar, Vipin I-338 Muzammal, Muhammad II-210
Kutty, Sangeetha I-488 Nakagawa, Hiroshi I-87
Nakamura, Atsuyoshi II-234
Lau, Raymond Y.K. I-363 Nanopoulos, Alexandros II-149
Laufkötter, Charlotte II-444 Napolitano, Amri I-124
Le, Huong Thanh I-512 Nayak, Richi I-488, II-99
Le, Trung II-246 Nebel, Jean-Christophe II-173
Lewandowski, Michal II-173 Nguyen, Thien Huu I-512
Li, Chao II-87 Nguyen, Thuy Thanh I-512
Li, Chun-Hung I-75, I-460 Nijssen, Siegfried II-382
Li, Jhao-Yin II-111
Li, Lian II-63 Oguri, Kiyoshi I-435
Li, Nan I-423 Okanohara, Daisuke II-26
Li, Pei II-407 Onizuka, Makoto II-38
Li, Peipei I-313
Li, Xiaoming I-537 Pan, Junfeng I-289
Li, Yuefeng I-363, I-488 Parimi, Rohit II-75
Li, Yuxuan II-321 Parker, D.S. II-532
Li, Zhaonan II-506 Pathak, Nishith II-519
Liang, Qianhui I-313 Pears, Russel I-387, II-161
Ling, Charles X. II-395 Perrino, Eric II-519
Liu, Bing I-448 Phung, Dinh I-136
Liu, Da-You I-249 Pudi, Vikram II-469
Liu, Dayou II-123 Qing, Xiangyun I-301
Liu, Hongyan II-407 Qiu, Xipeng I-50
Liu, Huan I-26 Qu, Guangzhi I-209
Liu, Jie I-249
Liu, Wei II-345 Radovanović, Miloš I-183
Liu, Xiaobing I-537 Raman, Rajeev II-210
Liu, Ying I-500 Reddy, Chandan K. II-51
Lu, Aidong II-1 Ru, Liyun II-506
Author Index 557

Sam, Rathany Chan I-512 Widiputra, Harya II-161


Sarmah, Rosy Das I-225 Wu, Hui I-209
Sarmah, Sauravjyoti I-225 Wu, Jianxin I-112
Sato, Issei I-87 Wu, Leting II-1
Schmidt-Thieme, Lars II-149 Wu, Ou II-270
Segond, Marc II-493 Wu, Xindong I-313
Seidl, Thomas II-444 Wu, Xintao II-1
Sharma, Dharmendra I-476, II-246 Wyner, Adam II-357
Shevade, Shirish I-13
Shibata, Yuichiro I-435 Xiang, Tao II-296
Shibuya, Tetsuo I-62 Xiang, Yanping II-420
Shim, Kyong II-519 Xiong, Tengke I-265
Singh, Himanshu II-469 Xu, Hua I-448
Sinthupinyo, Wasin I-160 Xu, Yue I-363
Soares, Carlos II-432 Xue, Gui-Rong I-289
Spencer, Bruce I-100
Srivastava, Jaideep II-519 Yamanishi, Kenji II-185
Steinbach, Michael I-338 Yan, Hongfei I-537
Su, Xiaoyuan I-124 Yang, Bo I-249, II-123
Sun, Xu II-222 Yang, Jing II-63
Sundararajan, Sellamanickam I-13 Yang, Pengyi II-333
Yang, Weidong I-423
Tabei, Yasuo II-26 Yeh, Mi-Yen II-111
Takamatsu, Shingo I-87 Yin, Jianping I-237
Takasu, Atsuhiro I-435 Ying, Xiaowei II-1
Tang, Jie I-549, II-506 Yoo, Jin Soung I-351
Tang, Ke II-309 Yu, Hang II-482
Tomašev, Nenad I-183 Yu, Haoyu I-338
Tomioka, Ryota II-185, II-222 Yu, Jeffrey Xu II-407
Tran, Dat I-476, II-246 Yu, Jian I-38, I-171
Tsai, Flora S. II-284 Yu, Yong I-289
Tsuda, Koji II-26 Yun, Jiali I-38, I-171

Ueda, Naonori II-222 Zaelit, Daniel I-399


Urabe, Yasuhiro II-185 Zeng, Yifeng II-420
Zhai, Zhongwu I-448
Venkatesh, Svetha I-136 Zhan, Tian-Jie I-460
Zhan, Yubin I-237
Wan, Xiaojun I-326 Zhang, Chengqi II-370
Wang, Baijie I-196 Zhang, Harry I-100
Wang, Bin I-100 Zhang, Kuo II-506
Wang, Bo II-506 Zhang, Xiaoqin II-270
Wang, Gang I-249 Zhang, Xiuzhen II-321
Wang, Shengrui I-265 Zhang, Yanchun I-375
Wang, Su-Jing I-249 Zhang, Yi II-284
Wang, Xin I-196 Zhang, Yuhong I-313
Wang, Xingyu I-301 Zhang, Zhongfei (Mark) II-270
Wang, Yang I-289 Zhang, Zili II-333
Wardeh, Maya II-357 Zhao, Yanchang II-370
Weise, Thomas II-309 Zhao, Zhongying II-87
558 Author Index

Zhou, Bing B. II-333 Zhu, Guansheng I-423


Zhou, Jinlong I-50 Zhu, Hao I-423
Zhou, Xujuan I-363 Zhu, Xingquan I-209
Zhou, Yan II-198 Žliobaitė, Indrė I-277
Zhou, Zhi-Hua II-1 Zomaya, Albert Y. II-333

You might also like